Candidate Flow

Live Interview

Real-time AI interview via WebSocket audio streaming — architecture, proctoring, and combined recording.

The live interview is a real-time audio conversation between the candidate and Maya, the AI interviewer powered by the Agno Interview Agent. Audio streams bidirectionally through the Vert.x WebSocket edge. Candidate voice is sent as self-contained WebM/Opus blobs, processed by Google Cloud STT, fed to Vertex AI Gemini (fine-tuned) for response generation, and played back via Gemini 2.5 Flash TTS (raw PCM L16 24kHz mono with emotion tags).

Page: /interview/:sessionId

The interview page connects to Vert.x on mount, streams microphone audio, plays back TTS responses, and runs background proctoring via the FaceDetector API.

UI ElementDescription
Maya avatar (AudioVisualizer)Animated bar chart driven by agent TTS AnalyserNode — active while Maya speaks
Candidate camera previewBottom-left overlay. Auto-starts on mount; shows "No camera" on permission deny
Candidate mic visualizerMini bar chart in the camera card, driven by mic AnalyserNode
Live transcript panelRight sidebar — candidate STT results and Maya text responses
STT indicatorHeader badge — "STT Active" (emerald, animated) for 4 s after each transcription arrives
Proctoring badgeHeader — green shield (clean) or amber shield with violation count
Mute buttonToggle mic track.enabled — does not stop recording or WebSocket
End InterviewStops recorder, closes WebSocket, releases all streams, shows completion screen
Theme toggleDark / light mode, persisted in localStorage (interview-theme)

WebSocket Connection

The service connects to Vert.x on mount using the tenant ID from sessionStorage. The URL is controlled by the NEXT_PUBLIC_WS_URL env var. Set NEXT_PUBLIC_WS_URL in your deployment environment to point at the Vert.x WebSocket edge for your environment.

typescript
// InterviewWebSocketService.connect()
const wsUrl = process.env.NEXT_PUBLIC_WS_URL;
const url = `${wsUrl}/ws?sessionId=${encodeURIComponent(sessionId)}&tenantId=${encodeURIComponent(tenantId)}`;

const ws = new WebSocket(url);

ws.onopen = () => {
  startAudioRecording(ws);   // AudioStreamService.startRecording()
  fireGreeting();            // speakText() via backend /tts/synthesize
};

ws.onmessage = (event) => {
  const msg = JSON.parse(event.data);
  // AGENT_RESPONSE with audio metadata → handleAudioResponse()
  // AGENT_RESPONSE text-only → speakText() via TTS, then handleTextResponse()
  // TRANSCRIPTION → addCandidateMessage() + light STT indicator for 4 s
};
Set NEXT_PUBLIC_WS_URL in your deployment environment (e.g. wss://your-edge-host) to connect to the Vert.x WebSocket edge for your environment.

WebSocket Message Types

Type (Vert.x enum)DirectionHandled By
AUDIOBrowser → Vert.xForwarded to Kafka audio.candidate.spoken
VIDEOBrowser → Vert.xForwarded to Kafka video.candidate.stream
AGENT_RESPONSEVert.x → BrowserAgentResponseHandler (audio or text)
AGENT_SPEAKING_STARTVert.x → BrowserMic muted immediately before first audio packet arrives
TRANSCRIPTIONVert.x → BrowserFinal STT result — appended to transcript panel
TRANSCRIPTION_INTERIMVert.x → BrowserLive partial transcript shown while candidate is still speaking
INTERVIEW_ENDEDVert.x → BrowserSession auto-close — shows completion screen
CONTROLVert.x → BrowserConnection lifecycle (connected / closed)
ERRORVert.x → BrowsersetConnectionError() in UI

Proactive Mic Muting

When Maya is about to speak, the server publishes an AGENT_SPEAKING_START message before the first audio packet. The browser mutes the microphone immediately on receiving this signal — typically 50–100 ms before audio begins playing. This prevents candidate audio from being sent while Maya is speaking.

ApproachBehaviour
AGENT_SPEAKING_START receivedMic track disabled immediately; AudioStreamService marks isAgentSpeaking = true
Agent audio finishes playingMic track re-enabled automatically; AudioStreamService marks isAgentSpeaking = false
Candidate speaks during playbackAudio chunks skipped (isAgentSpeaking guard) — no echo sent to STT
Without this signal the browser would not know the agent was about to speak until the first audio chunk arrived (1–2 s later), which could result in candidate echo being transcribed as an answer.

Audio Format — WebM/Opus Stop-Start Cycle

Audio is NOT streamed as raw PCM. The AudioStreamService uses a stop-start MediaRecorder cycle every 5 seconds. Each stop produces one complete, self-contained WebM/Opus blob (EBML header + cluster data) that the Python agent can decode as a valid WebM file without chunk assembly.

text
MediaStream (getUserMedia: echoCancellation + noiseSuppression + autoGainControl)
    │
    ├── AudioContext → AnalyserNode (fftSize=256) → mic waveform visualizer (never stopped)
    │
    └── MediaRecorder (audio/webm;codecs=opus OR audio/webm OR audio/mp4)
          │
          ├── Start new recorder immediately (gap-free)
          ├── After 5 s: stop old recorder → dataavailable fires with ONE complete WebM blob
          ├── Skip if blob < 100 bytes (silence)
          ├── Skip if muted (mic track disabled)
          └── arrayBuffer() → sendAudioChunk(buffer, chunkIndex)
                                  │
                                  ▼
                   WebSocket: { type: "AUDIO", data: base64, sampleRate: 16000,
                                encoding: "LINEAR16", chunkIndex, sessionId, tenantId }
DirectionContainerCodecCycle
Candidate → AgentWebMOpus (48 kHz browser default)One complete blob per 5 s
Agent → CandidateRaw PCML16 signed 16-bit 24kHz monoPer sentence (base64 in AGENT_RESPONSE)
Chrome throttles setInterval in background tabs. The service listens to visibilitychange and forces a cycle rotation when the tab becomes visible again if more than 5 s has elapsed.

Audio Pipeline (End-to-End)

text
Candidate Browser
    │
    ├── MediaRecorder (WebM/Opus, 5-second blobs)
    │     └── base64 → WebSocket { type: "AUDIO" }
    │
    ▼
Vert.x WebSocket Edge (port 8080)
    │     └── Kafka → audio.candidate.spoken (key=sessionId, snappy)
    │
    ▼
Agno Interviewer Agent "Maya" (port 7778)
    ├── Kafka consumer (ThreadPoolExecutor background thread)
    ├── Google Cloud Speech-to-Text (STT)
    │     └── Config: LINEAR16, en-US, enhanced model, auto-punctuation
    │           Google awaits SILENCE_TIMEOUT (~2-3 s) for complete utterance
    ├── Vertex AI Gemini 2.0 Flash (fine-tuned) — response generation with interview plan context
    │     └── Only responds if utterance >= 10 words (short answers buffered)
    │           Confusion / off-topic detection built in
    └── Gemini 2.5 Flash TTS (Vertex AI, voice: Kore), PCM L16 24 kHz mono
          └── Supports emotion tags: [encouraging], [thoughtful], [pause], [laugh], etc.
          └── Kafka → audio.agent.spoken (key=sessionId)
                  │
                  ▼
          Vert.x Kafka Consumer → WebSocket { type: "AGENT_RESPONSE", data: base64_pcm }
                  │
                  ▼
          AgentResponseHandler
              ├── pcmToAudioBuffer(): Int16 → Float32 → AudioBuffer (24kHz, zero decode overhead)
              ├── Sequential audio queue (never overlapping)
              ├── AnalyserNode → Maya waveform visualizer
              └── MediaStreamDestinationNode → combined interview recording track

Greeting (Client-Side)

Maya's opening greeting is fired from the browser, not by Vert.x or a Kafka control message. On WebSocket connect, the page calls speakText() after 800 ms via the backend TTS endpoint.

typescript
// Fires once per session mount (greetingFiredRef guards against double-fire)
if (!greetingFiredRef.current) {
  greetingFiredRef.current = true;
  setTimeout(() => {
    responseHandlerRef.current
      ?.speakText(
        "Hello! Welcome to your AI interview. I'm your interviewer today. " +
          "Please take a moment to get comfortable, then tell me about yourself " +
          "and what excites you about this role.",
      )
      .catch(() => {});
  }, 800);
}

speakText() calls POST /api/v1/tts/synthesize on the backend with voice Kore (Gemini 2.5 Flash TTS), decodes the returned PCM L16 24kHz into an AudioBuffer, and queues it through the same audio queue used for agent WebSocket responses — so greeting and live responses never overlap.

Combined Interview Recording

The page records the entire interview as a single WebM file combining three streams: camera video, candidate mic audio, and agent TTS audio. 30-second chunks are sent via the WebSocket as VIDEO type messages.

typescript
// Starts when camera + WebSocket + mic recording are all active (isRecording)
const tracks = [
  ...cameraStream.getVideoTracks(),      // camera from getUserMedia
  ...micStream.getAudioTracks(),          // mic from AudioStreamService.getMicStream()
  ...(agentStream?.getAudioTracks() ?? []), // TTS from AgentResponseHandler.getAgentAudioStream()
];

const combined = new MediaStream(tracks);
const recorder = new MediaRecorder(combined, {
  mimeType: 'video/webm;codecs=vp8,opus',
  videoBitsPerSecond: 500_000,
  audioBitsPerSecond: 64_000,
});

recorder.ondataavailable = (e) => {
  // Sends as { type: "VIDEO", data: base64 }
  wsService.sendVideoChunk(await e.data.arrayBuffer());
};

recorder.start(30_000); // One chunk every 30 seconds
FromToVia
Browser MediaRecorder (VIDEO)Vert.x WebSocket edgeWebSocket { type: "VIDEO" }
Vert.x edgeKafka video.candidate.streamKafka producer (10 partitions, 1h retention)
KafkaVideoStorageConsumerService (NestJS)Kafka consumer → disk

Proctoring

Basic automated proctoring runs during the live session. A proctoring badge appears in the header showing violation count.

CheckMechanismFrequencyViolation Trigger
No face detectedFaceDetector API (Chrome 70+)Every 5 s0 faces in canvas snapshot
Multiple peopleFaceDetector API (Chrome 70+)Every 5 s2 or more faces
Right-click preventioncontextmenu event listenerOn every right-clickAlways blocked
typescript
// FaceDetector runs on a canvas snapshot of the video element
const detector = new FaceDetector({ maxDetectedFaces: 5, fastMode: true });

setInterval(async () => {
  ctx.drawImage(videoEl, 0, 0, canvas.width, canvas.height);
  const faces = await detector.detect(canvas);

  if (faces.length === 0) recordViolation('No face detected...');
  else if (faces.length >= 2) recordViolation(`Multiple faces detected (${faces.length})...`);
}, 5000);
FaceDetector is a Chrome-only experimental API (Chromium 70+). The check silently skips in Firefox and Safari without affecting the interview session. Violation count is tracked locally — it is not yet persisted server-side.

Interview Completion

TriggerWhoState After
All plan sections coveredAgno Interview AgentCOMPLETED → ASSESSMENT_PENDING
Candidate clicks End InterviewBrowser (WebSocket close)COMPLETED
Admin cancelsAPI GatewayCANCELLED
Session timeout (2 h)Vert.x TTLCOMPLETED

Reconnection

If the WebSocket disconnects, the service attempts up to 5 reconnections with linear backoff. The Redis session remains active for 2 hours (7200 s TTL).

typescript
// InterviewWebSocketService — linear backoff: 1s, 2s, 3s, 4s, 5s
private reconnect(): void {
  this.reconnectAttempts++;
  const delay = this.reconnectDelay * this.reconnectAttempts; // 1000 ms * attempt
  setTimeout(() => this.connect().catch(() => {}), delay);
}
Interview session state (plan section, transcript, greeting flag) is stored in Redis with a 2-hour TTL. Reconnecting candidates resume from the exact point of disconnection. The Python agent checks greeting_sent in Redis to avoid re-greeting on reconnect.
Was this page helpful?