Candidate Flow

Live Interview

Real-time AI interview via WebSocket audio streaming — architecture, proctoring, and combined recording.

The live interview is a real-time audio conversation between the candidate and Maya, the AI interviewer powered by the Agno Interview Agent. Audio streams bidirectionally through the Vert.x WebSocket edge. Candidate voice is sent as self-contained WebM/Opus blobs, processed by Google Cloud STT, fed to Vertex AI Gemini (fine-tuned) for response generation, and played back via Gemini 2.5 Flash TTS (raw PCM L16 24kHz mono with emotion tags).

Page: /interview/:sessionId

The interview page connects to Vert.x on mount, streams microphone audio, plays back TTS responses, and runs background proctoring via the FaceDetector API.

UI Element	Description
Maya avatar (AudioVisualizer)	Animated bar chart driven by agent TTS AnalyserNode — active while Maya speaks
Candidate camera preview	Bottom-left overlay. Auto-starts on mount; shows "No camera" on permission deny
Candidate mic visualizer	Mini bar chart in the camera card, driven by mic AnalyserNode
Live transcript panel	Right sidebar — candidate STT results and Maya text responses
STT indicator	Header badge — "STT Active" (emerald, animated) for 4 s after each transcription arrives
Proctoring badge	Header — green shield (clean) or amber shield with violation count
Mute button	Toggle mic track.enabled — does not stop recording or WebSocket
End Interview	Stops recorder, closes WebSocket, releases all streams, shows completion screen
Theme toggle	Dark / light mode, persisted in localStorage (interview-theme)

WebSocket Connection

The service connects to Vert.x on mount using the tenant ID from sessionStorage. The URL is controlled by the NEXT_PUBLIC_WS_URL env var. Set NEXT_PUBLIC_WS_URL in your deployment environment to point at the Vert.x WebSocket edge for your environment.

typescript

// InterviewWebSocketService.connect()
const wsUrl = process.env.NEXT_PUBLIC_WS_URL;
const url = `${wsUrl}/ws?sessionId=${encodeURIComponent(sessionId)}&tenantId=${encodeURIComponent(tenantId)}`;

const ws = new WebSocket(url);

ws.onopen = () => {
  startAudioRecording(ws);   // AudioStreamService.startRecording()
  fireGreeting();            // speakText() via backend /tts/synthesize
};

ws.onmessage = (event) => {
  const msg = JSON.parse(event.data);
  // AGENT_RESPONSE with audio metadata → handleAudioResponse()
  // AGENT_RESPONSE text-only → speakText() via TTS, then handleTextResponse()
  // TRANSCRIPTION → addCandidateMessage() + light STT indicator for 4 s
};

Set NEXT_PUBLIC_WS_URL in your deployment environment (e.g. wss://your-edge-host) to connect to the Vert.x WebSocket edge for your environment.

WebSocket Message Types

Type (Vert.x enum)	Direction	Handled By
AUDIO	Browser → Vert.x	Forwarded to Kafka audio.candidate.spoken
VIDEO	Browser → Vert.x	Forwarded to Kafka video.candidate.stream
AGENT_RESPONSE	Vert.x → Browser	AgentResponseHandler (audio or text)
AGENT_SPEAKING_START	Vert.x → Browser	Mic muted immediately before first audio packet arrives
TRANSCRIPTION	Vert.x → Browser	Final STT result — appended to transcript panel
TRANSCRIPTION_INTERIM	Vert.x → Browser	Live partial transcript shown while candidate is still speaking
INTERVIEW_ENDED	Vert.x → Browser	Session auto-close — shows completion screen
CONTROL	Vert.x → Browser	Connection lifecycle (connected / closed)
ERROR	Vert.x → Browser	setConnectionError() in UI

Proactive Mic Muting

When Maya is about to speak, the server publishes an AGENT_SPEAKING_START message before the first audio packet. The browser mutes the microphone immediately on receiving this signal — typically 50–100 ms before audio begins playing. This prevents candidate audio from being sent while Maya is speaking.

Approach	Behaviour
AGENT_SPEAKING_START received	Mic track disabled immediately; AudioStreamService marks isAgentSpeaking = true
Agent audio finishes playing	Mic track re-enabled automatically; AudioStreamService marks isAgentSpeaking = false
Candidate speaks during playback	Audio chunks skipped (isAgentSpeaking guard) — no echo sent to STT

Without this signal the browser would not know the agent was about to speak until the first audio chunk arrived (1–2 s later), which could result in candidate echo being transcribed as an answer.

Audio Format — WebM/Opus Stop-Start Cycle

Audio is NOT streamed as raw PCM. The AudioStreamService uses a stop-start MediaRecorder cycle every 5 seconds. Each stop produces one complete, self-contained WebM/Opus blob (EBML header + cluster data) that the Python agent can decode as a valid WebM file without chunk assembly.

text

MediaStream (getUserMedia: echoCancellation + noiseSuppression + autoGainControl)
    │
    ├── AudioContext → AnalyserNode (fftSize=256) → mic waveform visualizer (never stopped)
    │
    └── MediaRecorder (audio/webm;codecs=opus OR audio/webm OR audio/mp4)
          │
          ├── Start new recorder immediately (gap-free)
          ├── After 5 s: stop old recorder → dataavailable fires with ONE complete WebM blob
          ├── Skip if blob < 100 bytes (silence)
          ├── Skip if muted (mic track disabled)
          └── arrayBuffer() → sendAudioChunk(buffer, chunkIndex)
                                  │
                                  ▼
                   WebSocket: { type: "AUDIO", data: base64, sampleRate: 16000,
                                encoding: "LINEAR16", chunkIndex, sessionId, tenantId }

Direction	Container	Codec	Cycle
Candidate → Agent	WebM	Opus (48 kHz browser default)	One complete blob per 5 s
Agent → Candidate	Raw PCM	L16 signed 16-bit 24kHz mono	Per sentence (base64 in AGENT_RESPONSE)

Chrome throttles setInterval in background tabs. The service listens to visibilitychange and forces a cycle rotation when the tab becomes visible again if more than 5 s has elapsed.

Audio Pipeline (End-to-End)

text

Candidate Browser
    │
    ├── MediaRecorder (WebM/Opus, 5-second blobs)
    │     └── base64 → WebSocket { type: "AUDIO" }
    │
    ▼
Vert.x WebSocket Edge (port 8080)
    │     └── Kafka → audio.candidate.spoken (key=sessionId, snappy)
    │
    ▼
Agno Interviewer Agent "Maya" (port 7778)
    ├── Kafka consumer (ThreadPoolExecutor background thread)
    ├── Google Cloud Speech-to-Text (STT)
    │     └── Config: LINEAR16, en-US, enhanced model, auto-punctuation
    │           Google awaits SILENCE_TIMEOUT (~2-3 s) for complete utterance
    ├── Vertex AI Gemini 2.0 Flash (fine-tuned) — response generation with interview plan context
    │     └── Only responds if utterance >= 10 words (short answers buffered)
    │           Confusion / off-topic detection built in
    └── Gemini 2.5 Flash TTS (Vertex AI, voice: Kore), PCM L16 24 kHz mono
          └── Supports emotion tags: [encouraging], [thoughtful], [pause], [laugh], etc.
          └── Kafka → audio.agent.spoken (key=sessionId)
                  │
                  ▼
          Vert.x Kafka Consumer → WebSocket { type: "AGENT_RESPONSE", data: base64_pcm }
                  │
                  ▼
          AgentResponseHandler
              ├── pcmToAudioBuffer(): Int16 → Float32 → AudioBuffer (24kHz, zero decode overhead)
              ├── Sequential audio queue (never overlapping)
              ├── AnalyserNode → Maya waveform visualizer
              └── MediaStreamDestinationNode → combined interview recording track

Greeting (Client-Side)

Maya's opening greeting is fired from the browser, not by Vert.x or a Kafka control message. On WebSocket connect, the page calls speakText() after 800 ms via the backend TTS endpoint.

typescript

// Fires once per session mount (greetingFiredRef guards against double-fire)
if (!greetingFiredRef.current) {
  greetingFiredRef.current = true;
  setTimeout(() => {
    responseHandlerRef.current
      ?.speakText(
        "Hello! Welcome to your AI interview. I'm your interviewer today. " +
          "Please take a moment to get comfortable, then tell me about yourself " +
          "and what excites you about this role.",
      )
      .catch(() => {});
  }, 800);
}

speakText() calls POST /api/v1/tts/synthesize on the backend with voice Kore (Gemini 2.5 Flash TTS), decodes the returned PCM L16 24kHz into an AudioBuffer, and queues it through the same audio queue used for agent WebSocket responses — so greeting and live responses never overlap.

Combined Interview Recording

The page records the entire interview as a single WebM file combining three streams: camera video, candidate mic audio, and agent TTS audio. 30-second chunks are sent via the WebSocket as VIDEO type messages.

typescript

// Starts when camera + WebSocket + mic recording are all active (isRecording)
const tracks = [
  ...cameraStream.getVideoTracks(),      // camera from getUserMedia
  ...micStream.getAudioTracks(),          // mic from AudioStreamService.getMicStream()
  ...(agentStream?.getAudioTracks() ?? []), // TTS from AgentResponseHandler.getAgentAudioStream()
];

const combined = new MediaStream(tracks);
const recorder = new MediaRecorder(combined, {
  mimeType: 'video/webm;codecs=vp8,opus',
  videoBitsPerSecond: 500_000,
  audioBitsPerSecond: 64_000,
});

recorder.ondataavailable = (e) => {
  // Sends as { type: "VIDEO", data: base64 }
  wsService.sendVideoChunk(await e.data.arrayBuffer());
};

recorder.start(30_000); // One chunk every 30 seconds

From	To	Via
Browser MediaRecorder (VIDEO)	Vert.x WebSocket edge	WebSocket { type: "VIDEO" }
Vert.x edge	Kafka video.candidate.stream	Kafka producer (10 partitions, 1h retention)
Kafka	VideoStorageConsumerService (NestJS)	Kafka consumer → disk

Proctoring

Basic automated proctoring runs during the live session. A proctoring badge appears in the header showing violation count.

Check	Mechanism	Frequency	Violation Trigger
No face detected	FaceDetector API (Chrome 70+)	Every 5 s	0 faces in canvas snapshot
Multiple people	FaceDetector API (Chrome 70+)	Every 5 s	2 or more faces
Right-click prevention	contextmenu event listener	On every right-click	Always blocked

typescript

// FaceDetector runs on a canvas snapshot of the video element
const detector = new FaceDetector({ maxDetectedFaces: 5, fastMode: true });

setInterval(async () => {
  ctx.drawImage(videoEl, 0, 0, canvas.width, canvas.height);
  const faces = await detector.detect(canvas);

  if (faces.length === 0) recordViolation('No face detected...');
  else if (faces.length >= 2) recordViolation(`Multiple faces detected (${faces.length})...`);
}, 5000);

FaceDetector is a Chrome-only experimental API (Chromium 70+). The check silently skips in Firefox and Safari without affecting the interview session. Violation count is tracked locally — it is not yet persisted server-side.

Interview Completion

Trigger	Who	State After
All plan sections covered	Agno Interview Agent	COMPLETED → ASSESSMENT_PENDING
Candidate clicks End Interview	Browser (WebSocket close)	COMPLETED
Admin cancels	API Gateway	CANCELLED
Session timeout (2 h)	Vert.x TTL	COMPLETED

Reconnection

If the WebSocket disconnects, the service attempts up to 5 reconnections with linear backoff. The Redis session remains active for 2 hours (7200 s TTL).

typescript

// InterviewWebSocketService — linear backoff: 1s, 2s, 3s, 4s, 5s
private reconnect(): void {
  this.reconnectAttempts++;
  const delay = this.reconnectDelay * this.reconnectAttempts; // 1000 ms * attempt
  setTimeout(() => this.connect().catch(() => {}), delay);
}

Interview session state (plan section, transcript, greeting flag) is stored in Redis with a 2-hour TTL. Reconnecting candidates resume from the exact point of disconnection. The Python agent checks greeting_sent in Redis to avoid re-greeting on reconnect.

Was this page helpful?

Previous← Device Check