Candidate Flow
Live Interview
Real-time AI interview via WebSocket audio streaming — architecture, proctoring, and combined recording.
The live interview is a real-time audio conversation between the candidate and Maya, the AI interviewer powered by the Agno Interview Agent. Audio streams bidirectionally through the Vert.x WebSocket edge. Candidate voice is sent as self-contained WebM/Opus blobs, processed by Google Cloud STT, fed to Vertex AI Gemini (fine-tuned) for response generation, and played back via Gemini 2.5 Flash TTS (raw PCM L16 24kHz mono with emotion tags).
Page: /interview/:sessionId
The interview page connects to Vert.x on mount, streams microphone audio, plays back TTS responses, and runs background proctoring via the FaceDetector API.
| UI Element | Description |
|---|---|
| Maya avatar (AudioVisualizer) | Animated bar chart driven by agent TTS AnalyserNode — active while Maya speaks |
| Candidate camera preview | Bottom-left overlay. Auto-starts on mount; shows "No camera" on permission deny |
| Candidate mic visualizer | Mini bar chart in the camera card, driven by mic AnalyserNode |
| Live transcript panel | Right sidebar — candidate STT results and Maya text responses |
| STT indicator | Header badge — "STT Active" (emerald, animated) for 4 s after each transcription arrives |
| Proctoring badge | Header — green shield (clean) or amber shield with violation count |
| Mute button | Toggle mic track.enabled — does not stop recording or WebSocket |
| End Interview | Stops recorder, closes WebSocket, releases all streams, shows completion screen |
| Theme toggle | Dark / light mode, persisted in localStorage (interview-theme) |
WebSocket Connection
The service connects to Vert.x on mount using the tenant ID from sessionStorage. The URL is controlled by the NEXT_PUBLIC_WS_URL env var. Set NEXT_PUBLIC_WS_URL in your deployment environment to point at the Vert.x WebSocket edge for your environment.
// InterviewWebSocketService.connect()
const wsUrl = process.env.NEXT_PUBLIC_WS_URL;
const url = `${wsUrl}/ws?sessionId=${encodeURIComponent(sessionId)}&tenantId=${encodeURIComponent(tenantId)}`;
const ws = new WebSocket(url);
ws.onopen = () => {
startAudioRecording(ws); // AudioStreamService.startRecording()
fireGreeting(); // speakText() via backend /tts/synthesize
};
ws.onmessage = (event) => {
const msg = JSON.parse(event.data);
// AGENT_RESPONSE with audio metadata → handleAudioResponse()
// AGENT_RESPONSE text-only → speakText() via TTS, then handleTextResponse()
// TRANSCRIPTION → addCandidateMessage() + light STT indicator for 4 s
};NEXT_PUBLIC_WS_URL in your deployment environment (e.g. wss://your-edge-host) to connect to the Vert.x WebSocket edge for your environment.WebSocket Message Types
| Type (Vert.x enum) | Direction | Handled By |
|---|---|---|
| AUDIO | Browser → Vert.x | Forwarded to Kafka audio.candidate.spoken |
| VIDEO | Browser → Vert.x | Forwarded to Kafka video.candidate.stream |
| AGENT_RESPONSE | Vert.x → Browser | AgentResponseHandler (audio or text) |
| AGENT_SPEAKING_START | Vert.x → Browser | Mic muted immediately before first audio packet arrives |
| TRANSCRIPTION | Vert.x → Browser | Final STT result — appended to transcript panel |
| TRANSCRIPTION_INTERIM | Vert.x → Browser | Live partial transcript shown while candidate is still speaking |
| INTERVIEW_ENDED | Vert.x → Browser | Session auto-close — shows completion screen |
| CONTROL | Vert.x → Browser | Connection lifecycle (connected / closed) |
| ERROR | Vert.x → Browser | setConnectionError() in UI |
Proactive Mic Muting
When Maya is about to speak, the server publishes an AGENT_SPEAKING_START message before the first audio packet. The browser mutes the microphone immediately on receiving this signal — typically 50–100 ms before audio begins playing. This prevents candidate audio from being sent while Maya is speaking.
| Approach | Behaviour |
|---|---|
| AGENT_SPEAKING_START received | Mic track disabled immediately; AudioStreamService marks isAgentSpeaking = true |
| Agent audio finishes playing | Mic track re-enabled automatically; AudioStreamService marks isAgentSpeaking = false |
| Candidate speaks during playback | Audio chunks skipped (isAgentSpeaking guard) — no echo sent to STT |
Audio Format — WebM/Opus Stop-Start Cycle
Audio is NOT streamed as raw PCM. The AudioStreamService uses a stop-start MediaRecorder cycle every 5 seconds. Each stop produces one complete, self-contained WebM/Opus blob (EBML header + cluster data) that the Python agent can decode as a valid WebM file without chunk assembly.
MediaStream (getUserMedia: echoCancellation + noiseSuppression + autoGainControl)
│
├── AudioContext → AnalyserNode (fftSize=256) → mic waveform visualizer (never stopped)
│
└── MediaRecorder (audio/webm;codecs=opus OR audio/webm OR audio/mp4)
│
├── Start new recorder immediately (gap-free)
├── After 5 s: stop old recorder → dataavailable fires with ONE complete WebM blob
├── Skip if blob < 100 bytes (silence)
├── Skip if muted (mic track disabled)
└── arrayBuffer() → sendAudioChunk(buffer, chunkIndex)
│
▼
WebSocket: { type: "AUDIO", data: base64, sampleRate: 16000,
encoding: "LINEAR16", chunkIndex, sessionId, tenantId }| Direction | Container | Codec | Cycle |
|---|---|---|---|
| Candidate → Agent | WebM | Opus (48 kHz browser default) | One complete blob per 5 s |
| Agent → Candidate | Raw PCM | L16 signed 16-bit 24kHz mono | Per sentence (base64 in AGENT_RESPONSE) |
setInterval in background tabs. The service listens to visibilitychange and forces a cycle rotation when the tab becomes visible again if more than 5 s has elapsed.Audio Pipeline (End-to-End)
Candidate Browser
│
├── MediaRecorder (WebM/Opus, 5-second blobs)
│ └── base64 → WebSocket { type: "AUDIO" }
│
▼
Vert.x WebSocket Edge (port 8080)
│ └── Kafka → audio.candidate.spoken (key=sessionId, snappy)
│
▼
Agno Interviewer Agent "Maya" (port 7778)
├── Kafka consumer (ThreadPoolExecutor background thread)
├── Google Cloud Speech-to-Text (STT)
│ └── Config: LINEAR16, en-US, enhanced model, auto-punctuation
│ Google awaits SILENCE_TIMEOUT (~2-3 s) for complete utterance
├── Vertex AI Gemini 2.0 Flash (fine-tuned) — response generation with interview plan context
│ └── Only responds if utterance >= 10 words (short answers buffered)
│ Confusion / off-topic detection built in
└── Gemini 2.5 Flash TTS (Vertex AI, voice: Kore), PCM L16 24 kHz mono
└── Supports emotion tags: [encouraging], [thoughtful], [pause], [laugh], etc.
└── Kafka → audio.agent.spoken (key=sessionId)
│
▼
Vert.x Kafka Consumer → WebSocket { type: "AGENT_RESPONSE", data: base64_pcm }
│
▼
AgentResponseHandler
├── pcmToAudioBuffer(): Int16 → Float32 → AudioBuffer (24kHz, zero decode overhead)
├── Sequential audio queue (never overlapping)
├── AnalyserNode → Maya waveform visualizer
└── MediaStreamDestinationNode → combined interview recording trackGreeting (Client-Side)
Maya's opening greeting is fired from the browser, not by Vert.x or a Kafka control message. On WebSocket connect, the page calls speakText() after 800 ms via the backend TTS endpoint.
// Fires once per session mount (greetingFiredRef guards against double-fire)
if (!greetingFiredRef.current) {
greetingFiredRef.current = true;
setTimeout(() => {
responseHandlerRef.current
?.speakText(
"Hello! Welcome to your AI interview. I'm your interviewer today. " +
"Please take a moment to get comfortable, then tell me about yourself " +
"and what excites you about this role.",
)
.catch(() => {});
}, 800);
}speakText() calls POST /api/v1/tts/synthesize on the backend with voice Kore (Gemini 2.5 Flash TTS), decodes the returned PCM L16 24kHz into an AudioBuffer, and queues it through the same audio queue used for agent WebSocket responses — so greeting and live responses never overlap.
Combined Interview Recording
The page records the entire interview as a single WebM file combining three streams: camera video, candidate mic audio, and agent TTS audio. 30-second chunks are sent via the WebSocket as VIDEO type messages.
// Starts when camera + WebSocket + mic recording are all active (isRecording)
const tracks = [
...cameraStream.getVideoTracks(), // camera from getUserMedia
...micStream.getAudioTracks(), // mic from AudioStreamService.getMicStream()
...(agentStream?.getAudioTracks() ?? []), // TTS from AgentResponseHandler.getAgentAudioStream()
];
const combined = new MediaStream(tracks);
const recorder = new MediaRecorder(combined, {
mimeType: 'video/webm;codecs=vp8,opus',
videoBitsPerSecond: 500_000,
audioBitsPerSecond: 64_000,
});
recorder.ondataavailable = (e) => {
// Sends as { type: "VIDEO", data: base64 }
wsService.sendVideoChunk(await e.data.arrayBuffer());
};
recorder.start(30_000); // One chunk every 30 seconds| From | To | Via |
|---|---|---|
| Browser MediaRecorder (VIDEO) | Vert.x WebSocket edge | WebSocket { type: "VIDEO" } |
| Vert.x edge | Kafka video.candidate.stream | Kafka producer (10 partitions, 1h retention) |
| Kafka | VideoStorageConsumerService (NestJS) | Kafka consumer → disk |
Proctoring
Basic automated proctoring runs during the live session. A proctoring badge appears in the header showing violation count.
| Check | Mechanism | Frequency | Violation Trigger |
|---|---|---|---|
| No face detected | FaceDetector API (Chrome 70+) | Every 5 s | 0 faces in canvas snapshot |
| Multiple people | FaceDetector API (Chrome 70+) | Every 5 s | 2 or more faces |
| Right-click prevention | contextmenu event listener | On every right-click | Always blocked |
// FaceDetector runs on a canvas snapshot of the video element
const detector = new FaceDetector({ maxDetectedFaces: 5, fastMode: true });
setInterval(async () => {
ctx.drawImage(videoEl, 0, 0, canvas.width, canvas.height);
const faces = await detector.detect(canvas);
if (faces.length === 0) recordViolation('No face detected...');
else if (faces.length >= 2) recordViolation(`Multiple faces detected (${faces.length})...`);
}, 5000);Interview Completion
| Trigger | Who | State After |
|---|---|---|
| All plan sections covered | Agno Interview Agent | COMPLETED → ASSESSMENT_PENDING |
| Candidate clicks End Interview | Browser (WebSocket close) | COMPLETED |
| Admin cancels | API Gateway | CANCELLED |
| Session timeout (2 h) | Vert.x TTL | COMPLETED |
Reconnection
If the WebSocket disconnects, the service attempts up to 5 reconnections with linear backoff. The Redis session remains active for 2 hours (7200 s TTL).
// InterviewWebSocketService — linear backoff: 1s, 2s, 3s, 4s, 5s
private reconnect(): void {
this.reconnectAttempts++;
const delay = this.reconnectDelay * this.reconnectAttempts; // 1000 ms * attempt
setTimeout(() => this.connect().catch(() => {}), delay);
}greeting_sent in Redis to avoid re-greeting on reconnect.