Multi-tenant AI interview platform architecture with event-driven microservices running on GKE.
The AI Interview System is an event-driven, multi-tenant platform deployed on Google Kubernetes Engine. At its core is a NestJS API gateway that orchestrates three subsystems: the Agno Python agents for AI reasoning, the Vert.x WebSocket edge for real-time interview audio, and the HITL workflow engine for human oversight. All services communicate asynchronously via Google Managed Kafka.
Architecture Layers
| Layer | Technology | Responsibility |
|---|
| API Gateway | NestJS 10 + TypeScript | REST Integration API, JWT/API-key auth, multi-tenant routing, webhook delivery |
| Agno Planner Agent | Python + Agno + Vertex AI | Skill validation via pgvector, interview plan generation (Gemini 2.0 Flash fine-tuned) |
| Agno Interviewer Agent — Maya | Python + Agno + Vertex AI | Real-time interview conduction, Google STT, LLM response, Google TTS synthesis |
| Agno Assessor Agent | Python + Agno + Vertex AI | Post-interview transcript evaluation, structured hiring recommendation |
| WebSocket Edge | Vert.x 4.x (JVM) | Real-time audio streaming, 50K+ WebSocket connections per node, Redis session routing |
| Database | PostgreSQL 16 + pgvector | Interview data, workflow states, skill embeddings with Row-Level Security |
| Event Bus | Google Managed Kafka (SASL_SSL) | Async event streaming — 100 partitions per audio topic |
| Cache / Session Store | Redis 7 | Interview session state, per-session audio buffers, WebSocket pod routing registry |
Production Infrastructure
The platform runs on GKE cluster teamcast-ai-clust in Google Cloud us-central1.
Node Pools
| Node Pool | Nodes | vCPU | RAM | Hosts |
|---|
| standard-8-pool | 3x n2-standard-8 | 8 each (24 total) | 32GB each (96GB total) | All services (current POC) |
External Services
| Service | Provider | Notes |
|---|
| Speech-to-Text (STT) | Google Cloud Speech — Chirp3 HD | Streaming + batch modes, 48kHz WebM/Opus |
| Text-to-Speech (TTS) | Gemini 2.5 Flash TTS (Vertex AI) | Raw PCM L16 24kHz mono, sentence-level streaming, emotion tags |
| LLM (all agents) | Vertex AI fine-tuned endpoints | Gemini 2.0 Flash with LoRA adapters (rank 8) |
| Kafka | Google Managed Kafka (SASL_SSL) | 3 brokers, 100 partitions per audio topic |
| Redis | Single node | Session state + WebSocket pod routing registry |
| Video Storage | Google Cloud Storage | WebM video chunks, per-session recordings |
Auto-Scaling (HPA)
All Python agent services scale horizontally. The Interviewer uses memory-based HPA because the workload is IO-bound — CPU stays at 3–5% even under full load while memory grows linearly with active audio sessions (~40MB per active session).
| Service | Min Pods | Max Pods | Scale Trigger | Threshold |
|---|
| Interviewer (Maya) | 2 | 20 | Memory utilization | 70% of 1536Mi request (~1075Mi) |
| Planner | 2 | 8 | CPU utilization | 60% |
| Assessor | 2 | 8 | CPU utilization | 60% |
HPA triggers when active audio sessions per pod exceed ~27. Each audio session holds a Google STT streaming context, audio buffers, and a growing LLM conversation history. At 20 pods (HPA max), the system supports approximately 540–700 simultaneous live audio interviews.
Multi-tenancy
Every database query is filtered by tenantId. Row-Level Security policies at the PostgreSQL level enforce isolation even if application code has a bug.
// Every controller extracts tenantId from the JWT
@Get()
async findAll(@TenantId() tenantId: string) {
return this.interviewService.findAll(tenantId);
}
// Every service filters by tenantId
async findAll(tenantId: string) {
return this.prisma.interview.findMany({
where: { tenantId }, // Required — never omit!
});
}
Never query the database without a tenantId filter. All services enforce this via the @TenantId() decorator and RLS policies.
Kafka Event Topics
| Topic | Published By | Consumed By | Trigger |
|---|
| interview.info_needed | API Gateway | Notification Service | Data validation fails — HITL required |
| skill.validation.requested | API Gateway | Agno Planner (7777) | Data complete or info-needed resolved |
| interview.plan.created | Agno Planner (7777) | API Gateway | Plan generation complete |
| interview.approved | API Gateway | Agno Interviewer (7778) | Plan approved — create session |
| audio.candidate.spoken | Vert.x Edge (8080) | Agno Interviewer (7778) | Candidate mic audio — 100 partitions, key=sessionId |
| audio.agent.spoken | Agno Interviewer (7778) | Vert.x Edge (8080) | Agent TTS audio (PCM L16 24kHz) — 100 partitions, key=sessionId |
| audio.candidate.transcribed | Agno Interviewer (7778) | Vert.x Edge (8080) | Live interim transcripts for real-time display |
| video.candidate.stream | Vert.x Edge | VideoStorageConsumerService | Combined WebM chunks (30s), uploaded to GCS |
| interview.completed | Agno Interviewer (7778) | API Gateway + Assessor (7779) | Interview session ends |
| interview.assessment.ready | Agno Assessor (7779) | API Gateway | Assessment generated — HITL gate |
| interview.assessment.completed | Agno Assessor (7779) | API Gateway | Recruiter approved verdict — final webhook |
Audio topics use sessionId as the Kafka message key. This guarantees all messages for one session land on the same partition and are processed in order by the same consumer thread — streaming=start always before streaming=chunk before streaming=end before the batch audio message.
Service Ports
| Service | Port | Protocol | Health Endpoint |
|---|
| API Gateway (NestJS) | 3009 | HTTP/REST + Kafka | GET /api/v1/health/live |
| Agno Planner Agent | 7777 | HTTP + Kafka | GET /health/live |
| Agno Interviewer Agent — Maya | 7778 | HTTP + Kafka | GET /health/live |
| Agno Assessor Agent | 7779 | HTTP + Kafka | GET /health/live |
| WebSocket Edge (Vert.x) | 8080 | WebSocket + Kafka + Redis | GET /health |
| PostgreSQL | 5432 | TCP | — |
| Kafka | 9092 | SASL_SSL TCP | — |
| Redis | 6379 | TCP | — |
Interview Data Flow
1. Plan Generation
External System
│
▼
POST /api/v1/integration/interviews (API Gateway)
│
├── Validate data completeness (CRITICAL / HIGH / MEDIUM fields)
├── Save to PostgreSQL (state: RECEIVED → VALIDATING_SKILLS)
└── Publish: skill.validation.requested (Kafka)
│
▼
Agno Planner Agent
│
├── Validate skills via pgvector similarity search
├── Generate interview plan (Vertex AI Gemini 2.0 Flash fine-tuned)
└── Publish: interview.plan.created (Kafka)
│
▼
API Gateway → state: PENDING
└── Webhook: interview.plan_generated → callbackUrl
2. Live Interview — Audio Pipeline
Candidate Browser
│
├── WebSocket connect: wss://mayaedge.teamcast.ai/ws?sessionId=...&tenantId=...
│ Vert.x registers sessionId in Redis (pod routing registry)
│
├── Audio frames (JSON): { type:"AUDIO", sessionId, streaming:"start|chunk|end" }
│ Vert.x → Kafka: audio.candidate.spoken (key=sessionId, 100 partitions)
│
└── Audio frames (batch): { type:"AUDIO", sessionId, data:"<base64 LINEAR16 PCM>" }
Vert.x → Kafka: audio.candidate.spoken
│
▼
Agno Interviewer Agent
4 consumer threads × 2 pods = 8 workers
Per-session ThreadPoolExecutor queue (FIFO)
│
├── streaming=start → open Google STT streaming session (48kHz Chirp3 HD)
├── streaming=chunk → feed bytes into STT (live interim transcripts)
├── streaming=end → close STT, cache final transcript
└── batch chunk → debounce → batch STT → LLM → TTS (per sentence)
│
├── Google STT 200–500ms
├── Vertex AI LLM 800–2000ms
└── Gemini Flash TTS 100–300ms
│
▼
Publish: audio.agent.spoken (AGENT_RESPONSE + base64 PCM L16 24kHz)
│
▼
Vert.x reads Kafka → Redis lookup → delivers to WebSocket
└── Browser decodes PCM Int16→Float32 via AudioContext
Benchmarked Performance
Measured on the production GKE cluster (2 interviewer pods, 3x n2-standard-8 nodes) using the Locust HTTP benchmark and a WebSocket audio pipeline benchmark with real candidate audio (60s, 16kHz LINEAR16 PCM).
HTTP Layer — Session Creation
| Concurrent Users | Throughput | p50 | p95 | Error Rate |
|---|
| 20 users | 11.4 req/s | 380ms | 970ms | 0% |
| 100 users | 54.1 req/s | 400ms | 1400ms | 0% |
| 200 users (mixed) | 117.8 req/s | 880ms | 2400ms | 0% |
WebSocket Audio Pipeline — Live Sessions
| Concurrent Sessions | Success Rate | WS Connect (median) | Greeting Latency (median) | Greeting (p95) |
|---|
| 1 | 100% | 602ms | 2901ms | — |
| 10 | 100% | 634ms | 3529ms | 4110ms |
| 20 | 90% | 598ms | 4462ms | 5311ms |
Greeting latency is the full path: LLM generates welcome text → Gemini Flash TTS synthesizes PCM → Kafka → Vert.x → WebSocket → browser receives first audio. At 10 concurrent sessions the system delivers 100% success with sub-4s greeting latency.
Per-Turn Audio Round-Trip (Production Observed)
| Stage | Typical | Notes |
|---|
| Browser → Vert.x | 20–50ms | GCP network |
| Vert.x → Kafka | 5–15ms | SASL_SSL managed Kafka |
| Kafka poll | 10–30ms | 100ms poll interval |
| Google STT | 200–500ms | Chirp3 HD, streaming mode |
| Vertex AI LLM | 800–2000ms | Fine-tuned endpoint |
| Gemini Flash TTS | 100–300ms | Gemini 2.5 Flash TTS, per sentence, PCM L16 24kHz |
| Kafka → Vert.x → browser | 20–50ms | Return path |
| Total per turn | 1.4s – 3.2s | End-to-end |
Permissions
| Permission | Roles | Description |
|---|
| interview:create | ADMIN, RECRUITER | Create new interviews via Integration API or admin UI |
| interview:read | ADMIN, RECRUITER, VIEWER | Read interview details and status |
| interview:update | ADMIN, RECRUITER | Update interview data (HITL completion) |
| interview:approve | ADMIN, RECRUITER | Approve/reject plans and assessments |
| interview:delete | ADMIN | Delete interview records |