Architecture

System Overview

Multi-tenant AI interview platform architecture with event-driven microservices running on GKE.

The AI Interview System is an event-driven, multi-tenant platform deployed on Google Kubernetes Engine. At its core is a NestJS API gateway that orchestrates three subsystems: the Agno Python agents for AI reasoning, the Vert.x WebSocket edge for real-time interview audio, and the HITL workflow engine for human oversight. All services communicate asynchronously via Google Managed Kafka.

Architecture Layers

LayerTechnologyResponsibility
API GatewayNestJS 10 + TypeScriptREST Integration API, JWT/API-key auth, multi-tenant routing, webhook delivery
Agno Planner AgentPython + Agno + Vertex AISkill validation via pgvector, interview plan generation (Gemini 2.0 Flash fine-tuned)
Agno Interviewer Agent — MayaPython + Agno + Vertex AIReal-time interview conduction, Google STT, LLM response, Google TTS synthesis
Agno Assessor AgentPython + Agno + Vertex AIPost-interview transcript evaluation, structured hiring recommendation
WebSocket EdgeVert.x 4.x (JVM)Real-time audio streaming, 50K+ WebSocket connections per node, Redis session routing
DatabasePostgreSQL 16 + pgvectorInterview data, workflow states, skill embeddings with Row-Level Security
Event BusGoogle Managed Kafka (SASL_SSL)Async event streaming — 100 partitions per audio topic
Cache / Session StoreRedis 7Interview session state, per-session audio buffers, WebSocket pod routing registry

Production Infrastructure

The platform runs on GKE cluster teamcast-ai-clust in Google Cloud us-central1.

Node Pools

Node PoolNodesvCPURAMHosts
standard-8-pool3x n2-standard-88 each (24 total)32GB each (96GB total)All services (current POC)

External Services

ServiceProviderNotes
Speech-to-Text (STT)Google Cloud Speech — Chirp3 HDStreaming + batch modes, 48kHz WebM/Opus
Text-to-Speech (TTS)Gemini 2.5 Flash TTS (Vertex AI)Raw PCM L16 24kHz mono, sentence-level streaming, emotion tags
LLM (all agents)Vertex AI fine-tuned endpointsGemini 2.0 Flash with LoRA adapters (rank 8)
KafkaGoogle Managed Kafka (SASL_SSL)3 brokers, 100 partitions per audio topic
RedisSingle nodeSession state + WebSocket pod routing registry
Video StorageGoogle Cloud StorageWebM video chunks, per-session recordings

Auto-Scaling (HPA)

All Python agent services scale horizontally. The Interviewer uses memory-based HPA because the workload is IO-bound — CPU stays at 3–5% even under full load while memory grows linearly with active audio sessions (~40MB per active session).

ServiceMin PodsMax PodsScale TriggerThreshold
Interviewer (Maya)220Memory utilization70% of 1536Mi request (~1075Mi)
Planner28CPU utilization60%
Assessor28CPU utilization60%
HPA triggers when active audio sessions per pod exceed ~27. Each audio session holds a Google STT streaming context, audio buffers, and a growing LLM conversation history. At 20 pods (HPA max), the system supports approximately 540–700 simultaneous live audio interviews.

Multi-tenancy

Every database query is filtered by tenantId. Row-Level Security policies at the PostgreSQL level enforce isolation even if application code has a bug.

typescript
// Every controller extracts tenantId from the JWT
@Get()
async findAll(@TenantId() tenantId: string) {
  return this.interviewService.findAll(tenantId);
}

// Every service filters by tenantId
async findAll(tenantId: string) {
  return this.prisma.interview.findMany({
    where: { tenantId },  // Required — never omit!
  });
}
Never query the database without a tenantId filter. All services enforce this via the @TenantId() decorator and RLS policies.

Kafka Event Topics

TopicPublished ByConsumed ByTrigger
interview.info_neededAPI GatewayNotification ServiceData validation fails — HITL required
skill.validation.requestedAPI GatewayAgno Planner (7777)Data complete or info-needed resolved
interview.plan.createdAgno Planner (7777)API GatewayPlan generation complete
interview.approvedAPI GatewayAgno Interviewer (7778)Plan approved — create session
audio.candidate.spokenVert.x Edge (8080)Agno Interviewer (7778)Candidate mic audio — 100 partitions, key=sessionId
audio.agent.spokenAgno Interviewer (7778)Vert.x Edge (8080)Agent TTS audio (PCM L16 24kHz) — 100 partitions, key=sessionId
audio.candidate.transcribedAgno Interviewer (7778)Vert.x Edge (8080)Live interim transcripts for real-time display
video.candidate.streamVert.x EdgeVideoStorageConsumerServiceCombined WebM chunks (30s), uploaded to GCS
interview.completedAgno Interviewer (7778)API Gateway + Assessor (7779)Interview session ends
interview.assessment.readyAgno Assessor (7779)API GatewayAssessment generated — HITL gate
interview.assessment.completedAgno Assessor (7779)API GatewayRecruiter approved verdict — final webhook
Audio topics use sessionId as the Kafka message key. This guarantees all messages for one session land on the same partition and are processed in order by the same consumer thread — streaming=start always before streaming=chunk before streaming=end before the batch audio message.

Service Ports

ServicePortProtocolHealth Endpoint
API Gateway (NestJS)3009HTTP/REST + KafkaGET /api/v1/health/live
Agno Planner Agent7777HTTP + KafkaGET /health/live
Agno Interviewer Agent — Maya7778HTTP + KafkaGET /health/live
Agno Assessor Agent7779HTTP + KafkaGET /health/live
WebSocket Edge (Vert.x)8080WebSocket + Kafka + RedisGET /health
PostgreSQL5432TCP
Kafka9092SASL_SSL TCP
Redis6379TCP

Interview Data Flow

1. Plan Generation

text
External System
    │
    ▼
POST /api/v1/integration/interviews   (API Gateway)
    │
    ├── Validate data completeness (CRITICAL / HIGH / MEDIUM fields)
    ├── Save to PostgreSQL  (state: RECEIVED → VALIDATING_SKILLS)
    └── Publish: skill.validation.requested  (Kafka)
              │
              ▼
        Agno Planner Agent
              │
              ├── Validate skills via pgvector similarity search
              ├── Generate interview plan  (Vertex AI Gemini 2.0 Flash fine-tuned)
              └── Publish: interview.plan.created  (Kafka)
                          │
                          ▼
                  API Gateway → state: PENDING
                          └── Webhook: interview.plan_generated → callbackUrl

2. Live Interview — Audio Pipeline

text
Candidate Browser
    │
    ├── WebSocket connect: wss://mayaedge.teamcast.ai/ws?sessionId=...&tenantId=...
    │     Vert.x registers sessionId in Redis (pod routing registry)
    │
    ├── Audio frames (JSON): { type:"AUDIO", sessionId, streaming:"start|chunk|end" }
    │     Vert.x → Kafka: audio.candidate.spoken  (key=sessionId, 100 partitions)
    │
    └── Audio frames (batch): { type:"AUDIO", sessionId, data:"<base64 LINEAR16 PCM>" }
              Vert.x → Kafka: audio.candidate.spoken
                    │
                    ▼
              Agno Interviewer Agent
              4 consumer threads × 2 pods = 8 workers
              Per-session ThreadPoolExecutor queue (FIFO)
                    │
                    ├── streaming=start  → open Google STT streaming session (48kHz Chirp3 HD)
                    ├── streaming=chunk  → feed bytes into STT (live interim transcripts)
                    ├── streaming=end    → close STT, cache final transcript
                    └── batch chunk      → debounce → batch STT → LLM → TTS (per sentence)
                                │
                                ├── Google STT          200–500ms
                                ├── Vertex AI LLM       800–2000ms
                                └── Gemini Flash TTS    100–300ms
                                          │
                                          ▼
                              Publish: audio.agent.spoken  (AGENT_RESPONSE + base64 PCM L16 24kHz)
                                          │
                                          ▼
                              Vert.x reads Kafka → Redis lookup → delivers to WebSocket
                                          └── Browser decodes PCM Int16→Float32 via AudioContext

Benchmarked Performance

Measured on the production GKE cluster (2 interviewer pods, 3x n2-standard-8 nodes) using the Locust HTTP benchmark and a WebSocket audio pipeline benchmark with real candidate audio (60s, 16kHz LINEAR16 PCM).

HTTP Layer — Session Creation

Concurrent UsersThroughputp50p95Error Rate
20 users11.4 req/s380ms970ms0%
100 users54.1 req/s400ms1400ms0%
200 users (mixed)117.8 req/s880ms2400ms0%

WebSocket Audio Pipeline — Live Sessions

Concurrent SessionsSuccess RateWS Connect (median)Greeting Latency (median)Greeting (p95)
1100%602ms2901ms
10100%634ms3529ms4110ms
2090%598ms4462ms5311ms

Greeting latency is the full path: LLM generates welcome text → Gemini Flash TTS synthesizes PCM → Kafka → Vert.x → WebSocket → browser receives first audio. At 10 concurrent sessions the system delivers 100% success with sub-4s greeting latency.

Per-Turn Audio Round-Trip (Production Observed)

StageTypicalNotes
Browser → Vert.x20–50msGCP network
Vert.x → Kafka5–15msSASL_SSL managed Kafka
Kafka poll10–30ms100ms poll interval
Google STT200–500msChirp3 HD, streaming mode
Vertex AI LLM800–2000msFine-tuned endpoint
Gemini Flash TTS100–300msGemini 2.5 Flash TTS, per sentence, PCM L16 24kHz
Kafka → Vert.x → browser20–50msReturn path
Total per turn1.4s – 3.2sEnd-to-end

Permissions

PermissionRolesDescription
interview:createADMIN, RECRUITERCreate new interviews via Integration API or admin UI
interview:readADMIN, RECRUITER, VIEWERRead interview details and status
interview:updateADMIN, RECRUITERUpdate interview data (HITL completion)
interview:approveADMIN, RECRUITERApprove/reject plans and assessments
interview:deleteADMINDelete interview records
Was this page helpful?