Architecture

System Overview

Multi-tenant AI interview platform architecture with event-driven microservices running on GKE.

The AI Interview System is an event-driven, multi-tenant platform deployed on Google Kubernetes Engine. At its core is a NestJS API gateway that orchestrates three subsystems: the Agno Python agents for AI reasoning, the Vert.x WebSocket edge for real-time interview audio, and the HITL workflow engine for human oversight. All services communicate asynchronously via Google Managed Kafka.

Architecture Layers

Layer	Technology	Responsibility
API Gateway	NestJS 10 + TypeScript	REST Integration API, JWT/API-key auth, multi-tenant routing, webhook delivery
Agno Planner Agent	Python + Agno + Vertex AI	Skill validation via pgvector, interview plan generation (Gemini 2.0 Flash fine-tuned)
Agno Interviewer Agent — Maya	Python + Agno + Vertex AI	Real-time interview conduction, Google STT, LLM response, Google TTS synthesis
Agno Assessor Agent	Python + Agno + Vertex AI	Post-interview transcript evaluation, structured hiring recommendation
WebSocket Edge	Vert.x 4.x (JVM)	Real-time audio streaming, 50K+ WebSocket connections per node, Redis session routing
Database	PostgreSQL 16 + pgvector	Interview data, workflow states, skill embeddings with Row-Level Security
Event Bus	Google Managed Kafka (SASL_SSL)	Async event streaming — 100 partitions per audio topic
Cache / Session Store	Redis 7	Interview session state, per-session audio buffers, WebSocket pod routing registry

Production Infrastructure

The platform runs on GKE cluster teamcast-ai-clust in Google Cloud us-central1.

Node Pools

Node Pool	Nodes	vCPU	RAM	Hosts
standard-8-pool	3x n2-standard-8	8 each (24 total)	32GB each (96GB total)	All services (current POC)

External Services

Service	Provider	Notes
Speech-to-Text (STT)	Google Cloud Speech — Chirp3 HD	Streaming + batch modes, 48kHz WebM/Opus
Text-to-Speech (TTS)	Gemini 2.5 Flash TTS (Vertex AI)	Raw PCM L16 24kHz mono, sentence-level streaming, emotion tags
LLM (all agents)	Vertex AI fine-tuned endpoints	Gemini 2.0 Flash with LoRA adapters (rank 8)
Kafka	Google Managed Kafka (SASL_SSL)	3 brokers, 100 partitions per audio topic
Redis	Single node	Session state + WebSocket pod routing registry
Video Storage	Google Cloud Storage	WebM video chunks, per-session recordings

Auto-Scaling (HPA)

All Python agent services scale horizontally. The Interviewer uses memory-based HPA because the workload is IO-bound — CPU stays at 3–5% even under full load while memory grows linearly with active audio sessions (~40MB per active session).

Service	Min Pods	Max Pods	Scale Trigger	Threshold
Interviewer (Maya)	2	20	Memory utilization	70% of 1536Mi request (~1075Mi)
Planner	2	8	CPU utilization	60%
Assessor	2	8	CPU utilization	60%

HPA triggers when active audio sessions per pod exceed ~27. Each audio session holds a Google STT streaming context, audio buffers, and a growing LLM conversation history. At 20 pods (HPA max), the system supports approximately 540–700 simultaneous live audio interviews.

Multi-tenancy

Every database query is filtered by tenantId. Row-Level Security policies at the PostgreSQL level enforce isolation even if application code has a bug.

typescript

// Every controller extracts tenantId from the JWT
@Get()
async findAll(@TenantId() tenantId: string) {
  return this.interviewService.findAll(tenantId);
}

// Every service filters by tenantId
async findAll(tenantId: string) {
  return this.prisma.interview.findMany({
    where: { tenantId },  // Required — never omit!
  });
}

Never query the database without a tenantId filter. All services enforce this via the @TenantId() decorator and RLS policies.

Kafka Event Topics

Topic	Published By	Consumed By	Trigger
interview.info_needed	API Gateway	Notification Service	Data validation fails — HITL required
skill.validation.requested	API Gateway	Agno Planner (7777)	Data complete or info-needed resolved
interview.plan.created	Agno Planner (7777)	API Gateway	Plan generation complete
interview.approved	API Gateway	Agno Interviewer (7778)	Plan approved — create session
audio.candidate.spoken	Vert.x Edge (8080)	Agno Interviewer (7778)	Candidate mic audio — 100 partitions, key=sessionId
audio.agent.spoken	Agno Interviewer (7778)	Vert.x Edge (8080)	Agent TTS audio (PCM L16 24kHz) — 100 partitions, key=sessionId
audio.candidate.transcribed	Agno Interviewer (7778)	Vert.x Edge (8080)	Live interim transcripts for real-time display
video.candidate.stream	Vert.x Edge	VideoStorageConsumerService	Combined WebM chunks (30s), uploaded to GCS
interview.completed	Agno Interviewer (7778)	API Gateway + Assessor (7779)	Interview session ends
interview.assessment.ready	Agno Assessor (7779)	API Gateway	Assessment generated — HITL gate
interview.assessment.completed	Agno Assessor (7779)	API Gateway	Recruiter approved verdict — final webhook

Audio topics use sessionId as the Kafka message key. This guarantees all messages for one session land on the same partition and are processed in order by the same consumer thread — streaming=start always before streaming=chunk before streaming=end before the batch audio message.

Service Ports

Service	Port	Protocol	Health Endpoint
API Gateway (NestJS)	3009	HTTP/REST + Kafka	GET /api/v1/health/live
Agno Planner Agent	7777	HTTP + Kafka	GET /health/live
Agno Interviewer Agent — Maya	7778	HTTP + Kafka	GET /health/live
Agno Assessor Agent	7779	HTTP + Kafka	GET /health/live
WebSocket Edge (Vert.x)	8080	WebSocket + Kafka + Redis	GET /health
PostgreSQL	5432	TCP	—
Kafka	9092	SASL_SSL TCP	—
Redis	6379	TCP	—

Interview Data Flow

1. Plan Generation

text

External System
    │
    ▼
POST /api/v1/integration/interviews   (API Gateway)
    │
    ├── Validate data completeness (CRITICAL / HIGH / MEDIUM fields)
    ├── Save to PostgreSQL  (state: RECEIVED → VALIDATING_SKILLS)
    └── Publish: skill.validation.requested  (Kafka)
              │
              ▼
        Agno Planner Agent
              │
              ├── Validate skills via pgvector similarity search
              ├── Generate interview plan  (Vertex AI Gemini 2.0 Flash fine-tuned)
              └── Publish: interview.plan.created  (Kafka)
                          │
                          ▼
                  API Gateway → state: PENDING
                          └── Webhook: interview.plan_generated → callbackUrl

2. Live Interview — Audio Pipeline

text

Candidate Browser
    │
    ├── WebSocket connect: wss://mayaedge.teamcast.ai/ws?sessionId=...&tenantId=...
    │     Vert.x registers sessionId in Redis (pod routing registry)
    │
    ├── Audio frames (JSON): { type:"AUDIO", sessionId, streaming:"start|chunk|end" }
    │     Vert.x → Kafka: audio.candidate.spoken  (key=sessionId, 100 partitions)
    │
    └── Audio frames (batch): { type:"AUDIO", sessionId, data:"<base64 LINEAR16 PCM>" }
              Vert.x → Kafka: audio.candidate.spoken
                    │
                    ▼
              Agno Interviewer Agent
              4 consumer threads × 2 pods = 8 workers
              Per-session ThreadPoolExecutor queue (FIFO)
                    │
                    ├── streaming=start  → open Google STT streaming session (48kHz Chirp3 HD)
                    ├── streaming=chunk  → feed bytes into STT (live interim transcripts)
                    ├── streaming=end    → close STT, cache final transcript
                    └── batch chunk      → debounce → batch STT → LLM → TTS (per sentence)
                                │
                                ├── Google STT          200–500ms
                                ├── Vertex AI LLM       800–2000ms
                                └── Gemini Flash TTS    100–300ms
                                          │
                                          ▼
                              Publish: audio.agent.spoken  (AGENT_RESPONSE + base64 PCM L16 24kHz)
                                          │
                                          ▼
                              Vert.x reads Kafka → Redis lookup → delivers to WebSocket
                                          └── Browser decodes PCM Int16→Float32 via AudioContext

Benchmarked Performance

Measured on the production GKE cluster (2 interviewer pods, 3x n2-standard-8 nodes) using the Locust HTTP benchmark and a WebSocket audio pipeline benchmark with real candidate audio (60s, 16kHz LINEAR16 PCM).

HTTP Layer — Session Creation

Concurrent Users	Throughput	p50	p95	Error Rate
20 users	11.4 req/s	380ms	970ms	0%
100 users	54.1 req/s	400ms	1400ms	0%
200 users (mixed)	117.8 req/s	880ms	2400ms	0%

WebSocket Audio Pipeline — Live Sessions

Concurrent Sessions	Success Rate	WS Connect (median)	Greeting Latency (median)	Greeting (p95)
1	100%	602ms	2901ms	—
10	100%	634ms	3529ms	4110ms
20	90%	598ms	4462ms	5311ms

Greeting latency is the full path: LLM generates welcome text → Gemini Flash TTS synthesizes PCM → Kafka → Vert.x → WebSocket → browser receives first audio. At 10 concurrent sessions the system delivers 100% success with sub-4s greeting latency.

Per-Turn Audio Round-Trip (Production Observed)

Stage	Typical	Notes
Browser → Vert.x	20–50ms	GCP network
Vert.x → Kafka	5–15ms	SASL_SSL managed Kafka
Kafka poll	10–30ms	100ms poll interval
Google STT	200–500ms	Chirp3 HD, streaming mode
Vertex AI LLM	800–2000ms	Fine-tuned endpoint
Gemini Flash TTS	100–300ms	Gemini 2.5 Flash TTS, per sentence, PCM L16 24kHz
Kafka → Vert.x → browser	20–50ms	Return path
Total per turn	1.4s – 3.2s	End-to-end

Permissions

Permission	Roles	Description
interview:create	ADMIN, RECRUITER	Create new interviews via Integration API or admin UI
interview:read	ADMIN, RECRUITER, VIEWER	Read interview details and status
interview:update	ADMIN, RECRUITER	Update interview data (HITL completion)
interview:approve	ADMIN, RECRUITER	Approve/reject plans and assessments
interview:delete	ADMIN	Delete interview records

Was this page helpful?

Previous← Webhook Configuration

NextInput Normalization →