AI Models

Model Specifications

Per-agent inference configuration, input/output schemas, and architecture details for Teamcast Maya's three fine-tuned agents on Gemini 2.5 Flash.

All three agents run on dedicated Google Vertex AI endpoints in us-central1. Endpoint identifiers are managed centrally via k8s/configmap.yaml and injected at deployment time — no hardcoded values in application code.

Planner Agent

Inference Configuration

Field	Value
Base model	Gemini 2.5 Flash (fine-tuned, v3)
LoRA adapter rank	16
Temperature	0.7
Max output tokens	8,192 (core plan) + 2,048 (supplementary)
Response format	JSON (response_mime_type)
Concurrency	Two parallel async calls

Split-Call Architecture

The planner makes two independent Vertex AI calls concurrently to reduce latency. Call A generates the core interview plan; Call B generates supplementary content. Both complete in parallel.

Call	Outputs	Max Tokens
A — Core	Competencies with rubric sub-dimensions (1-5 scale indicators), questions, skills coverage map	8,192
B — Supplementary	Greeting script, candidate outreach draft, must-have and nice-to-have criteria	2,048

Input Schema

json

{
  "position": "Senior Backend Engineer",
  "level": "senior",
  "job_description": "We are looking for...",
  "skills": ["Python", "PostgreSQL", "Redis"],
  "candidate_name": "Jane Doe",
  "resume_text": "5 years experience at...",
  "duration_minutes": 45
}

Output Schema (v3)

json

{
  "competencies": [
    {
      "name": "System Design",
      "weight": 0.3,
      "is_required": true,
      "minimum_acceptable_score": 3.0,
      "sub_dimensions": [
        {
          "name": "Architecture Thinking",
          "indicators": {
            "1": "Cannot articulate basic system components",
            "2": "Identifies components but misses interactions",
            "3": "Describes reasonable architecture with trade-offs",
            "4": "Strong architecture with clear scaling strategy",
            "5": "Exceptional design with novel approaches"
          }
        }
      ],
      "questions": ["Design a rate limiter at scale.", "..."]
    }
  ],
  "skills_coverage": { "Python": true, "PostgreSQL": true },
  "greeting_script": "Welcome Jane, thanks for joining today...",
  "inmail_draft": "Hi Jane, I reviewed your profile...",
  "must_have": ["3+ years Python", "distributed systems experience"],
  "nice_to_have": ["Kubernetes", "Go"]
}

Assessor Agent

Inference Configuration

Field	Value
Base model	Gemini 2.5 Flash (fine-tuned, v3)
LoRA adapter rank	16
Temperature	0.1 (near-deterministic)
Max output tokens	8,192
Response format	JSON (response_mime_type)
Post-processing	Two-layer scoring + recommendation enforcement

Two-Layer Scoring Architecture

Layer 1 (LLM): scores 5 evidence components per sub-dimension on a 1-5 scale. Layer 2 (deterministic): aggregates evidence components into agent_score, aggregated_score, and overall_score using scoring_engine.py. The LLM sets overall_score to 0.0 as a placeholder — it is always recomputed by Layer 2.

Recommendation Thresholds (1-5 Scale)

The assessor outputs a hiring recommendation label. A post-processing step overrides the model label if it conflicts with the overall_score computed by Layer 2.

Recommendation	Overall Score	Additional
STRONG_HIRE	>= 4.5	No required competency failures
HIRE	>= 3.5	No required competency failures
MAYBE	>= 2.5 or has failures
NO_HIRE	< 2.5

Output Schema (v3)

json

{
  "overall_score": 0.0,
  "competency_scores": [
    {
      "competency": "System Design",
      "weight": 0.3,
      "sub_dimension_scores": [
        {
          "sub_dimension": "Architecture Thinking",
          "evidence_components": {
            "completeness": 4,
            "reasoning_clarity": 3,
            "outcome_strength": 4,
            "ownership": 5,
            "evidence_confidence": 4
          },
          "transcript_turns": [5, 6, 12],
          "exclusion_triggered": false,
          "exclusion_reason": "",
          "observations": "Candidate described a rate limiter with clear outcomes..."
        }
      ],
      "aggregated_score": 0.0
    }
  ],
  "question_scores": [
    {
      "question_id": "q1",
      "question_text": "Design a rate limiter at scale.",
      "score": 4,
      "max_score": 5,
      "feedback": "Strong architecture with clear scaling strategy",
      "evidence_turns": [5, 6, 7]
    }
  ],
  "strengths": ["Clear communication", "Strong problem-solving"],
  "weaknesses": ["Limited experience with advanced topics"],
  "recommendation": "HIRE",
  "required_competency_failures": [],
  "summary": "Overall assessment addressing three core evaluation questions...",
  "must_have_evaluations": [
    { "criterion": "5+ years Python", "met": true, "confidence": 0.9, "evidence": "..." }
  ],
  "nice_to_have_evaluations": [
    { "criterion": "Kubernetes experience", "met": false, "confidence": 0.8, "evidence": "..." }
  ]
}

Note: overall_score and aggregated_score are placeholders (0.0) in the LLM output. The evaluation engine computes them deterministically from evidence_components.

Interviewer Agent

Inference Configuration

Field	Value
Base model	Gemini 2.5 Flash (fine-tuned, v3 — 500 examples across 12 modes)
LoRA adapter rank	16
Temperature	0.7
Max output tokens	512
Response format	Plain text
System prompt	Company/job context, progress block, evidence requirements, communication techniques

12 Interaction Modes

The interviewer receives a [SYSTEM HINT] with each request indicating the interaction mode. Each mode triggers different response behavior:

Mode	Behavior
greeting	Open the interview with context and first question
answer	Acknowledge candidate answer and transition to next question
follow_up	Probe deeper on a partial or surface-level answer
evidence_probe	Request specific evidence (metrics, outcomes, examples)
rephrase	Rephrase question when candidate seems confused
off_topic	Redirect candidate back to the interview topic
prompt_elaboration	Encourage candidate to elaborate on a brief answer
closing	Wrap up the interview professionally
time_warning	Alert candidate about remaining time
time_up	End the interview when time expires
candidate_stop	Handle candidate requesting to stop
interruption	Handle mid-answer transitions

Real-Time Audio Pipeline

The interviewer operates in real-time during a live audio session. Each candidate utterance flows through the full pipeline before a response is delivered:

Stage	Technology	Notes
Candidate audio capture	Browser audio (WebM/Opus)	48kHz sample rate
Transport	WebSocket edge layer	Low-latency binary streaming
Speech-to-text	Google Cloud Speech-to-Text	Streaming recognition
Response generation	Vertex AI fine-tuned endpoint	Full conversation history + mode hint + evidence requirements
Text-to-speech	Gemini 2.5 Flash TTS (Vertex AI)	Kore voice, PCM L16 24kHz mono, emotion tags
Audio delivery	WebSocket to browser	Binary audio frames

Output Behaviour

The interviewer outputs plain text — the next question or follow-up. The model is trained to:

Use professional interview techniques (labeling, mirroring, calibrated questions, specificity probing, tactical empathy)
Keep responses to 2-6 sentences (100% compliance in v3 evaluation)
Never give advice, hints, or reveal correct answers (100% safety rate)
Probe incomplete answers with targeted follow-up questions
Transition naturally between competency areas
Track evidence requirements and probe for missing evidence signals

Deprecated Endpoints

The following endpoints are deprecated and scheduled for decommissioning. All production traffic has been migrated to v3 endpoints.

Agent	Version	Base Model	Status
Planner	v2	Gemini 2.0 Flash	Deprecated — replaced by v3
Assessor	v2	Gemini 2.0 Flash	Deprecated — replaced by v3
Interviewer	v1	Gemini 2.0 Flash	Deprecated — replaced by v3

Was this page helpful?

Previous← Fine-Tuning Methodology

NextEvaluation Results →