AI Models

Model Specifications

Per-agent inference configuration, input/output schemas, and architecture details for Teamcast Maya's three fine-tuned agents on Gemini 2.5 Flash.

All three agents run on dedicated Google Vertex AI endpoints in us-central1. Endpoint identifiers are managed centrally via k8s/configmap.yaml and injected at deployment time — no hardcoded values in application code.

Planner Agent

Inference Configuration

FieldValue
Base modelGemini 2.5 Flash (fine-tuned, v3)
LoRA adapter rank16
Temperature0.7
Max output tokens8,192 (core plan) + 2,048 (supplementary)
Response formatJSON (response_mime_type)
ConcurrencyTwo parallel async calls

Split-Call Architecture

The planner makes two independent Vertex AI calls concurrently to reduce latency. Call A generates the core interview plan; Call B generates supplementary content. Both complete in parallel.

CallOutputsMax Tokens
A — CoreCompetencies with rubric sub-dimensions (1-5 scale indicators), questions, skills coverage map8,192
B — SupplementaryGreeting script, candidate outreach draft, must-have and nice-to-have criteria2,048

Input Schema

json
{
  "position": "Senior Backend Engineer",
  "level": "senior",
  "job_description": "We are looking for...",
  "skills": ["Python", "PostgreSQL", "Redis"],
  "candidate_name": "Jane Doe",
  "resume_text": "5 years experience at...",
  "duration_minutes": 45
}

Output Schema (v3)

json
{
  "competencies": [
    {
      "name": "System Design",
      "weight": 0.3,
      "is_required": true,
      "minimum_acceptable_score": 3.0,
      "sub_dimensions": [
        {
          "name": "Architecture Thinking",
          "indicators": {
            "1": "Cannot articulate basic system components",
            "2": "Identifies components but misses interactions",
            "3": "Describes reasonable architecture with trade-offs",
            "4": "Strong architecture with clear scaling strategy",
            "5": "Exceptional design with novel approaches"
          }
        }
      ],
      "questions": ["Design a rate limiter at scale.", "..."]
    }
  ],
  "skills_coverage": { "Python": true, "PostgreSQL": true },
  "greeting_script": "Welcome Jane, thanks for joining today...",
  "inmail_draft": "Hi Jane, I reviewed your profile...",
  "must_have": ["3+ years Python", "distributed systems experience"],
  "nice_to_have": ["Kubernetes", "Go"]
}

Assessor Agent

Inference Configuration

FieldValue
Base modelGemini 2.5 Flash (fine-tuned, v3)
LoRA adapter rank16
Temperature0.1 (near-deterministic)
Max output tokens8,192
Response formatJSON (response_mime_type)
Post-processingTwo-layer scoring + recommendation enforcement

Two-Layer Scoring Architecture

Layer 1 (LLM): scores 5 evidence components per sub-dimension on a 1-5 scale. Layer 2 (deterministic): aggregates evidence components into agent_score, aggregated_score, and overall_score using scoring_engine.py. The LLM sets overall_score to 0.0 as a placeholder — it is always recomputed by Layer 2.

Recommendation Thresholds (1-5 Scale)

The assessor outputs a hiring recommendation label. A post-processing step overrides the model label if it conflicts with the overall_score computed by Layer 2.

RecommendationOverall ScoreAdditional
STRONG_HIRE>= 4.5No required competency failures
HIRE>= 3.5No required competency failures
MAYBE>= 2.5 or has failures
NO_HIRE< 2.5

Output Schema (v3)

json
{
  "overall_score": 0.0,
  "competency_scores": [
    {
      "competency": "System Design",
      "weight": 0.3,
      "sub_dimension_scores": [
        {
          "sub_dimension": "Architecture Thinking",
          "evidence_components": {
            "completeness": 4,
            "reasoning_clarity": 3,
            "outcome_strength": 4,
            "ownership": 5,
            "evidence_confidence": 4
          },
          "transcript_turns": [5, 6, 12],
          "exclusion_triggered": false,
          "exclusion_reason": "",
          "observations": "Candidate described a rate limiter with clear outcomes..."
        }
      ],
      "aggregated_score": 0.0
    }
  ],
  "question_scores": [
    {
      "question_id": "q1",
      "question_text": "Design a rate limiter at scale.",
      "score": 4,
      "max_score": 5,
      "feedback": "Strong architecture with clear scaling strategy",
      "evidence_turns": [5, 6, 7]
    }
  ],
  "strengths": ["Clear communication", "Strong problem-solving"],
  "weaknesses": ["Limited experience with advanced topics"],
  "recommendation": "HIRE",
  "required_competency_failures": [],
  "summary": "Overall assessment addressing three core evaluation questions...",
  "must_have_evaluations": [
    { "criterion": "5+ years Python", "met": true, "confidence": 0.9, "evidence": "..." }
  ],
  "nice_to_have_evaluations": [
    { "criterion": "Kubernetes experience", "met": false, "confidence": 0.8, "evidence": "..." }
  ]
}

Note: overall_score and aggregated_score are placeholders (0.0) in the LLM output. The evaluation engine computes them deterministically from evidence_components.

Interviewer Agent

Inference Configuration

FieldValue
Base modelGemini 2.5 Flash (fine-tuned, v3 — 500 examples across 12 modes)
LoRA adapter rank16
Temperature0.7
Max output tokens512
Response formatPlain text
System promptCompany/job context, progress block, evidence requirements, communication techniques

12 Interaction Modes

The interviewer receives a [SYSTEM HINT] with each request indicating the interaction mode. Each mode triggers different response behavior:

ModeBehavior
greetingOpen the interview with context and first question
answerAcknowledge candidate answer and transition to next question
follow_upProbe deeper on a partial or surface-level answer
evidence_probeRequest specific evidence (metrics, outcomes, examples)
rephraseRephrase question when candidate seems confused
off_topicRedirect candidate back to the interview topic
prompt_elaborationEncourage candidate to elaborate on a brief answer
closingWrap up the interview professionally
time_warningAlert candidate about remaining time
time_upEnd the interview when time expires
candidate_stopHandle candidate requesting to stop
interruptionHandle mid-answer transitions

Real-Time Audio Pipeline

The interviewer operates in real-time during a live audio session. Each candidate utterance flows through the full pipeline before a response is delivered:

StageTechnologyNotes
Candidate audio captureBrowser audio (WebM/Opus)48kHz sample rate
TransportWebSocket edge layerLow-latency binary streaming
Speech-to-textGoogle Cloud Speech-to-TextStreaming recognition
Response generationVertex AI fine-tuned endpointFull conversation history + mode hint + evidence requirements
Text-to-speechGemini 2.5 Flash TTS (Vertex AI)Kore voice, PCM L16 24kHz mono, emotion tags
Audio deliveryWebSocket to browserBinary audio frames

Output Behaviour

The interviewer outputs plain text — the next question or follow-up. The model is trained to:

  • Use professional interview techniques (labeling, mirroring, calibrated questions, specificity probing, tactical empathy)
  • Keep responses to 2-6 sentences (100% compliance in v3 evaluation)
  • Never give advice, hints, or reveal correct answers (100% safety rate)
  • Probe incomplete answers with targeted follow-up questions
  • Transition naturally between competency areas
  • Track evidence requirements and probe for missing evidence signals

Deprecated Endpoints

The following endpoints are deprecated and scheduled for decommissioning. All production traffic has been migrated to v3 endpoints.
AgentVersionBase ModelStatus
Plannerv2Gemini 2.0 FlashDeprecated — replaced by v3
Assessorv2Gemini 2.0 FlashDeprecated — replaced by v3
Interviewerv1Gemini 2.0 FlashDeprecated — replaced by v3
Was this page helpful?