AI Models
Fine-Tuning Methodology
How Teamcast Maya's three AI agents are trained, validated, and continuously improved using Google Vertex AI Supervised Fine-Tuning.
Overview
Maya uses Supervised Fine-Tuning (SFT) via Google Vertex AI to adapt Gemini 2.5 Flash for the specific task of conducting and evaluating technical interviews. Fine-tuning is performed independently for each agent using purpose-built datasets that mirror the exact prompt format used in production.
| Parameter | Value |
|---|---|
| Base model | Gemini 2.5 Flash (gemini-2.5-flash) |
| Fine-tuning method | Supervised Fine-Tuning (SFT) with LoRA |
| LoRA adapter rank | 16 (maximum supported for Gemini 2.5 Flash) |
| Epochs | 5 |
| Learning rate multiplier | 1.0 |
| Infrastructure | Google Vertex AI — us-central1 |
| Data version | v3 — production-aligned with evidence components |
Training Data Pipeline
v3 training data is synthetically generated using the base Gemini 2.5 Flash model guided by the exact production system prompts. This ensures perfect alignment between what the model learns during fine-tuning and what it receives at inference time.
Parallel Generation
Training examples are generated in parallel using 12 concurrent workers, each making independent Vertex AI API calls. This approach reduces generation time by approximately 10x compared to sequential generation.
Assessor — Evidence Components (v3)
The v3 assessor uses a two-layer scoring model. Layer 1 (the LLM) scores 5 evidence components per sub-dimension on a 1-5 scale. Layer 2 (deterministic math in service.py) aggregates evidence components into agent_score, aggregated_score, and overall_score. This separation ensures reproducible scoring while leveraging the LLM for nuanced evidence evaluation.
| Evidence Component | Scale | What it measures |
|---|---|---|
| Completeness | 1-5 | Are context, action, and outcome all present? |
| Reasoning Clarity | 1-5 | How clearly did the candidate articulate thinking and tradeoffs? |
| Outcome Strength | 1-5 | How concrete and measurable was the stated outcome? |
| Ownership | 1-5 | Did they use "I" showing personal agency vs vague "we"? |
| Evidence Confidence | 1-5 | How confident that evidence is real vs fabricated? |
Assessor — Recommendation Thresholds (1-5 Scale)
| Tier | Score Range | Recommendation | Condition |
|---|---|---|---|
| Exceptional | 4.5 - 5.0 | STRONG_HIRE | No required competency failures |
| Strong | 3.5 - 4.5 | HIRE | No required competency failures |
| Moderate | 2.5 - 3.5 | MAYBE | Or has required competency failures |
| Weak/Insufficient | 1.0 - 2.5 | NO_HIRE |
Interviewer — 12 Interaction Modes
The v3 interviewer training data covers 12 distinct interaction modes, each with a weighted distribution to match production usage patterns:
| Mode | Weight | Purpose |
|---|---|---|
| answer | 30% | Standard response to candidate answer — acknowledge and transition |
| follow_up | 15% | Probe deeper on a partial or surface-level answer |
| evidence_probe | 10% | Request specific evidence (metrics, outcomes, examples) |
| greeting | 10% | Opening the interview with context and first question |
| rephrase | 8% | Rephrase question when candidate seems confused |
| off_topic | 5% | Redirect candidate back to the interview topic |
| prompt_elaboration | 5% | Encourage candidate to elaborate on a brief answer |
| closing | 5% | Wrap up the interview professionally |
| time_warning | 4% | Alert candidate about remaining time |
| time_up | 3% | End the interview when time expires |
| candidate_stop | 3% | Handle candidate requesting to stop |
| interruption | 2% | Handle mid-answer transitions or topic changes |
Data Splits (v3)
| Agent | Train | Val | Test | Total |
|---|---|---|---|---|
| Planner | 240 | 30 | 30 | 300 |
| Assessor | 516 | 64 | 64 | 644 |
| Interviewer | 400 | 50 | 50 | 500 |
All v3 datasets are synthetically generated using production system prompts. The assessor dataset includes examples across all recommendation tiers with evidence_components and transcript_turns. The interviewer dataset covers all 12 interaction modes with weighted distribution.
Production Alignment
A critical lesson from v1: training data must use the exact same system prompt and output schema that runs in production. v1 data used an older schema with normalized 0-1 scores and RECOMMENDED / NOT_RECOMMENDED labels. v2 moved to 0-4 scores with STRONG_HIRE / HIRE / MAYBE / NO_HIRE. v3 uses the current 1-5 evidence component scale with transcript_turns and evidence_turns arrays.
Inference-Time Safeguards
Fine-tuning improves the base model but does not guarantee perfect calibration. Maya applies additional safeguards at inference time:
| Safeguard | Applied to | Effect |
|---|---|---|
| Temperature 0.1 | Assessor | Near-deterministic scoring — reduces variance on borderline candidates |
| Two-layer scoring | Assessor | Layer 1 (LLM) generates evidence components; Layer 2 (deterministic) computes aggregated scores — ensures reproducible overall scores |
| Recommendation enforcement | Assessor | Post-processes model output: if the generated recommendation conflicts with the overall score and explicit tier thresholds, it is overridden |
| Structured JSON output mode | Planner, Assessor | Uses response_mime_type="application/json" — forces valid JSON output, eliminates markdown fence wrapping |
| systemInstruction | All agents | System prompt passed via native Vertex AI systemInstruction parameter, not injected as user message |
Version History
| Version | Base Model | LoRA Rank | Scale | Key Changes | Status |
|---|---|---|---|---|---|
| v1 | Gemini 2.0 Flash | 4 | 0-1 | Initial extraction-based data | Removed |
| v2 | Gemini 2.0 Flash | 8 | 0-4 | Synthetic data, borderline calibration | Deprecated |
| v3 | Gemini 2.5 Flash | 16 | 1-5 | Evidence components, transcript_turns, 12 interviewer modes | Live |
Retraining Workflow
The full fine-tuning cycle runs in three stages. First, new training data is generated in parallel (~10-15 minutes for 500 examples). Second, the data is uploaded to Google Cloud Storage and Vertex AI fine-tuning jobs are launched, completing in approximately 2-3 hours. Third, the new endpoint IDs are updated in the deployment configuration (k8s/configmap.yaml) and applied via rolling restart with zero downtime.