AI Models

Fine-Tuning Methodology

How Teamcast Maya's three AI agents are trained, validated, and continuously improved using Google Vertex AI Supervised Fine-Tuning.

Overview

Maya uses Supervised Fine-Tuning (SFT) via Google Vertex AI to adapt Gemini 2.5 Flash for the specific task of conducting and evaluating technical interviews. Fine-tuning is performed independently for each agent using purpose-built datasets that mirror the exact prompt format used in production.

ParameterValue
Base modelGemini 2.5 Flash (gemini-2.5-flash)
Fine-tuning methodSupervised Fine-Tuning (SFT) with LoRA
LoRA adapter rank16 (maximum supported for Gemini 2.5 Flash)
Epochs5
Learning rate multiplier1.0
InfrastructureGoogle Vertex AI — us-central1
Data versionv3 — production-aligned with evidence components
v3 upgraded from Gemini 2.0 Flash (LoRA rank 8) to Gemini 2.5 Flash (LoRA rank 16). The higher adapter rank and newer base model provide better structured output adherence and scoring calibration.

Training Data Pipeline

v3 training data is synthetically generated using the base Gemini 2.5 Flash model guided by the exact production system prompts. This ensures perfect alignment between what the model learns during fine-tuning and what it receives at inference time.

Parallel Generation

Training examples are generated in parallel using 12 concurrent workers, each making independent Vertex AI API calls. This approach reduces generation time by approximately 10x compared to sequential generation.

Assessor — Evidence Components (v3)

The v3 assessor uses a two-layer scoring model. Layer 1 (the LLM) scores 5 evidence components per sub-dimension on a 1-5 scale. Layer 2 (deterministic math in service.py) aggregates evidence components into agent_score, aggregated_score, and overall_score. This separation ensures reproducible scoring while leveraging the LLM for nuanced evidence evaluation.

Evidence ComponentScaleWhat it measures
Completeness1-5Are context, action, and outcome all present?
Reasoning Clarity1-5How clearly did the candidate articulate thinking and tradeoffs?
Outcome Strength1-5How concrete and measurable was the stated outcome?
Ownership1-5Did they use "I" showing personal agency vs vague "we"?
Evidence Confidence1-5How confident that evidence is real vs fabricated?

Assessor — Recommendation Thresholds (1-5 Scale)

TierScore RangeRecommendationCondition
Exceptional4.5 - 5.0STRONG_HIRENo required competency failures
Strong3.5 - 4.5HIRENo required competency failures
Moderate2.5 - 3.5MAYBEOr has required competency failures
Weak/Insufficient1.0 - 2.5NO_HIRE

Interviewer — 12 Interaction Modes

The v3 interviewer training data covers 12 distinct interaction modes, each with a weighted distribution to match production usage patterns:

ModeWeightPurpose
answer30%Standard response to candidate answer — acknowledge and transition
follow_up15%Probe deeper on a partial or surface-level answer
evidence_probe10%Request specific evidence (metrics, outcomes, examples)
greeting10%Opening the interview with context and first question
rephrase8%Rephrase question when candidate seems confused
off_topic5%Redirect candidate back to the interview topic
prompt_elaboration5%Encourage candidate to elaborate on a brief answer
closing5%Wrap up the interview professionally
time_warning4%Alert candidate about remaining time
time_up3%End the interview when time expires
candidate_stop3%Handle candidate requesting to stop
interruption2%Handle mid-answer transitions or topic changes

Data Splits (v3)

AgentTrainValTestTotal
Planner2403030300
Assessor5166464644
Interviewer4005050500

All v3 datasets are synthetically generated using production system prompts. The assessor dataset includes examples across all recommendation tiers with evidence_components and transcript_turns. The interviewer dataset covers all 12 interaction modes with weighted distribution.

Production Alignment

A critical lesson from v1: training data must use the exact same system prompt and output schema that runs in production. v1 data used an older schema with normalized 0-1 scores and RECOMMENDED / NOT_RECOMMENDED labels. v2 moved to 0-4 scores with STRONG_HIRE / HIRE / MAYBE / NO_HIRE. v3 uses the current 1-5 evidence component scale with transcript_turns and evidence_turns arrays.

Training data must be regenerated whenever the production system prompt or output schema changes. Maya's data generation tooling imports prompts directly from the live agent configuration to prevent drift.

Inference-Time Safeguards

Fine-tuning improves the base model but does not guarantee perfect calibration. Maya applies additional safeguards at inference time:

SafeguardApplied toEffect
Temperature 0.1AssessorNear-deterministic scoring — reduces variance on borderline candidates
Two-layer scoringAssessorLayer 1 (LLM) generates evidence components; Layer 2 (deterministic) computes aggregated scores — ensures reproducible overall scores
Recommendation enforcementAssessorPost-processes model output: if the generated recommendation conflicts with the overall score and explicit tier thresholds, it is overridden
Structured JSON output modePlanner, AssessorUses response_mime_type="application/json" — forces valid JSON output, eliminates markdown fence wrapping
systemInstructionAll agentsSystem prompt passed via native Vertex AI systemInstruction parameter, not injected as user message

Version History

VersionBase ModelLoRA RankScaleKey ChangesStatus
v1Gemini 2.0 Flash40-1Initial extraction-based dataRemoved
v2Gemini 2.0 Flash80-4Synthetic data, borderline calibrationDeprecated
v3Gemini 2.5 Flash161-5Evidence components, transcript_turns, 12 interviewer modesLive

Retraining Workflow

The full fine-tuning cycle runs in three stages. First, new training data is generated in parallel (~10-15 minutes for 500 examples). Second, the data is uploaded to Google Cloud Storage and Vertex AI fine-tuning jobs are launched, completing in approximately 2-3 hours. Third, the new endpoint IDs are updated in the deployment configuration (k8s/configmap.yaml) and applied via rolling restart with zero downtime.

Was this page helpful?