AI Models

Fine-Tuning Methodology

How Teamcast Maya's three AI agents are trained, validated, and continuously improved using Google Vertex AI Supervised Fine-Tuning.

Overview

Maya uses Supervised Fine-Tuning (SFT) via Google Vertex AI to adapt Gemini 2.5 Flash for the specific task of conducting and evaluating technical interviews. Fine-tuning is performed independently for each agent using purpose-built datasets that mirror the exact prompt format used in production.

Parameter	Value
Base model	Gemini 2.5 Flash (gemini-2.5-flash)
Fine-tuning method	Supervised Fine-Tuning (SFT) with LoRA
LoRA adapter rank	16 (maximum supported for Gemini 2.5 Flash)
Epochs	5
Learning rate multiplier	1.0
Infrastructure	Google Vertex AI — us-central1
Data version	v3 — production-aligned with evidence components

v3 upgraded from Gemini 2.0 Flash (LoRA rank 8) to Gemini 2.5 Flash (LoRA rank 16). The higher adapter rank and newer base model provide better structured output adherence and scoring calibration.

Training Data Pipeline

v3 training data is synthetically generated using the base Gemini 2.5 Flash model guided by the exact production system prompts. This ensures perfect alignment between what the model learns during fine-tuning and what it receives at inference time.

Parallel Generation

Training examples are generated in parallel using 12 concurrent workers, each making independent Vertex AI API calls. This approach reduces generation time by approximately 10x compared to sequential generation.

Assessor — Evidence Components (v3)

The v3 assessor uses a two-layer scoring model. Layer 1 (the LLM) scores 5 evidence components per sub-dimension on a 1-5 scale. Layer 2 (deterministic math in service.py) aggregates evidence components into agent_score, aggregated_score, and overall_score. This separation ensures reproducible scoring while leveraging the LLM for nuanced evidence evaluation.

Evidence Component	Scale	What it measures
Completeness	1-5	Are context, action, and outcome all present?
Reasoning Clarity	1-5	How clearly did the candidate articulate thinking and tradeoffs?
Outcome Strength	1-5	How concrete and measurable was the stated outcome?
Ownership	1-5	Did they use "I" showing personal agency vs vague "we"?
Evidence Confidence	1-5	How confident that evidence is real vs fabricated?

Assessor — Recommendation Thresholds (1-5 Scale)

Tier	Score Range	Recommendation	Condition
Exceptional	4.5 - 5.0	STRONG_HIRE	No required competency failures
Strong	3.5 - 4.5	HIRE	No required competency failures
Moderate	2.5 - 3.5	MAYBE	Or has required competency failures
Weak/Insufficient	1.0 - 2.5	NO_HIRE

Interviewer — 12 Interaction Modes

The v3 interviewer training data covers 12 distinct interaction modes, each with a weighted distribution to match production usage patterns:

Mode	Weight	Purpose
answer	30%	Standard response to candidate answer — acknowledge and transition
follow_up	15%	Probe deeper on a partial or surface-level answer
evidence_probe	10%	Request specific evidence (metrics, outcomes, examples)
greeting	10%	Opening the interview with context and first question
rephrase	8%	Rephrase question when candidate seems confused
off_topic	5%	Redirect candidate back to the interview topic
prompt_elaboration	5%	Encourage candidate to elaborate on a brief answer
closing	5%	Wrap up the interview professionally
time_warning	4%	Alert candidate about remaining time
time_up	3%	End the interview when time expires
candidate_stop	3%	Handle candidate requesting to stop
interruption	2%	Handle mid-answer transitions or topic changes

Data Splits (v3)

Agent	Train	Val	Test	Total
Planner	240	30	30	300
Assessor	516	64	64	644
Interviewer	400	50	50	500

All v3 datasets are synthetically generated using production system prompts. The assessor dataset includes examples across all recommendation tiers with evidence_components and transcript_turns. The interviewer dataset covers all 12 interaction modes with weighted distribution.

Production Alignment

A critical lesson from v1: training data must use the exact same system prompt and output schema that runs in production. v1 data used an older schema with normalized 0-1 scores and RECOMMENDED / NOT_RECOMMENDED labels. v2 moved to 0-4 scores with STRONG_HIRE / HIRE / MAYBE / NO_HIRE. v3 uses the current 1-5 evidence component scale with transcript_turns and evidence_turns arrays.

Training data must be regenerated whenever the production system prompt or output schema changes. Maya's data generation tooling imports prompts directly from the live agent configuration to prevent drift.

Inference-Time Safeguards

Fine-tuning improves the base model but does not guarantee perfect calibration. Maya applies additional safeguards at inference time:

Safeguard	Applied to	Effect
Temperature 0.1	Assessor	Near-deterministic scoring — reduces variance on borderline candidates
Two-layer scoring	Assessor	Layer 1 (LLM) generates evidence components; Layer 2 (deterministic) computes aggregated scores — ensures reproducible overall scores
Recommendation enforcement	Assessor	Post-processes model output: if the generated recommendation conflicts with the overall score and explicit tier thresholds, it is overridden
Structured JSON output mode	Planner, Assessor	Uses response_mime_type="application/json" — forces valid JSON output, eliminates markdown fence wrapping
systemInstruction	All agents	System prompt passed via native Vertex AI systemInstruction parameter, not injected as user message

Version History

Version	Base Model	LoRA Rank	Scale	Key Changes	Status
v1	Gemini 2.0 Flash	4	0-1	Initial extraction-based data	Removed
v2	Gemini 2.0 Flash	8	0-4	Synthetic data, borderline calibration	Deprecated
v3	Gemini 2.5 Flash	16	1-5	Evidence components, transcript_turns, 12 interviewer modes	Live

Retraining Workflow

The full fine-tuning cycle runs in three stages. First, new training data is generated in parallel (~10-15 minutes for 500 examples). Second, the data is uploaded to Google Cloud Storage and Vertex AI fine-tuning jobs are launched, completing in approximately 2-3 hours. Third, the new endpoint IDs are updated in the deployment configuration (k8s/configmap.yaml) and applied via rolling restart with zero downtime.

Was this page helpful?

Previous← Overview

NextModel Specifications →