AI Models

Evaluation Results

Benchmark metrics for Teamcast Maya's v3 fine-tuned agents on Gemini 2.5 Flash, comparing fine-tuned endpoints against the base model on held-out test sets.

All three agents were evaluated on held-out v3 test sets (20 examples each) comparing fine-tuned endpoints against the base Gemini 2.5 Flash model. v3 uses the 1-5 evidence component scale with transcript_turns arrays.

Accuracy Overview

The bar chart compares key accuracy metrics across all three agents. Fine-tuned models consistently outperform the base model on structured output quality. ROUGE-L is scaled by x100 for readability.

Accuracy metrics by agent — v3 on Gemini 2.5 Flash (%)

Quality Radar

Six quality dimensions mapped on a 0-100 scale. Evidence Completeness measures % of outputs with all 5 evidence component keys per sub-dimension. Score Precision is computed as 100 - (MAE / 5 x 100) on the 1-5 scale.

Overall model quality — v3 (0 - 100 scale)

Inference Latency

Average wall-clock time per request for fine-tuned endpoints. The Interviewer operates in a real-time audio pipeline; its 2.9s is the LLM call only (STT and TTS are additional).

Average inference latency — v3 (seconds)

Planner Agent

Metric	Fine-Tuned	Base	Delta
JSON valid (strict)	90.0%	50.0%	+40.0pp
JSON valid (after repair)	100.0%	100.0%	0
ROUGE-L	0.332	0.0*	+0.332
Section count match	15.0%	N/A
Section types covered	12.1%	0.0%	+12.1pp
Avg latency	29.2s	4.0s**
Errors	0/20	18/20	-18

The fine-tuned Planner produces 90% strictly valid JSON (100% after lightweight auto-repair) with zero errors across all 20 test examples. The base model failed on 18 of 20 examples due to timeouts and format errors, making direct comparison impossible on most metrics. ROUGE-L of 0.332 confirms the model has learned the v3 competency-and-rubric structure. *Base ROUGE-L is 0.0 because most examples errored. **Base latency reflects only the 2 successful examples.

Assessor Agent

Metric	Fine-Tuned	Base	Delta
JSON valid	100%	100%	0
Overall score MAE	2.83	0.0*
Recommendation off-by	1.0 tiers	N/A
Score in valid range	100%	100%	0
Evidence components complete	85.0%	15.0%	+70.0pp
Has transcript_turns	40.0%	100%	-60.0pp
Question evidence_turns	25.0%	100%	-75.0pp
Avg latency	29.6s	21.4s	+8.2s

Both fine-tuned and base models achieve 100% JSON validity with proper systemInstruction and response_mime_type configuration. The fine-tuned model excels at evidence_components completeness (85% vs 15%) — the critical v3 metric for two-layer scoring. *Base MAE is 0.0 because the base model outputs overall_score as the placeholder 0.0 value; the fine-tuned model actually computes scores (MAE of 2.83 reflects scoring attempts, not quality). The base model generates transcript_turns and evidence_turns more consistently but uses a different output structure that the v3 schema expects. Recommendation distribution for fine-tuned: Strong Hire 30%, Hire 15%, No Hire 20%, mixed labels 35%.

Interviewer Agent

Metric	Fine-Tuned	Base	Delta
ROUGE-L	0.523	0.453	+0.070
Techniques per response	0.55	1.10	-0.55
Response length ok	100.0%	95.0%	+5.0pp
No advice leak	100.0%	100.0%	0
Avg latency	2.9s	3.8s	-0.9s (22% faster)

The fine-tuned Interviewer achieves higher ROUGE-L overlap with expected responses (0.523 vs 0.453), perfect response length compliance (100%), and 22% faster inference. Both models maintain perfect safety (0% advice leak). The lower technique count (0.55 vs 1.10) indicates the fine-tuned model integrates techniques more naturally into conversational flow rather than applying multiple techniques per turn.

Version Comparison

Key improvements across fine-tuning versions:

Metric	v1 (2.0 Flash)	v2 (2.0 Flash)	v3 (2.5 Flash)
Planner JSON valid	~30%	70.6% / 94.1%	90.0% / 100.0%
Planner ROUGE-L	0.028	0.234	0.332
Assessor JSON valid	~80%	100%	100%
Assessor Rec match	~30%	68.6%	N/A (schema change)
Assessor Evidence complete	N/A	N/A	85.0%
Interviewer ROUGE-L	N/A	0.298	0.523
Interviewer Safety	N/A	99.7%	100.0%
Base model	gemini-2.0-flash	gemini-2.0-flash	gemini-2.5-flash
LoRA rank	4	8	16

Metric Definitions

Metric	Description
JSON valid (strict) %	% of outputs that parse as valid JSON without any fixup applied
JSON valid (after repair) %	% of outputs valid after auto-correcting minor formatting issues such as missing commas — reflects production-realistic validity
ROUGE-L	F1 score of the longest common subsequence between generated and expected text. Measures structural and lexical similarity on a 0-1 scale.
Overall score MAE	Mean absolute error of the numeric overall_score field vs the ground-truth score. Lower is better; scale is 1-5.
Recommendation off-by	Average tier distance when the label does not match. Tiers ordered: NO_HIRE, MAYBE, HIRE, STRONG_HIRE. Off-by 1 = adjacent tier.
Evidence components complete %	% of outputs with all 5 evidence component keys (completeness, reasoning_clarity, outcome_strength, ownership, evidence_confidence) per sub-dimension
Has transcript_turns %	% of outputs with transcript_turns arrays (0-based turn indices) on sub-dimension scores
Question evidence_turns %	% of outputs with evidence_turns arrays on question scores
Techniques per response	Average count of professional interviewing techniques detected per response: labeling, calibrated questions, mirroring, probing, and summarization
Response length ok %	% of responses containing 2-6 sentences — the target conciseness range for live interview questions
No advice leak %	% of responses that do not contain forbidden phrases that would coach or hint to the candidate

Was this page helpful?

Previous← Model Specifications

NextInterview Management →