AI Models

Evaluation Results

Benchmark metrics for Teamcast Maya's v3 fine-tuned agents on Gemini 2.5 Flash, comparing fine-tuned endpoints against the base model on held-out test sets.

All three agents were evaluated on held-out v3 test sets (20 examples each) comparing fine-tuned endpoints against the base Gemini 2.5 Flash model. v3 uses the 1-5 evidence component scale with transcript_turns arrays.

Accuracy Overview

The bar chart compares key accuracy metrics across all three agents. Fine-tuned models consistently outperform the base model on structured output quality. ROUGE-L is scaled by x100 for readability.

Accuracy metrics by agent — v3 on Gemini 2.5 Flash (%)

Quality Radar

Six quality dimensions mapped on a 0-100 scale. Evidence Completeness measures % of outputs with all 5 evidence component keys per sub-dimension. Score Precision is computed as 100 - (MAE / 5 x 100) on the 1-5 scale.

Overall model quality — v3 (0 - 100 scale)

Inference Latency

Average wall-clock time per request for fine-tuned endpoints. The Interviewer operates in a real-time audio pipeline; its 2.9s is the LLM call only (STT and TTS are additional).

Average inference latency — v3 (seconds)

Planner Agent

MetricFine-TunedBaseDelta
JSON valid (strict)90.0%50.0%+40.0pp
JSON valid (after repair)100.0%100.0%0
ROUGE-L0.3320.0*+0.332
Section count match15.0%N/A
Section types covered12.1%0.0%+12.1pp
Avg latency29.2s4.0s**
Errors0/2018/20-18

The fine-tuned Planner produces 90% strictly valid JSON (100% after lightweight auto-repair) with zero errors across all 20 test examples. The base model failed on 18 of 20 examples due to timeouts and format errors, making direct comparison impossible on most metrics. ROUGE-L of 0.332 confirms the model has learned the v3 competency-and-rubric structure. *Base ROUGE-L is 0.0 because most examples errored. **Base latency reflects only the 2 successful examples.

Assessor Agent

MetricFine-TunedBaseDelta
JSON valid100%100%0
Overall score MAE2.830.0*
Recommendation off-by1.0 tiersN/A
Score in valid range100%100%0
Evidence components complete85.0%15.0%+70.0pp
Has transcript_turns40.0%100%-60.0pp
Question evidence_turns25.0%100%-75.0pp
Avg latency29.6s21.4s+8.2s

Both fine-tuned and base models achieve 100% JSON validity with proper systemInstruction and response_mime_type configuration. The fine-tuned model excels at evidence_components completeness (85% vs 15%) — the critical v3 metric for two-layer scoring. *Base MAE is 0.0 because the base model outputs overall_score as the placeholder 0.0 value; the fine-tuned model actually computes scores (MAE of 2.83 reflects scoring attempts, not quality). The base model generates transcript_turns and evidence_turns more consistently but uses a different output structure that the v3 schema expects. Recommendation distribution for fine-tuned: Strong Hire 30%, Hire 15%, No Hire 20%, mixed labels 35%.

Interviewer Agent

MetricFine-TunedBaseDelta
ROUGE-L0.5230.453+0.070
Techniques per response0.551.10-0.55
Response length ok100.0%95.0%+5.0pp
No advice leak100.0%100.0%0
Avg latency2.9s3.8s-0.9s (22% faster)

The fine-tuned Interviewer achieves higher ROUGE-L overlap with expected responses (0.523 vs 0.453), perfect response length compliance (100%), and 22% faster inference. Both models maintain perfect safety (0% advice leak). The lower technique count (0.55 vs 1.10) indicates the fine-tuned model integrates techniques more naturally into conversational flow rather than applying multiple techniques per turn.

Version Comparison

Key improvements across fine-tuning versions:

Metricv1 (2.0 Flash)v2 (2.0 Flash)v3 (2.5 Flash)
Planner JSON valid~30%70.6% / 94.1%90.0% / 100.0%
Planner ROUGE-L0.0280.2340.332
Assessor JSON valid~80%100%100%
Assessor Rec match~30%68.6%N/A (schema change)
Assessor Evidence completeN/AN/A85.0%
Interviewer ROUGE-LN/A0.2980.523
Interviewer SafetyN/A99.7%100.0%
Base modelgemini-2.0-flashgemini-2.0-flashgemini-2.5-flash
LoRA rank4816

Metric Definitions

MetricDescription
JSON valid (strict) %% of outputs that parse as valid JSON without any fixup applied
JSON valid (after repair) %% of outputs valid after auto-correcting minor formatting issues such as missing commas — reflects production-realistic validity
ROUGE-LF1 score of the longest common subsequence between generated and expected text. Measures structural and lexical similarity on a 0-1 scale.
Overall score MAEMean absolute error of the numeric overall_score field vs the ground-truth score. Lower is better; scale is 1-5.
Recommendation off-byAverage tier distance when the label does not match. Tiers ordered: NO_HIRE, MAYBE, HIRE, STRONG_HIRE. Off-by 1 = adjacent tier.
Evidence components complete %% of outputs with all 5 evidence component keys (completeness, reasoning_clarity, outcome_strength, ownership, evidence_confidence) per sub-dimension
Has transcript_turns %% of outputs with transcript_turns arrays (0-based turn indices) on sub-dimension scores
Question evidence_turns %% of outputs with evidence_turns arrays on question scores
Techniques per responseAverage count of professional interviewing techniques detected per response: labeling, calibrated questions, mirroring, probing, and summarization
Response length ok %% of responses containing 2-6 sentences — the target conciseness range for live interview questions
No advice leak %% of responses that do not contain forbidden phrases that would coach or hint to the candidate
Was this page helpful?