AI Models
Evaluation Results
Benchmark metrics for Teamcast Maya's v3 fine-tuned agents on Gemini 2.5 Flash, comparing fine-tuned endpoints against the base model on held-out test sets.
Accuracy Overview
The bar chart compares key accuracy metrics across all three agents. Fine-tuned models consistently outperform the base model on structured output quality. ROUGE-L is scaled by x100 for readability.
Accuracy metrics by agent — v3 on Gemini 2.5 Flash (%)
Quality Radar
Six quality dimensions mapped on a 0-100 scale. Evidence Completeness measures % of outputs with all 5 evidence component keys per sub-dimension. Score Precision is computed as 100 - (MAE / 5 x 100) on the 1-5 scale.
Overall model quality — v3 (0 - 100 scale)
Inference Latency
Average wall-clock time per request for fine-tuned endpoints. The Interviewer operates in a real-time audio pipeline; its 2.9s is the LLM call only (STT and TTS are additional).
Average inference latency — v3 (seconds)
Planner Agent
| Metric | Fine-Tuned | Base | Delta |
|---|---|---|---|
| JSON valid (strict) | 90.0% | 50.0% | +40.0pp |
| JSON valid (after repair) | 100.0% | 100.0% | 0 |
| ROUGE-L | 0.332 | 0.0* | +0.332 |
| Section count match | 15.0% | N/A | |
| Section types covered | 12.1% | 0.0% | +12.1pp |
| Avg latency | 29.2s | 4.0s** | |
| Errors | 0/20 | 18/20 | -18 |
The fine-tuned Planner produces 90% strictly valid JSON (100% after lightweight auto-repair) with zero errors across all 20 test examples. The base model failed on 18 of 20 examples due to timeouts and format errors, making direct comparison impossible on most metrics. ROUGE-L of 0.332 confirms the model has learned the v3 competency-and-rubric structure. *Base ROUGE-L is 0.0 because most examples errored. **Base latency reflects only the 2 successful examples.
Assessor Agent
| Metric | Fine-Tuned | Base | Delta |
|---|---|---|---|
| JSON valid | 100% | 100% | 0 |
| Overall score MAE | 2.83 | 0.0* | |
| Recommendation off-by | 1.0 tiers | N/A | |
| Score in valid range | 100% | 100% | 0 |
| Evidence components complete | 85.0% | 15.0% | +70.0pp |
| Has transcript_turns | 40.0% | 100% | -60.0pp |
| Question evidence_turns | 25.0% | 100% | -75.0pp |
| Avg latency | 29.6s | 21.4s | +8.2s |
Both fine-tuned and base models achieve 100% JSON validity with proper systemInstruction and response_mime_type configuration. The fine-tuned model excels at evidence_components completeness (85% vs 15%) — the critical v3 metric for two-layer scoring. *Base MAE is 0.0 because the base model outputs overall_score as the placeholder 0.0 value; the fine-tuned model actually computes scores (MAE of 2.83 reflects scoring attempts, not quality). The base model generates transcript_turns and evidence_turns more consistently but uses a different output structure that the v3 schema expects. Recommendation distribution for fine-tuned: Strong Hire 30%, Hire 15%, No Hire 20%, mixed labels 35%.
Interviewer Agent
| Metric | Fine-Tuned | Base | Delta |
|---|---|---|---|
| ROUGE-L | 0.523 | 0.453 | +0.070 |
| Techniques per response | 0.55 | 1.10 | -0.55 |
| Response length ok | 100.0% | 95.0% | +5.0pp |
| No advice leak | 100.0% | 100.0% | 0 |
| Avg latency | 2.9s | 3.8s | -0.9s (22% faster) |
The fine-tuned Interviewer achieves higher ROUGE-L overlap with expected responses (0.523 vs 0.453), perfect response length compliance (100%), and 22% faster inference. Both models maintain perfect safety (0% advice leak). The lower technique count (0.55 vs 1.10) indicates the fine-tuned model integrates techniques more naturally into conversational flow rather than applying multiple techniques per turn.
Version Comparison
Key improvements across fine-tuning versions:
| Metric | v1 (2.0 Flash) | v2 (2.0 Flash) | v3 (2.5 Flash) |
|---|---|---|---|
| Planner JSON valid | ~30% | 70.6% / 94.1% | 90.0% / 100.0% |
| Planner ROUGE-L | 0.028 | 0.234 | 0.332 |
| Assessor JSON valid | ~80% | 100% | 100% |
| Assessor Rec match | ~30% | 68.6% | N/A (schema change) |
| Assessor Evidence complete | N/A | N/A | 85.0% |
| Interviewer ROUGE-L | N/A | 0.298 | 0.523 |
| Interviewer Safety | N/A | 99.7% | 100.0% |
| Base model | gemini-2.0-flash | gemini-2.0-flash | gemini-2.5-flash |
| LoRA rank | 4 | 8 | 16 |
Metric Definitions
| Metric | Description |
|---|---|
| JSON valid (strict) % | % of outputs that parse as valid JSON without any fixup applied |
| JSON valid (after repair) % | % of outputs valid after auto-correcting minor formatting issues such as missing commas — reflects production-realistic validity |
| ROUGE-L | F1 score of the longest common subsequence between generated and expected text. Measures structural and lexical similarity on a 0-1 scale. |
| Overall score MAE | Mean absolute error of the numeric overall_score field vs the ground-truth score. Lower is better; scale is 1-5. |
| Recommendation off-by | Average tier distance when the label does not match. Tiers ordered: NO_HIRE, MAYBE, HIRE, STRONG_HIRE. Off-by 1 = adjacent tier. |
| Evidence components complete % | % of outputs with all 5 evidence component keys (completeness, reasoning_clarity, outcome_strength, ownership, evidence_confidence) per sub-dimension |
| Has transcript_turns % | % of outputs with transcript_turns arrays (0-based turn indices) on sub-dimension scores |
| Question evidence_turns % | % of outputs with evidence_turns arrays on question scores |
| Techniques per response | Average count of professional interviewing techniques detected per response: labeling, calibrated questions, mirroring, probing, and summarization |
| Response length ok % | % of responses containing 2-6 sentences — the target conciseness range for live interview questions |
| No advice leak % | % of responses that do not contain forbidden phrases that would coach or hint to the candidate |