Vantage: Executive LLM Scores Durable Skills Like Humans

Executive LLM Coordinates Realistic Interactions for Skill Elicitation

Vantage solves the ecological validity vs. psychometric rigor tradeoff by using a single Executive LLM (Gemini 2.5 Pro for collaboration) to generate all AI teammate responses, guided by pedagogical rubrics. This steers conversations dynamically—like computerized adaptive tests—to provoke specific sub-skills: for Conflict Resolution (CR), it sustains disagreements until the human shows resolution strategies; for Project Management (PM), it prompts planning behaviors. Independent Agents (separate LLMs) fail here, as uncoordinated talks often lack skill-relevant evidence (e.g., no conflict if agents agree). In 373 transcripts from 188 participants on 30-minute tasks (science design or debate), skill-matched Executive LLM hit 92.4% conversation-level evidence rates for PM and 85% for CR—far above baselines. Simply instructing humans to focus on skills yielded no boost (p > 0.6), proving AI-side steering is key. This setup simulates authentic group dynamics scalably, unlike PISA 2015's scripted multiple-choice or costly human-human assessments.

Matches Human Raters with Transparent, Regression-Based Scoring

Scoring uses a separate AI Evaluator (Gemini 3.0) that rates each human turn 20 times, taking the mode (NA if any NA), then trains linear/logistic regression models via leave-one-out cross-validation for conversation scores. Inter-rater agreement hits human-human levels: Cohen’s Kappa 0.45–0.64 across CR/PM and turn/conversation tasks. For creativity, a Gemini autorater on 280 high schoolers' multimedia tasks (e.g., news segment design) achieved item-level Kappa 0.66 and overall Pearson correlation 0.88 with experts—rare for subjective tasks—after prompt/rubric refinement on 100 samples and evaluation on 180 holdouts. Results surface via skills maps with excerpt evidence, enabling actionable feedback.

LLM Simulation De-Risks Development Across 8 Dimensions

Simulate humans at known rubric levels (1–4) with Gemini to test recovery error (mean absolute difference between true and inferred levels): Executive LLM cuts error vs. Independents for CR/PM, with patterns mirroring real data—validating cheap iteration before human studies. This extends to all 6 creativity dimensions (Fluidity, Originality, Quality, Building on Ideas, Elaborating, Selecting) and 2 critical thinking ones (Interpret/Analyze; Evaluate/Judge), all showing higher evidence rates (statistically significant). Human creativity/CT ratings ongoing, but simulation confirms generalization beyond collaboration.