Vantage: GenAI Matches Human Experts in Skills Assessment

Steering Conversations to Elicit Skill Evidence

Vantage simulates real-world team interactions by placing learners in open-ended tasks like debate prep or creative pitches with AI avatars. An Executive LLM analyzes the conversation in real-time using a rubric to dynamically introduce challenges—such as conflicts or pushback—ensuring high-density evidence for specific skills like conflict resolution or project management. This adaptive steering produces significantly more skill-relevant information than independent avatars: in tests, steered conversations yielded sufficient data for scoring in a higher fraction of cases (statistically significant, marked by * in charts). Upon task completion, an AI Evaluator scores the transcript against the same rubric, delivering a visual skill map with qualitative feedback on sub-skills, making progress visible and actionable.

This approach overcomes rigid tests' limitations by creating standardized yet authentic scenarios, scalable for high school/college students without needing real human groups, which are resource-intensive and inconsistent.

Validation Proves AI Reliability Equals Experts

In a joint NYU study with 188 US testers aged 18-25, Vantage assessed collaboration skills. Human raters from NYU used identical rubrics; AI-human agreement (Cohen’s Kappa with quadratic weights) matched human-human agreement for conflict resolution and project management, confirming AI Evaluator accuracy. Separately, partnering with OpenMic on 180 students' creative tasks (e.g., character interviews), AI scores correlated 0.88 with human experts via Pearson’s correlation, validating even on complex creativity.

These results establish Vantage as a reliable automated assessor, aligned with OECD and WEF frameworks prioritizing critical thinking, collaboration, and creativity—skills automation can't replace.

Scaling Skills Assessment for Education

Vantage, now on Google Labs, enables a 'skills layer' atop curricula: students debate social topics or lead lab planning with avatars, getting dual feedback on knowledge and skills. This supplements group projects scalably. Future work tests skill transfer to real interactions, cultural inclusivity, and growth via repeated practice, supporting research on pedagogical impacts.