Executive LLMs Unlock Scalable Durable Skills Assessment

Executive LLM Bridges Natural Interaction and Controlled Assessment

Durable skills like collaboration, creativity, and critical thinking drive workplace success but evade measurement due to conflicting needs: ecological validity (real-world-like human interactions) versus psychometric rigor (scalable, reproducible evidence). Traditional approaches fall short—PISA 2015 used scripted AI with multiple-choice, limiting authenticity; ATC21S relied on human-human dyads in digital environments, introducing uncontrollable variance. LLMs solve this by simulating open-ended group work in Vantage, a chat-based platform where humans (ages 18-25, 188 Prolific-recruited participants generating 373 conversations) tackle classroom-like tasks with 3 AI teammates over 30 minutes via text or voice.

A single Executive LLM (Gemini 2.5 Pro) generates all AI responses, prompted with skill rubrics to maximize evidence density. Unlike 'Independent Agents' (separate LLMs per teammate yielding unfocused chats), the Executive actively steers: for Conflict Resolution, it provokes disputes via one teammate until resolution behaviors emerge; for Project Management, it introduces delays or scope issues. This orchestration elicits 2x more skill-related turns—e.g., 0.4-0.6 fraction of turns show evidence versus 0.2 for Independent Agents (p≤0.05, Fisher exact test, Figure 6). Focus instructions to humans (e.g., 'pay attention to Conflict Resolution') further boost evidence without artificiality.

"Measurement is a compromise in the name of efficiency since the 'long lasting observation of a person in real life until (s)he spontaneously exhibits the behavior of interest... would take too much time before enough evidence was collected." (Sijtsma 23, cited to justify steering for efficiency over passive observation). This quote underscores why unstructured chats fail—Executive LLM acts as an adaptive test, preserving natural flow while guaranteeing observability.

Rubrics, derived from literature and refined via expert ratings on samples, score dimensions 1-4 (NA if insufficient evidence). Tasks mimic classrooms: collaboration (Debate, Planning Event); creativity (Invent gadget, Design poster); critical thinking (Analyze evidence). Appendix details full rubrics, e.g., Conflict Resolution axes like 'Identifies underlying issues' (levels: ignores vs. deeply analyzes).

AI Evaluator Delivers Human-Level Scoring at Scale

Post-conversation, a Gemini 3.0 AI Evaluator scores transcripts per human turn: 20 repeated ratings, NA if any NA, else mode vote. Conversation-level scores train linear/logistic regression on human-rated data (leave-one-out CV). Inter-rater agreement (2 NYU pedagogical experts) is moderate (Cohen's Kappa 0.45-0.64 for binary NA/not and quadratic-weighted scores, Figure 5)—challenging even for humans post-calibration. LLM-human agreement matches exactly, proving scalability: one LLM replaces costly experts.

Feedback in Vantage is actionable—a skills map quantifies competencies (e.g., overall + sub-dimensions), expandable to excerpts like 'You excelled in prioritizing tasks here: "Let's tackle the budget first."' (Figure 3). Holistic scores aggregate turn evidence, handling NAs robustly.

"LLMs can bridge the gap between unstructured student collaboration, which more closely emulates classroom practice, and standardized assessment, which, while artificial, attempts to isolate the behaviors needed for valid inference." (Authors, core thesis on LLM's dual role in authenticity and isolation).

Simulations validate further: Gemini simulates humans at fixed rubric levels (e.g., level 3 Conflict Resolution, 50 turns x 100 reps), recovering true levels accurately. Unskilled simulations yield low evidence, confirming sensitivity.

Proven Efficacy Across Skills, Including Real Students

Collaboration (4-member groups) saw Executive LLM double evidence versus baselines. Creativity/critical thinking used Gemini 3; tasks like 'Invent a gadget for remote learning' (Figure 4) elicited ideation fluency, originality. High-school creativity submissions (complex open tasks) showed Gemini autorater on par with experts—reliable for unstructured outputs.

Vantage evolves protocols cheaply via simulations before human trials, e.g., testing evidence density (Figures 9-10). Tradeoffs: LLMs risk hallucination (mitigated by rubric grounding, repetition); steering might feel contrived if overdone (but participants unaware). Still, outperforms priors: more evidence than PISA/ATC21S without their rigidity or variance.

"The Executive LLM generates the responses for all of the AI teammates in the conversation and is designed to steer the conversation toward maximal information and assessment accuracy." (Authors, on single-LLM control versus multi-agent chaos).

This isn't hype—metrics prove orchestrated LLMs quantify 'unmeasurable' skills, teachable via feedback loops. What fails: passive agents (low evidence). What works: rubric-driven steering + repeated LLM voting.

Key Takeaways

Prompt a single Executive LLM with rubrics to control multiple AI personas, steering chats to elicit specific skill evidence (e.g., provoke conflicts for resolution testing).
For scoring, run 20 LLM ratings per turn (Gemini 3.0), use mode after NA veto—matches human Kappa 0.45-0.64, scales infinitely.
Design classroom-mirroring tasks (e.g., Debate for collaboration) with 1-4 rubrics refined by expert pilots.
Simulate humans (prompt Gemini at fixed skill levels) to iterate protocols pre-deployment, recovering true levels accurately.
Add user focus instructions ('attend to Project Management') and voice/text UI for 30-min sessions—boosts evidence 20-40%.
Tradeoff: Executive steering doubles evidence vs. independent agents but requires careful prompting to stay natural.
For creativity, autoraters handle open student outputs reliably—deploy for high-school grading.

"Our analysis shows that the use of the Executive LLM significantly increases elicited evidence, compared to non-steered interactions." (Authors, empirical win on core hypothesis).

"In addition, we show that LLM-automated scoring of conversations largely agrees with that of expert annotators." (Authors, on interrater parity).