#research
Every summary, chronological. Filter by category, tag, or source from the rail.
GLARE: Natural Language Interfaces for Global Model Explanations
GLARE provides a natural language interface for querying global model explanations, allowing users to interpret complex AI behavior through conversational prompts rather than static visualizations.
Moving Beyond Static Leaderboards for LLM Agent Evaluation
Static benchmarks often fail to predict real-world performance for LLM agents; the authors propose a framework focused on predictive validity to better align evaluation with practical utility.
The Symbiotic Evolution of AI and Software Engineering
The intersection of AI and Software Engineering (AI4SE and SE4AI) has matured over the last decade, shifting from experimental research to essential production-grade methodologies for building, testing, and maintaining complex systems.
Optimizing LLM Post-Training Through Pairwise Comparison Selection
The paper investigates how the selection of response pairs in preference-based post-training (like DPO or PPO) impacts model performance, suggesting that strategic pair selection is as critical as the training algorithm itself.
Detecting LLM Epistemic Blind Spots via Cross-Model Attribution
LLMs often hallucinate confidence in clinical settings. This paper introduces a method using Cross-Model Attribution Divergence (CMAD) to identify when models rely on unreliable features, effectively flagging epistemic uncertainty in tabular data.
Deontic Policies for Runtime Governance of Agentic AI
The paper proposes using deontic logic—a system of formal rules defining obligations, permissions, and prohibitions—to govern the runtime behavior of autonomous AI agents.
SciRisk-Bench: Evaluating Safety in AI for Science
SciRisk-Bench is a new benchmark designed to evaluate the safety risks of AI models specifically applied to scientific research, focusing on multi-dimensional risk assessment.
Improving AI Scientist Reliability via Research Harnesses
The paper proposes a 'Research Harness' to externalize synthesis and validation, addressing the reliability issues inherent in autonomous AI research agents.
DeFAb: A New Benchmark for Defeasible Abduction in LLMs
DeFAb is a new, verifiable benchmark designed to test how well foundation models handle defeasible abduction—the ability to form logical explanations that can be retracted or revised in light of new, contradictory information.
CEO-Bench: Measuring Long-Term Strategic Reasoning in AI Agents
CEO-Bench is a new evaluation framework designed to test whether AI agents can maintain strategic coherence and decision-making over extended, multi-step business scenarios.
DeepInsight: Evaluating the Physical AI Stack
DeepInsight proposes a unified infrastructure for evaluating AI systems across the entire physical stack, addressing the fragmentation in current performance assessment methodologies.
Verbal Reinforcement Learning: Closing the Feedback Loop
The paper introduces a framework for 'Verbal Reinforcement Learning' (VRL), shifting from raw reward signals to structured insight governance by extracting and managing verbal feedback from world interactions.
Foundation Model Orchestrated Workflows for Engineering Design
This research introduces a surrogate-assisted design workflow for pedestrian protection systems, using foundation models to orchestrate complex simulation and optimization tasks.
Analyzing AI Model Behavior via Agent Trajectories
This paper provides a comprehensive 106-page framework for evaluating LLM behavior by analyzing the sequential decision-making paths (trajectories) agents take when solving complex tasks, rather than just looking at final outputs.
Benchmarking LLM Strategic Decision-Making in Corporate Simulations
This research evaluates the efficacy of LLMs in executive leadership roles by simulating multi-role corporate environments to test their ability to perform strategic resource reallocation.
Architecting Distributed General-Purpose Agent Networks
The paper proposes a framework for distributed agent networks, shifting from monolithic AI systems to decentralized, collaborative architectures that improve scalability and task specialization.
MemTrace: Beyond Final Accuracy in LLM Long-Term Memory
MemTrace is a diagnostic framework designed to evaluate LLM long-term memory beyond simple accuracy metrics, focusing on the underlying mechanisms of information retention and retrieval over time.
SpeechDx: A Multi-Task Benchmark for Clinical Speech AI
SpeechDx is a new multi-task benchmark designed to evaluate AI models on clinical speech analysis, addressing the need for standardized, robust performance metrics in medical diagnostics.
Incumbent Advantage: Brand Bias in LLM Recommendation Systems
LLMs exhibit significant brand bias, disproportionately recommending incumbent products regardless of quality, creating a 'rich-get-richer' feedback loop that threatens market competition.
MiniMax Sparse Attention: Scaling Long Context with Block-Sparsity
MiniMax Sparse Attention (MSA) reduces the quadratic cost of long-context attention by using a two-branch, block-sparse approach that selects key-value blocks via a learned indexer, maintaining performance while fixing compute costs at O(kBk).
CogGuard: Proactive Monitoring for Edge Intelligent Services
CogGuard is a framework designed to improve the reliability of edge-based AI services by integrating cognitive and operational profiling to predict and mitigate system failures before they occur.
Visual-Seeker: Active Visual Reasoning for Multimodal Agents
Visual-Seeker introduces a visual-native agentic search framework that moves beyond text-based retrieval by employing active visual reasoning to navigate and interpret complex multimodal environments.
Mask-Proof: Automated Data Curation for Mathematical Proofs
Mask-Proof is an LLM-based pipeline designed to automate the curation of high-quality mathematical proof data, addressing the scarcity of reliable training sets for formal reasoning models.
CONCORD: Asynchronous Sparse Aggregation for Device-Cloud RAG
CONCORD is a framework for device-cloud Retrieval-Augmented Generation that optimizes performance under document isolation by using asynchronous sparse aggregation to balance local privacy with cloud-scale retrieval.
Evaluating LLM Judge Reliability via Subset Selection
The 'Metric Match' approach improves LLM judge evaluation by using subset selection to identify high-fidelity data samples, ensuring that automated metrics better correlate with human preferences.
Cognitive Debt: The Hidden Fragility of AI-Augmented Systems
The paper introduces 'Cognitive Debt' as a framework to explain how AI-driven intellectual leverage creates systemic fragility by offloading critical reasoning to models, leading to a loss of human oversight and domain expertise.
Measuring Trust Dynamics in Multi-Agent AI Systems
This research provides a framework for quantifying how AI agents form, break, and recover trust, offering essential insights for the governance of autonomous multi-agent systems.
Revisiting the Link Between AI Literacy and Usage
The paper challenges the assumption that lower digital literacy correlates with higher AI usage, suggesting instead that 'adoption breadth'—the variety of tools used—is a more accurate metric for understanding AI engagement.
Comparing Diff-in-Means and INLP for LLM Refusal Mechanisms
This paper evaluates two common techniques—Difference-in-Means and Iterative Null-space Projection (INLP)—for identifying and mitigating refusal behaviors in LLMs, highlighting the limitations of assuming refusal is a single, linear direction.
Hybrid Open-Ended Tri-Evolution for Deep Research Agents
The paper introduces a 'Hybrid Open-Ended Tri-Evolution' framework to improve the performance of deep research AI agents by optimizing their exploration and reasoning capabilities.
Showing 30 of 155