#machine-learning
Every summary, chronological. Filter by category, tag, or source from the rail.
GLARE: Natural Language Interfaces for Global Model Explanations
GLARE provides a natural language interface for querying global model explanations, allowing users to interpret complex AI behavior through conversational prompts rather than static visualizations.
Configurable Clinical Information Extraction with Agentic RAG
Agentic RAG systems for clinical data require modular configuration to balance precision and recall, as monolithic pipelines often fail to handle the high variability of medical documentation.
Optimizing LLM Post-Training Through Pairwise Comparison Selection
The paper investigates how the selection of response pairs in preference-based post-training (like DPO or PPO) impacts model performance, suggesting that strategic pair selection is as critical as the training algorithm itself.
Detecting LLM Epistemic Blind Spots via Cross-Model Attribution
LLMs often hallucinate confidence in clinical settings. This paper introduces a method using Cross-Model Attribution Divergence (CMAD) to identify when models rely on unreliable features, effectively flagging epistemic uncertainty in tabular data.
Perplexity Brain: Self-Improving Memory for AI Agents
Perplexity's 'Brain' system shifts AI memory from user-centric profiles to agent-centric performance, using an overnight context graph to learn from past tasks, failures, and corrections to improve future efficiency.
RODS: Improving Multi-Turn Tool-Use Agents via Reward-Driven Synthesis
RODS (Reward-Driven Online Data Synthesis) improves multi-turn tool-use agents by generating high-quality synthetic training data through iterative reward-based filtering, addressing the scarcity of complex, multi-step interaction data.
Skill-Guided Continuation Distillation for GUI Agents
The paper introduces a method to improve GUI agent performance by distilling complex task trajectories into modular, skill-based sub-tasks, enhancing generalization and execution reliability.
SciRisk-Bench: Evaluating Safety in AI for Science
SciRisk-Bench is a new benchmark designed to evaluate the safety risks of AI models specifically applied to scientific research, focusing on multi-dimensional risk assessment.
Improving AI Scientist Reliability via Research Harnesses
The paper proposes a 'Research Harness' to externalize synthesis and validation, addressing the reliability issues inherent in autonomous AI research agents.
WorldLines: Benchmarking Long-Horizon Stateful Embodied Agents
WorldLines introduces a new benchmark and modeling framework designed to evaluate how embodied AI agents maintain state and execute complex, long-horizon tasks over extended periods.
DeFAb: A New Benchmark for Defeasible Abduction in LLMs
DeFAb is a new, verifiable benchmark designed to test how well foundation models handle defeasible abduction—the ability to form logical explanations that can be retracted or revised in light of new, contradictory information.
The KV Cache Compression Race: TurboQuant vs OSCAR vs EpiCache
KV cache compression is the new frontier for scaling LLM inference, with TurboQuant, OSCAR, and EpiCache offering distinct strategies to balance memory footprint against model accuracy.
6 Habits That Elevate Data Science Projects Beyond Model Selection
Exceptional data science outcomes depend less on complex algorithms and more on disciplined fundamentals like data auditing, version control, and rigorous documentation.
DeepInsight: Evaluating the Physical AI Stack
DeepInsight proposes a unified infrastructure for evaluating AI systems across the entire physical stack, addressing the fragmentation in current performance assessment methodologies.
Foundation Model Orchestrated Workflows for Engineering Design
This research introduces a surrogate-assisted design workflow for pedestrian protection systems, using foundation models to orchestrate complex simulation and optimization tasks.
SEAGym: A Benchmark for Self-Evolving LLM Agents
SEAGym provides a standardized evaluation environment designed to measure the capabilities of self-evolving LLM agents, focusing on their ability to autonomously improve performance over time.
Analyzing AI Model Behavior via Agent Trajectories
This paper provides a comprehensive 106-page framework for evaluating LLM behavior by analyzing the sequential decision-making paths (trajectories) agents take when solving complex tasks, rather than just looking at final outputs.
MemTrace: Beyond Final Accuracy in LLM Long-Term Memory
MemTrace is a diagnostic framework designed to evaluate LLM long-term memory beyond simple accuracy metrics, focusing on the underlying mechanisms of information retention and retrieval over time.
SpeechDx: A Multi-Task Benchmark for Clinical Speech AI
SpeechDx is a new multi-task benchmark designed to evaluate AI models on clinical speech analysis, addressing the need for standardized, robust performance metrics in medical diagnostics.
MiniMax Sparse Attention: Scaling Long Context with Block-Sparsity
MiniMax Sparse Attention (MSA) reduces the quadratic cost of long-context attention by using a two-branch, block-sparse approach that selects key-value blocks via a learned indexer, maintaining performance while fixing compute costs at O(kBk).
Optimizing Video Diffusion for Real-Time Generation
Achieve real-time video generation by stacking quantization, caching, and step distillation to reduce the standard 50-step denoising process to as few as 1-8 steps.
AI EngineerMask-Proof: Automated Data Curation for Mathematical Proofs
Mask-Proof is an LLM-based pipeline designed to automate the curation of high-quality mathematical proof data, addressing the scarcity of reliable training sets for formal reasoning models.
CONCORD: Asynchronous Sparse Aggregation for Device-Cloud RAG
CONCORD is a framework for device-cloud Retrieval-Augmented Generation that optimizes performance under document isolation by using asynchronous sparse aggregation to balance local privacy with cloud-scale retrieval.
Verifiable Agentic Data Science via Tool-Grounded Reasoning
To solve complex, irregular Time-Series Question Answering (TSQA), agents must move beyond pure generation toward tool-grounded reasoning that enforces verifiable, step-by-step execution.
Evaluating LLM Judge Reliability via Subset Selection
The 'Metric Match' approach improves LLM judge evaluation by using subset selection to identify high-fidelity data samples, ensuring that automated metrics better correlate with human preferences.
Measuring Trust Dynamics in Multi-Agent AI Systems
This research provides a framework for quantifying how AI agents form, break, and recover trust, offering essential insights for the governance of autonomous multi-agent systems.
Why Accuracy Metrics Hide ML Model Failures
High accuracy scores in automated systems like résumé classifiers often mask systemic biases and data quality issues that lead to unfair rejection patterns.
Revisiting the Link Between AI Literacy and Usage
The paper challenges the assumption that lower digital literacy correlates with higher AI usage, suggesting instead that 'adoption breadth'—the variety of tools used—is a more accurate metric for understanding AI engagement.
Comparing Diff-in-Means and INLP for LLM Refusal Mechanisms
This paper evaluates two common techniques—Difference-in-Means and Iterative Null-space Projection (INLP)—for identifying and mitigating refusal behaviors in LLMs, highlighting the limitations of assuming refusal is a single, linear direction.
Hybrid Open-Ended Tri-Evolution for Deep Research Agents
The paper introduces a 'Hybrid Open-Ended Tri-Evolution' framework to improve the performance of deep research AI agents by optimizing their exploration and reasoning capabilities.
Showing 30 of 286