№ 02 / SUMMARIES

#research

Every summary, chronological. Filter by category, tag, or source from the rail.

Tag · #research
DAY 01Friday JUN 19 · 20266 SUMMARIES
arXiv cs.AIAI & LLMs

GLARE: Natural Language Interfaces for Global Model Explanations

GLARE provides a natural language interface for querying global model explanations, allowing users to interpret complex AI behavior through conversational prompts rather than static visualizations.

arXiv cs.AI
arXiv cs.AIAI & LLMs

Moving Beyond Static Leaderboards for LLM Agent Evaluation

Static benchmarks often fail to predict real-world performance for LLM agents; the authors propose a framework focused on predictive validity to better align evaluation with practical utility.

arXiv cs.AIAI & LLMs

The Symbiotic Evolution of AI and Software Engineering

The intersection of AI and Software Engineering (AI4SE and SE4AI) has matured over the last decade, shifting from experimental research to essential production-grade methodologies for building, testing, and maintaining complex systems.

arXiv cs.AIAI & LLMs

Optimizing LLM Post-Training Through Pairwise Comparison Selection

The paper investigates how the selection of response pairs in preference-based post-training (like DPO or PPO) impacts model performance, suggesting that strategic pair selection is as critical as the training algorithm itself.

arXiv cs.AIAI & LLMs

Detecting LLM Epistemic Blind Spots via Cross-Model Attribution

LLMs often hallucinate confidence in clinical settings. This paper introduces a method using Cross-Model Attribution Divergence (CMAD) to identify when models rely on unreliable features, effectively flagging epistemic uncertainty in tabular data.

arXiv cs.AIAI & LLMs

Deontic Policies for Runtime Governance of Agentic AI

The paper proposes using deontic logic—a system of formal rules defining obligations, permissions, and prohibitions—to govern the runtime behavior of autonomous AI agents.

DAY 02Thursday JUN 18 · 20264 SUMMARIES
arXiv cs.AIAI & LLMs

SciRisk-Bench: Evaluating Safety in AI for Science

SciRisk-Bench is a new benchmark designed to evaluate the safety risks of AI models specifically applied to scientific research, focusing on multi-dimensional risk assessment.

arXiv cs.AI
arXiv cs.AIAI & LLMs

Improving AI Scientist Reliability via Research Harnesses

The paper proposes a 'Research Harness' to externalize synthesis and validation, addressing the reliability issues inherent in autonomous AI research agents.

arXiv cs.AIAI & LLMs

DeFAb: A New Benchmark for Defeasible Abduction in LLMs

DeFAb is a new, verifiable benchmark designed to test how well foundation models handle defeasible abduction—the ability to form logical explanations that can be retracted or revised in light of new, contradictory information.

arXiv cs.AIAI & LLMs

CEO-Bench: Measuring Long-Term Strategic Reasoning in AI Agents

CEO-Bench is a new evaluation framework designed to test whether AI agents can maintain strategic coherence and decision-making over extended, multi-step business scenarios.

DAY 03Wednesday JUN 17 · 202610 SUMMARIES
arXiv cs.AIAI & LLMs

DeepInsight: Evaluating the Physical AI Stack

DeepInsight proposes a unified infrastructure for evaluating AI systems across the entire physical stack, addressing the fragmentation in current performance assessment methodologies.

arXiv cs.AI
arXiv cs.AIAI & LLMs

Verbal Reinforcement Learning: Closing the Feedback Loop

The paper introduces a framework for 'Verbal Reinforcement Learning' (VRL), shifting from raw reward signals to structured insight governance by extracting and managing verbal feedback from world interactions.

arXiv cs.AIAI & LLMs

Foundation Model Orchestrated Workflows for Engineering Design

This research introduces a surrogate-assisted design workflow for pedestrian protection systems, using foundation models to orchestrate complex simulation and optimization tasks.

arXiv cs.AIAI & LLMs

Analyzing AI Model Behavior via Agent Trajectories

This paper provides a comprehensive 106-page framework for evaluating LLM behavior by analyzing the sequential decision-making paths (trajectories) agents take when solving complex tasks, rather than just looking at final outputs.

arXiv cs.AIAI & LLMs

Benchmarking LLM Strategic Decision-Making in Corporate Simulations

This research evaluates the efficacy of LLMs in executive leadership roles by simulating multi-role corporate environments to test their ability to perform strategic resource reallocation.

arXiv cs.AIAI & LLMs

Architecting Distributed General-Purpose Agent Networks

The paper proposes a framework for distributed agent networks, shifting from monolithic AI systems to decentralized, collaborative architectures that improve scalability and task specialization.

arXiv cs.AIAI & LLMs

MemTrace: Beyond Final Accuracy in LLM Long-Term Memory

MemTrace is a diagnostic framework designed to evaluate LLM long-term memory beyond simple accuracy metrics, focusing on the underlying mechanisms of information retention and retrieval over time.

arXiv cs.AIAI & LLMs

SpeechDx: A Multi-Task Benchmark for Clinical Speech AI

SpeechDx is a new multi-task benchmark designed to evaluate AI models on clinical speech analysis, addressing the need for standardized, robust performance metrics in medical diagnostics.

arXiv cs.AIAI & LLMs

Incumbent Advantage: Brand Bias in LLM Recommendation Systems

LLMs exhibit significant brand bias, disproportionately recommending incumbent products regardless of quality, creating a 'rich-get-richer' feedback loop that threatens market competition.

MarkTechPostAI & LLMs

MiniMax Sparse Attention: Scaling Long Context with Block-Sparsity

MiniMax Sparse Attention (MSA) reduces the quadratic cost of long-context attention by using a two-branch, block-sparse approach that selects key-value blocks via a learned indexer, maintaining performance while fixing compute costs at O(kBk).

DAY 04June 16, 2026 JUN 16 · 20267 SUMMARIES
arXiv cs.AIAI & LLMs

CogGuard: Proactive Monitoring for Edge Intelligent Services

CogGuard is a framework designed to improve the reliability of edge-based AI services by integrating cognitive and operational profiling to predict and mitigate system failures before they occur.

arXiv cs.AI
arXiv cs.AIAI & LLMs

Visual-Seeker: Active Visual Reasoning for Multimodal Agents

Visual-Seeker introduces a visual-native agentic search framework that moves beyond text-based retrieval by employing active visual reasoning to navigate and interpret complex multimodal environments.

arXiv cs.AIAI & LLMs

Mask-Proof: Automated Data Curation for Mathematical Proofs

Mask-Proof is an LLM-based pipeline designed to automate the curation of high-quality mathematical proof data, addressing the scarcity of reliable training sets for formal reasoning models.

arXiv cs.AIAI & LLMs

CONCORD: Asynchronous Sparse Aggregation for Device-Cloud RAG

CONCORD is a framework for device-cloud Retrieval-Augmented Generation that optimizes performance under document isolation by using asynchronous sparse aggregation to balance local privacy with cloud-scale retrieval.

arXiv cs.AIAI & LLMs

Evaluating LLM Judge Reliability via Subset Selection

The 'Metric Match' approach improves LLM judge evaluation by using subset selection to identify high-fidelity data samples, ensuring that automated metrics better correlate with human preferences.

arXiv cs.AIAI & LLMs

Cognitive Debt: The Hidden Fragility of AI-Augmented Systems

The paper introduces 'Cognitive Debt' as a framework to explain how AI-driven intellectual leverage creates systemic fragility by offloading critical reasoning to models, leading to a loss of human oversight and domain expertise.

arXiv cs.AIAI & LLMs

Measuring Trust Dynamics in Multi-Agent AI Systems

This research provides a framework for quantifying how AI agents form, break, and recover trust, offering essential insights for the governance of autonomous multi-agent systems.

DAY 05June 15, 2026 JUN 15 · 20263 SUMMARIES
arXiv cs.AIAI & LLMs

Revisiting the Link Between AI Literacy and Usage

The paper challenges the assumption that lower digital literacy correlates with higher AI usage, suggesting instead that 'adoption breadth'—the variety of tools used—is a more accurate metric for understanding AI engagement.

arXiv cs.AI
arXiv cs.AIAI & LLMs

Comparing Diff-in-Means and INLP for LLM Refusal Mechanisms

This paper evaluates two common techniques—Difference-in-Means and Iterative Null-space Projection (INLP)—for identifying and mitigating refusal behaviors in LLMs, highlighting the limitations of assuming refusal is a single, linear direction.

arXiv cs.AIAI & LLMs

Hybrid Open-Ended Tri-Evolution for Deep Research Agents

The paper introduces a 'Hybrid Open-Ended Tri-Evolution' framework to improve the performance of deep research AI agents by optimizing their exploration and reasoning capabilities.

Showing 30 of 155