arXiv cs.AI

Parallel Context Compaction for Long-Horizon LLM Agent Serving

The paper proposes a method to optimize long-horizon LLM agent performance by using parallel context compaction, reducing the computational overhead of maintaining massive context windows during extended agent interactions.

arXiv cs.AI

DART: Improving Agent Reliability via Semantic Recoverability

DART (Dynamic Agent Recovery Technique) introduces a framework for structured tool agents to detect and recover from execution failures by leveraging semantic feedback loops, significantly reducing task abandonment.

Foundation Protocol: Coordination for Agentic Systems

The Foundation Protocol proposes a standardized coordination layer designed to enable interoperability, trust, and resource allocation between autonomous AI agents in a decentralized society.

Inductive Deductive Synthesis for Formally Verified AI Systems

Inductive Deductive Synthesis (IDS) combines inductive AI generation with deductive formal verification to ensure AI-generated code is mathematically correct and reliable.

GENSTRAT: A Framework for Strategic Reasoning in LLMs

GENSTRAT provides a structured approach to evaluating and improving how Large Language Models perform in strategic, multi-agent environments, moving beyond simple pattern matching to formal strategic reasoning.

EVE-Agent: Improving Self-Evolving Agents with Evidence Verification

EVE-Agent improves self-evolving search agents by requiring them to provide verifiable evidence for their answers, ensuring training data is grounded and auditable without human labels.

Energy per Successful Goal: A New Metric for Agentic AI Efficiency

The paper introduces 'Energy per Successful Goal' (ESG) as a critical metric for evaluating AI agent efficiency, shifting focus from raw compute costs to the energy required to complete specific, actionable objectives.

BOHM: Zero-Cost Hierarchical Attribution for Compound AI Systems

BOHM introduces a method for attributing performance in compound AI systems without the computational overhead of traditional evaluation methods.

DAY 02Friday MAY 22 · 202614 SUMMARIES

Governance by Construction for Generalist Agents

The paper proposes 'Governance by Construction' as a paradigm for AI safety, shifting from post-hoc monitoring to embedding constraints directly into the agent's architecture and execution environment.

arXiv cs.AI

Conflict-Aware Additive Guidance for Flow Models

This paper introduces a method to manage conflicting compositional rewards in flow-based generative models by dynamically adjusting guidance to prevent performance degradation.

VBFDD-Agent: Translating Battery Signals into Descriptive Text

The VBFDD-Agent framework improves electric vehicle battery diagnostics by converting raw digital sensor signals into descriptive text, enabling LLMs to perform more accurate fault detection and diagnosis.

HANA: A Hierarchical Agent-native Network Architecture

HANA transitions network management from static automation to autonomous operation by utilizing a hierarchical agent-based framework that enables decentralized decision-making and self-optimization.

Optimizing Agentic Pipelines with Temporal Semantic Caching

The paper introduces a framework for improving agentic plan-execute pipelines by implementing temporal semantic caching, which reduces redundant LLM calls and latency by caching execution results based on semantic similarity and temporal relevance.

Personality Engineering: A New Framework for AI Negotiation Agents

Researchers propose 'personality engineering' using the interpersonal circumplex model to parameterize and test AI agent behavior in controlled negotiation experiments.

COAgents: A Multi-Agent Framework for Routing Optimization

COAgents is a multi-agent framework designed to navigate complex search spaces in routing problems by combining collaborative agent intelligence with optimization techniques.

Open-World Evaluations for Frontier AI Capabilities

The paper proposes shifting AI benchmarking from static, closed-set datasets to open-world evaluations, which better measure true agentic capability and generalization in unpredictable environments.

AgentAtlas: Moving Beyond Outcome-Only LLM Agent Evaluation

AgentAtlas shifts the focus of LLM agent evaluation from simple success/failure leaderboards to granular, process-oriented analysis of agent behavior and decision-making patterns.

Evaluating Uncertainty in AI Systems with ECUAS_n Metrics

The ECUAS_n family of metrics provides a principled, unified framework for evaluating AI systems that output uncertainty estimates, addressing the lack of standardized benchmarking for uncertainty-augmented models.

AgentCo-op: Retrieval-Based Synthesis of Multi-Agent Workflows

AgentCo-op introduces a retrieval-based framework to dynamically synthesize interoperable multi-agent workflows, moving beyond static agent orchestration to modular, reusable task execution.

SOLAR: Self-Optimizing Agents for Lifelong Learning

SOLAR introduces a framework for autonomous agents that perform continuous, self-directed learning and adaptation in open-ended environments, addressing the limitations of static model training.

OSCToM: Advancing High-Order Theory of Mind via RL-Guided Adversarial Generation

OSCToM improves AI's ability to model complex, recursive mental states (Theory of Mind) by using reinforcement learning to guide adversarial data generation, addressing the scarcity of high-order social reasoning datasets.

COSMO-Agent: Automating CAD-CAE Design Loops with LLMs

COSMO-Agent is a reinforcement learning framework that enables LLMs to bridge the CAD-CAE semantic gap by orchestrating external tools to perform iterative, constraint-driven geometric design.

DAY 03Wednesday MAY 20 · 20268 SUMMARIES

SimGym: Simulating E-Commerce A/B Tests with VLM Agents

SimGym is a framework that uses traffic-grounded Vision-Language Model (VLM) agents to simulate user behavior in e-commerce environments, enabling faster and more accurate A/B test predictions.

arXiv cs.AI

Evaluating the Feasibility of Autonomous AI Research Systems

The article provides a framework for assessing how close current AI systems are to performing end-to-end scientific research, highlighting the gap between task-specific automation and true autonomous discovery.

Formalizing Agentic Knowledge Graphs for LLM Discoverability

The paper proposes a formal framework for 'Agentic KG Affordances,' enabling AI agents to programmatically discover and interact with knowledge graphs by standardizing how knowledge is exposed and queried.

Distinguishing Uncertainty Types for Better AI Exploration

Effective AI exploration requires distinguishing between aleatoric uncertainty (stochasticity) and epistemic uncertainty (volatility), as treating them identically leads to suboptimal learning behaviors.

Hallucination as Exploit: Security Risks in Multimodal AI Agents

Multimodal AI agents are vulnerable to 'evidence-carrying' attacks, where attackers use hallucination to force models into executing malicious code or leaking sensitive data via manipulated visual inputs.

DecisionBench: Measuring Agentic Delegation in Long-Horizon Tasks

DecisionBench provides a standardized framework for evaluating how AI agents delegate sub-tasks in complex, long-horizon workflows, addressing a critical gap in multi-agent system performance measurement.

Optimizing System Prompts via Embedding by Elicitation

The paper introduces 'Embedding by Elicitation,' a method that uses Bayesian Optimization to dynamically refine system prompts by learning latent representations, overcoming the limitations of static prompt engineering.