arXiv cs.AI
Every summary, chronological. Filter by category, tag, or source from the rail.
Parallel Context Compaction for Long-Horizon LLM Agent Serving
The paper proposes a method to optimize long-horizon LLM agent performance by using parallel context compaction, reducing the computational overhead of maintaining massive context windows during extended agent interactions.
DART: Improving Agent Reliability via Semantic Recoverability
DART (Dynamic Agent Recovery Technique) introduces a framework for structured tool agents to detect and recover from execution failures by leveraging semantic feedback loops, significantly reducing task abandonment.
Foundation Protocol: Coordination for Agentic Systems
The Foundation Protocol proposes a standardized coordination layer designed to enable interoperability, trust, and resource allocation between autonomous AI agents in a decentralized society.
Inductive Deductive Synthesis for Formally Verified AI Systems
Inductive Deductive Synthesis (IDS) combines inductive AI generation with deductive formal verification to ensure AI-generated code is mathematically correct and reliable.
GENSTRAT: A Framework for Strategic Reasoning in LLMs
GENSTRAT provides a structured approach to evaluating and improving how Large Language Models perform in strategic, multi-agent environments, moving beyond simple pattern matching to formal strategic reasoning.
EVE-Agent: Improving Self-Evolving Agents with Evidence Verification
EVE-Agent improves self-evolving search agents by requiring them to provide verifiable evidence for their answers, ensuring training data is grounded and auditable without human labels.
Energy per Successful Goal: A New Metric for Agentic AI Efficiency
The paper introduces 'Energy per Successful Goal' (ESG) as a critical metric for evaluating AI agent efficiency, shifting focus from raw compute costs to the energy required to complete specific, actionable objectives.
BOHM: Zero-Cost Hierarchical Attribution for Compound AI Systems
BOHM introduces a method for attributing performance in compound AI systems without the computational overhead of traditional evaluation methods.
Governance by Construction for Generalist Agents
The paper proposes 'Governance by Construction' as a paradigm for AI safety, shifting from post-hoc monitoring to embedding constraints directly into the agent's architecture and execution environment.
Conflict-Aware Additive Guidance for Flow Models
This paper introduces a method to manage conflicting compositional rewards in flow-based generative models by dynamically adjusting guidance to prevent performance degradation.
VBFDD-Agent: Translating Battery Signals into Descriptive Text
The VBFDD-Agent framework improves electric vehicle battery diagnostics by converting raw digital sensor signals into descriptive text, enabling LLMs to perform more accurate fault detection and diagnosis.
HANA: A Hierarchical Agent-native Network Architecture
HANA transitions network management from static automation to autonomous operation by utilizing a hierarchical agent-based framework that enables decentralized decision-making and self-optimization.
Optimizing Agentic Pipelines with Temporal Semantic Caching
The paper introduces a framework for improving agentic plan-execute pipelines by implementing temporal semantic caching, which reduces redundant LLM calls and latency by caching execution results based on semantic similarity and temporal relevance.
Personality Engineering: A New Framework for AI Negotiation Agents
Researchers propose 'personality engineering' using the interpersonal circumplex model to parameterize and test AI agent behavior in controlled negotiation experiments.
COAgents: A Multi-Agent Framework for Routing Optimization
COAgents is a multi-agent framework designed to navigate complex search spaces in routing problems by combining collaborative agent intelligence with optimization techniques.
Open-World Evaluations for Frontier AI Capabilities
The paper proposes shifting AI benchmarking from static, closed-set datasets to open-world evaluations, which better measure true agentic capability and generalization in unpredictable environments.
AgentAtlas: Moving Beyond Outcome-Only LLM Agent Evaluation
AgentAtlas shifts the focus of LLM agent evaluation from simple success/failure leaderboards to granular, process-oriented analysis of agent behavior and decision-making patterns.
Evaluating Uncertainty in AI Systems with ECUAS_n Metrics
The ECUAS_n family of metrics provides a principled, unified framework for evaluating AI systems that output uncertainty estimates, addressing the lack of standardized benchmarking for uncertainty-augmented models.
AgentCo-op: Retrieval-Based Synthesis of Multi-Agent Workflows
AgentCo-op introduces a retrieval-based framework to dynamically synthesize interoperable multi-agent workflows, moving beyond static agent orchestration to modular, reusable task execution.
SOLAR: Self-Optimizing Agents for Lifelong Learning
SOLAR introduces a framework for autonomous agents that perform continuous, self-directed learning and adaptation in open-ended environments, addressing the limitations of static model training.
OSCToM: Advancing High-Order Theory of Mind via RL-Guided Adversarial Generation
OSCToM improves AI's ability to model complex, recursive mental states (Theory of Mind) by using reinforcement learning to guide adversarial data generation, addressing the scarcity of high-order social reasoning datasets.
COSMO-Agent: Automating CAD-CAE Design Loops with LLMs
COSMO-Agent is a reinforcement learning framework that enables LLMs to bridge the CAD-CAE semantic gap by orchestrating external tools to perform iterative, constraint-driven geometric design.
SimGym: Simulating E-Commerce A/B Tests with VLM Agents
SimGym is a framework that uses traffic-grounded Vision-Language Model (VLM) agents to simulate user behavior in e-commerce environments, enabling faster and more accurate A/B test predictions.
Evaluating the Feasibility of Autonomous AI Research Systems
The article provides a framework for assessing how close current AI systems are to performing end-to-end scientific research, highlighting the gap between task-specific automation and true autonomous discovery.
Formalizing Agentic Knowledge Graphs for LLM Discoverability
The paper proposes a formal framework for 'Agentic KG Affordances,' enabling AI agents to programmatically discover and interact with knowledge graphs by standardizing how knowledge is exposed and queried.
Distinguishing Uncertainty Types for Better AI Exploration
Effective AI exploration requires distinguishing between aleatoric uncertainty (stochasticity) and epistemic uncertainty (volatility), as treating them identically leads to suboptimal learning behaviors.
Hallucination as Exploit: Security Risks in Multimodal AI Agents
Multimodal AI agents are vulnerable to 'evidence-carrying' attacks, where attackers use hallucination to force models into executing malicious code or leaking sensitive data via manipulated visual inputs.
DecisionBench: Measuring Agentic Delegation in Long-Horizon Tasks
DecisionBench provides a standardized framework for evaluating how AI agents delegate sub-tasks in complex, long-horizon workflows, addressing a critical gap in multi-agent system performance measurement.
Optimizing System Prompts via Embedding by Elicitation
The paper introduces 'Embedding by Elicitation,' a method that uses Bayesian Optimization to dynamically refine system prompts by learning latent representations, overcoming the limitations of static prompt engineering.
Developing Data Probes to Quantify LLM Data Impact
The authors propose 'data probes' as a diagnostic framework to move beyond black-box training, enabling developers to measure how specific data characteristics influence model performance and behavior.
Showing 30 of 51