#agents
Every summary, chronological. Filter by category, tag, or source from the rail.
SpatialClaw: Using Code as an Action Interface for Spatial Reasoning
SpatialClaw is a training-free agent framework that improves spatial reasoning in VLMs by treating Python code—rather than structured tool calls—as the primary interface for perception and geometric tasks.
Building Complex Software with Long-Running AI Agents
Long-running AI agents can execute multi-day, complex engineering pipelines—such as building an OS or optimizing 3D web scenes—by self-correcting through dependent tasks rather than relying on single-prompt generation.
Google Cloud TechGoverning AI Agents with Looker and MCP
By using the Model Context Protocol (MCP) to connect AI agents to Looker's semantic layer, developers can replace fragile raw SQL generation with governed, model-aware data interactions.
Scale Your Expertise, Not Your Job Titles
Instead of using AI to perform roles you aren't trained for, use it to encode your unique professional expertise into systems, allowing your specific skills to scale across an entire project.
The New Software Lifecycle: From Vibe Coding to Agentic Engineering
AI has shifted the software development bottleneck from implementation to specification and verification. Success now depends on 'harness engineering'—the 90% of an agent's architecture that isn't the model—and treating context management as a versioned, architectural decision.
Moving Beyond Static Leaderboards for LLM Agent Evaluation
Static benchmarks often fail to predict real-world performance for LLM agents; the authors propose a framework focused on predictive validity to better align evaluation with practical utility.
Configurable Clinical Information Extraction with Agentic RAG
Agentic RAG systems for clinical data require modular configuration to balance precision and recall, as monolithic pipelines often fail to handle the high variability of medical documentation.
Deontic Policies for Runtime Governance of Agentic AI
The paper proposes using deontic logic—a system of formal rules defining obligations, permissions, and prohibitions—to govern the runtime behavior of autonomous AI agents.
The Rise of Agentic Traffic and Microsoft's Model Strategy
Agentic AI bots now dominate web traffic, signaling a shift in how we interact with information. Meanwhile, Microsoft is pivoting to first-party models, prioritizing safety and cost-efficiency for enterprise users.
Architecting Long-Running AI Agents for Multi-Day Workflows
Move beyond stateless chatbots by implementing event-driven dormancy, durable checkpointing, and decoupled evaluation to manage complex, multi-day workflows.
Google Cloud TechThe Production AI Playbook: Deploying Agents at Enterprise Scale
Moving AI from demo to production requires shifting focus from model selection to five pillars: evaluation, observability, data foundation, orchestration, and governance.
RODS: Improving Multi-Turn Tool-Use Agents via Reward-Driven Synthesis
RODS (Reward-Driven Online Data Synthesis) improves multi-turn tool-use agents by generating high-quality synthetic training data through iterative reward-based filtering, addressing the scarcity of complex, multi-step interaction data.
Skill-Guided Continuation Distillation for GUI Agents
The paper introduces a method to improve GUI agent performance by distilling complex task trajectories into modular, skill-based sub-tasks, enhancing generalization and execution reliability.
Decoupling Search from Reasoning in LLM Agents
Native search grounding in LLMs creates rigid, expensive, and opaque agent architectures. Moving to a Decoupled Search Grounding (DSG) layer allows for vendor-agnostic control over retrieval, caching, and cost, while maintaining accuracy.
Improving AI Scientist Reliability via Research Harnesses
The paper proposes a 'Research Harness' to externalize synthesis and validation, addressing the reliability issues inherent in autonomous AI research agents.
CEO-Bench: Measuring Long-Term Strategic Reasoning in AI Agents
CEO-Bench is a new evaluation framework designed to test whether AI agents can maintain strategic coherence and decision-making over extended, multi-step business scenarios.
Building AI Agents with Google's Agent Development Kit (ADK)
A practical walkthrough on using Google's Agent Development Kit (ADK) to build autonomous agents that can interact with text-based environments, specifically demonstrated through a retro-inspired adventure game.
Google Cloud TechPredicting AI Model Behavior via Deployment Simulation
OpenAI uses 'Deployment Simulation'—replaying real, de-identified user conversations with new models—to predict safety risks and undesired behaviors before public release, outperforming traditional synthetic evaluations.
SEAGym: A Benchmark for Self-Evolving LLM Agents
SEAGym provides a standardized evaluation environment designed to measure the capabilities of self-evolving LLM agents, focusing on their ability to autonomously improve performance over time.
Analyzing AI Model Behavior via Agent Trajectories
This paper provides a comprehensive 106-page framework for evaluating LLM behavior by analyzing the sequential decision-making paths (trajectories) agents take when solving complex tasks, rather than just looking at final outputs.
Benchmarking LLM Strategic Decision-Making in Corporate Simulations
This research evaluates the efficacy of LLMs in executive leadership roles by simulating multi-role corporate environments to test their ability to perform strategic resource reallocation.
Architecting Distributed General-Purpose Agent Networks
The paper proposes a framework for distributed agent networks, shifting from monolithic AI systems to decentralized, collaborative architectures that improve scalability and task specialization.
Improving Agentic Search via Diverse Query Initialization
The paper proposes moving beyond simple parallel sampling in agentic search by implementing diverse query initialization, which improves retrieval performance by covering a broader semantic space.
Qwen-RobotSuite: Three Foundation Models for Embodied AI
The Qwen team has released a suite of three specialized foundation models—RobotManip, RobotWorld, and RobotNav—designed to address data fragmentation in robotics through unified action representations, language-conditioned world modeling, and scalable navigation interfaces.
Pinterest Pivots to Conversational AI Shopping
Pinterest is testing 'Ask Pinterest,' a standalone AI-powered shopping app that uses its 'Taste Graph' data to provide personalized, conversational recommendations for complex, multi-step consumer queries.
Building Long-Running, Event-Driven AI Agents with ADK
The Agent Development Kit (ADK) enables stateless, event-driven AI agents that maintain state across weeks of dormancy without token bloat, using a state-machine approach rather than traditional chat-based memory.
Google Cloud TechBuilding Multi-Agent Systems with ADK and A2A
The Agent Development Kit (ADK) and Agent2Agent (A2A) protocol enable specialized AI agents to collaborate on complex tasks, using an orchestration layer to resolve conflicts and incorporate human-in-the-loop decision-making.
Visual-Seeker: Active Visual Reasoning for Multimodal Agents
Visual-Seeker introduces a visual-native agentic search framework that moves beyond text-based retrieval by employing active visual reasoning to navigate and interpret complex multimodal environments.
Verifiable Agentic Data Science via Tool-Grounded Reasoning
To solve complex, irregular Time-Series Question Answering (TSQA), agents must move beyond pure generation toward tool-grounded reasoning that enforces verifiable, step-by-step execution.
PrologMCP: Standardizing Logic-Based Tooling for LLM Agents
PrologMCP provides a standardized interface for LLM agents to interact with Prolog knowledge bases, enabling more reliable symbolic reasoning and complex constraint satisfaction in AI workflows.
Showing 30 of 984