AI & LLMs
The deepest channel on Edge. Foundation models, agent architectures, retrieval systems, evals, and the moving line between research and production.
This pillar covers the work that determines what AI products can actually do. New model releases get filed here when they shift capability or cost in a meaningful way, alongside the harder material from the labs and the practitioners who turn it into shipping software. Read it for primary sources rather than recap blogs: lab papers and notes, retrieval benchmarks, agent traces, eval methodology, and the long-form essays that hold up six months later.
Two threads run through everything filed here. The first is what is genuinely new at the model layer: capability cliffs, training recipes, alignment work, the shape of the next deployment cycle. The second is what works in production: which patterns of context engineering and tool use compound across teams, where retrieval beats fine-tuning and where it loses, what the operational tax of running an agentic system actually looks like.
The summaries below are sorted by recency. The pillar refreshes as new entries land.
Filed under AI & LLMs
In the Weights: Measuring Your Digital Presence in AI Models
In the Weights is a new tool that evaluates how well various LLMs recall specific individuals without web search, effectively serving as a modern, AI-centric vanity search.
VibeThinker-3B: High-Performance Reasoning at 3B Parameters
VibeThinker-3B is a compact, open-source reasoning model that achieves performance comparable to massive models on math and coding tasks by using a specialized 'Spectrum-to-Signal' post-training pipeline.
SpatialClaw: Using Code as an Action Interface for Spatial Reasoning
SpatialClaw is a training-free agent framework that improves spatial reasoning in VLMs by treating Python code—rather than structured tool calls—as the primary interface for perception and geometric tasks.
Building Complex Software with Long-Running AI Agents
Long-running AI agents can execute multi-day, complex engineering pipelines—such as building an OS or optimizing 3D web scenes—by self-correcting through dependent tasks rather than relying on single-prompt generation.
Optimizing AI Apps with LLM Routing
Stop relying on a single 'best' model. Implementing an LLM router allows you to dynamically match requests to models based on cost, latency, and task complexity, ensuring production stability and efficiency.
Governing AI Agents with Looker and MCP
By using the Model Context Protocol (MCP) to connect AI agents to Looker's semantic layer, developers can replace fragile raw SQL generation with governed, model-aware data interactions.
GLARE: Natural Language Interfaces for Global Model Explanations
GLARE provides a natural language interface for querying global model explanations, allowing users to interpret complex AI behavior through conversational prompts rather than static visualizations.
Moving Beyond Static Leaderboards for LLM Agent Evaluation
Static benchmarks often fail to predict real-world performance for LLM agents; the authors propose a framework focused on predictive validity to better align evaluation with practical utility.
The Symbiotic Evolution of AI and Software Engineering
The intersection of AI and Software Engineering (AI4SE and SE4AI) has matured over the last decade, shifting from experimental research to essential production-grade methodologies for building, testing, and maintaining c…
Toten: Ontological Tokenization for Technical Portuguese
Toten is a knowledge-based tokenization framework designed to accurately parse physical quantities and technical notation in Brazilian Portuguese, addressing common failures in standard NLP tokenizers.
Optimizing LLM Post-Training Through Pairwise Comparison Selection
The paper investigates how the selection of response pairs in preference-based post-training (like DPO or PPO) impacts model performance, suggesting that strategic pair selection is as critical as the training algorithm …
Configurable Clinical Information Extraction with Agentic RAG
Agentic RAG systems for clinical data require modular configuration to balance precision and recall, as monolithic pipelines often fail to handle the high variability of medical documentation.
Detecting LLM Epistemic Blind Spots via Cross-Model Attribution
LLMs often hallucinate confidence in clinical settings. This paper introduces a method using Cross-Model Attribution Divergence (CMAD) to identify when models rely on unreliable features, effectively flagging epistemic u…
Deontic Policies for Runtime Governance of Agentic AI
The paper proposes using deontic logic—a system of formal rules defining obligations, permissions, and prohibitions—to govern the runtime behavior of autonomous AI agents.
Building Reliable AI Code Generation Pipelines with Salesforce CodeGen
To move AI-generated code from prototype to production, implement a multi-stage pipeline that includes automated unit testing, safety sandboxing, and model-based reranking to filter out hallucinated or insecure outputs.
Perplexity Brain: Self-Improving Memory for AI Agents
Perplexity's 'Brain' system shifts AI memory from user-centric profiles to agent-centric performance, using an overnight context graph to learn from past tasks, failures, and corrections to improve future efficiency.
Liquid AI's New 350M Multilingual Retrieval Models
Liquid AI has released LFM2.5-Embedding-350M and LFM2.5-ColBERT-350M, two efficient, bidirectional retrieval models optimized for multilingual search across 11 languages.
The Rise of Agentic Traffic and Microsoft's Model Strategy
Agentic AI bots now dominate web traffic, signaling a shift in how we interact with information. Meanwhile, Microsoft is pivoting to first-party models, prioritizing safety and cost-efficiency for enterprise users.
Architecting Long-Running AI Agents for Multi-Day Workflows
Move beyond stateless chatbots by implementing event-driven dormancy, durable checkpointing, and decoupled evaluation to manage complex, multi-day workflows.
Building AI Agents with Model Context Protocol (MCP)
The Model Context Protocol (MCP) acts as a universal adapter, allowing AI agents to securely interact with external tools and live data via a standardized input/output interface, decoupling agent logic from tool implemen…
The Production AI Playbook: Deploying Agents at Enterprise Scale
Moving AI from demo to production requires shifting focus from model selection to five pillars: evaluation, observability, data foundation, orchestration, and governance.
RODS: Improving Multi-Turn Tool-Use Agents via Reward-Driven Synthesis
RODS (Reward-Driven Online Data Synthesis) improves multi-turn tool-use agents by generating high-quality synthetic training data through iterative reward-based filtering, addressing the scarcity of complex, multi-step i…
Decoupling Search from Reasoning in LLM Agents
Native search grounding in LLMs creates rigid, expensive, and opaque agent architectures. Moving to a Decoupled Search Grounding (DSG) layer allows for vendor-agnostic control over retrieval, caching, and cost, while mai…
SciRisk-Bench: Evaluating Safety in AI for Science
SciRisk-Bench is a new benchmark designed to evaluate the safety risks of AI models specifically applied to scientific research, focusing on multi-dimensional risk assessment.
Skill-Guided Continuation Distillation for GUI Agents
The paper introduces a method to improve GUI agent performance by distilling complex task trajectories into modular, skill-based sub-tasks, enhancing generalization and execution reliability.
Improving AI Scientist Reliability via Research Harnesses
The paper proposes a 'Research Harness' to externalize synthesis and validation, addressing the reliability issues inherent in autonomous AI research agents.
WorldLines: Benchmarking Long-Horizon Stateful Embodied Agents
WorldLines introduces a new benchmark and modeling framework designed to evaluate how embodied AI agents maintain state and execute complex, long-horizon tasks over extended periods.
DeFAb: A New Benchmark for Defeasible Abduction in LLMs
DeFAb is a new, verifiable benchmark designed to test how well foundation models handle defeasible abduction—the ability to form logical explanations that can be retracted or revised in light of new, contradictory inform…
CEO-Bench: Measuring Long-Term Strategic Reasoning in AI Agents
CEO-Bench is a new evaluation framework designed to test whether AI agents can maintain strategic coherence and decision-making over extended, multi-step business scenarios.
Vercel's Eve: A Filesystem-First Framework for AI Agents
Vercel has released Eve, an open-source framework that treats AI agents as directories of files, mapping specific capabilities like tools, skills, and schedules to file paths to eliminate boilerplate and production plumb…
The KV Cache Compression Race: TurboQuant vs OSCAR vs EpiCache
KV cache compression is the new frontier for scaling LLM inference, with TurboQuant, OSCAR, and EpiCache offering distinct strategies to balance memory footprint against model accuracy.
The Shift Toward User-Controlled AI Recommendation Algorithms
Major social platforms are moving from opaque, one-size-fits-all algorithms to user-tunable systems, leveraging LLMs to allow granular control over feed content.
Building AI Agents with Google's Agent Development Kit (ADK)
A practical walkthrough on using Google's Agent Development Kit (ADK) to build autonomous agents that can interact with text-based environments, specifically demonstrated through a retro-inspired adventure game.
Solving the Physical AI Data Bottleneck
XDOF is building the infrastructure for physical AI by providing the high-fidelity, large-scale training data that robotics models currently lack, moving beyond the limitations of low-quality video data.
Pramaana Labs Uses Formal Verification to Secure Enterprise AI
Pramaana Labs raised $27M to integrate formal verification—using the LEAN programming language—with LLMs to ensure deterministic, error-free outputs in high-stakes fields like tax, law, and drug discovery.
Predicting AI Model Behavior via Deployment Simulation
OpenAI uses 'Deployment Simulation'—replaying real, de-identified user conversations with new models—to predict safety risks and undesired behaviors before public release, outperforming traditional synthetic evaluations.
Verbal Reinforcement Learning: Closing the Feedback Loop
The paper introduces a framework for 'Verbal Reinforcement Learning' (VRL), shifting from raw reward signals to structured insight governance by extracting and managing verbal feedback from world interactions.
DeepInsight: Evaluating the Physical AI Stack
DeepInsight proposes a unified infrastructure for evaluating AI systems across the entire physical stack, addressing the fragmentation in current performance assessment methodologies.
Foundation Model Orchestrated Workflows for Engineering Design
This research introduces a surrogate-assisted design workflow for pedestrian protection systems, using foundation models to orchestrate complex simulation and optimization tasks.
SEAGym: A Benchmark for Self-Evolving LLM Agents
SEAGym provides a standardized evaluation environment designed to measure the capabilities of self-evolving LLM agents, focusing on their ability to autonomously improve performance over time.
Benchmarking LLM Strategic Decision-Making in Corporate Simulations
This research evaluates the efficacy of LLMs in executive leadership roles by simulating multi-role corporate environments to test their ability to perform strategic resource reallocation.
Analyzing AI Model Behavior via Agent Trajectories
This paper provides a comprehensive 106-page framework for evaluating LLM behavior by analyzing the sequential decision-making paths (trajectories) agents take when solving complex tasks, rather than just looking at fina…
Incumbent Advantage: Brand Bias in LLM Recommendation Systems
LLMs exhibit significant brand bias, disproportionately recommending incumbent products regardless of quality, creating a 'rich-get-richer' feedback loop that threatens market competition.
Architecting Distributed General-Purpose Agent Networks
The paper proposes a framework for distributed agent networks, shifting from monolithic AI systems to decentralized, collaborative architectures that improve scalability and task specialization.
MemTrace: Beyond Final Accuracy in LLM Long-Term Memory
MemTrace is a diagnostic framework designed to evaluate LLM long-term memory beyond simple accuracy metrics, focusing on the underlying mechanisms of information retention and retrieval over time.
SpeechDx: A Multi-Task Benchmark for Clinical Speech AI
SpeechDx is a new multi-task benchmark designed to evaluate AI models on clinical speech analysis, addressing the need for standardized, robust performance metrics in medical diagnostics.
Improving Agentic Search via Diverse Query Initialization
The paper proposes moving beyond simple parallel sampling in agentic search by implementing diverse query initialization, which improves retrieval performance by covering a broader semantic space.
Qwen-RobotSuite: Three Foundation Models for Embodied AI
The Qwen team has released a suite of three specialized foundation models—RobotManip, RobotWorld, and RobotNav—designed to address data fragmentation in robotics through unified action representations, language-condition…
OpenAI's Deployment Simulation for Agentic Coding Risk Assessment
OpenAI has introduced a deployment simulation framework that uses simulated tool calls to evaluate the safety and reliability of agentic coding systems before they are deployed in real-world environments.
MiniMax Sparse Attention: Scaling Long Context with Block-Sparsity
MiniMax Sparse Attention (MSA) reduces the quadratic cost of long-context attention by using a two-branch, block-sparse approach that selects key-value blocks via a learned indexer, maintaining performance while fixing c…
Pinterest Pivots to Conversational AI Shopping
Pinterest is testing 'Ask Pinterest,' a standalone AI-powered shopping app that uses its 'Taste Graph' data to provide personalized, conversational recommendations for complex, multi-step consumer queries.
AI Agents, Patch Avalanches, and the New Era of Cyber Resilience
As AI agents begin automating password management and vulnerability discovery, security teams must shift from a mindset of total prevention to one of risk-based prioritization and cyber resilience.
Building Multi-Agent Systems with ADK and A2A
The Agent Development Kit (ADK) and Agent2Agent (A2A) protocol enable specialized AI agents to collaborate on complex tasks, using an orchestration layer to resolve conflicts and incorporate human-in-the-loop decision-ma…
Reducing AI Hallucinations via Harness Engineering
Startup 'Probably' raised $9M to shift AI reliability from model-centric to harness-centric, using deterministic validators to enable smaller, cheaper, and more accurate models.
Optimizing Video Diffusion for Real-Time Generation
Achieve real-time video generation by stacking quantization, caching, and step distillation to reduce the standard 50-step denoising process to as few as 1-8 steps.
Mask-Proof: Automated Data Curation for Mathematical Proofs
Mask-Proof is an LLM-based pipeline designed to automate the curation of high-quality mathematical proof data, addressing the scarcity of reliable training sets for formal reasoning models.
Visual-Seeker: Active Visual Reasoning for Multimodal Agents
Visual-Seeker introduces a visual-native agentic search framework that moves beyond text-based retrieval by employing active visual reasoning to navigate and interpret complex multimodal environments.
CogGuard: Proactive Monitoring for Edge Intelligent Services
CogGuard is a framework designed to improve the reliability of edge-based AI services by integrating cognitive and operational profiling to predict and mitigate system failures before they occur.
CONCORD: Asynchronous Sparse Aggregation for Device-Cloud RAG
CONCORD is a framework for device-cloud Retrieval-Augmented Generation that optimizes performance under document isolation by using asynchronous sparse aggregation to balance local privacy with cloud-scale retrieval.
Verifiable Agentic Data Science via Tool-Grounded Reasoning
To solve complex, irregular Time-Series Question Answering (TSQA), agents must move beyond pure generation toward tool-grounded reasoning that enforces verifiable, step-by-step execution.
Cognitive Debt: The Hidden Fragility of AI-Augmented Systems
The paper introduces 'Cognitive Debt' as a framework to explain how AI-driven intellectual leverage creates systemic fragility by offloading critical reasoning to models, leading to a loss of human oversight and domain e…
Evaluating LLM Judge Reliability via Subset Selection
The 'Metric Match' approach improves LLM judge evaluation by using subset selection to identify high-fidelity data samples, ensuring that automated metrics better correlate with human preferences.
Measuring Trust Dynamics in Multi-Agent AI Systems
This research provides a framework for quantifying how AI agents form, break, and recover trust, offering essential insights for the governance of autonomous multi-agent systems.
PrologMCP: Standardizing Logic-Based Tooling for LLM Agents
PrologMCP provides a standardized interface for LLM agents to interact with Prolog knowledge bases, enabling more reliable symbolic reasoning and complex constraint satisfaction in AI workflows.
Scaling Agentic Search with Dynamic Workspace Expansion
DR-DCI improves agentic search by combining retriever-based scalability with local terminal-style operations, allowing agents to dynamically pull documents into a workspace for precise analysis.
Sakana Marlin: Autonomous Enterprise Research via AB-MCTS
Sakana AI's Marlin is an enterprise research agent that uses Adaptive Branching Monte Carlo Tree Search (AB-MCTS) to autonomously generate 60–100 page research reports over 8-hour sessions.
Building Dynamic Experiences with GenUI and Agentic Workflows
GenUI (Agent-to-UI) enables applications to generate custom user interfaces on-demand using Gemini, allowing for real-time personalization that goes beyond static design.
Revisiting the Link Between AI Literacy and Usage
The paper challenges the assumption that lower digital literacy correlates with higher AI usage, suggesting instead that 'adoption breadth'—the variety of tools used—is a more accurate metric for understanding AI engagem…
Comparing Diff-in-Means and INLP for LLM Refusal Mechanisms
This paper evaluates two common techniques—Difference-in-Means and Iterative Null-space Projection (INLP)—for identifying and mitigating refusal behaviors in LLMs, highlighting the limitations of assuming refusal is a si…
Hybrid Open-Ended Tri-Evolution for Deep Research Agents
The paper introduces a 'Hybrid Open-Ended Tri-Evolution' framework to improve the performance of deep research AI agents by optimizing their exploration and reasoning capabilities.
Orchestra-o1: A Framework for Omnimodal Agent Orchestration
Orchestra-o1 introduces a specialized architecture for coordinating omnimodal AI agents, enabling them to process and act across diverse data modalities in complex, multi-step tasks.
Hands-On Guide to FineWeb Corpus Processing and Analytics
Learn to stream, filter, deduplicate, and analyze large-scale web datasets like FineWeb using Python, MinHash, and tiktoken to prepare high-quality data for LLM training.
Z.ai Releases GLM-5.2 with 1M-Token Context for Coding Agents
Z.ai's new GLM-5.2 model introduces a 1M-token context window and variable 'thinking-effort' levels, enabling coding agents to process entire mid-sized repositories without needing constant summarization.
Building Functional Personas with AI for User-Centric Decisions
Move beyond static, demographic-heavy personas by using AI to synthesize research into 'functional' personas focused on user goals, tasks, and objections, then making them interactive via custom chatbots.
Leveraging Public AI Incident Databases for Risk Mitigation
Organizations deploying AI can proactively identify and mitigate risks by querying nine public databases that catalog documented AI failures, ranging from deepfake fraud to algorithmic bias.
Scaling RAG Pipelines to 10M+ Documents with High Accuracy
To minimize hallucinations at scale, implement a multi-stage RAG pipeline that combines hybrid indexing, reciprocal rank fusion, and a strict 'retrieve, constrain, verify, abstain' workflow that forces the model to cite …
Integrating Design Systems with AI via Model Context Protocol
By using the Model Context Protocol (MCP) to feed design system rules into AI agents, developers can ensure AI-generated code remains consistent, brand-compliant, and architecturally sound.
Scaling AI Adoption Through Structured Workforce Training
OpenAI has launched three new Academy courses designed to move organizations from basic AI experimentation to building repeatable, agent-assisted workflows.
Google's Gemini-SQL2 Sets New BIRD Benchmark Record
Google's Gemini-SQL2, powered by Gemini 3.1 Pro, achieved an 80.04% execution accuracy on the BIRD text-to-SQL benchmark, outperforming all other single-model entries.
Moonshot AI Releases Kimi K2.7-Code: Agentic Coding Model
Moonshot AI's new K2.7-Code model improves coding benchmarks by up to 31.5% over its predecessor while reducing reasoning-token usage by 30%, optimizing both performance and cost for long-horizon software engineering tas…
OpenAI Acquires Ona to Enable Persistent AI Agent Workflows
OpenAI is acquiring Ona to integrate secure, cloud-based execution environments into Codex, allowing AI agents to perform long-running, autonomous tasks within customer-controlled infrastructure.
The Containment Gap in Agentic AI Frameworks
Current agentic AI frameworks lack the necessary architectural guardrails to meet public-facing safety requirements, creating a 'containment gap' between development environments and production deployment.
A Tutorial on World Models and Physical AI
The provided source is a placeholder for a research paper on World Models and Physical AI, which explores how AI agents can learn internal representations of the physical world to improve decision-making and interaction.
Formalizing Theory of Mind for AI Agents
The article proposes a formal mathematical specification for a 'Theory of Mind' (ToM) mechanism, enabling AI agents to model and predict the mental states of other agents to improve collaborative decision-making.
Predicting Query-Level Rejection Risk in Clinical LLM Systems
A framework for deployment-centered evaluation that identifies high-risk clinical queries, allowing systems to proactively reject unsafe or unreliable LLM outputs before they reach the user.
Evoflux: Optimizing Agent Workflows via Inference-Time Evolution
Evoflux improves compact AI agent performance by evolving executable tool workflows at inference time, allowing smaller models to solve complex tasks without massive parameter counts.
ToolSense: A Diagnostic Framework for Auditing LLM Tool Knowledge
ToolSense provides a structured diagnostic framework to audit how effectively LLMs understand and utilize external tools, moving beyond simple prompt testing to evaluate parametric tool knowledge.
Arbor: Enhancing Agent Cognition via Tree Search
Arbor introduces a tree search-based cognition layer for autonomous agents, enabling more robust decision-making by systematically exploring action paths rather than relying solely on single-step inference.
Perplexity Integrates Deep Research into 'Computer' Orchestration
Perplexity has moved its Deep Research feature into 'Computer,' a multi-model orchestration system that breaks complex queries into subtasks and routes them across 20+ frontier models to generate reports, decks, and dash…
Zamba2-VL: Hybrid Mamba2-Transformer Vision-Language Models
Zyphra's Zamba2-VL models use a hybrid Mamba2-Transformer architecture to achieve near-linear time prefill and significantly lower time-to-first-token compared to dense Transformer-based VLMs.
Claude Fable 5, Agentic Payments, and Self-Improving Products
Anthropic's Fable 5 model sets new benchmarks in coding, while emerging agentic payment protocols and self-improving product loops like Amplitude Wave signal a shift toward autonomous software development.
The Reality Check: AI Costs, Routing, and Cloud Shifts
As AI moves from hype to production, companies are shifting toward tiered routing to manage costs and capacity, while hardware limitations are forcing a pivot from pure on-device AI to hybrid cloud architectures.
Avataar AI's Varya: A Low-Cost, Culturally Aware Video Model
Avataar AI has launched Varya, a distilled, high-speed video generation model optimized for the Indian market, offering a 20x price reduction compared to global competitors by focusing on efficiency and cultural relevanc…
Building Agent-Ready Websites with WebMCP
WebMCP is a proposed web standard that allows developers to expose site functionality as structured tools for AI agents, replacing brittle screen-scraping with direct, reliable API-like interactions.
OpenAI's Multi-Layered Approach to AI Content Provenance
OpenAI is adopting the EU Code of Practice on Transparency of AI-Generated Content, utilizing a multi-layered strategy that combines C2PA metadata, watermarking, and public verification tools to improve digital content t…
SVoT: Enhancing Spatial Reasoning via State-Aware Visualization
SVoT improves spatial reasoning in LLMs by using reinforcement learning to generate state-aware visual representations of thought, allowing models to track complex spatial relationships more accurately than text-only cha…
Recursive Reasoning for Theory of Mind in AI
The paper proposes that improving AI's Theory of Mind requires recursive perspective-taking, allowing models to model the mental states of others rather than relying on static pattern matching.
Securing Continuous Data Summarization Against Adversarial Attacks
This paper addresses vulnerabilities in continuous data summarization systems by identifying multi-target adversarial attack vectors and proposing robust defense mechanisms to ensure AI trustworthiness.
Hierarchical Memory Navigation for Efficient AI Agents
The paper introduces a hierarchical memory structure that improves agent efficiency by organizing information before retrieval, moving beyond simple flat vector search.
Lung-R1: Enhancing Pulmonary Diagnostics with Knowledge Graphs
Lung-R1 improves diagnostic accuracy in pulmonary medicine by integrating structured knowledge graphs with LLMs, reducing hallucinations and improving clinical reasoning.
Show all 764 in AI & LLMs →