TOPIC · 764 summaries

AI & LLMs

The deepest channel on Edge. Foundation models, agent architectures, retrieval systems, evals, and the moving line between research and production.

This pillar covers the work that determines what AI products can actually do. New model releases get filed here when they shift capability or cost in a meaningful way, alongside the harder material from the labs and the practitioners who turn it into shipping software. Read it for primary sources rather than recap blogs: lab papers and notes, retrieval benchmarks, agent traces, eval methodology, and the long-form essays that hold up six months later.

Two threads run through everything filed here. The first is what is genuinely new at the model layer: capability cliffs, training recipes, alignment work, the shape of the next deployment cycle. The second is what works in production: which patterns of context engineering and tool use compound across teams, where retrieval beats fine-tuning and where it loses, what the operational tax of running an agentic system actually looks like.

The summaries below are sorted by recency. The pillar refreshes as new entries land.

№ 01

Filed under AI & LLMs

764
TechCrunch — AI

In the Weights: Measuring Your Digital Presence in AI Models

In the Weights is a new tool that evaluates how well various LLMs recall specific individuals without web search, effectively serving as a modern, AI-centric vanity search.

MarkTechPost

VibeThinker-3B: High-Performance Reasoning at 3B Parameters

VibeThinker-3B is a compact, open-source reasoning model that achieves performance comparable to massive models on math and coding tasks by using a specialized 'Spectrum-to-Signal' post-training pipeline.

MarkTechPost

SpatialClaw: Using Code as an Action Interface for Spatial Reasoning

SpatialClaw is a training-free agent framework that improves spatial reasoning in VLMs by treating Python code—rather than structured tool calls—as the primary interface for perception and geometric tasks.

Google Cloud Tech

Building Complex Software with Long-Running AI Agents

Long-running AI agents can execute multi-day, complex engineering pipelines—such as building an OS or optimizing 3D web scenes—by self-correcting through dependent tasks rather than relying on single-prompt generation.

Level Up Coding

Optimizing AI Apps with LLM Routing

Stop relying on a single 'best' model. Implementing an LLM router allows you to dynamically match requests to models based on cost, latency, and task complexity, ensuring production stability and efficiency.

Google Cloud Tech

Governing AI Agents with Looker and MCP

By using the Model Context Protocol (MCP) to connect AI agents to Looker's semantic layer, developers can replace fragile raw SQL generation with governed, model-aware data interactions.

arXiv cs.AI

GLARE: Natural Language Interfaces for Global Model Explanations

GLARE provides a natural language interface for querying global model explanations, allowing users to interpret complex AI behavior through conversational prompts rather than static visualizations.

arXiv cs.AI

Moving Beyond Static Leaderboards for LLM Agent Evaluation

Static benchmarks often fail to predict real-world performance for LLM agents; the authors propose a framework focused on predictive validity to better align evaluation with practical utility.

arXiv cs.AI

The Symbiotic Evolution of AI and Software Engineering

The intersection of AI and Software Engineering (AI4SE and SE4AI) has matured over the last decade, shifting from experimental research to essential production-grade methodologies for building, testing, and maintaining c…

arXiv cs.AI

Toten: Ontological Tokenization for Technical Portuguese

Toten is a knowledge-based tokenization framework designed to accurately parse physical quantities and technical notation in Brazilian Portuguese, addressing common failures in standard NLP tokenizers.

arXiv cs.AI

Optimizing LLM Post-Training Through Pairwise Comparison Selection

The paper investigates how the selection of response pairs in preference-based post-training (like DPO or PPO) impacts model performance, suggesting that strategic pair selection is as critical as the training algorithm …

arXiv cs.AI

Configurable Clinical Information Extraction with Agentic RAG

Agentic RAG systems for clinical data require modular configuration to balance precision and recall, as monolithic pipelines often fail to handle the high variability of medical documentation.

arXiv cs.AI

Detecting LLM Epistemic Blind Spots via Cross-Model Attribution

LLMs often hallucinate confidence in clinical settings. This paper introduces a method using Cross-Model Attribution Divergence (CMAD) to identify when models rely on unreliable features, effectively flagging epistemic u…

arXiv cs.AI

Deontic Policies for Runtime Governance of Agentic AI

The paper proposes using deontic logic—a system of formal rules defining obligations, permissions, and prohibitions—to govern the runtime behavior of autonomous AI agents.

MarkTechPost

Building Reliable AI Code Generation Pipelines with Salesforce CodeGen

To move AI-generated code from prototype to production, implement a multi-stage pipeline that includes automated unit testing, safety sandboxing, and model-based reranking to filter out hallucinated or insecure outputs.

MarkTechPost

Perplexity Brain: Self-Improving Memory for AI Agents

Perplexity's 'Brain' system shifts AI memory from user-centric profiles to agent-centric performance, using an overnight context graph to learn from past tasks, failures, and corrections to improve future efficiency.

MarkTechPost

Liquid AI's New 350M Multilingual Retrieval Models

Liquid AI has released LFM2.5-Embedding-350M and LFM2.5-ColBERT-350M, two efficient, bidirectional retrieval models optimized for multilingual search across 11 languages.

IBM Technology

The Rise of Agentic Traffic and Microsoft's Model Strategy

Agentic AI bots now dominate web traffic, signaling a shift in how we interact with information. Meanwhile, Microsoft is pivoting to first-party models, prioritizing safety and cost-efficiency for enterprise users.

Google Cloud Tech

Architecting Long-Running AI Agents for Multi-Day Workflows

Move beyond stateless chatbots by implementing event-driven dormancy, durable checkpointing, and decoupled evaluation to manage complex, multi-day workflows.

Google Cloud Tech

Building AI Agents with Model Context Protocol (MCP)

The Model Context Protocol (MCP) acts as a universal adapter, allowing AI agents to securely interact with external tools and live data via a standardized input/output interface, decoupling agent logic from tool implemen…

AI Engineer

The Production AI Playbook: Deploying Agents at Enterprise Scale

Moving AI from demo to production requires shifting focus from model selection to five pillars: evaluation, observability, data foundation, orchestration, and governance.

arXiv cs.AI

RODS: Improving Multi-Turn Tool-Use Agents via Reward-Driven Synthesis

RODS (Reward-Driven Online Data Synthesis) improves multi-turn tool-use agents by generating high-quality synthetic training data through iterative reward-based filtering, addressing the scarcity of complex, multi-step i…

arXiv cs.AI

Decoupling Search from Reasoning in LLM Agents

Native search grounding in LLMs creates rigid, expensive, and opaque agent architectures. Moving to a Decoupled Search Grounding (DSG) layer allows for vendor-agnostic control over retrieval, caching, and cost, while mai…

arXiv cs.AI

SciRisk-Bench: Evaluating Safety in AI for Science

SciRisk-Bench is a new benchmark designed to evaluate the safety risks of AI models specifically applied to scientific research, focusing on multi-dimensional risk assessment.

arXiv cs.AI

Skill-Guided Continuation Distillation for GUI Agents

The paper introduces a method to improve GUI agent performance by distilling complex task trajectories into modular, skill-based sub-tasks, enhancing generalization and execution reliability.

arXiv cs.AI

Improving AI Scientist Reliability via Research Harnesses

The paper proposes a 'Research Harness' to externalize synthesis and validation, addressing the reliability issues inherent in autonomous AI research agents.

arXiv cs.AI

WorldLines: Benchmarking Long-Horizon Stateful Embodied Agents

WorldLines introduces a new benchmark and modeling framework designed to evaluate how embodied AI agents maintain state and execute complex, long-horizon tasks over extended periods.

arXiv cs.AI

DeFAb: A New Benchmark for Defeasible Abduction in LLMs

DeFAb is a new, verifiable benchmark designed to test how well foundation models handle defeasible abduction—the ability to form logical explanations that can be retracted or revised in light of new, contradictory inform…

arXiv cs.AI

CEO-Bench: Measuring Long-Term Strategic Reasoning in AI Agents

CEO-Bench is a new evaluation framework designed to test whether AI agents can maintain strategic coherence and decision-making over extended, multi-step business scenarios.

MarkTechPost

Vercel's Eve: A Filesystem-First Framework for AI Agents

Vercel has released Eve, an open-source framework that treats AI agents as directories of files, mapping specific capabilities like tools, skills, and schedules to file paths to eliminate boilerplate and production plumb…

MarkTechPost

The KV Cache Compression Race: TurboQuant vs OSCAR vs EpiCache

KV cache compression is the new frontier for scaling LLM inference, with TurboQuant, OSCAR, and EpiCache offering distinct strategies to balance memory footprint against model accuracy.

TechCrunch — AI

The Shift Toward User-Controlled AI Recommendation Algorithms

Major social platforms are moving from opaque, one-size-fits-all algorithms to user-tunable systems, leveraging LLMs to allow granular control over feed content.

Google Cloud Tech

Building AI Agents with Google's Agent Development Kit (ADK)

A practical walkthrough on using Google's Agent Development Kit (ADK) to build autonomous agents that can interact with text-based environments, specifically demonstrated through a retro-inspired adventure game.

TechCrunch — AI

Solving the Physical AI Data Bottleneck

XDOF is building the infrastructure for physical AI by providing the high-fidelity, large-scale training data that robotics models currently lack, moving beyond the limitations of low-quality video data.

TechCrunch — AI

Pramaana Labs Uses Formal Verification to Secure Enterprise AI

Pramaana Labs raised $27M to integrate formal verification—using the LEAN programming language—with LLMs to ensure deterministic, error-free outputs in high-stakes fields like tax, law, and drug discovery.

OpenAI News

Predicting AI Model Behavior via Deployment Simulation

OpenAI uses 'Deployment Simulation'—replaying real, de-identified user conversations with new models—to predict safety risks and undesired behaviors before public release, outperforming traditional synthetic evaluations.

arXiv cs.AI

Verbal Reinforcement Learning: Closing the Feedback Loop

The paper introduces a framework for 'Verbal Reinforcement Learning' (VRL), shifting from raw reward signals to structured insight governance by extracting and managing verbal feedback from world interactions.

arXiv cs.AI

DeepInsight: Evaluating the Physical AI Stack

DeepInsight proposes a unified infrastructure for evaluating AI systems across the entire physical stack, addressing the fragmentation in current performance assessment methodologies.

arXiv cs.AI

Foundation Model Orchestrated Workflows for Engineering Design

This research introduces a surrogate-assisted design workflow for pedestrian protection systems, using foundation models to orchestrate complex simulation and optimization tasks.

arXiv cs.AI

SEAGym: A Benchmark for Self-Evolving LLM Agents

SEAGym provides a standardized evaluation environment designed to measure the capabilities of self-evolving LLM agents, focusing on their ability to autonomously improve performance over time.

arXiv cs.AI

Benchmarking LLM Strategic Decision-Making in Corporate Simulations

This research evaluates the efficacy of LLMs in executive leadership roles by simulating multi-role corporate environments to test their ability to perform strategic resource reallocation.

arXiv cs.AI

Analyzing AI Model Behavior via Agent Trajectories

This paper provides a comprehensive 106-page framework for evaluating LLM behavior by analyzing the sequential decision-making paths (trajectories) agents take when solving complex tasks, rather than just looking at fina…

arXiv cs.AI

Incumbent Advantage: Brand Bias in LLM Recommendation Systems

LLMs exhibit significant brand bias, disproportionately recommending incumbent products regardless of quality, creating a 'rich-get-richer' feedback loop that threatens market competition.

arXiv cs.AI

Architecting Distributed General-Purpose Agent Networks

The paper proposes a framework for distributed agent networks, shifting from monolithic AI systems to decentralized, collaborative architectures that improve scalability and task specialization.

arXiv cs.AI

MemTrace: Beyond Final Accuracy in LLM Long-Term Memory

MemTrace is a diagnostic framework designed to evaluate LLM long-term memory beyond simple accuracy metrics, focusing on the underlying mechanisms of information retention and retrieval over time.

arXiv cs.AI

SpeechDx: A Multi-Task Benchmark for Clinical Speech AI

SpeechDx is a new multi-task benchmark designed to evaluate AI models on clinical speech analysis, addressing the need for standardized, robust performance metrics in medical diagnostics.

arXiv cs.AI

Improving Agentic Search via Diverse Query Initialization

The paper proposes moving beyond simple parallel sampling in agentic search by implementing diverse query initialization, which improves retrieval performance by covering a broader semantic space.

MarkTechPost

Qwen-RobotSuite: Three Foundation Models for Embodied AI

The Qwen team has released a suite of three specialized foundation models—RobotManip, RobotWorld, and RobotNav—designed to address data fragmentation in robotics through unified action representations, language-condition…

MarkTechPost

OpenAI's Deployment Simulation for Agentic Coding Risk Assessment

OpenAI has introduced a deployment simulation framework that uses simulated tool calls to evaluate the safety and reliability of agentic coding systems before they are deployed in real-world environments.

MarkTechPost

MiniMax Sparse Attention: Scaling Long Context with Block-Sparsity

MiniMax Sparse Attention (MSA) reduces the quadratic cost of long-context attention by using a two-branch, block-sparse approach that selects key-value blocks via a learned indexer, maintaining performance while fixing c…

TechCrunch — AI

Pinterest Pivots to Conversational AI Shopping

Pinterest is testing 'Ask Pinterest,' a standalone AI-powered shopping app that uses its 'Taste Graph' data to provide personalized, conversational recommendations for complex, multi-step consumer queries.

IBM Technology

AI Agents, Patch Avalanches, and the New Era of Cyber Resilience

As AI agents begin automating password management and vulnerability discovery, security teams must shift from a mindset of total prevention to one of risk-based prioritization and cyber resilience.

Google Cloud Tech

Building Multi-Agent Systems with ADK and A2A

The Agent Development Kit (ADK) and Agent2Agent (A2A) protocol enable specialized AI agents to collaborate on complex tasks, using an orchestration layer to resolve conflicts and incorporate human-in-the-loop decision-ma…

TechCrunch — AI

Reducing AI Hallucinations via Harness Engineering

Startup 'Probably' raised $9M to shift AI reliability from model-centric to harness-centric, using deterministic validators to enable smaller, cheaper, and more accurate models.

AI Engineer

Optimizing Video Diffusion for Real-Time Generation

Achieve real-time video generation by stacking quantization, caching, and step distillation to reduce the standard 50-step denoising process to as few as 1-8 steps.

arXiv cs.AI

Mask-Proof: Automated Data Curation for Mathematical Proofs

Mask-Proof is an LLM-based pipeline designed to automate the curation of high-quality mathematical proof data, addressing the scarcity of reliable training sets for formal reasoning models.

arXiv cs.AI

Visual-Seeker: Active Visual Reasoning for Multimodal Agents

Visual-Seeker introduces a visual-native agentic search framework that moves beyond text-based retrieval by employing active visual reasoning to navigate and interpret complex multimodal environments.

arXiv cs.AI

CogGuard: Proactive Monitoring for Edge Intelligent Services

CogGuard is a framework designed to improve the reliability of edge-based AI services by integrating cognitive and operational profiling to predict and mitigate system failures before they occur.

arXiv cs.AI

CONCORD: Asynchronous Sparse Aggregation for Device-Cloud RAG

CONCORD is a framework for device-cloud Retrieval-Augmented Generation that optimizes performance under document isolation by using asynchronous sparse aggregation to balance local privacy with cloud-scale retrieval.

arXiv cs.AI

Verifiable Agentic Data Science via Tool-Grounded Reasoning

To solve complex, irregular Time-Series Question Answering (TSQA), agents must move beyond pure generation toward tool-grounded reasoning that enforces verifiable, step-by-step execution.

arXiv cs.AI

Cognitive Debt: The Hidden Fragility of AI-Augmented Systems

The paper introduces 'Cognitive Debt' as a framework to explain how AI-driven intellectual leverage creates systemic fragility by offloading critical reasoning to models, leading to a loss of human oversight and domain e…

arXiv cs.AI

Evaluating LLM Judge Reliability via Subset Selection

The 'Metric Match' approach improves LLM judge evaluation by using subset selection to identify high-fidelity data samples, ensuring that automated metrics better correlate with human preferences.

arXiv cs.AI

Measuring Trust Dynamics in Multi-Agent AI Systems

This research provides a framework for quantifying how AI agents form, break, and recover trust, offering essential insights for the governance of autonomous multi-agent systems.

arXiv cs.AI

PrologMCP: Standardizing Logic-Based Tooling for LLM Agents

PrologMCP provides a standardized interface for LLM agents to interact with Prolog knowledge bases, enabling more reliable symbolic reasoning and complex constraint satisfaction in AI workflows.

arXiv cs.AI

Scaling Agentic Search with Dynamic Workspace Expansion

DR-DCI improves agentic search by combining retriever-based scalability with local terminal-style operations, allowing agents to dynamically pull documents into a workspace for precise analysis.

MarkTechPost

Sakana Marlin: Autonomous Enterprise Research via AB-MCTS

Sakana AI's Marlin is an enterprise research agent that uses Adaptive Branching Monte Carlo Tree Search (AB-MCTS) to autonomously generate 60–100 page research reports over 8-hour sessions.

Google Cloud Tech

Building Dynamic Experiences with GenUI and Agentic Workflows

GenUI (Agent-to-UI) enables applications to generate custom user interfaces on-demand using Gemini, allowing for real-time personalization that goes beyond static design.

arXiv cs.AI

Revisiting the Link Between AI Literacy and Usage

The paper challenges the assumption that lower digital literacy correlates with higher AI usage, suggesting instead that 'adoption breadth'—the variety of tools used—is a more accurate metric for understanding AI engagem…

arXiv cs.AI

Comparing Diff-in-Means and INLP for LLM Refusal Mechanisms

This paper evaluates two common techniques—Difference-in-Means and Iterative Null-space Projection (INLP)—for identifying and mitigating refusal behaviors in LLMs, highlighting the limitations of assuming refusal is a si…

arXiv cs.AI

Hybrid Open-Ended Tri-Evolution for Deep Research Agents

The paper introduces a 'Hybrid Open-Ended Tri-Evolution' framework to improve the performance of deep research AI agents by optimizing their exploration and reasoning capabilities.

arXiv cs.AI

Orchestra-o1: A Framework for Omnimodal Agent Orchestration

Orchestra-o1 introduces a specialized architecture for coordinating omnimodal AI agents, enabling them to process and act across diverse data modalities in complex, multi-step tasks.

MarkTechPost

Hands-On Guide to FineWeb Corpus Processing and Analytics

Learn to stream, filter, deduplicate, and analyze large-scale web datasets like FineWeb using Python, MinHash, and tiktoken to prepare high-quality data for LLM training.

MarkTechPost

Z.ai Releases GLM-5.2 with 1M-Token Context for Coding Agents

Z.ai's new GLM-5.2 model introduces a 1M-token context window and variable 'thinking-effort' levels, enabling coding agents to process entire mid-sized repositories without needing constant summarization.

Smashing Magazine

Building Functional Personas with AI for User-Centric Decisions

Move beyond static, demographic-heavy personas by using AI to synthesize research into 'functional' personas focused on user goals, tasks, and objections, then making them interactive via custom chatbots.

Level Up Coding

Leveraging Public AI Incident Databases for Risk Mitigation

Organizations deploying AI can proactively identify and mitigate risks by querying nine public databases that catalog documented AI failures, ranging from deepfake fraud to algorithmic bias.

Level Up Coding

Scaling RAG Pipelines to 10M+ Documents with High Accuracy

To minimize hallucinations at scale, implement a multi-stage RAG pipeline that combines hybrid indexing, reciprocal rank fusion, and a strict 'retrieve, constrain, verify, abstain' workflow that forces the model to cite …

IBM Technology

Integrating Design Systems with AI via Model Context Protocol

By using the Model Context Protocol (MCP) to feed design system rules into AI agents, developers can ensure AI-generated code remains consistent, brand-compliant, and architecturally sound.

OpenAI News

Scaling AI Adoption Through Structured Workforce Training

OpenAI has launched three new Academy courses designed to move organizations from basic AI experimentation to building repeatable, agent-assisted workflows.

MarkTechPost

Google's Gemini-SQL2 Sets New BIRD Benchmark Record

Google's Gemini-SQL2, powered by Gemini 3.1 Pro, achieved an 80.04% execution accuracy on the BIRD text-to-SQL benchmark, outperforming all other single-model entries.

MarkTechPost

Moonshot AI Releases Kimi K2.7-Code: Agentic Coding Model

Moonshot AI's new K2.7-Code model improves coding benchmarks by up to 31.5% over its predecessor while reducing reasoning-token usage by 30%, optimizing both performance and cost for long-horizon software engineering tas…

OpenAI News

OpenAI Acquires Ona to Enable Persistent AI Agent Workflows

OpenAI is acquiring Ona to integrate secure, cloud-based execution environments into Codex, allowing AI agents to perform long-running, autonomous tasks within customer-controlled infrastructure.

arXiv cs.AI

The Containment Gap in Agentic AI Frameworks

Current agentic AI frameworks lack the necessary architectural guardrails to meet public-facing safety requirements, creating a 'containment gap' between development environments and production deployment.

arXiv cs.AI

A Tutorial on World Models and Physical AI

The provided source is a placeholder for a research paper on World Models and Physical AI, which explores how AI agents can learn internal representations of the physical world to improve decision-making and interaction.

arXiv cs.AI

Formalizing Theory of Mind for AI Agents

The article proposes a formal mathematical specification for a 'Theory of Mind' (ToM) mechanism, enabling AI agents to model and predict the mental states of other agents to improve collaborative decision-making.

arXiv cs.AI

Predicting Query-Level Rejection Risk in Clinical LLM Systems

A framework for deployment-centered evaluation that identifies high-risk clinical queries, allowing systems to proactively reject unsafe or unreliable LLM outputs before they reach the user.

arXiv cs.AI

Evoflux: Optimizing Agent Workflows via Inference-Time Evolution

Evoflux improves compact AI agent performance by evolving executable tool workflows at inference time, allowing smaller models to solve complex tasks without massive parameter counts.

arXiv cs.AI

ToolSense: A Diagnostic Framework for Auditing LLM Tool Knowledge

ToolSense provides a structured diagnostic framework to audit how effectively LLMs understand and utilize external tools, moving beyond simple prompt testing to evaluate parametric tool knowledge.

arXiv cs.AI

Arbor: Enhancing Agent Cognition via Tree Search

Arbor introduces a tree search-based cognition layer for autonomous agents, enabling more robust decision-making by systematically exploring action paths rather than relying solely on single-step inference.

MarkTechPost

Perplexity Integrates Deep Research into 'Computer' Orchestration

Perplexity has moved its Deep Research feature into 'Computer,' a multi-model orchestration system that breaks complex queries into subtasks and routes them across 20+ frontier models to generate reports, decks, and dash…

MarkTechPost

Zamba2-VL: Hybrid Mamba2-Transformer Vision-Language Models

Zyphra's Zamba2-VL models use a hybrid Mamba2-Transformer architecture to achieve near-linear time prefill and significantly lower time-to-first-token compared to dense Transformer-based VLMs.

Department of Product

Claude Fable 5, Agentic Payments, and Self-Improving Products

Anthropic's Fable 5 model sets new benchmarks in coding, while emerging agentic payment protocols and self-improving product loops like Amplitude Wave signal a shift toward autonomous software development.

IBM Technology

The Reality Check: AI Costs, Routing, and Cloud Shifts

As AI moves from hype to production, companies are shifting toward tiered routing to manage costs and capacity, while hardware limitations are forcing a pivot from pure on-device AI to hybrid cloud architectures.

TechCrunch — AI

Avataar AI's Varya: A Low-Cost, Culturally Aware Video Model

Avataar AI has launched Varya, a distilled, high-speed video generation model optimized for the Indian market, offering a 20x price reduction compared to global competitors by focusing on efficiency and cultural relevanc…

AI Engineer

Building Agent-Ready Websites with WebMCP

WebMCP is a proposed web standard that allows developers to expose site functionality as structured tools for AI agents, replacing brittle screen-scraping with direct, reliable API-like interactions.

OpenAI News

OpenAI's Multi-Layered Approach to AI Content Provenance

OpenAI is adopting the EU Code of Practice on Transparency of AI-Generated Content, utilizing a multi-layered strategy that combines C2PA metadata, watermarking, and public verification tools to improve digital content t…

arXiv cs.AI

SVoT: Enhancing Spatial Reasoning via State-Aware Visualization

SVoT improves spatial reasoning in LLMs by using reinforcement learning to generate state-aware visual representations of thought, allowing models to track complex spatial relationships more accurately than text-only cha…

arXiv cs.AI

Recursive Reasoning for Theory of Mind in AI

The paper proposes that improving AI's Theory of Mind requires recursive perspective-taking, allowing models to model the mental states of others rather than relying on static pattern matching.

arXiv cs.AI

Securing Continuous Data Summarization Against Adversarial Attacks

This paper addresses vulnerabilities in continuous data summarization systems by identifying multi-target adversarial attack vectors and proposing robust defense mechanisms to ensure AI trustworthiness.

arXiv cs.AI

Hierarchical Memory Navigation for Efficient AI Agents

The paper introduces a hierarchical memory structure that improves agent efficiency by organizing information before retrieval, moving beyond simple flat vector search.

arXiv cs.AI

Lung-R1: Enhancing Pulmonary Diagnostics with Knowledge Graphs

Lung-R1 improves diagnostic accuracy in pulmonary medicine by integrating structured knowledge graphs with LLMs, reducing hallucinations and improving clinical reasoning.

Show all 764 in AI & LLMs →