CATEGORY · 1 OF 11

AI & LLMs

The deepest channel on Edge. Foundation models, agent architectures, retrieval, evals, and the moving line between research and production.

758SUMMARIES
+74THIS WEEK
69SOURCES
Category · AI & LLMs
DAY 01Yesterday JUN 19 · 202612 SUMMARIES
arXiv cs.AIAI & LLMs

GLARE: Natural Language Interfaces for Global Model Explanations

GLARE provides a natural language interface for querying global model explanations, allowing users to interpret complex AI behavior through conversational prompts rather than static visualizations.

arXiv cs.AI
arXiv cs.AIAI & LLMs

Moving Beyond Static Leaderboards for LLM Agent Evaluation

Static benchmarks often fail to predict real-world performance for LLM agents; the authors propose a framework focused on predictive validity to better align evaluation with practical utility.

arXiv cs.AIAI & LLMs

Toten: Ontological Tokenization for Technical Portuguese

Toten is a knowledge-based tokenization framework designed to accurately parse physical quantities and technical notation in Brazilian Portuguese, addressing common failures in standard NLP tokenizers.

arXiv cs.AIAI & LLMs

The Symbiotic Evolution of AI and Software Engineering

The intersection of AI and Software Engineering (AI4SE and SE4AI) has matured over the last decade, shifting from experimental research to essential production-grade methodologies for building, testing, and maintaining complex systems.

arXiv cs.AIAI & LLMs

Configurable Clinical Information Extraction with Agentic RAG

Agentic RAG systems for clinical data require modular configuration to balance precision and recall, as monolithic pipelines often fail to handle the high variability of medical documentation.

arXiv cs.AIAI & LLMs

Optimizing LLM Post-Training Through Pairwise Comparison Selection

The paper investigates how the selection of response pairs in preference-based post-training (like DPO or PPO) impacts model performance, suggesting that strategic pair selection is as critical as the training algorithm itself.

arXiv cs.AIAI & LLMs

Detecting LLM Epistemic Blind Spots via Cross-Model Attribution

LLMs often hallucinate confidence in clinical settings. This paper introduces a method using Cross-Model Attribution Divergence (CMAD) to identify when models rely on unreliable features, effectively flagging epistemic uncertainty in tabular data.

arXiv cs.AIAI & LLMs

Deontic Policies for Runtime Governance of Agentic AI

The paper proposes using deontic logic—a system of formal rules defining obligations, permissions, and prohibitions—to govern the runtime behavior of autonomous AI agents.

MarkTechPostAI & LLMs

Building Reliable AI Code Generation Pipelines with Salesforce CodeGen

To move AI-generated code from prototype to production, implement a multi-stage pipeline that includes automated unit testing, safety sandboxing, and model-based reranking to filter out hallucinated or insecure outputs.

MarkTechPostAI & LLMs

Liquid AI's New 350M Multilingual Retrieval Models

Liquid AI has released LFM2.5-Embedding-350M and LFM2.5-ColBERT-350M, two efficient, bidirectional retrieval models optimized for multilingual search across 11 languages.

MarkTechPostAI & LLMs

Perplexity Brain: Self-Improving Memory for AI Agents

Perplexity's 'Brain' system shifts AI memory from user-centric profiles to agent-centric performance, using an overnight context graph to learn from past tasks, failures, and corrections to improve future efficiency.

IBM TechnologyAI & LLMs

The Rise of Agentic Traffic and Microsoft's Model Strategy

Agentic AI bots now dominate web traffic, signaling a shift in how we interact with information. Meanwhile, Microsoft is pivoting to first-party models, prioritizing safety and cost-efficiency for enterprise users.

DAY 02Thursday JUN 18 · 202613 SUMMARIES
Google Cloud TechAI & LLMs

Architecting Long-Running AI Agents for Multi-Day Workflows

Move beyond stateless chatbots by implementing event-driven dormancy, durable checkpointing, and decoupled evaluation to manage complex, multi-day workflows.

Google Cloud Tech
Google Cloud TechAI & LLMs

Building AI Agents with Model Context Protocol (MCP)

The Model Context Protocol (MCP) acts as a universal adapter, allowing AI agents to securely interact with external tools and live data via a standardized input/output interface, decoupling agent logic from tool implementation.

AI EngineerAI & LLMs

The Production AI Playbook: Deploying Agents at Enterprise Scale

Moving AI from demo to production requires shifting focus from model selection to five pillars: evaluation, observability, data foundation, orchestration, and governance.

arXiv cs.AIAI & LLMs

RODS: Improving Multi-Turn Tool-Use Agents via Reward-Driven Synthesis

RODS (Reward-Driven Online Data Synthesis) improves multi-turn tool-use agents by generating high-quality synthetic training data through iterative reward-based filtering, addressing the scarcity of complex, multi-step interaction data.

arXiv cs.AIAI & LLMs

Skill-Guided Continuation Distillation for GUI Agents

The paper introduces a method to improve GUI agent performance by distilling complex task trajectories into modular, skill-based sub-tasks, enhancing generalization and execution reliability.

arXiv cs.AIAI & LLMs

Decoupling Search from Reasoning in LLM Agents

Native search grounding in LLMs creates rigid, expensive, and opaque agent architectures. Moving to a Decoupled Search Grounding (DSG) layer allows for vendor-agnostic control over retrieval, caching, and cost, while maintaining accuracy.

arXiv cs.AIAI & LLMs

SciRisk-Bench: Evaluating Safety in AI for Science

SciRisk-Bench is a new benchmark designed to evaluate the safety risks of AI models specifically applied to scientific research, focusing on multi-dimensional risk assessment.

arXiv cs.AIAI & LLMs

Improving AI Scientist Reliability via Research Harnesses

The paper proposes a 'Research Harness' to externalize synthesis and validation, addressing the reliability issues inherent in autonomous AI research agents.

arXiv cs.AIAI & LLMs

WorldLines: Benchmarking Long-Horizon Stateful Embodied Agents

WorldLines introduces a new benchmark and modeling framework designed to evaluate how embodied AI agents maintain state and execute complex, long-horizon tasks over extended periods.

arXiv cs.AIAI & LLMs

DeFAb: A New Benchmark for Defeasible Abduction in LLMs

DeFAb is a new, verifiable benchmark designed to test how well foundation models handle defeasible abduction—the ability to form logical explanations that can be retracted or revised in light of new, contradictory information.

arXiv cs.AIAI & LLMs

CEO-Bench: Measuring Long-Term Strategic Reasoning in AI Agents

CEO-Bench is a new evaluation framework designed to test whether AI agents can maintain strategic coherence and decision-making over extended, multi-step business scenarios.

MarkTechPostAI & LLMs

Vercel's Eve: A Filesystem-First Framework for AI Agents

Vercel has released Eve, an open-source framework that treats AI agents as directories of files, mapping specific capabilities like tools, skills, and schedules to file paths to eliminate boilerplate and production plumbing.

MarkTechPostAI & LLMs

The KV Cache Compression Race: TurboQuant vs OSCAR vs EpiCache

KV cache compression is the new frontier for scaling LLM inference, with TurboQuant, OSCAR, and EpiCache offering distinct strategies to balance memory footprint against model accuracy.

DAY 03Wednesday JUN 17 · 20265 SUMMARIES
TechCrunch — AIAI & LLMs

The Shift Toward User-Controlled AI Recommendation Algorithms

Major social platforms are moving from opaque, one-size-fits-all algorithms to user-tunable systems, leveraging LLMs to allow granular control over feed content.

TechCrunch — AI
Google Cloud TechAI & LLMs

Building AI Agents with Google's Agent Development Kit (ADK)

A practical walkthrough on using Google's Agent Development Kit (ADK) to build autonomous agents that can interact with text-based environments, specifically demonstrated through a retro-inspired adventure game.

TechCrunch — AIAI & LLMs

Solving the Physical AI Data Bottleneck

XDOF is building the infrastructure for physical AI by providing the high-fidelity, large-scale training data that robotics models currently lack, moving beyond the limitations of low-quality video data.

TechCrunch — AIAI & LLMs

Pramaana Labs Uses Formal Verification to Secure Enterprise AI

Pramaana Labs raised $27M to integrate formal verification—using the LEAN programming language—with LLMs to ensure deterministic, error-free outputs in high-stakes fields like tax, law, and drug discovery.

OpenAI NewsAI & LLMs

Predicting AI Model Behavior via Deployment Simulation

OpenAI uses 'Deployment Simulation'—replaying real, de-identified user conversations with new models—to predict safety risks and undesired behaviors before public release, outperforming traditional synthetic evaluations.

Showing 30 of 758