№ 02 / SUMMARIES

#machine-learning

Every summary, chronological. Filter by category, tag, or source from the rail.

Tag · #machine-learning
DAY 01Friday JUN 19 · 20265 SUMMARIES
arXiv cs.AIAI & LLMs

GLARE: Natural Language Interfaces for Global Model Explanations

GLARE provides a natural language interface for querying global model explanations, allowing users to interpret complex AI behavior through conversational prompts rather than static visualizations.

arXiv cs.AI
arXiv cs.AIAI & LLMs

Configurable Clinical Information Extraction with Agentic RAG

Agentic RAG systems for clinical data require modular configuration to balance precision and recall, as monolithic pipelines often fail to handle the high variability of medical documentation.

arXiv cs.AIAI & LLMs

Optimizing LLM Post-Training Through Pairwise Comparison Selection

The paper investigates how the selection of response pairs in preference-based post-training (like DPO or PPO) impacts model performance, suggesting that strategic pair selection is as critical as the training algorithm itself.

arXiv cs.AIAI & LLMs

Detecting LLM Epistemic Blind Spots via Cross-Model Attribution

LLMs often hallucinate confidence in clinical settings. This paper introduces a method using Cross-Model Attribution Divergence (CMAD) to identify when models rely on unreliable features, effectively flagging epistemic uncertainty in tabular data.

MarkTechPostAI & LLMs

Perplexity Brain: Self-Improving Memory for AI Agents

Perplexity's 'Brain' system shifts AI memory from user-centric profiles to agent-centric performance, using an overnight context graph to learn from past tasks, failures, and corrections to improve future efficiency.

DAY 02Thursday JUN 18 · 20267 SUMMARIES
arXiv cs.AIAI & LLMs

RODS: Improving Multi-Turn Tool-Use Agents via Reward-Driven Synthesis

RODS (Reward-Driven Online Data Synthesis) improves multi-turn tool-use agents by generating high-quality synthetic training data through iterative reward-based filtering, addressing the scarcity of complex, multi-step interaction data.

arXiv cs.AI
arXiv cs.AIAI & LLMs

Skill-Guided Continuation Distillation for GUI Agents

The paper introduces a method to improve GUI agent performance by distilling complex task trajectories into modular, skill-based sub-tasks, enhancing generalization and execution reliability.

arXiv cs.AIAI & LLMs

SciRisk-Bench: Evaluating Safety in AI for Science

SciRisk-Bench is a new benchmark designed to evaluate the safety risks of AI models specifically applied to scientific research, focusing on multi-dimensional risk assessment.

arXiv cs.AIAI & LLMs

Improving AI Scientist Reliability via Research Harnesses

The paper proposes a 'Research Harness' to externalize synthesis and validation, addressing the reliability issues inherent in autonomous AI research agents.

arXiv cs.AIAI & LLMs

WorldLines: Benchmarking Long-Horizon Stateful Embodied Agents

WorldLines introduces a new benchmark and modeling framework designed to evaluate how embodied AI agents maintain state and execute complex, long-horizon tasks over extended periods.

arXiv cs.AIAI & LLMs

DeFAb: A New Benchmark for Defeasible Abduction in LLMs

DeFAb is a new, verifiable benchmark designed to test how well foundation models handle defeasible abduction—the ability to form logical explanations that can be retracted or revised in light of new, contradictory information.

MarkTechPostAI & LLMs

The KV Cache Compression Race: TurboQuant vs OSCAR vs EpiCache

KV cache compression is the new frontier for scaling LLM inference, with TurboQuant, OSCAR, and EpiCache offering distinct strategies to balance memory footprint against model accuracy.

DAY 03Wednesday JUN 17 · 20268 SUMMARIES
Python in Plain EnglishData Science & Visualization

6 Habits That Elevate Data Science Projects Beyond Model Selection

Exceptional data science outcomes depend less on complex algorithms and more on disciplined fundamentals like data auditing, version control, and rigorous documentation.

Python in Plain English
arXiv cs.AIAI & LLMs

DeepInsight: Evaluating the Physical AI Stack

DeepInsight proposes a unified infrastructure for evaluating AI systems across the entire physical stack, addressing the fragmentation in current performance assessment methodologies.

arXiv cs.AIAI & LLMs

Foundation Model Orchestrated Workflows for Engineering Design

This research introduces a surrogate-assisted design workflow for pedestrian protection systems, using foundation models to orchestrate complex simulation and optimization tasks.

arXiv cs.AIAI & LLMs

SEAGym: A Benchmark for Self-Evolving LLM Agents

SEAGym provides a standardized evaluation environment designed to measure the capabilities of self-evolving LLM agents, focusing on their ability to autonomously improve performance over time.

arXiv cs.AIAI & LLMs

Analyzing AI Model Behavior via Agent Trajectories

This paper provides a comprehensive 106-page framework for evaluating LLM behavior by analyzing the sequential decision-making paths (trajectories) agents take when solving complex tasks, rather than just looking at final outputs.

arXiv cs.AIAI & LLMs

MemTrace: Beyond Final Accuracy in LLM Long-Term Memory

MemTrace is a diagnostic framework designed to evaluate LLM long-term memory beyond simple accuracy metrics, focusing on the underlying mechanisms of information retention and retrieval over time.

arXiv cs.AIAI & LLMs

SpeechDx: A Multi-Task Benchmark for Clinical Speech AI

SpeechDx is a new multi-task benchmark designed to evaluate AI models on clinical speech analysis, addressing the need for standardized, robust performance metrics in medical diagnostics.

MarkTechPostAI & LLMs

MiniMax Sparse Attention: Scaling Long Context with Block-Sparsity

MiniMax Sparse Attention (MSA) reduces the quadratic cost of long-context attention by using a two-branch, block-sparse approach that selects key-value blocks via a learned indexer, maintaining performance while fixing compute costs at O(kBk).

DAY 04Tuesday JUN 16 · 20266 SUMMARIES
AI EngineerAI & LLMs

Optimizing Video Diffusion for Real-Time Generation

Achieve real-time video generation by stacking quantization, caching, and step distillation to reduce the standard 50-step denoising process to as few as 1-8 steps.

AI Engineer
arXiv cs.AIAI & LLMs

Mask-Proof: Automated Data Curation for Mathematical Proofs

Mask-Proof is an LLM-based pipeline designed to automate the curation of high-quality mathematical proof data, addressing the scarcity of reliable training sets for formal reasoning models.

arXiv cs.AIAI & LLMs

CONCORD: Asynchronous Sparse Aggregation for Device-Cloud RAG

CONCORD is a framework for device-cloud Retrieval-Augmented Generation that optimizes performance under document isolation by using asynchronous sparse aggregation to balance local privacy with cloud-scale retrieval.

arXiv cs.AIAI & LLMs

Verifiable Agentic Data Science via Tool-Grounded Reasoning

To solve complex, irregular Time-Series Question Answering (TSQA), agents must move beyond pure generation toward tool-grounded reasoning that enforces verifiable, step-by-step execution.

arXiv cs.AIAI & LLMs

Evaluating LLM Judge Reliability via Subset Selection

The 'Metric Match' approach improves LLM judge evaluation by using subset selection to identify high-fidelity data samples, ensuring that automated metrics better correlate with human preferences.

arXiv cs.AIAI & LLMs

Measuring Trust Dynamics in Multi-Agent AI Systems

This research provides a framework for quantifying how AI agents form, break, and recover trust, offering essential insights for the governance of autonomous multi-agent systems.

DAY 05June 15, 2026 JUN 15 · 20264 SUMMARIES
Level Up CodingData Science & Visualization

Why Accuracy Metrics Hide ML Model Failures

High accuracy scores in automated systems like résumé classifiers often mask systemic biases and data quality issues that lead to unfair rejection patterns.

Level Up Coding
arXiv cs.AIAI & LLMs

Revisiting the Link Between AI Literacy and Usage

The paper challenges the assumption that lower digital literacy correlates with higher AI usage, suggesting instead that 'adoption breadth'—the variety of tools used—is a more accurate metric for understanding AI engagement.

arXiv cs.AIAI & LLMs

Comparing Diff-in-Means and INLP for LLM Refusal Mechanisms

This paper evaluates two common techniques—Difference-in-Means and Iterative Null-space Projection (INLP)—for identifying and mitigating refusal behaviors in LLMs, highlighting the limitations of assuming refusal is a single, linear direction.

arXiv cs.AIAI & LLMs

Hybrid Open-Ended Tri-Evolution for Deep Research Agents

The paper introduces a 'Hybrid Open-Ended Tri-Evolution' framework to improve the performance of deep research AI agents by optimizing their exploration and reasoning capabilities.

Showing 30 of 286