#evals
Every summary, chronological. Filter by category, tag, or source from the rail.
Agent-Native Immune System (ANIS): Architecture for Runtime Defense
The Agent-Native Immune System (ANIS) shifts AI security from static training-time alignment to dynamic, runtime defense, using a six-layer 'Immune Tower' to protect autonomous agents against memory poisoning and tool-chain manipulation.
Odyssey: A Categorical Framework for Verifiable Foundation Models
Odyssey uses categorical sheaf theory to compose modular 'foundries'—verifiable, truth-preserving architectural components—that allow for structured, queryable, and auditable LLM-based systems.
DysLexLens: Analyzing Dyslexic AI User Experiences via LLMs
DysLexLens is an end-to-end framework that extracts, structures, and validates insights from noisy online forum data to understand how dyslexic learners interact with AI tools.
ToE: Hierarchical Claim Verification Against Adversarial Misinformation
Tree of Evidence (ToE) is a fact-checking framework that uses a reinforcement learning-driven agent to decompose claims into hierarchical argument trees, significantly improving verification accuracy against adversarially poisoned inputs.
OpenAI's GPT-5.6 Launch: Frontier Models as Managed Assets
OpenAI released the GPT-5.6 family (Sol, Terra, Luna) as a restricted, government-mediated preview, signaling a shift where release governance is now a core component of the model specification.
Optimizing Software Workflows with AI Code Review
AI code review accelerates development by automating static and dynamic analysis, but it requires human oversight to manage context, mitigate false positives, and ensure architectural alignment.
Building Interoperable Standards for Advanced AI Systems
OpenAI is co-founding the Appia Foundation to translate high-level AI safety frameworks into modular, open technical specifications that enable consistent, third-party evaluation across the global AI supply chain.
Prototype Big, Deploy Small: A Framework for Local LLM Adoption
Stop overpaying for frontier models. By using a 'prototype big, deploy small' framework and rigorous capability evals, you can identify 'Sage' (Small and Good Enough) models that provide production-grade performance on-device, saving costs and improving latency.
The Agentic AI Engineer: Scaling Agent Development via Loops
To scale agent development, teams must move from manual iteration to an 'Agentic AI Engineer' model: a multi-agent system that automates the entire lifecycle of spec, build, eval, diagnose, and optimize.
The Prompt as a Platform: Agentic Engineering for Distributed Systems
Dominik Tornow argues that software engineering is shifting from general-purpose implementations to bespoke systems synthesized by agents from abstract specifications, using deterministic simulation as the critical feedback loop for design.
RL-Guided ETL Pipeline Remediation: Architecture and Evals
Automate ETL failure recovery using a deterministic anomaly detection layer, a Q-learning policy for action selection, and a hard-coded safety guardrail to ensure operational reliability.
Debugging Production AI Agents via Record and Replay
Stop chasing bitwise determinism in LLMs. Instead, implement a record-and-replay architecture to capture agent state transitions, enabling deterministic debugging and regression testing of non-deterministic production failures.
The Promptware Kill Chain: Understanding AI Malware
Promptware exploits the lack of separation between instructions and data in LLMs to execute a multi-stage attack, requiring a zero-trust approach where AI agents are treated as hostile runtimes.
IBM TechnologyStress-Testing AI Agents with Simulated Digital Environments
Patronus AI is using 'digital world models' to simulate complex environments, allowing developers to stress-test autonomous agents through reinforcement learning and automated verification.
Red-Teaming and Security for Agentic AI Systems
AI security requires a shift from traditional cybersecurity to treating LLMs as untrusted, alien intelligence. As agents gain autonomy, automated red-teaming tools like Gray Swan's 'Shade' are becoming essential for identifying vulnerabilities that human testers miss.
Showing 15 of 15