The Trust-But-Verify Architecture

The core innovation in this release pipeline is the separation of mechanical tasks (version bumping, tagging, PyPI publishing) from judgment-based tasks (writing release notes, drafting announcements). To ensure reliability, the team implemented a 'trust-but-verify' loop:

  1. Deterministic Manifest: Before the model generates anything, a Python script extracts all PR numbers from the commit range to create a ground-truth manifest.
  2. AI Drafting: An open-weights model (currently GLM-5.2) drafts the changelog based on the PR titles and documentation diffs.
  3. Automated Validation: A script compares the model's output against the manifest. If the model drops a PR or hallucinates an extra one, the system automatically re-prompts the agent to fix the specific discrepancies.

This approach allows the team to leverage the model's ability to synthesize prose while using deterministic code to enforce completeness and accuracy.

Grounding Models with Context

To prevent the model from inventing code examples or misinterpreting API changes, the pipeline feeds actual documentation diffs (docs/*.md) into the model's context. This ensures that the generated release notes quote the exact documentation written by the PR author. The prompts are managed as 'Skills'—version-controlled Markdown files that act as onboarding instructions for the agent, defining tone, taxonomy, and structural requirements.

Security and Efficiency

The workflow is built entirely on open tools to ensure it is reproducible by other maintainers:

  • Trusted Publishing: Uses OIDC tokens and PEP 740 attestations for secure PyPI publishing, eliminating the need for long-lived API tokens.
  • Supply Chain Integrity: All agent runtimes are pinned and verified via SHA256 checksums.
  • Cost: The entire automated process costs approximately $0.25 per release using Hugging Face Inference Providers.

Human-in-the-Loop

Despite the automation, a human remains the final gatekeeper. The model generates a draft release, which a human reviewer then edits for tone and emphasis. By archiving both the raw AI draft and the human-edited version, the team creates a dataset that allows them to iteratively improve the agent's performance over time.