Pydantic Schemas Fix LLM Output Fragility

Overcome Parser Debt with 4 Levels of Structured Guarantees

LLM outputs start as unreliable strings—markdown-wrapped JSON, wrong keys like 'movie_title' instead of 'title', strings for integers (e.g., "2014"), or prefixed prose—crashing json.loads() or silently corrupting database inserts expecting VARCHAR title, INTEGER year, VARCHAR genre. Naive fixes add 30+ lines of if-statements for stripping, normalizing, and casting, but fail on new edge cases like XML tags or model updates.

Advance through levels: (1) Prompt-based hoping yields no guarantees; (2) JSON mode ensures parsable JSON but wrong shapes; (3) JSON Schema mode mandates exact keys, types (string, integer), enums (e.g., "action", "comedy", "sci-fi", "drama"), and no extra properties; (4) strict=True schema enforces during decoding, preventing invalid outputs like string years. For a movie schema {"title": string, "year": integer (ge=1888, le=2030), "genre": enum}, strict mode blocks generation of non-integers, restoring contract between probabilistic LLMs and deterministic apps.

Pydantic: Single Class for Schema, Validation, and Guidance

Define structures once in Python classes to auto-generate JSON Schema, validate/coerce at boundaries, and guide LLMs via Field descriptions:

from pydantic import BaseModel, Field
from enum import Enum

class Genre(str, Enum):
    ACTION = "action"
    COMEDY = "comedy"
    SCI_FI = "sci-fi"
    DRAMA = "drama"

class MovieRecommendation(BaseModel):
    title: str = Field(description="Full movie title, without year or parentheses")
    year: int = Field(ge=1888, le=2030, description="Release year as a 4-digit number")
    genre: Genre = Field(description="Primary genre - pick exactly one")

Benefits: model_json_schema() outputs full schema with refs, bounds, enums—no manual maintenance. model_validate_json('{"year": "2010"}') coerces string to int (prints 2010 as <class 'int'>), rejects invalid genre="banana" or year=99999 with ValidationError before downstream code. model_dump_json() enables clean serialization. Descriptions like "pick exactly one" improve output quality over bare fields.

For support tickets:

class Priority(str, Enum): LOW="low"; MEDIUM="medium"; HIGH="high"; URGENT="urgent"
class SupportTicket(BaseModel):
    subject: str
    priority: Priority
    product: str
    is_billing_issue: bool
    customer_sentiment: float = Field(ge=-1.0, le=1.0)
    action_items: list[str]

Extracts email into validated object for direct DB/API use, inferring priority/sentiment without parsing logic.

Integrate Natively or via LangChain for Typed Objects

OpenAI SDK (single-provider): Pass Pydantic directly—client.beta.chat.completions.parse(..., response_format=MovieRecommendation) returns .parsed as validated object, skipping json.loads() entirely.

LangChain (chains/agents): ChatOpenAI().with_structured_output(MovieRecommendation).invoke(prompt) yields typed instance. Use include_raw=True for observability, method="json_schema", strict=True (5-15% latency hit) to enforce at generation.

Both replace text-to-dict with direct domain objects, enabling composition into pipelines.

Production Rules: Reliability Over Hype

Log all ValidationErrors as signals for schema tweaks (e.g., unclear descriptions, tight bounds). Defaults: field descriptions, enums for constraints, numeric ge/le, flat schemas, strict mode. Retry strict failures with json_object fallback. This schema-first shift turns PoC hacks into systems where LLM output matches app schemas exactly, preventing bugs at DB boundaries.