Pipeline Beats Prompt for Reliable Trip Planning

Shift to Constraint Satisfaction via Pipelines

Most AI travel apps fail because they treat planning as text generation, ignoring live data, user constraints like fitness level or kids, and self-validation—leading to unrealistic suggestions like January Beartooth Highway drives or 14-hour hikes with toddlers. Instead, build a pipeline where LLMs handle creativity but code enforces reliability: parse inputs into structured constraints (dates, group size, budget, interests), detect contradictions (e.g., "easy but adventurous" triggers two distinct plans: Option A Easy+Scenic, Option B Adventure-Forward), and inject user context like past visits to avoid repeats.

Ground plans in real-time data fetched in parallel: NPS alerts/closures, Recreation.gov permits via RIDB API (bridged with fuzzy matching), 3-day OpenWeatherMap forecasts at park coordinates, and web searches (Brave/Serper/Tavily) tailored by freshness (past-day for wildfires). Frame as "AUTHORITATIVE real-time data" overriding training data; admit gaps like API downtime instead of hallucinating.

Dual AI Voices with Structured Extraction

Use two personas for varied styles: Claude-powered "Local" (opinionated, casual, 150–300 words, picks Zion over Bryce and what to skip) and GPT-powered "Planner" (time-blocked itineraries with ITINERARY_JSON for visuals, including start times, distances, gear). Extract JSON (days, stops, coords, durations, alternatives) via regex, structure detection, or fallback AI call to handle truncations, markdown, or quote issues.

Post-Generation Validation and Regeneration

Validate against common violations: wrong day count, strenuous trails for beginners, accommodation mismatches, schedule overflows (>10 hours/day families), overlaps, >4 stops/day with kids. Smart-swap violators using alternatives array if nearby (<30 miles), unused, and compliant; else flag for regeneration. Compute confidence: High (0.9+, 0–2 corrections, <25% affected), Medium (0.6, 3+ or 25–50%), Low (0.3, 5+ or >50%). Regenerate low-confidence plans with failure feedback (e.g., "Previous violated beginner fitness with Angels Landing; fill gap at 37.27, -112.95 compliantly"). Append warnings for unmentioned closures via fuzzy-match.

Score quality across dimensions: Compliance (25%, % passing checks), Interest Match (25%, synonym-mapped like photography→viewpoints), Diversity (20%, Shannon entropy of stop types), Pacing (15%, penalize <2 or >5 stops/day), Geo-efficiency (15%, backtracking detection). Labels (Excellent/Good/Fair/Needs Improvement) trigger serve/regenerate; stream responses with source badges (NPS/Weather) via SSE.

Production Lessons: Trust No Generation

LLMs ignore instructions (strenuous hikes despite "easy"), struggle with constraints (move checking to code), produce unreliable JSON (multi-layer extraction needed). Test full pipeline end-to-end (12-check suite caught unused prompts from fallback bug). Caches crash via unbounded Maps; use NodeCache with maxKeys/checkperiod. Symptoms mislead (CORS masked crashes). In 2026, products win on pre/post-model engineering, blending LLM creativity with pipeline trust.