Healthcare LLM Rate Limits: 2 Fail, 1 Works
Simple per-user rate limits on LLM APIs fail to stop credential stuffing attacks (causing $47K bills) and block critical clinical workflows; context-aware throttling with priority and anomaly detection is the only production-ready solution.
Rate Limiting's Hidden Vulnerabilities in Healthcare LLMs
Healthcare organizations building AI triage and decision support tools face a dual threat from LLM rate limiting: it either fails catastrophically against attacks or disrupts life-saving workflows. A $47,832 bill hit one system after credential stuffing via 8 compromised physician accounts evaded per-user quotas of 1,000 requests/day, processing 94,000 requests and 847 million tokens in 72 hours. Similar incidents across six investigations reveal the core issue: LLMs have non-uniform request costs (a 200-token summary costs $0.003 vs. $0.47 for an 8,000-token analysis) and mixed traffic (clinical, research, attacks). Traditional REST-style limits assume uniform costs and legitimate traffic, violating both for LLMs.
Real-world failures compound this. During a mass casualty event, a hospital-wide 50 requests/minute limit blocked triage for 4.5 minutes amid 63 simultaneous physician queries. Shift changes spike queues to 280 requests, delaying acute MI reviews by 8.2 seconds. Credential stuffing with 15 accounts racks up $2,160 in hours while staying under limits. Radware’s 2026 Global Threat Analysis Report notes 91.8% bad bot growth in 2025, amplifying these risks.
“The rate limiting system — designed to control costs — had become the attack surface.” – Piyoosh Rai, on the $47K incident, highlighting how safety mechanisms expose new vectors.
Pattern 1: Simple Token Limits – Cheap but Bypassed and Blocking
This in-memory approach caps tokens per user per window (e.g., 100,000/hour), directly tying to billing. Implementation is trivial (~50 lines of Python with defaultdict and Lock), costing $0.
It fails three ways:
- Credential Stuffing Bypass: Attackers rotate 15 compromised accounts, each sending 10 max-length (8,192-token) prompts. 150 requests consume 1.8M tokens ($54/rotation), scaling to $2,160 over 9 hours without triggering limits. Finance sees weekend bills jump from $180 to $2,300.
- Workflow Blocking: Hospital-wide 100K tokens/hour halts triage during surges. 18 patients from a crash consume 84K tokens in minutes; remaining critical cases wait 47 minutes, forcing manual fallback and delaying internal bleeding detection.
- No Anomaly Detection: Limits ignore patterns like rotation, escalation, or mimicry of normal timing.
Tradeoffs: Zero upfront cost, but $2,100+ breach costs and high clinical risk. Blocks emergencies without distinguishing priority.
Pattern 2: Tiered User Quotas – Improved but Queue-Prone
Roles get quotas: STANDARD (nurses: 50 req/hour, 50K tokens), ADVANCED (physicians: 100 req, 150K tokens), RESEARCH (200 req, 500K tokens), ADMIN (500 req, 1M tokens). Tracks requests/tokens hourly/daily plus concurrent limits (e.g., 5 for ADVANCED). Still in-memory, but needs user management (~$15-30K setup).
Better at per-user abuse (60% attack surface cut), yet gaps persist:
- Queue Cascades: 40 physicians at shift change (7:00 AM) queue 140-280 requests. Acute MI summary latencies hit 8.2s (vs. 1.4s normal), causing tool abandonment and missed interactions.
- Ignores Priority: Researcher's 94 PDFs (11.2M tokens) queues behind emergency drug checks, adding 12s delays.
- Stuffing Viable: 15 ADVANCED accounts yield 1,500 req/hour, 2.25M tokens/hour capacity.
Tradeoffs: Handles roles/concurrent better, but system-wide spikes and equal prioritization hurt clinical use. Credential stuffing reduced, not eliminated.
“Simple rate limiting can’t distinguish between attack traffic and legitimate high-priority clinical use.” – Piyoosh Rai, explaining why token caps block emergencies like mass casualties.
Pattern 3: Context-Aware Throttling – Production Winner
The effective pattern layers clinical priority (emergency > routine), behavior analysis (anomalies like stuffing), adaptive load throttling, cost circuit breakers, and attack signatures. While details are implementation-heavy, it addresses all failures: prioritizes ICU checks over batch jobs, detects rotations, scales dynamically.
From six incidents and four health systems' consultations, this alone prevents bills and disruptions. It rejects uniform assumptions, treating requests by urgency, pattern, and load.
Tradeoffs: Higher complexity/cost (monitoring, ML for anomalies), but zero breaches post-implementation in cited cases. Enables reliable scaling for 200+ users at $3,200/month baseline.
“LLM rate limiting violates both assumptions uniform cost, legitimate traffic.” – Piyoosh Rai, core reasoning why healthcare needs beyond traditional limits.
Key Takeaways
- Implement multi-layer throttling: prioritize clinical urgency (e.g., triage > summaries) to avoid blocking emergencies.
- Track beyond tokens/requests: monitor concurrent queues, behavior anomalies, and system load for adaptive limits.
- Tier quotas by role but add global safeguards against stuffing (e.g., IP/session patterns, not just per-user).
- Use circuit breakers for costs: hard budget caps with alerts, tested against max-token attacks.
- Audit for non-uniform costs: estimate input+output tokens accurately; attackers exploit verbose outputs.
- Simulate surges: test shift changes (40+ users) and casualties (60+ req/min) before production.
- Integrate anomaly detection early: flag credential rotation, sudden volume from breached accounts.
- Start simple, evolve: Pattern 1 for prototypes, Pattern 3 for healthcare prod.
- Measure clinical impact: latency >2s risks abandonment; aim <1.5s p95 under load.
- Budget for security: $15-30K setup beats $47K surprises.