Implementing Request Scheduling and Preemption in NanoGPT

The Need for Intelligent Scheduling

In a standard First-Come-First-Serve (FCFS) inference setup, long-running requests can monopolize token budgets, forcing high-priority or short requests to wait indefinitely. To solve this, a Scheduler class must manage request admission and eviction based on a defined max_kv_tokens budget. By tracking priority and arrival_time, the system can dynamically reorder tasks to optimize throughput and latency.

Core Scheduling Mechanisms

The scheduler relies on two primary functions to enforce memory constraints:

_maybe_admit: Uses a min-heap to select the next request from the waiting queue. It only admits a request if the current kv_used plus the candidate's prompt tokens remain under the max_kv_tokens limit.
_maybe_preempt: Acts as a safety valve. If the system exceeds its memory budget (often due to active requests growing their KV caches during decoding), it identifies a 'victim'—the request with the lowest priority and most recent arrival time. The victim is evicted, its KV cache is cleared, and it is pushed back into the waiting queue to be re-prefilled from scratch later.

Implementation Strategy

This approach uses recompute preemption, which trades future GPU compute cycles for immediate memory relief. By resetting the prefill_cursor and clearing the cache, the system ensures that when the request is re-admitted, it starts the prefill process over. This maintains the integrity of the KV cache, as verified by ensuring the cache length matches the prompt plus generated tokens. When integrating this into the scheduled_generate loop, the scheduler replaces manual tracking of active requests, simplifying the lifecycle management of concurrent inferences.

The Need for Intelligent Scheduling

Core Scheduling Mechanisms

Implementation Strategy

More from Software Engineering

35B Models on RTX 4090: TurboQuant KV Compression Unlocks 32K Context

LLM-as-Judge Evaluates RAG: Keyword Beats Vector

Harmony: Render gpt-oss Response Format in Rust/Python

TurboQuant: 4-7x KV Cache Compression in vLLM