Deploying vLLM Endpoints on Hugging Face Jobs

On-Demand LLM Serving

Hugging Face Jobs provides a container-based environment to deploy LLM inference servers without the overhead of Kubernetes or manual server provisioning. By using the hf jobs run command, users can deploy the vllm/vllm-openai image directly on HF infrastructure. The service is billed per-second, making it a cost-effective solution for short-lived tasks like model evaluation, batch processing, or rapid prototyping.

Configuration and Scaling

Deploying a model requires specifying the hardware flavor and exposing the container port. For larger models, users can scale by selecting multi-GPU flavors (e.g., h200x2) and configuring vLLM's --tensor-parallel-size to match the GPU count.

Key performance tuning tips include:

Memory Management: If a model fails to start due to OOM errors, reduce --max-model-len and --max-num-seqs to fit within the GPU's memory constraints.
Debugging: Use the --ssh flag during launch to gain shell access to the container, allowing for real-time monitoring with nvidia-smi and log inspection.
Tooling: Enable features like --enable-auto-tool-choice to support agentic workflows, allowing the endpoint to function as a backend for coding agents like Pi.

Security and Lifecycle

Endpoints deployed via HF Jobs are gated by the user's Hugging Face token. Requests must include a bearer token with read access to the job's namespace, effectively using the HF jobs proxy as an API gateway. Because these jobs are billed by usage, users should explicitly cancel jobs using hf jobs cancel <job_id> when finished, though a --timeout flag serves as a safety net to prevent runaway costs.

Choosing the Right Infrastructure

HF Jobs: Best for maximum control, experimentation, and one-off tasks where you need to customize the container and flags directly.
Inference Endpoints: Best for production-ready services requiring scale-to-zero capabilities, managed uptime, and more granular access control.

On-Demand LLM Serving

Configuration and Scaling

Security and Lifecycle

Choosing the Right Infrastructure

More from Inference & Serving

ParallelKernelBench: Frontier LLMs Struggle with Multi-GPU Kernels

Prototype Big, Deploy Small: A Framework for Local LLM Adoption

Optimizing Browser AI with Cross-Origin Storage

OpenAI's GPT-5.6 Launch: Frontier Models as Managed Assets