On-Demand LLM Serving
Hugging Face Jobs provides a container-based environment to deploy LLM inference servers without the overhead of Kubernetes or manual server provisioning. By using the hf jobs run command, users can deploy the vllm/vllm-openai image directly on HF infrastructure. The service is billed per-second, making it a cost-effective solution for short-lived tasks like model evaluation, batch processing, or rapid prototyping.
Configuration and Scaling
Deploying a model requires specifying the hardware flavor and exposing the container port. For larger models, users can scale by selecting multi-GPU flavors (e.g., h200x2) and configuring vLLM's --tensor-parallel-size to match the GPU count.
Key performance tuning tips include:
- Memory Management: If a model fails to start due to OOM errors, reduce
--max-model-lenand--max-num-seqsto fit within the GPU's memory constraints. - Debugging: Use the
--sshflag during launch to gain shell access to the container, allowing for real-time monitoring withnvidia-smiand log inspection. - Tooling: Enable features like
--enable-auto-tool-choiceto support agentic workflows, allowing the endpoint to function as a backend for coding agents like Pi.
Security and Lifecycle
Endpoints deployed via HF Jobs are gated by the user's Hugging Face token. Requests must include a bearer token with read access to the job's namespace, effectively using the HF jobs proxy as an API gateway. Because these jobs are billed by usage, users should explicitly cancel jobs using hf jobs cancel <job_id> when finished, though a --timeout flag serves as a safety net to prevent runaway costs.
Choosing the Right Infrastructure
- HF Jobs: Best for maximum control, experimentation, and one-off tasks where you need to customize the container and flags directly.
- Inference Endpoints: Best for production-ready services requiring scale-to-zero capabilities, managed uptime, and more granular access control.