Self-Host Gemma 4 on Cloud Run GPUs: Ollama vs vLLM
Deploy open Gemma 4 LLM on serverless Cloud Run GPUs two ways: Ollama bakes model into container for instant cold starts; vLLM mounts from GCS FUSE for model swaps without rebuilds. Full CI/CD via Cloud Build.
Choose Open Models like Gemma 4 for Control and Cost Predictability
Self-hosting open models like Google's Gemma 4 gives you full control over customization, fine-tuning, and data privacy—critical for regulated industries like healthcare or finance where sending data to closed models like Gemini isn't viable. Closed models excel out-of-the-box with state-of-the-art performance but limit tuning beyond prompts. Open models cap costs at infrastructure levels (no per-API-call scaling) and integrate as the "brain" in agentic systems via wrappers like Google's Agent Development Kit (ADK), which supports any LLM, not just Gemini.
Key principles: Evaluate models by performance, use case, cost, and capacity. Gemma 4 (2B parameter version here) fits L4 GPUs on Cloud Run, enabling scale-to-zero serverless inference. Use Ollama for dev/POC (easy local testing, multi-GPU) or vLLM for production (PagedAttention for memory efficiency, dynamic batching, high concurrency).
"Open model like Gemma is easy to take control, you can even fine-tune it." — Annie Wang
Common mistake: Assuming agent frameworks lock you into proprietary models—ADK's LiteLLM wrapper connects any model seamlessly.
Shared GCP Foundation: Project Setup and Permissions
Start in Cloud Shell (persistent VS Code-like VM at console.cloud.google.com). Run setup script to:
- Authenticate gcloud (
gcloud auth login). - Clone repos:
agentverse-devops-sre(templates, Cloud Build YAMLs) andagentverse-dungeon(agent fight files). - Create project (
agentverse-guardians-<ID>), link billing manually via Manage Resources if needed. - Set project:
gcloud config set project <ID>. - Enable APIs: Artifact Registry, Cloud Build, Cloud Run, Cloud Storage, Secret Manager (
gcloud services enable). - Create Artifact Registry repo:
gcloud artifacts repositories create <repo> --repository-format=docker. - Grant default service account roles: Storage Admin, Cloud Build Service Account, Logs Writer/Viewer, Secret Manager Secret Accessor.
- Run
warmup.shto cache GCS FUSE.
Service accounts act as "robot accounts" for granular permissions—use separate ones in production. Enabling APIs incurs no immediate cost; billing starts on usage.
"Every Google Cloud project has a default service account... that's essentially going to be like the operator behind many of your default actions." — Ayo Adedeji (IO)
Quality criteria: Verify project ID in yellow (Cloud Shell), gcloud config list shows correct project. Refresh page if timeouts occur (70-min security idle).
Ollama Deployment: Bake Model for Instant Cold Starts
Ollama pulls and embeds Gemma 4 directly into the container—ideal for rapid iteration but requires rebuilds for model updates.
Dockerfile:
FROM ollama/ollama
COPY entrypoint.sh /entrypoint.sh
RUN chmod +x /entrypoint.sh
ENTRYPOINT ["/entrypoint.sh"]
entrypoint.sh runs ollama serve and pulls gemma2:2b.
cloudbuild-ollama.yaml: Defines CI/CD pipeline:
- Build:
gcloud builds submit --config=cloudbuild-ollama.yaml .docker build -t image .docker push gcr.io/$PROJECT_ID/ollama.- Deploy to Cloud Run:
gcloud run deploy ollama --image=gcr.io/$PROJECT_ID/ollama --cpu=4 --memory=16Gi --gpu=nvidia-l4 --concurrency=4 --min-instances=1 --max-instances=1 --allow-unauthenticated --region=us-central1.
Trade-offs: 16GB RAM for 2B model; L4 GPU; concurrency=4. Scales to zero but min=1 here for lab (scale higher in prod). Build takes 15-20 mins—monitor in Cloud Build console.
Test: curl -X POST https://ollama-<hash>-uc.a.run.app/api/generate -d '{"model": "gemma2:2b", "prompt": "Why is the sky blue?"}'.
Before: Local Ollama testing. After: Serverless endpoint ready for agents.
"Ollama is great for development use cases. It's really easy to install and get up and running." — Ayo Adedeji
vLLM Deployment: Decouple Model via GCS FUSE for Agility
vLLM loads model from Cloud Storage FUSE mount—slower initial boot (caches on first run) but swap models by updating GCS without redeploy.
Prerequisites: Hugging Face token in Secret Manager (gcloud secrets create hf-token --data-file=<token>).
- Download Gemma 4 to GCS: Script pulls from HF (
huggingface-cli download google/gemma-2-2b-it). - Dockerfile: Base
vllm/vllm-openai, mounts GCS bucket via FUSE (gcsfuse), serves on/v1. - cloudbuild-vllm.yaml: Similar pipeline, but image pulls HF token secret.
- Deploy:
--gpu=nvidia-l4-count=1 --env-vars-file=vllm.env(adds HF_TOKEN).
- Deploy:
FUSE enables mounting GCS as filesystem: gcsfuse <bucket> /models—warmup caches for speed.
Test: Same curl to /v1/chat/completions with OpenAI-compatible API.
"vLLM is great for production use cases. It comes with PagedAttention... great for memory efficiency." — Ayo Adedeji
Common mistake: Forgetting GPU alloc (L4), insufficient RAM (16Gi+), or FUSE warmup—leads to OOM or slow boots.
Production Trade-offs and Agent Integration
| Aspect | Ollama | vLLM |
|---|---|---|
| Cold Start | Instant (baked model) | Slower (GCS mount) |
| Model Updates | Rebuild/deploy | GCS overwrite |
| Use Case | Dev/POC | Prod (concurrency) |
| Concurrency | Basic | Dynamic batching |
Optimize: Use authenticated invokes; scale max-instances >1; monitor costs (GPUs aren't free). Integrate as agent "brain": ADK routes tools/reasoning to your Cloud Run endpoint.
"The model you're choosing really like can determine the upper bound, the capability of your agentic system." — Annie Wang
Exercise: Extend to boss fight in Agentverse—deploy agent vs. agent via A2A.
Key Takeaways
- Self-host Gemma 4 on Cloud Run L4 GPUs for predictable costs and privacy in agent systems.
- Use Ollama for fast dev deploys: Bake model in Dockerfile, CI/CD via Cloud Build YAML.
- Prefer vLLM for prod: Mount GCS via FUSE, update models without rebuilds.
- Always setup IAM on default service account; enable APIs only incur costs on use.
- Configure Cloud Run: 4 CPU/16Gi RAM/GPU=1/concurrency=4; scale-to-zero with min=1 for labs.
- Test with curl to
/api/generate(Ollama) or/v1/chat/completions(vLLM). - Warm GCS FUSE cache; monitor builds in console (15-20 min).
- Integrate via ADK LiteLLM wrapper for any model as agent brain.