Self-Host Gemma 4 on Cloud Run GPUs: Ollama vs vLLM

Choose Open Models like Gemma 4 for Control and Cost Predictability

Self-hosting open models like Google's Gemma 4 gives you full control over customization, fine-tuning, and data privacy—critical for regulated industries like healthcare or finance where sending data to closed models like Gemini isn't viable. Closed models excel out-of-the-box with state-of-the-art performance but limit tuning beyond prompts. Open models cap costs at infrastructure levels (no per-API-call scaling) and integrate as the "brain" in agentic systems via wrappers like Google's Agent Development Kit (ADK), which supports any LLM, not just Gemini.

Key principles: Evaluate models by performance, use case, cost, and capacity. Gemma 4 (2B parameter version here) fits L4 GPUs on Cloud Run, enabling scale-to-zero serverless inference. Use Ollama for dev/POC (easy local testing, multi-GPU) or vLLM for production (PagedAttention for memory efficiency, dynamic batching, high concurrency).

"Open model like Gemma is easy to take control, you can even fine-tune it." — Annie Wang

Common mistake: Assuming agent frameworks lock you into proprietary models—ADK's LiteLLM wrapper connects any model seamlessly.

Shared GCP Foundation: Project Setup and Permissions

Start in Cloud Shell (persistent VS Code-like VM at console.cloud.google.com). Run setup script to:

Authenticate gcloud (gcloud auth login).
Clone repos: agentverse-devops-sre (templates, Cloud Build YAMLs) and agentverse-dungeon (agent fight files).
Create project (agentverse-guardians-<ID>), link billing manually via Manage Resources if needed.
Set project: gcloud config set project <ID>.
Enable APIs: Artifact Registry, Cloud Build, Cloud Run, Cloud Storage, Secret Manager (gcloud services enable).
Create Artifact Registry repo: gcloud artifacts repositories create <repo> --repository-format=docker.
Grant default service account roles: Storage Admin, Cloud Build Service Account, Logs Writer/Viewer, Secret Manager Secret Accessor.
Run warmup.sh to cache GCS FUSE.

Service accounts act as "robot accounts" for granular permissions—use separate ones in production. Enabling APIs incurs no immediate cost; billing starts on usage.

"Every Google Cloud project has a default service account... that's essentially going to be like the operator behind many of your default actions." — Ayo Adedeji (IO)

Quality criteria: Verify project ID in yellow (Cloud Shell), gcloud config list shows correct project. Refresh page if timeouts occur (70-min security idle).

Ollama Deployment: Bake Model for Instant Cold Starts

Ollama pulls and embeds Gemma 4 directly into the container—ideal for rapid iteration but requires rebuilds for model updates.

Dockerfile:

FROM ollama/ollama
COPY entrypoint.sh /entrypoint.sh
RUN chmod +x /entrypoint.sh
ENTRYPOINT ["/entrypoint.sh"]

entrypoint.sh runs ollama serve and pulls gemma2:2b.

cloudbuild-ollama.yaml: Defines CI/CD pipeline:

Build: gcloud builds submit --config=cloudbuild-ollama.yaml .
- docker build -t image .
- docker push gcr.io/$PROJECT_ID/ollama.
- Deploy to Cloud Run: gcloud run deploy ollama --image=gcr.io/$PROJECT_ID/ollama --cpu=4 --memory=16Gi --gpu=nvidia-l4 --concurrency=4 --min-instances=1 --max-instances=1 --allow-unauthenticated --region=us-central1.

Trade-offs: 16GB RAM for 2B model; L4 GPU; concurrency=4. Scales to zero but min=1 here for lab (scale higher in prod). Build takes 15-20 mins—monitor in Cloud Build console.

Test: curl -X POST https://ollama-<hash>-uc.a.run.app/api/generate -d '{"model": "gemma2:2b", "prompt": "Why is the sky blue?"}'.

Before: Local Ollama testing. After: Serverless endpoint ready for agents.

"Ollama is great for development use cases. It's really easy to install and get up and running." — Ayo Adedeji

vLLM Deployment: Decouple Model via GCS FUSE for Agility

vLLM loads model from Cloud Storage FUSE mount—slower initial boot (caches on first run) but swap models by updating GCS without redeploy.

Prerequisites: Hugging Face token in Secret Manager (gcloud secrets create hf-token --data-file=<token>).

Download Gemma 4 to GCS: Script pulls from HF (huggingface-cli download google/gemma-2-2b-it).
Dockerfile: Base vllm/vllm-openai, mounts GCS bucket via FUSE (gcsfuse), serves on /v1.
cloudbuild-vllm.yaml: Similar pipeline, but image pulls HF token secret.
- Deploy: --gpu=nvidia-l4-count=1 --env-vars-file=vllm.env (adds HF_TOKEN).

FUSE enables mounting GCS as filesystem: gcsfuse <bucket> /models—warmup caches for speed.

Test: Same curl to /v1/chat/completions with OpenAI-compatible API.

"vLLM is great for production use cases. It comes with PagedAttention... great for memory efficiency." — Ayo Adedeji

Common mistake: Forgetting GPU alloc (L4), insufficient RAM (16Gi+), or FUSE warmup—leads to OOM or slow boots.

Production Trade-offs and Agent Integration

Aspect	Ollama	vLLM
Cold Start	Instant (baked model)	Slower (GCS mount)
Model Updates	Rebuild/deploy	GCS overwrite
Use Case	Dev/POC	Prod (concurrency)
Concurrency	Basic	Dynamic batching

Optimize: Use authenticated invokes; scale max-instances >1; monitor costs (GPUs aren't free). Integrate as agent "brain": ADK routes tools/reasoning to your Cloud Run endpoint.

"The model you're choosing really like can determine the upper bound, the capability of your agentic system." — Annie Wang

Exercise: Extend to boss fight in Agentverse—deploy agent vs. agent via A2A.

Key Takeaways

Self-host Gemma 4 on Cloud Run L4 GPUs for predictable costs and privacy in agent systems.
Use Ollama for fast dev deploys: Bake model in Dockerfile, CI/CD via Cloud Build YAML.
Prefer vLLM for prod: Mount GCS via FUSE, update models without rebuilds.
Always setup IAM on default service account; enable APIs only incur costs on use.
Configure Cloud Run: 4 CPU/16Gi RAM/GPU=1/concurrency=4; scale-to-zero with min=1 for labs.
Test with curl to /api/generate (Ollama) or /v1/chat/completions (vLLM).
Warm GCS FUSE cache; monitor builds in console (15-20 min).
Integrate via ADK LiteLLM wrapper for any model as agent brain.