Deploy Gemma 4 on Cloud Run GPUs: Ollama vs vLLM
Self-host open Gemma 4 on serverless Cloud Run GPUs: use Ollama for instant cold starts in dev or vLLM for model agility in prod, automated via Cloud Build CI/CD.
Open Models Unlock Agent Control and Cost Predictability
Self-hosting open models like Google's Gemma 4 (2B param version here) beats closed models like Gemini when you need data isolation in regulated fields (healthcare, finance), fine-tuning on domain data, or fixed infra costs that don't scale linearly with usage. Closed models excel out-of-box with SOTA performance and zero management, but open ones let you customize beyond prompts—key for agentic systems where the model is the "brain." Use Google's Agent Development Kit (ADK) with its LiteLLM wrapper to plug in any model, including self-hosted Gemma, for tool-calling and reasoning.
"A lot of industries such as healthcare or finance... running self-hosted models is a really good solution for that." — Ayo Adedeji, on why open models fit isolated scenarios.
Pick models by performance, use case, and cost: Gemma caps agent capability as the upper bound, so test against your needs. Trade-off: open models require infra ops, but enable on-prem or VPC isolation.
Baked-In Models with Ollama: Prioritize Dev Speed and Cold-Start Latency
Ollama suits POCs and dev: dead-simple install, multi-GPU ready, model baked into the container for instant cold starts (no download on boot). Downside: updating the model means rebuilding/pushing the full image—slow for prod iteration.
Step-by-step deployment:
- Prep Cloud Shell env (persistent VS Code-like VM, auto-timeout after 70min—refresh to reauth):
gcloud auth login, clone repos (agentverse-devops-srefor templates/CI YAMLs,agentverse-dungeonfor agent fight assets), run init script for new project (agentverse-guardians-<id>), manually link billing via Manage Resources if fetch fails,gcloud config set project, enable APIs (artifactregistry,run,cloudbuild,storage,secretmanager—no immediate charges, only on use). - Infra scaffolding: Create Artifact Registry repo (
us-central1-docker.pkg.dev/$PROJECT/ollama), grant default service account IAM roles (roles/storage.objectAdmin,roles/cloudbuild.builds.builder,roles/logging.logWriter,roles/secretmanager.secretAccessor—think "robot accounts" for granular prod perms), runwarmup.sh(pre-caches GCS FUSE for vLLM later). - Dockerfile (bake model):
One line pulls/stores Gemma 4 (2B) inside image.FROM ollama/ollama RUN ollama pull gemma2:2b - Cloudbuild.yaml (CI/CD blueprint: build → push → deploy):
Key Cloud Run flags: 4CPU/16Gi RAM (matches 2B model), L4 GPU (inference accel), concurrency=4 (parallel requests), min/max=1 (lab cost control—scale higher in prod), unauth (secure with IAM in prod). Builds take 15-20min (Docker pull/build/push).steps: - name: 'gcr.io/cloud-builders/docker' args: ['build', '-t', 'us-central1-docker.pkg.dev/$PROJECT/ollama/ollama:latest', '.'] - name: 'gcr.io/cloud-builders/docker' args: ['push', 'us-central1-docker.pkg.dev/$PROJECT/ollama/ollama:latest'] - name: 'gcr.io/google.com/cloudsdktool/cloud-sdk' args: - gcloud - run - deploy - ollama - --image=us-central1-docker.pkg.dev/$PROJECT/ollama/ollama:latest - --platform=managed - --region=us-central1 - --allow-unauthenticated - --cpu=4 - --memory=16Gi - --concurrency=4 - --gpu=1 - --gpu-type=nvidia-l4 - --min-instances=1 - --max-instances=1 - Trigger:
gcloud builds submit --config=cloudbuild.yaml .
Monitor in Console > Cloud Build (logs/steps). Test endpoint: `curl -X POST