Deploy Gemma to Cloud Run with Ollama & vLLM
Hands-on guide to deploying open Gemma models on Google Cloud Run using Ollama for dev or vLLM for prod, covering agent system pillars like cost, scale, and model choice for custom AI agents.
Pillars of End-to-End Agent System Management
Building agentic systems requires balancing cost/capacity, model strategy, serving at scale, security/safety, and observability. Cost with closed models like Gemini scales linearly per API call, while open models like Gemma have fixed infra costs regardless of usage volume. Capacity optimization involves GPU resource allocation, such as Nvidia L4 accelerators on Cloud Run. Model strategy weighs closed models' state-of-the-art performance and ease against open models' customizability, fine-tuning, and on-premise deployment for regulated industries like healthcare or finance. Serving at scale demands frameworks supporting concurrency and batching. Security benefits from self-hosting to avoid sending sensitive data externally. Observability tracks agent reasoning, tool selection, and performance.
Agents use models as the 'brain' for reasoning and tool selection, setting the system's capability ceiling. Google's Agent Development Kit (ADK) supports any model via its LiteLLM wrapper, not just Gemini, allowing Gemma integration. Evaluate models by performance benchmarks, use case fit (e.g., domain-specific tuning), and total cost.
"Ollama as you mentioned, you can customize the model. So a lot of use cases have very um uh domain-specific data where you can kind of improve performance by tuning. Um and you can do that with an open model as opposed to closed models like Gemini."
Open vs Closed Model Trade-offs
Closed models excel out-of-the-box with general capabilities but limit customization beyond prompting. Open models enable full control, fine-tuning on proprietary data, and self-hosting for data isolation. Gemma suits agent brains needing high customization without vendor lock-in. Avoid open models if rapid prototyping without infra management is priority; choose closed for managed scaling.
Key decision framework: Match model to agent architecture where the LLM reasons over tools. Test Gemma-2-2B (2B parameters) for lighter loads—fits 16GB memory. Production pitfalls: Underestimating memory leads to OOM errors; always spec min 16GB RAM and GPU.
"The model you're choosing really like can determine the like the upper bound, the capability of your agentic system. That's why it's very important and you want to be smart to choose your model."
Ollama Deployment Pipeline for Development
Ollama suits local/dev workflows: simple install, multi-GPU support, model baked into images. Prerequisites: Google Cloud project with billing, Cloud Shell (persistent VS Code-like env, auto-timeout after 70min—refresh to re-auth). Assumes basic gcloud familiarity; fits AI devs building POCs.
Step-by-step:
- Environment Setup: Run
gcloud auth loginwith billing-linked account. Clone repos:Agent Verse DevOps SRE(templates, YAMLs) andAgent Verse Dungeon(boss fight assets for agent testing). Init project:AgentVerseGuardian-<ID>, link billing manually via console.cloud.google.com > Manage Resources > Select project > Link billing account. - Configure gcloud:
gcloud config set project <ID>. Verify: project ID in yellow,gcloud config list. - Enable APIs: Run script for Cloud Storage, AI Platform, Cloud Build, Artifact Registry, Secret Manager. No immediate charges—billed on usage.
- Artifact Registry: Create repo for images:
gcloud artifacts repositories create <repo> --repository-format=docker. - Permissions: Grant default service account roles: Storage Object Admin, Cloud Build Service Account, Logs Writer/Viewer, Secret Manager Accessor. Analogy: Service accounts as 'robot users' with scoped perms; use separate ones in prod.
- Warm-up:
./warmup.shpreps GCS FUSE cache (for vLLM later).
Dockerfile (bakes model):
FROM ollama/ollama
COPY entrypoint.sh /entrypoint.sh
RUN chmod +x /entrypoint.sh
ENTRYPOINT ["ollama", "serve"]
Pull model: ollama pull gemma2:2b during build.
cloudbuild.yaml (CI/CD blueprint):
- Build:
docker build -t <image> . - Push:
docker push <registry>/<image> - Deploy to Cloud Run: Spec 4 CPU, 16GB RAM, Nvidia L4 GPU, concurrency=4, min/max instances=1 (lab-only; scale in prod), allow-unauthenticated (secure in prod).
Run: gcloud builds submit --config=cloudbuild.yaml --region=us-central1 --substitutions=_REPO_NAME=<repo>,_PROJECT_ID=<id>,_SERVICE_NAME=gemma-ollama (15-20min). Track in Cloud Build console.
Verify: ./set-env.sh; GEMMA_OLLAMA_URL=$(gcloud run services describe ... --format='value(status.url)'); curl -X POST $GEMMA_OLLAMA_URL/api/generate -d '{"model": "gemma2:2b", "prompt": "As a guardian of Aetherius, what is my primary duty?"}'. Expect Gemma response.
Common mistakes: Forgetting set-env.sh (resets vars post-timeout); insufficient memory (OOM); unauthenticated in prod (add IAM).
Quality check: Response streams coherently, no errors in Cloud Run logs.
"Cloud Run is a really powerful serverless platform and gives us a lot of configuration capability... we're specifying four CPU um minimum for each machine of the service. Um we're specifying memory to be at least 16 GB."
vLLM Deployment for Production Scale
vLLM optimizes prod: PagedAttention for memory efficiency, dynamic batching, high concurrency. Differs from Ollama: Models stored in GCS (not baked), pulled from Hugging Face via Secret Manager (store HF token). Use GCS FUSE for fast model mounting.
Process mirrors Ollama but: Download weights to GCS bucket, mount via FUSE in container. Higher throughput for multi-user agents. Trade-off: More setup vs Ollama's simplicity.
"vLLM is great for production use cases. It comes with page attention. It's great for uh memory efficiency um and allows you to kind of do uh multiple concurrency um when it comes to calls and dynamic batching."
Integrating Deployed Models into Agents
Connect Cloud Run endpoint to ADK agents for tool-calling/reasoning. Test via 'boss fight': Agent vs agent in A2A (agent-to-agent) via Dungeon repo. Scales to multi-throughput; monitor via Cloud Logging.
Exercise: Deploy both runtimes, benchmark latency/concurrency on sample agent prompts. Extend: Fine-tune Gemma on domain data, add auth.
"Google ADK uh comes with a light LM wrapper that allows you to kind of connect models as you see fit. Um so later on in this lab, we're going to learn how we can use Gemma 4 as the brain behind an agent."
Key Takeaways
- Prioritize model strategy: Open like Gemma for customization/cost control in agents; closed like Gemini for managed SOTA.
- Use Ollama for dev POCs (bake model, quick local test); vLLM for prod (GCS storage, batching).
- Always spec GPU (L4), 16GB+ RAM, concurrency=4+ for 2B models on Cloud Run.
- CI/CD via Cloud Build: Dockerfile → Artifact Registry → Deploy; track builds/logs.
- Secure with IAM service accounts, Secret Manager for HF keys; authenticate endpoints.
- Verify deployments with curl to /api/generate; integrate via LiteLLM in ADK.
- Refresh Cloud Shell every 70min; link billing manually if script fails.
- Benchmark: Fixed infra costs beat per-call scaling for high-volume agents.
- Fits broader workflow: After deployment, plug into agent loops for reasoning/tools.