Gemma 4 Prod Stack: Model Armor, ADK Agents, Tracing

Unifying Model Serving with Load Balancer Routing

After deploying Gemma 4 separately via vLLM (optimized for production throughput, parallelism, memory) and Ollama (suited for dev/testing) to Cloud Run services, the team routes traffic through a single regional external Application Load Balancer endpoint. This avoids managing multiple URLs in production.

Key decisions:

Network Endpoint Groups (NEGs): Serverless NEGs represent Cloud Run backends for the LB. Created via gcloud compute network-endpoint-groups create with --network-endpoint-type=SERVERLESS.
Backend Services: Defined for each Cloud Run service (gcloud compute backend-services create), attached to NEGs. Enables LB to communicate securely.
URL Map: Routes based on path—e.g., /vllm/ to vLLM backend, /ollama/ to Ollama. Switch dev/prod by path prefix without endpoint changes. Command: gcloud compute url-maps create with host/path rules.

Tradeoffs: Cloud Run scales multi-region natively, so LB adds setup overhead (NEGs, backends, proxy subnet, HTTPS certs, target proxy, forwarding rules). But it provides a single invocable HTTPS endpoint and service extensions. Without LB, use direct Cloud Run URLs, losing unified routing.

Proxy-only subnet reserves private IPs for LB-to-Cloud Run communication in the VPC. SSL certs enable HTTPS termination at the target HTTPS proxy, which consults the URL map before forwarding (port 443).

"The reason why we're doing that for this particular lab using a load balancer, it's actually acting as a very advanced URL or a traffic router. So we have two different services, but we really don't want to be maintaining two different endpoints in production."

—Ayo Adedeji, explaining single-endpoint benefits over direct Cloud Run access.

Network-Level Security with Model Armor Service Extension

Model Armor scans every prompt/response for jailbreaks, prompt injection, PII leaks (e.g., SSNs, credit cards), harassment via LB service extension—triggered before backend routing.

Integration: Attach as extension to URL map (gcloud compute url-maps add-service-extension). Configurable thresholds/actions: block malicious inputs, replace harmful outputs with defaults. Detects sensitive data in agent generations.

Alternatives considered:

SDK/API: Invoke via Python SDK or REST API in ADK callbacks (before-agent or after-model). No LB needed—e.g., filter inputs pre-agent call.
Direct in code: Embed in app logic, but network-level is zero-code-change, applies to all backends.

Why LB extension? Enforces security at ingress without app modifications; scales with traffic. For non-LB setups, callbacks provide lifecycle hooks (e.g., pre-model scan).

"Model armor is really versatile you can use it in many different ways so there's a model armor python SDK... There's also model armor API that you can call... often times... before agent call back or after model call back."

—Ayo Adedeji, on flexible Model Armor invocation beyond LB.

Results: Blocks malicious traffic pre-model; logs detections for audit. Config via templates for custom harms/PII.

Model-Agnostic Agents with ADK and vLLM on Cloud Run

Agent Development Kit (ADK) builds agents atop any LLM (Gemini, Gemma 4). Here, pairs with lightweight vLLM serving Gemma 4, deployed to Cloud Run via Cloud Build CI/CD.

Pipeline: Cloud Build triggers deploys; vLLM handles inference. Preps for "boss fight"—agent vs. cloud dungeon agent.

Why vLLM? High token throughput, GPU efficiency for prod. ADK callbacks enable Model Armor hooks.

"ADK is actually model agnostic... The trick is we're gonna using ADK with light LLM vLLM and you're gonna learn how to use that."

—Annie Wang, highlighting ADK flexibility for Gemma 4.

Production Observability: Metrics and End-to-End Tracing

Post-deploy: Prometheus sidecar scrapes vLLM metrics (token throughput, GPU utilization, TTFT, req/s, latency, output tokens/req)—feeds cost/performance monitoring.

Cloud Trace with OpenTelemetry: Traces agent flows end-to-end.

Why these? Directly tie to costs (GPU, tokens); essential for agent ops at scale. Sidecar avoids custom exporters.

"We want to track things such as time to first token... GPU utilization request per second request latency output tokens per request. The reason why we want to do this because this all factors into how we control for and monitor performance throughput and costs."

—Ayo Adedeji, on metric selection for prod serving.

Key Takeaways

Use LB + URL maps for single-endpoint routing to multiple backends (e.g., vLLM prod vs. Ollama dev); path-based switching simplifies ops.
Integrate Model Armor as LB extension for zero-code network security; fallback to SDK/API in ADK callbacks for direct Cloud Run.
Build model-agnostic agents with ADK + vLLM on Cloud Run; CI/CD via Cloud Build for rapid iteration.
Monitor vLLM via Prometheus sidecar (GPU util, latency, tokens); add OpenTelemetry for agent traces.
Skip LB if no extensions/routing needed—Cloud Run scales alone—but LB unlocks Model Armor at ingress.
Reserve proxy-only subnet for secure LB-VPC comms; provision SSL certs for HTTPS.
Test in labs: Free GCP credits (non-GPU); full stack preps for agent battles/dungeons.
Prioritize observability pillars: security/safety first, then metrics for cost control.

"When we're talking about end-to-end agent system management... there's many different pillars... observability and security and safety."

—Ayo Adedeji, framing agent ops holistically.