Gemma 4 Prod Stack: Secure Agents with Armor & Tracing

Build a production Gemma 4 agent stack on GCP: shield prompts with Model Armor via load balancer, deploy ADK agents on vLLM/Cloud Run, monitor via Prometheus/Cloud Trace for security, scale, and cost control.

Shielding Models from Attacks with Model Armor and Load Balancers

Model Armor integrates as a Load Balancer Service Extension to scan every prompt and response for jailbreaks, PII leaks, harassment, and harmful content before reaching backends. This network-level protection blocks malicious traffic automatically, configurable for custom deny responses.

Core Setup Process:

  1. Deploy Gemma 4 backends: Use vLLM (optimized for throughput/memory parallelism, production-scale) and Ollama (development-friendly) as separate Cloud Run services from prior deployment.
  2. Create Serverless Network Endpoint Groups (NEGs): Represent Cloud Run backends for load balancer (e.g., gcloud compute network-endpoint-groups create vllm-neg --network-endpoint-type=SERVERLESS --cloud-run-service=vllm-gemma-service --region=us-central1).
  3. Define Backend Services: Link NEGs to backends (e.g., gcloud compute backend-services create vllm-backend --global --network-endpoint-groups=vllm-neg --network-endpoint-groups-region=us-central1).
  4. Build URL Map for Routing: Single endpoint routes via path prefixes (e.g., /vllm/ to vLLM backend, /ollama/ to Ollama), enabling dev/prod switching without endpoint sprawl.
  5. Provision Proxy-Only Subnet: Reserves private IPs for secure load balancer access to VPC-networked Cloud Run (e.g., /28 CIDR block).
  6. Generate SSL Cert and Create HTTPS Proxy/Forwarding Rule: Enables secure HTTPS invocation (e.g., gcloud compute ssl-certificates create lb-ssl --global).
  7. Attach Model Armor Extension: Links to URL map, configuring detectors like prompt-injection, pii, harmful-content with thresholds and actions (block/log).

Key Principles: Load balancers centralize traffic management, avoiding multiple endpoints while enabling extensions like Model Armor. Without LB, invoke Model Armor via Python SDK/API in agent callbacks (e.g., before-agent or after-model in ADK) for inline protection. Trade-off: LB adds setup complexity but automates network-level filtering; direct Cloud Run suits simpler scaling without routing.

Common Pitfalls Avoided:

  • Forgetting proxy subnet: Breaks private VPC access.
  • No URL map: Can't route one endpoint to multiple backends.
  • HTTPS without cert: Forces insecure HTTP.

Quality Check: Test with adversarial prompts (e.g., jailbreak attempts); expect 403 blocks. Monitor logs for detections.

"Model Armor... detecting for malicious inputs... prompt injection, jailbreaking... sensitive data leaks like security card or social security number." — Ayo Adedeji, explaining detection scope.

Deploying Scalable Agents with ADK and vLLM on Cloud Run

Agent Development Kit (ADK) builds model-agnostic agents (works with Gemma 4, not just Gemini) powered by vLLM for high-throughput inference. Deploy via Cloud Build CI/CD to Cloud Run for serverless scaling.

Agent Pipeline Steps:

  1. Prep Dungeon (Boss Fight Setup): Run Cloud Build to deploy opponent agent (gcloud builds submit --config=cloudbuild-dungeon.yaml); monitors in Cloud Build console.
  2. Integrate ADK with vLLM: Use lightweight LLM backend in ADK config (e.g., LiteLLM wrapper for Gemma 4 endpoint).
  3. CI/CD with Cloud Build: Triggers on repo changes, builds container with vLLM/Gemma 4, deploys to Cloud Run.
  4. Invoke via Load Balancer: Agent calls routed through secured endpoint.

Principles: ADK's callbacks enable Model Armor API insertion (pre-agent for input scan, post-model for output). vLLM excels in production (parallelism, GPU efficiency) vs. Ollama (dev prototyping). Single LB endpoint simplifies client integration.

Trade-offs: Cloud Run auto-scales but incurs cold starts; LB adds latency (~50-100ms) for security. For non-LB: Direct Cloud Run endpoints with SDK-integrated safety.

Evaluation Criteria: Agent handles multi-turn interactions reliably; boss fight tests combat logic (e.g., vs. cloud monster).

"ADK is actually model agnostic... using ADK with LiteLLM and you're gonna learn how to use that." — Annie Wang, on flexibility.

Monitoring Production Metrics and End-to-End Tracing

Achieve observability with Prometheus sidecar for vLLM metrics (TTFT, GPU util, latency, tokens/sec) and OpenTelemetry/Cloud Trace for agent traces.

Metrics Setup:

  • Inject Prometheus sidecar into Cloud Run (scrapes /metrics from vLLM).
  • Key Metrics: Token throughput, GPU utilization, req/s, latency, output tokens/req — all tie to cost/performance.

Tracing Setup:

  1. Instrument ADK with OpenTelemetry (OTel) exporter to Cloud Trace.
  2. Trace spans: Prompt → Model call → Response, end-to-end via LB.
  3. View in Cloud Monitoring/Trace console.

Principles: Metrics predict costs (e.g., high GPU idle = waste); traces debug agent failures (e.g., tool call latency). Sidecar avoids app code changes.

Pitfalls: Unguarded metrics explode bills; set alerts for >80% GPU.

"Track... time to first token, GPU utilization, request per second, request latency, output tokens per request... factors into... performance throughput and costs." — Ayo Adedeji, on prod monitoring.

Boss Fight Integration: Pits your ADK agent against deployed dungeon agent; traces reveal perf bottlenecks.

"By the end of today's episode, you will have a secure observable Gemma 4 AI agent in production." — Intro takeaway.

Key Takeaways

  • Route multiple model backends (vLLM/Ollama) through one LB endpoint with URL maps for dev/prod switching.
  • Attach Model Armor as LB extension for automatic jailbreak/PII scanning; fallback to SDK in callbacks.
  • Build ADK agents with LiteLLM for Gemma 4; deploy via Cloud Build to Cloud Run.
  • Add Prometheus sidecar for vLLM metrics (GPU, tokens) and OTel for traces to control costs.
  • Reserve proxy-only subnet for secure LB-to-Cloud Run comms in VPC.
  • Test security with adversarial prompts; monitor traces for agent debugging.
  • Prefer LB for network safety at scale; direct Cloud Run for simplicity.
  • Always reconfigure env vars in lab scripts for resilience.

Summarized by x-ai/grok-4.1-fast via openrouter

8675 input / 2550 output tokens in 22474ms

© 2026 Edge