Gemma 4 Prod Stack: Secure Agents with Armor & Tracing
Build a production Gemma 4 agent stack on GCP: shield prompts with Model Armor via load balancer, deploy ADK agents on vLLM/Cloud Run, monitor via Prometheus/Cloud Trace for security, scale, and cost control.
Shielding Models from Attacks with Model Armor and Load Balancers
Model Armor integrates as a Load Balancer Service Extension to scan every prompt and response for jailbreaks, PII leaks, harassment, and harmful content before reaching backends. This network-level protection blocks malicious traffic automatically, configurable for custom deny responses.
Core Setup Process:
- Deploy Gemma 4 backends: Use vLLM (optimized for throughput/memory parallelism, production-scale) and Ollama (development-friendly) as separate Cloud Run services from prior deployment.
- Create Serverless Network Endpoint Groups (NEGs): Represent Cloud Run backends for load balancer (e.g.,
gcloud compute network-endpoint-groups create vllm-neg --network-endpoint-type=SERVERLESS --cloud-run-service=vllm-gemma-service --region=us-central1). - Define Backend Services: Link NEGs to backends (e.g.,
gcloud compute backend-services create vllm-backend --global --network-endpoint-groups=vllm-neg --network-endpoint-groups-region=us-central1). - Build URL Map for Routing: Single endpoint routes via path prefixes (e.g.,
/vllm/to vLLM backend,/ollama/to Ollama), enabling dev/prod switching without endpoint sprawl. - Provision Proxy-Only Subnet: Reserves private IPs for secure load balancer access to VPC-networked Cloud Run (e.g.,
/28CIDR block). - Generate SSL Cert and Create HTTPS Proxy/Forwarding Rule: Enables secure HTTPS invocation (e.g.,
gcloud compute ssl-certificates create lb-ssl --global). - Attach Model Armor Extension: Links to URL map, configuring detectors like
prompt-injection,pii,harmful-contentwith thresholds and actions (block/log).
Key Principles: Load balancers centralize traffic management, avoiding multiple endpoints while enabling extensions like Model Armor. Without LB, invoke Model Armor via Python SDK/API in agent callbacks (e.g., before-agent or after-model in ADK) for inline protection. Trade-off: LB adds setup complexity but automates network-level filtering; direct Cloud Run suits simpler scaling without routing.
Common Pitfalls Avoided:
- Forgetting proxy subnet: Breaks private VPC access.
- No URL map: Can't route one endpoint to multiple backends.
- HTTPS without cert: Forces insecure HTTP.
Quality Check: Test with adversarial prompts (e.g., jailbreak attempts); expect 403 blocks. Monitor logs for detections.
"Model Armor... detecting for malicious inputs... prompt injection, jailbreaking... sensitive data leaks like security card or social security number." — Ayo Adedeji, explaining detection scope.
Deploying Scalable Agents with ADK and vLLM on Cloud Run
Agent Development Kit (ADK) builds model-agnostic agents (works with Gemma 4, not just Gemini) powered by vLLM for high-throughput inference. Deploy via Cloud Build CI/CD to Cloud Run for serverless scaling.
Agent Pipeline Steps:
- Prep Dungeon (Boss Fight Setup): Run Cloud Build to deploy opponent agent (
gcloud builds submit --config=cloudbuild-dungeon.yaml); monitors in Cloud Build console. - Integrate ADK with vLLM: Use lightweight LLM backend in ADK config (e.g.,
LiteLLMwrapper for Gemma 4 endpoint). - CI/CD with Cloud Build: Triggers on repo changes, builds container with vLLM/Gemma 4, deploys to Cloud Run.
- Invoke via Load Balancer: Agent calls routed through secured endpoint.
Principles: ADK's callbacks enable Model Armor API insertion (pre-agent for input scan, post-model for output). vLLM excels in production (parallelism, GPU efficiency) vs. Ollama (dev prototyping). Single LB endpoint simplifies client integration.
Trade-offs: Cloud Run auto-scales but incurs cold starts; LB adds latency (~50-100ms) for security. For non-LB: Direct Cloud Run endpoints with SDK-integrated safety.
Evaluation Criteria: Agent handles multi-turn interactions reliably; boss fight tests combat logic (e.g., vs. cloud monster).
"ADK is actually model agnostic... using ADK with LiteLLM and you're gonna learn how to use that." — Annie Wang, on flexibility.
Monitoring Production Metrics and End-to-End Tracing
Achieve observability with Prometheus sidecar for vLLM metrics (TTFT, GPU util, latency, tokens/sec) and OpenTelemetry/Cloud Trace for agent traces.
Metrics Setup:
- Inject Prometheus sidecar into Cloud Run (scrapes
/metricsfrom vLLM). - Key Metrics: Token throughput, GPU utilization, req/s, latency, output tokens/req — all tie to cost/performance.
Tracing Setup:
- Instrument ADK with OpenTelemetry (OTel) exporter to Cloud Trace.
- Trace spans: Prompt → Model call → Response, end-to-end via LB.
- View in Cloud Monitoring/Trace console.
Principles: Metrics predict costs (e.g., high GPU idle = waste); traces debug agent failures (e.g., tool call latency). Sidecar avoids app code changes.
Pitfalls: Unguarded metrics explode bills; set alerts for >80% GPU.
"Track... time to first token, GPU utilization, request per second, request latency, output tokens per request... factors into... performance throughput and costs." — Ayo Adedeji, on prod monitoring.
Boss Fight Integration: Pits your ADK agent against deployed dungeon agent; traces reveal perf bottlenecks.
"By the end of today's episode, you will have a secure observable Gemma 4 AI agent in production." — Intro takeaway.
Key Takeaways
- Route multiple model backends (vLLM/Ollama) through one LB endpoint with URL maps for dev/prod switching.
- Attach Model Armor as LB extension for automatic jailbreak/PII scanning; fallback to SDK in callbacks.
- Build ADK agents with LiteLLM for Gemma 4; deploy via Cloud Build to Cloud Run.
- Add Prometheus sidecar for vLLM metrics (GPU, tokens) and OTel for traces to control costs.
- Reserve proxy-only subnet for secure LB-to-Cloud Run comms in VPC.
- Test security with adversarial prompts; monitor traces for agent debugging.
- Prefer LB for network safety at scale; direct Cloud Run for simplicity.
- Always reconfigure env vars in lab scripts for resilience.