Gemma 4 Prod Stack: Secure Agents with Armor & Tracing

Shielding Models from Attacks with Model Armor and Load Balancers

Model Armor integrates as a Load Balancer Service Extension to scan every prompt and response for jailbreaks, PII leaks, harassment, and harmful content before reaching backends. This network-level protection blocks malicious traffic automatically, configurable for custom deny responses.

Core Setup Process:

Deploy Gemma 4 backends: Use vLLM (optimized for throughput/memory parallelism, production-scale) and Ollama (development-friendly) as separate Cloud Run services from prior deployment.
Create Serverless Network Endpoint Groups (NEGs): Represent Cloud Run backends for load balancer (e.g., gcloud compute network-endpoint-groups create vllm-neg --network-endpoint-type=SERVERLESS --cloud-run-service=vllm-gemma-service --region=us-central1).
Define Backend Services: Link NEGs to backends (e.g., gcloud compute backend-services create vllm-backend --global --network-endpoint-groups=vllm-neg --network-endpoint-groups-region=us-central1).
Build URL Map for Routing: Single endpoint routes via path prefixes (e.g., /vllm/ to vLLM backend, /ollama/ to Ollama), enabling dev/prod switching without endpoint sprawl.
Provision Proxy-Only Subnet: Reserves private IPs for secure load balancer access to VPC-networked Cloud Run (e.g., /28 CIDR block).
Generate SSL Cert and Create HTTPS Proxy/Forwarding Rule: Enables secure HTTPS invocation (e.g., gcloud compute ssl-certificates create lb-ssl --global).
Attach Model Armor Extension: Links to URL map, configuring detectors like prompt-injection, pii, harmful-content with thresholds and actions (block/log).

Key Principles: Load balancers centralize traffic management, avoiding multiple endpoints while enabling extensions like Model Armor. Without LB, invoke Model Armor via Python SDK/API in agent callbacks (e.g., before-agent or after-model in ADK) for inline protection. Trade-off: LB adds setup complexity but automates network-level filtering; direct Cloud Run suits simpler scaling without routing.

Common Pitfalls Avoided:

Forgetting proxy subnet: Breaks private VPC access.
No URL map: Can't route one endpoint to multiple backends.
HTTPS without cert: Forces insecure HTTP.

Quality Check: Test with adversarial prompts (e.g., jailbreak attempts); expect 403 blocks. Monitor logs for detections.

"Model Armor... detecting for malicious inputs... prompt injection, jailbreaking... sensitive data leaks like security card or social security number." — Ayo Adedeji, explaining detection scope.

Deploying Scalable Agents with ADK and vLLM on Cloud Run

Agent Development Kit (ADK) builds model-agnostic agents (works with Gemma 4, not just Gemini) powered by vLLM for high-throughput inference. Deploy via Cloud Build CI/CD to Cloud Run for serverless scaling.

Agent Pipeline Steps:

Prep Dungeon (Boss Fight Setup): Run Cloud Build to deploy opponent agent (gcloud builds submit --config=cloudbuild-dungeon.yaml); monitors in Cloud Build console.
Integrate ADK with vLLM: Use lightweight LLM backend in ADK config (e.g., LiteLLM wrapper for Gemma 4 endpoint).
CI/CD with Cloud Build: Triggers on repo changes, builds container with vLLM/Gemma 4, deploys to Cloud Run.
Invoke via Load Balancer: Agent calls routed through secured endpoint.

Principles: ADK's callbacks enable Model Armor API insertion (pre-agent for input scan, post-model for output). vLLM excels in production (parallelism, GPU efficiency) vs. Ollama (dev prototyping). Single LB endpoint simplifies client integration.

Trade-offs: Cloud Run auto-scales but incurs cold starts; LB adds latency (~50-100ms) for security. For non-LB: Direct Cloud Run endpoints with SDK-integrated safety.

Evaluation Criteria: Agent handles multi-turn interactions reliably; boss fight tests combat logic (e.g., vs. cloud monster).

"ADK is actually model agnostic... using ADK with LiteLLM and you're gonna learn how to use that." — Annie Wang, on flexibility.

Monitoring Production Metrics and End-to-End Tracing

Achieve observability with Prometheus sidecar for vLLM metrics (TTFT, GPU util, latency, tokens/sec) and OpenTelemetry/Cloud Trace for agent traces.

Metrics Setup:

Inject Prometheus sidecar into Cloud Run (scrapes /metrics from vLLM).
Key Metrics: Token throughput, GPU utilization, req/s, latency, output tokens/req — all tie to cost/performance.

Tracing Setup:

Instrument ADK with OpenTelemetry (OTel) exporter to Cloud Trace.
Trace spans: Prompt → Model call → Response, end-to-end via LB.
View in Cloud Monitoring/Trace console.

Principles: Metrics predict costs (e.g., high GPU idle = waste); traces debug agent failures (e.g., tool call latency). Sidecar avoids app code changes.

Pitfalls: Unguarded metrics explode bills; set alerts for >80% GPU.

"Track... time to first token, GPU utilization, request per second, request latency, output tokens per request... factors into... performance throughput and costs." — Ayo Adedeji, on prod monitoring.

Boss Fight Integration: Pits your ADK agent against deployed dungeon agent; traces reveal perf bottlenecks.

"By the end of today's episode, you will have a secure observable Gemma 4 AI agent in production." — Intro takeaway.

Key Takeaways

Route multiple model backends (vLLM/Ollama) through one LB endpoint with URL maps for dev/prod switching.
Attach Model Armor as LB extension for automatic jailbreak/PII scanning; fallback to SDK in callbacks.
Build ADK agents with LiteLLM for Gemma 4; deploy via Cloud Build to Cloud Run.
Add Prometheus sidecar for vLLM metrics (GPU, tokens) and OTel for traces to control costs.
Reserve proxy-only subnet for secure LB-to-Cloud Run comms in VPC.
Test security with adversarial prompts; monitor traces for agent debugging.
Prefer LB for network safety at scale; direct Cloud Run for simplicity.
Always reconfigure env vars in lab scripts for resilience.