Secure Gemma AI Agent Prod Deployment on GCP
Build a production-ready Gemma 4 agent on Cloud Run with load-balanced traffic routing, Model Armor security against prompt injection/jailbreaks, and observability metrics like GPU usage and token counts.
Production Architecture: Load Balancer + Model Armor + Observability
Deploying AI agents like Gemma 4 to production requires balancing serving scale, security, and monitoring. The core setup uses two Cloud Run services—one for vLLM (optimized for high-throughput, parallelism, memory efficiency in prod) and one for Ollama (flexible for dev/experimentation)—routed via a single Google Cloud Load Balancer endpoint. This avoids managing multiple URLs while enabling service extensions like Model Armor for network-level input/output scanning.
Why this stack? Cloud Run handles serverless scaling natively, but load balancers add traffic control (path-based routing, e.g., /vllm vs /ollama), HTTPS termination, and integrations unavailable directly on Cloud Run. Model Armor scans for prompt injection, jailbreaks, PII leaks (e.g., SSNs, credit cards), harassment—configurable via templates. Observability via Cloud Trace captures agent-specific metrics: time-to-first-token (TTFT), GPU utilization, requests/second, latency, output tokens/request—critical for cost control (tokens/GPU drive bills).
Agent Development Kit (ADK) makes it model-agnostic: pair with LiteLLM to invoke Gemma 4 seamlessly. Principles: Network-level security (load balancer extension) for raw model endpoints without app logic; app-level (SDK/API in ADK callbacks) for agent workflows. Trade-off: Network is more secure/efficient for multi-backend; app-level offers lifecycle hooks (pre-agent/post-model).
Prerequisites: GCP project with Cloud Run services from prior lab (Gemma 4 via vLLM/Ollama). Intermediate GCP/Cloud Shell familiarity; no GPU credits here, so use spot/preemptible if scaling.
Step-by-Step: Load Balancer Setup for Unified Endpoint
Reconstruct the deployment as a repeatable Terraform-like gcloud sequence in Cloud Shell. Each command sets env vars for resilience (e.g., if terminal refreshes).
- Prep Dungeon (Boss Fight Setup): Run
gcloud builds submitfor agent-verse-dungeon Cloud Build job. Deploys opponent agent Cloud Run service for end-lab battle. Monitor in Cloud Build console. - Create Serverless Network Endpoint Groups (NEGs): NEGs represent Cloud Run backends for load balancer.
gcloud compute network-endpoint-groups create ${VLLM_NEG_NAME}
--network-endpoint-type=SERVERLESS
--cloud-run-location=us-central1
--cloud-run-service=${VLLM_SERVICE_NAME}
--region=us-central1
Repeat for Ollama NEG. Type: `SERVERLESS` for Cloud Run (vs VM/Storage).
3. **Define Backend Services:** Link NEGs to backends.
```bash
gcloud compute backend-services create ${VLLM_BACKEND_NAME} \
--global \
--network-endpoint-group=${VLLM_NEG_NAME} \
--network-endpoint-group-region=us-central1
Repeat for Ollama. Backends are load balancer's 'buckets' for NEGs.
- HTTPS Frontend: Provision self-signed cert (no custom domain needed).
gcloud compute ssl-certificates create ${CERT_NAME}
--global
--create-self-signed
5. **URL Map for Path Routing:** Single endpoint routes /vllm to vLLM backend, /ollama to Ollama.
```bash
gcloud compute url-maps create ${URL_MAP_NAME} \
--default-service ${VLLM_BACKEND_NAME} \
--path-rules "/ollama=${OLLAMA_BACKEND_NAME}"
Principle: Simplifies dev/prod switching (vLLM prod, Ollama dev) without endpoint sprawl.
- Proxy-Only Subnet: Reserves private IPs for load balancer to access Cloud Run's VPC.
gcloud compute networks subnets create ${SUBNET_NAME}
--purpose=REGIONAL_MANAGED_PROXY
--role=PRIVATE_GOOGLE_ACCESS
--network=${VPC_NAME}
--region=${REGION}
--range=${SUBNET_RANGE}
Enables secure intra-network comms.
7. **Target HTTPS Proxy + Forwarding Rule:** Terminates TLS, consults URL map.
```bash
gcloud compute target-https-proxies create ${PROXY_NAME} \
--url-map=${URL_MAP_NAME} \
--ssl-certificates=${CERT_NAME}
gcloud compute forwarding-rules create ${FORWARDING_RULE_NAME} \
--global \
--target-https-proxy=${PROXY_NAME} \
--ports=443 \
--address=${LOAD_BALANCER_IP}
Get IP: curl -H "Authorization: Bearer $(gcloud auth print-access-token)" https://us-central1-loadbalancer.googleapis.com/v1/projects/${GOOGLE_CLOUD_PROJECT}/global/addresses/${LOAD_BALANCER_IP}.
Common Pitfall: Skipping proxy subnet blocks load balancer-Cloud Run access. Test: curl https://${LOAD_BALANCER_IP}/vllm/v1/completions -H "Content-Type: application/json" -d '{...}'.
Integrating Model Armor: Block Malicious Inputs/Outputs
Attach as load balancer service extension—scans before backend routing.
- Create Model Armor Policy: Define threats (prompt injection, jailbreak, PII, harassment).
gcloud model-security policies create ${POLICY_NAME}
--display-name="Gemma Policy"
--threat-types=prompt-injection,jailbreak,credit-card-number
--block-threshold=moderate
--log-level=all
Customize: `block-threshold` (low/medium/high), default response for blocks.
2. **Service Extension Attachment:**
```bash
gcloud compute service-extensions create ${EXTENSION_NAME} \
--service-attachment=${MODEL_ARMOR_ATTACHMENT} \
--service-directory-service=${MODEL_ARMOR_SERVICE}
gcloud compute url-maps update-extensions ${URL_MAP_NAME} \
--service-extension=${EXTENSION_NAME} \
--region=us-central1
Alternatives if No Load Balancer:
- Python SDK:
client.scan_text(input_text). - API in ADK callbacks: Pre-agent (scan input), post-model (scan output).
Quality Check: Logs in Cloud Logging; metrics show blocked requests. Before: Raw prompts leak/jailbreak. After: Auto-block + custom safe response.
Observability: Track Costs and Performance
Post-deploy, enable Cloud Trace on Cloud Run for agent metrics.
- Key Metrics: GPU util, req/s, latency, TTFT, tokens/request/output.
- Setup: Native Cloud Run + Trace exports to BigQuery/Logging.
- Cost Principle: Tokens * rate + GPU hours = bill; alert on spikes.
Exercise: Deploy ADK + LiteLLM agent to Cloud Run, invoke via LB, query traces: gcloud trace spans list --project=${PROJECT_ID}. Battle boss agent to test.
"Model Armor is detecting for malicious inputs as part of a prompt... and also looking for sensitive data leaks."
"By using this regional external application load balancer, we're going to have one load balancer endpoint and then based off of how you call that particular endpoint... it's going to route traffic."
"You can have it be triggered at the network level... or at the agent lifecycle. So it comes down to how you like to design systems."
"Track things such as time to first token, GPU utilization, request per second, request latency, output tokens per request... all factors into how we control for and monitor performance throughput and costs."
"ADK is actually model agnostic... the trick is we're gonna using ADK with LiteLLM."
Key Takeaways
- Use load balancers for single-endpoint routing + extensions like Model Armor on raw Cloud Run models without app logic.
- Configure Model Armor policies for specific threats (prompt-injection, PII); choose network vs app-level based on security needs.
- Always create NEGs/backends for Cloud Run in LB setups; proxy subnet for VPC access.
- Monitor TTFT/GPU/tokens via Cloud Trace to optimize costs—query post-deploy.
- ADK + LiteLLM enables model-agnostic agents; test in dev (Ollama) before prod (vLLM).
- Avoid direct Cloud Run for multi-service without LB if needing unified security.
- Self-sign certs for lab HTTPS; prod uses managed/custom domains.
- Reset env vars per command for lab resilience; script for prod IaC.