Scaling AI Agents and Inference on Google Cloud Run

The Evolution of Cloud Run for AI Workloads

Cloud Run is shifting its focus from simple web service hosting to a robust runtime for modern AI-driven development. The platform now supports a broader range of workloads, including AI agents, long-running background tasks, and high-performance inference, while maintaining its core serverless value proposition: zero infrastructure management and pay-per-use pricing.

AI Agents and Secure Sandboxing

To support AI agents, Google has introduced several primitives designed for agentic workflows:

Cloud Run Sandboxes: Provides ephemeral, isolated, micro-VM-based environments for executing untrusted code (e.g., scripts or agent-generated code) without risking the host application.
Cloud Run Instances: A new primitive that allows direct access to individual instances, supporting long-running background agents that do not require standard request-based scaling.
MCP Server Integration: A fully managed Model Context Protocol (MCP) server for Cloud Run, enabling agents to deploy and manage Cloud Run apps directly via standard protocols.
SSH Support: Enables developers to SSH into containers for advanced troubleshooting and inspection, a highly requested feature for complex production environments.

High-Performance Inference and Fine-Tuning

Cloud Run is expanding its hardware capabilities to handle more demanding AI tasks:

Blackwell GPUs: Introduction of NVIDIA RTX Pro 6000 Ada (Blackwell) GPUs, optimized for AI inference and fine-tuning.
Ephemeral Disks: Allows for large-scale file manipulation by moving storage off the memory-bound file system to dedicated ephemeral disks.
Job-based Fine-Tuning: Cloud Run jobs now support GPUs and delayed execution, allowing users to run fine-tuning tasks that scale to zero upon completion.

Scalability and Networking

For enterprise-grade applications, the platform is introducing more granular control:

Custom Scaling Controls: Users can now define strict minimums and maximums for instances to balance cost predictability with traffic surge handling.
Worker Pools: Now generally available, these always-on instances are designed for continuous background tasks, such as pull-based workloads (e.g., Temporal workers).
Crema (Cloud Run External Metrics Autoscaling): Powered by KEDA, this allows scaling based on external events (like message queue backlogs) without requiring a Kubernetes cluster.
Service Bindings: Currently in private preview, this feature simplifies service-to-service communication by automatically injecting JWTs for authentication, allowing internal services to connect via simple, context-aware short names.

Key Takeaways

Budget Predictability: New spend caps allow you to automatically pause resources if costs exceed a defined monthly limit.
Dev-to-Prod Loop: The dev-sync command enables live-updating code on a remote Cloud Run instance, allowing developers to iterate in the cloud without losing application state.
Agent Governance: Flagging Cloud Run services as agents allows them to register automatically with the Gemini Enterprise Agent platform for centralized governance.
Production-Ready: Replit’s experience demonstrates that Cloud Run can handle massive scale (over 1.2 million active deployments) while maintaining high reliability.
Simplified Networking: Service bindings remove the need for manual auth logic and complex networking configurations for internal microservices.

Notable Quotes

"Cloud Run gives you on-demand computers. You can run anything on Google's world-class infrastructure with zero overhead."
"The barrier to entry has collapsed... once you are done coding, you need to deploy your application to the Cloud and you want to do that as easily as possible."
"Cloud Run actually captures both ends of that really well—it's cost-efficient... but then when you need it, the scaling is there."
"Cloud Run sandboxes allow you to do secure on-the-fly code execution from within your Cloud resources."

The Evolution of Cloud Run for AI Workloads

AI Agents and Secure Sandboxing

High-Performance Inference and Fine-Tuning

Scalability and Networking

Key Takeaways

Notable Quotes

More from Inference & Serving

Scaling AI and Vibe Coding: What's New in Google Cloud Run

Building AI Agents with Cloudflare's Durable Objects & Dynamic Workers

Moving AI Agents from Development to Production

Building Production-Grade AI Agents with Go and Flutter