GCP Cloud Run: Serverless Containers Done Right

TL;DR

Cloud Run is Google’s fully managed container platform that runs any containerized HTTP application. It’s built on Knative, runs on Kubernetes under the hood, but exposes a dead-simple developer experience. The killer features: scale-to-zero (pay nothing when idle), built-in HTTPS, and cold starts under 2 seconds (with min instances). Best for: APIs, websites, event-driven microservices, and now — with GPU support — AI inference. The catch: 24-hour execution limit, no persistent local storage, and GCP ecosystem lock-in.


What Is It?

Cloud Run lets you deploy containers and automatically handles scaling, HTTPS, and infrastructure. It’s the middle ground between AWS Lambda (limited) and AWS Fargate (powerful but slower).

The Cloud Run Difference

Feature Lambda Cloud Run Fargate
Deployment Unit Function ZIP Container Container
Cold Start 100ms-1s 2s-15s 30-60s
Max Duration 15 min 60 min (HTTP), 24h (Jobs) Unlimited
Scale to Zero Yes Yes No
Concurrent Requests 1 Up to 1000 1
Built-in HTTPS Function URL Yes Needs ALB
Custom Domain CloudFront Direct mapping Route53 + ALB
GPU Support No Yes (NVIDIA L4) No

Architecture

┌─────────────────────────────────────────────────────────────┐
│                     Cloud Run Service                        │
│                                                              │
│   gcloud run deploy my-service --source .                   │
│                           │                                  │
│                           ▼                                  │
│   ┌──────────────────────────────────────────────┐          │
│   │         Cloud Build (optional)               │          │
│   │    Builds container from source code         │          │
│   └──────────────────────┬───────────────────────┘          │
│                          │                                   │
│                          ▼                                   │
│   ┌──────────────────────────────────────────────┐          │
│   │         Artifact Registry                    │          │
│   │         (Container Storage)                  │          │
│   └──────────────────────┬───────────────────────┘          │
│                          │                                   │
│                          ▼                                   │
│   ┌──────────────────────────────────────────────┐          │
│   │           Cloud Run (Knative)                │          │
│   │  ┌────────────┐  ┌────────────┐             │          │
│   │  │  Instance  │  │  Instance  │  ...        │          │
│   │  │  (cold)    │  │  (warm)    │             │          │
│   │  └────────────┘  └────────────┘             │          │
│   │        ↑ Scales from 0 to N                  │          │
│   └────────┼─────────────────────────────────────┘          │
│            │                                                 │
│   https://my-service-abc123-uc.a.run.app                     │
└─────────────────────────────────────────────────────────────┘

Deployment Options

1. Container Image (existing)

gcloud run deploy my-service --image gcr.io/project/image:tag

2. Source Code (buildpacks)

gcloud run deploy my-service --source .
# Auto-detects language, builds container, deploys

3. Continuous Deployment

gcloud run deploy my-service --source . --set-build-env-vars=GOOGLE_BUILDABLE=./cmd/api

Architecture Patterns

Pattern 1: HTTP API Backend

┌─────────┐     ┌──────────────┐     ┌──────────────┐
│  Client │────→│ Cloud Run    │────→│ Cloud SQL    │
│         │←────│ (REST API)   │←────│ (PostgreSQL) │
└─────────┘     └──────────────┘     └──────────────┘
       ↑              ↑
       └──────────────┘
    HTTPS + Cloud CDN

Benefits:

Pattern 2: Event-Driven Processing

Cloud Storage ──→ Pub/Sub ──→ Cloud Run (Push Subscription)
                                    └── Process file
                                    └── Write results to BigQuery

Key difference from Lambda:

Pattern 3: AI Inference with GPU

gcloud run deploy llm-service \
  --image gcr.io/project/llm-inference \
  --gpu 1 \
  --gpu-type nvidia-l4 \
  --max-instances 10

New capability: Cloud Run now supports NVIDIA L4 GPUs for:

Pattern 4: Multi-Region Deployment

# Deploy to 3 regions simultaneously
gcloud run services update-traffic my-service \
  --to-revisions=us-central1=50,us-east1=30,europe-west1=20

Built-in global load balancing across regions — no CloudFront needed.


Pricing

Cloud Run Pricing (us-central1)

Resource Price Free Tier
CPU $0.00002400/vCPU-second 240,000 vCPU-seconds/month
Memory $0.00000250/GB-second 450,000 GB-seconds/month
Requests $0.40/million 2 million requests/month
Networking $0.085/GB (egress) 1 GB egress/month

Cost Examples

Scenario Monthly Cost
Idle service (no traffic) $0
Small API (1M requests, 100ms avg) ~$5
Medium API (10M requests, 200ms avg) ~$50
Always-on (1 vCPU, 2 GB, 24/7) ~$85

Always-Free Tier is generous: Most small services cost $0.

Cost Optimization

1. Min Instances for Warmth

spec:
  template:
    metadata:
      annotations:
        autoscaling.knative.dev/minScale: "1"  # Keep 1 instance warm

2. Max Instances for Cost Control

autoscaling.knative.dev/maxScale: "100"

3. CPU Allocation

run.googleapis.com/cpu-throttling: "false"  # CPU always allocated

AWS Alternative: Fargate vs Cloud Run

Feature Cloud Run AWS Fargate Winner
Scale to Zero Yes No Cloud Run
Cold Start 2-15s 30-60s Cloud Run
Built-in HTTPS Yes Needs ALB Cloud Run
Concurrent Requests 1,000 1 Cloud Run
Max Memory 32 GB (64 with GPU) 120 GB Fargate
Max CPU 8 (32 with GPU) 16 Fargate
Task Duration 60 min (24h Jobs) Unlimited Fargate
GPU Support Yes (L4) No Cloud Run
Price (1vCPU, 2GB, 24/7) ~$85 ~$58 Fargate

When to Choose Fargate Over Cloud Run

AWS Lambda Comparison

Latency-sensitive API:     Lambda wins (100ms vs 2s cold start)
Long-running processing:   Cloud Run wins (60 min vs 15 min)
Cost at scale:             Cloud Run wins (concurrency = efficiency)
Ecosystem:                 Lambda wins (better integrations)

Azure Alternative: Container Apps

Feature Cloud Run Azure Container Apps
Scale to Zero Yes Yes
Knative-Based Yes Yes (KEDA)
Dapr Integration No Yes
Environment Single service Multiple apps per environment
Pricing Per request + vCPU-sec Per request + vCPU-sec

Azure’s Advantage: Dapr integration for microservice patterns (service discovery, pub/sub)

Cloud Run’s Advantage: Simpler, faster cold starts, better global distribution


Real-World Use Cases

Use Case 1: High-Traffic Website

Challenge: Marketing site with 1M daily visitors, traffic spikes during campaigns

Architecture:

Cloud CDN (caching)
     ↓
Cloud Run (static site or Next.js)
     ↓
Firestore (dynamic content)

Configuration:

minScale: "2"  # Keep warm instances
maxScale: "500"  # Handle spikes
concurrency: "100"  # Each instance handles 100 requests

Results:

Use Case 2: AI Inference API

Challenge: Deploy LLM for chatbot API, variable traffic

Architecture:

Client → Cloud Run (GPU-enabled)
            ├── Model: Llama 3.1 8B
            ├── GPU: NVIDIA L4
            └── Runtime: vLLM

Configuration:

gcloud run deploy llm-api \
  --image gcr.io/project/vllm-llama \
  --gpu 1 \
  --memory 32Gi \
  --cpu 8 \
  --max-instances 5 \
  --no-cpu-throttling

Results:

Use Case 3: Data Pipeline Trigger

Challenge: Process files from Cloud Storage, transform, load to BigQuery

Architecture:

Cloud Storage ──Eventarc──→ Cloud Run (Job)
                                ├── Download file
                                ├── Transform
                                └── Stream to BigQuery

Why Cloud Run Jobs:


The Catch

1. Cold Starts Still Exist

Problem: First request to scaled-zero service takes 2-15 seconds

Timeline:

Solutions:

2. No Persistent Local Storage

Problem: /tmp is ephemeral (in-memory, cleared on shutdown)

Limitations:

Workarounds:

3. Request Timeout Limits

Type Max Duration
HTTP requests 60 minutes
Jobs 24 hours
WebSocket 24 hours

For longer workloads, use Cloud Run Jobs or migrate to GKE.

4. VPC Connectivity Complexity

Direct VPC egress requires “Direct VPC Egress” (newer feature):

vpcAccess:
  connector: projects/my-project/locations/us-central1/connectors/my-connector
  egress: ALL_TRAFFIC

Without connector: Only public IPs, use Private Service Connect for private services

5. GCP Lock-in

Cloud Run is built on Knative (open source), but:

Portability: Can migrate to Knative on GKE or any Kubernetes cluster, but not seamlessly.


Verdict

Grade: A

Best for:

Standout features:

When not to use:


Researcher 🔬 — Staff Software Architect