Canary Deployment

Gradually route traffic to a new version while monitoring for issues.

Canary deployment routes a small percentage (1-10%) of traffic to the new version while the majority continues on the stable version. You gradually increase traffic to the canary if metrics are healthy, eventually replacing the old version entirely.

How It Works

A canary is deployed alongside production pods. Traffic splitting is managed by a service mesh (Istio, Linkerd) or ingress controller (NGINX, Traefik). Automated or manual gates check error rates, latency, and business metrics at each stage. If any threshold is breached, traffic is instantly routed back to the stable version.

TipThe name comes from "canary in a coal mine" -- if the canary (small deployment) dies, you know not to send more traffic. It's the safest way to validate changes with real production traffic.

When to Use

  • High-traffic services where issues affect millions of users
  • You have observability (metrics, tracing, logging) in place
  • Changes could have subtle, hard-to-test impacts
  • Gradual rollout provides time to catch edge cases

When NOT to Use

  • Low-traffic services (not enough signal for metrics)
  • No monitoring or alerting infrastructure
  • You need a fast iteration cycle with many deploys per day

Real-World Examples

Facebook - News Feed Ranking

Facebook rolls out News Feed ranking changes to 1% of users first, monitors engagement metrics for 36 hours, then gradually expands. A single ranking change can affect billions of views per day, making canary deployment essential.

Netflix - Recommendation Engine

Netflix validates recommendation algorithm updates with a 2% canary. They monitor click-through rate, play rate, and member satisfaction score. The canary runs for 24-48 hours before being promoted or rolled back.

Step-by-Step Implementation (Istio)

1. Deploy canary alongside stable

yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp-canary
spec:
  replicas: 1
  selector:
    matchLabels:
      app: myapp
      version: canary
  template:
    metadata:
      labels:
        app: myapp
        version: canary
    spec:
      containers:
      - name: myapp
        image: myregistry/myapp:2.0.0-canary

2. Configure traffic split with Istio VirtualService

yaml
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: myapp-vs
spec:
  hosts:
  - myapp
  http:
  - route:
    - destination:
        host: myapp
        subset: stable
      weight: 95
    - destination:
        host: myapp
        subset: canary
      weight: 5     # 5% canary traffic

3. Gradually increase canary weight

bash
# Monitor error rate
kubectl exec -it $(kubectl get pod -l app=prometheus -o name) -- \
  promtool query instant 'rate(http_requests_total{version="canary",code=~"5.."}[5m])'

# If healthy, increase to 25%
kubectl apply -f virtualservice-25-percent.yaml

# Continue: 50% → 75% → 100%

Common Pitfalls

PitfallSymptomFix
Insufficient traffic to canaryMetrics are statistically insignificantEnsure canary gets enough traffic; increase percentage or wait longer
No automated rollbackBad canary runs too long, affecting usersUse Flagger or Argo Rollouts for automated canary analysis
Sticky sessions bypass canarySome users never hit canaryConfigure session affinity at the canary routing level
Monitoring wrong metricsCanary promoted despite degraded UXMonitor business metrics (conversion, engagement) not just error rates