Canary Deployment
Gradually route traffic to a new version while monitoring for issues.
Canary deployment routes a small percentage (1-10%) of traffic to the new version while the majority continues on the stable version. You gradually increase traffic to the canary if metrics are healthy, eventually replacing the old version entirely.
How It Works
A canary is deployed alongside production pods. Traffic splitting is managed by a service mesh (Istio, Linkerd) or ingress controller (NGINX, Traefik). Automated or manual gates check error rates, latency, and business metrics at each stage. If any threshold is breached, traffic is instantly routed back to the stable version.
When to Use
- High-traffic services where issues affect millions of users
- You have observability (metrics, tracing, logging) in place
- Changes could have subtle, hard-to-test impacts
- Gradual rollout provides time to catch edge cases
When NOT to Use
- Low-traffic services (not enough signal for metrics)
- No monitoring or alerting infrastructure
- You need a fast iteration cycle with many deploys per day
Real-World Examples
Facebook - News Feed Ranking
Facebook rolls out News Feed ranking changes to 1% of users first, monitors engagement metrics for 36 hours, then gradually expands. A single ranking change can affect billions of views per day, making canary deployment essential.
Netflix - Recommendation Engine
Netflix validates recommendation algorithm updates with a 2% canary. They monitor click-through rate, play rate, and member satisfaction score. The canary runs for 24-48 hours before being promoted or rolled back.
Step-by-Step Implementation (Istio)
1. Deploy canary alongside stable
apiVersion: apps/v1
kind: Deployment
metadata:
name: myapp-canary
spec:
replicas: 1
selector:
matchLabels:
app: myapp
version: canary
template:
metadata:
labels:
app: myapp
version: canary
spec:
containers:
- name: myapp
image: myregistry/myapp:2.0.0-canary2. Configure traffic split with Istio VirtualService
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: myapp-vs
spec:
hosts:
- myapp
http:
- route:
- destination:
host: myapp
subset: stable
weight: 95
- destination:
host: myapp
subset: canary
weight: 5 # 5% canary traffic3. Gradually increase canary weight
# Monitor error rate
kubectl exec -it $(kubectl get pod -l app=prometheus -o name) -- \
promtool query instant 'rate(http_requests_total{version="canary",code=~"5.."}[5m])'
# If healthy, increase to 25%
kubectl apply -f virtualservice-25-percent.yaml
# Continue: 50% → 75% → 100%Common Pitfalls
| Pitfall | Symptom | Fix |
|---|---|---|
| Insufficient traffic to canary | Metrics are statistically insignificant | Ensure canary gets enough traffic; increase percentage or wait longer |
| No automated rollback | Bad canary runs too long, affecting users | Use Flagger or Argo Rollouts for automated canary analysis |
| Sticky sessions bypass canary | Some users never hit canary | Configure session affinity at the canary routing level |
| Monitoring wrong metrics | Canary promoted despite degraded UX | Monitor business metrics (conversion, engagement) not just error rates |