26 November 2018

Scaling Push Messaging for Millions of Netflix Devices (ARC334)

by mo

Netflix built Zuul Push, a massively scalable push messaging service that handles millions of always-on, persistent connections to proactively push time-sensitive data, like personalized movie recommendations, from the AWS Cloud to devices. This helped reduce Netflix’s notification latency and the Amazon EC2 footprint by eliminating wasteful polling requests. It also powers Netflix’s integration with Amazon Alexa. Zuul push is a high-performance async WebSocket/SSE server. In this session, we cover its design and how it delivers push notifications globally across AWS Regions. Key takeaways include how to scale to large numbers of persistent connections, differences between operating this type of service versus traditional request/response-style stateless services, and how push messaging can be used to add new, exciting features to your application.

imagine: friday night. 135 million versions of each dashboard. one per customer.

Time spent:

  • 2/3 -> picking movie
  • 1/3 -> watch movie


  • pushing out a new recommendation list to each connected device using push messaging instead of polling servers.
  • polling is wasteful and not a great delivery experience
  • server efficiency and UI freshness were in direct contrast
  • push messaging allows us to have server efficiency and ui freshness so both goals are achieved.
  • 2M requests/second
  • background push messages are great for appliations

PUSH: (P)ersist (U)ntil (S)omething (H)appens

  • server intiates push rather than client requesting data.
  • zuul push is a push messaging system
  • works everywhere across all devices not just mobile.
    • works on laptops, game consoles, blue ray devices etc.
  • websockets and sse (server sent events)
  • zuul push infrastructure
    • push servers sit on edge
    • client connect to edge push servers. (persistent connection)
    • push registry: maintains list of pushes
    • zuul push library used by recommendation engine
      • push message queue

This sounds similar to action cable.

These are best delivery effort not guaranteed delivery. If the client is not connected then the message gets dropped. 40M conncurrent connections at it’s peak. Innovation is in the zull push server in how it handles multilple persistent connections. zuul push server uses non-blocking sync io.

why sync io?

Goal: handle C10K concurrent connections on a single box.

  • spawning threads does not scale to meet C10 because it overwhelms CPU and context switches. (easy app logic)
  • async i/o model uses OS calls on a single thread. Do not need thousands of threads. (complicated app logic)

async i/o cannot use local variables to maintain state.

  • must use finite state machine (FSM).
  • they use netty: inbound/output socket API

nodejs also uses async i/o model. single thread and non-blocking.

Authenticate using session cookie, JWT or other token based authentication. They support using redis as the persistent push registry.

push registry checklist:

  • low read latency
  • automatic record expiry (phantom registration records need to be cleaned out. relies on TTL per registration)
  • sharding
  • replication


  • redis
  • cassandra
  • dynamo db

netflix uses: Dynamite (netflix open source project)

  • dynamite on top of redis
  • auto sharding
  • read/write quorum
  • cross region replication

Kafka message queues

  • Best effort delivery is mostly good enough.
  • Few needs guaranteed delivery.
  • Hive has every single push message delivery recorded.
  • Messages are delivered across regions.

Use different queues for different priority. Message of higher priority are meant to be processed before lower priority on a single message queue. Different queues for different priorities handles this.

Mantis (open source project)

Operational lessons of Zuul push cluster

  • Long lived persistent onnections make the server stateful.
  • long lived is great for client efficiency, but they are terrible for quick deploy or roll-back.

How do you upgrade persistent connections? Forcefully make them switch by killing connection then they will all immediately reconnect to new cluster at the same time.

Solution: limit client connection lifetime

Clients coded to reconnect when connection is killed. Load balancers are round-robin so clients are not sticky to one server. Tear down connections periodically. (25 - 35 minutes)

Randomize client connection lifetime to reduce them all expirying at the same time. If a network wide blip happens then everyone would reconnect at the same time and would have the same 30min lifetime without the randomization. A network blip could synchronize all clients to reconnect at the same time. Random helps here. Initial spike after network blip but then dampens out after time.

Instead of having server close connection. Ask client to close connection. Server can send message to client to close the connection.

In TCP, the party that closes the connection can end up in the TIME_WAIT state where the file descriptor cannot be released. By having client close the connection, we conserve file descriptors on the server.

rogue client?

  • timer is started on server. if client doesn’t comply, then forcefully close on server side.

optimize cluster

  • most clients are idle
  • ulimit 262144 ? A couple of Beefy servers.

Server went down and all clients stampeded back. (thundering herd)

goldilocks strategy - too hot, too cold, just right.

M4Large was the choice

Optimize for cost not instance count. More cheaper instances vs fewer bigger instances. Prefer smaller instances. Autoscaling fits better for traffic curve.

What threshold do you use to autoscale push cluster?

  • requests per second? -> not affective measures for persistent connection
  • cpu?


  • of open connections per server is the factor to measure.
  • of average open connections per server. for autoscale

Export Cloudwatch metrics and use that for autoscale policies. Classic load balancers do not work well with websockets.

Expected websocket handshake.

HEADER -> Upgrade -> Response status code 100.

Classic Load Balancers do not treat upgrade requests differently.

Run CLB as TCP load balancers instead of HTTP load balancers. Use layer 4 load balancing instead of layer 7 on CLB. TLS load balancer can still terminate SSL at CLB load balancers. Websockets are vulnerable to XSRF attacks. Server needs to verify origin header.

ALB Application Load Balancers can proxy WebSockets natively.


  • recycle connections periodically. (sticky connections)
  • randomize client connection lifetime. (network blip and thundering herd)
  • more small servers better.
  • autoscale on # of open connections.
  • use ALB or CLB in TCP mode.

TCP load balancing available in haproxy/nginx

alexa -> voice service -> speech recognition aws lambda -> zuul push server -> netflix connected device.