Traffic Management Platform

Clients

Streaming Industry

Focus

2021

Services

Envoy, Nginx, AWS

Building a Unified Service Mesh at a Major Streaming Company

When I joined the Production Engineering team at a major streaming entertainment company, we faced a common challenge in modern microservices architectures: each team had built their own ingress solution using different technologies like NGINX or AWS API Gateway. While this autonomy promoted innovation, it also led to duplicated efforts, inconsistent implementations of critical features, and increased operational overhead. We needed a unified solution that would abstract away traffic management complexities while providing enterprise-grade reliability.

Creating a Central Traffic Management Platform

We designed and built a centralized traffic management platform based on Envoy proxy to serve as the front door for all incoming traffic to our cloud infrastructure. The system needed to handle millions of HTTP requests per second to support our growing user base of tens of millions of subscribers.

Technical Architecture

We built the platform with several key architectural components:

1. The Edge Layer

At its core, our system runs multiple Envoy proxy instances behind Network Load Balancers (NLBs) for redundancy. These proxies handle everything from TLS termination to advanced traffic management features. We deployed them across different availability zones to ensure high availability.

2. Configuration Management

One of our most interesting design decisions was implementing a "configuration-as-code" approach. All traffic routing rules are stored in version control repositories as YAML files, allowing for version control, peer review, and automated testing. This meant teams could manage their traffic rules using familiar Git workflows.

The configuration system supports sophisticated features like:

  • Path-based and header-based routing
  • Weighted traffic distribution for A/B testing
  • Circuit breaking to prevent cascade failures
  • Rate limiting to protect services
  • Request/response header manipulation

3. Control Plane Evolution

Our control plane architecture evolved through two major iterations. Initially, we used Ansible to deploy static configurations, which worked but had limitations in terms of deployment speed. Later, we developed a more sophisticated system where Envoy instances would pull their configuration from a central service, enabling faster updates and more dynamic behavior.

Tackling Complex Challenges

1. Graceful Degradation

One of our biggest technical challenges was implementing graceful degradation under high load. We developed a multi-layered approach:

  1. Request prioritization to ensure critical traffic always gets through
  2. Adaptive concurrency limits that automatically adjust based on backend performance
  3. Intelligent load shedding that preserves core functionality during extreme load

2. Observability

Making a system observable was crucial for operating at scale. We implemented comprehensive monitoring using:

  • Detailed access logs for debugging
  • Metrics exposition for monitoring systems
  • Distributed tracing to track requests across services
  • Health checking and anomaly detection

Key Learnings
  1. Start Simple, Scale Smart: We began with basic routing capabilities and gradually added more sophisticated features based on real needs.
  2. Configuration as Code Works: Treating configuration as code and using Git workflows proved invaluable for managing complex routing rules at scale.
  3. Observability First: Building comprehensive observability into the system from the start made it much easier to diagnose issues and optimize performance.
  4. Embrace Progressive Delivery: Having mechanisms for gradual rollouts and easy rollbacks was crucial for maintaining reliability while evolving the system.

Looking Forward

The platform continues to evolve as we add new capabilities and optimize existing ones. It's become a crucial part of our infrastructure, handling millions of requests while providing the reliability and flexibility modern cloud applications need.

For engineers looking to build similar systems, I'd emphasize the importance of starting with a solid foundation of basic routing and gradually adding more sophisticated features as needed. The investment in good observability and configuration management will pay dividends as the system grows in complexity and scale.

This experience taught me that building reliable distributed systems isn't just about the technology choices - it's about understanding user needs, planning for failure, and building systems that are both powerful and simple to use. The success of our platform showed that with the right architecture and approach, we can solve complex problems while making life easier for our fellow engineers.