When I joined the Production Engineering team at a major streaming entertainment company, we faced a common challenge in modern microservices architectures: each team had built their own ingress solution using different technologies like NGINX or AWS API Gateway. While this autonomy promoted innovation, it also led to duplicated efforts, inconsistent implementations of critical features, and increased operational overhead. We needed a unified solution that would abstract away traffic management complexities while providing enterprise-grade reliability.
We designed and built a centralized traffic management platform based on Envoy proxy to serve as the front door for all incoming traffic to our cloud infrastructure. The system needed to handle millions of HTTP requests per second to support our growing user base of tens of millions of subscribers.
We built the platform with several key architectural components:
At its core, our system runs multiple Envoy proxy instances behind Network Load Balancers (NLBs) for redundancy. These proxies handle everything from TLS termination to advanced traffic management features. We deployed them across different availability zones to ensure high availability.
One of our most interesting design decisions was implementing a "configuration-as-code" approach. All traffic routing rules are stored in version control repositories as YAML files, allowing for version control, peer review, and automated testing. This meant teams could manage their traffic rules using familiar Git workflows.
The configuration system supports sophisticated features like:
Our control plane architecture evolved through two major iterations. Initially, we used Ansible to deploy static configurations, which worked but had limitations in terms of deployment speed. Later, we developed a more sophisticated system where Envoy instances would pull their configuration from a central service, enabling faster updates and more dynamic behavior.
One of our biggest technical challenges was implementing graceful degradation under high load. We developed a multi-layered approach:
Making a system observable was crucial for operating at scale. We implemented comprehensive monitoring using:
The platform continues to evolve as we add new capabilities and optimize existing ones. It's become a crucial part of our infrastructure, handling millions of requests while providing the reliability and flexibility modern cloud applications need.
For engineers looking to build similar systems, I'd emphasize the importance of starting with a solid foundation of basic routing and gradually adding more sophisticated features as needed. The investment in good observability and configuration management will pay dividends as the system grows in complexity and scale.
This experience taught me that building reliable distributed systems isn't just about the technology choices - it's about understanding user needs, planning for failure, and building systems that are both powerful and simple to use. The success of our platform showed that with the right architecture and approach, we can solve complex problems while making life easier for our fellow engineers.