Operating at Scale - An Inside Look at Facebook's Production Engineering Team
We connected with Facebook about their Production Engineering team in NYC. We discuss all the details on this team, the products they support in NYC, the types of profiles they look for when hiring, and more!
What is Production Engineering (PE)?
Facebook's journey to Production Engineering has evolved through the years, starting with a traditional Site Reliability Engineering (SRE) model in 2009 which morphed into Site Reliability Operations (SRO) + AppOps and in 2012 morphed into PE as we know it today. Facebook’s services, and the company overall, have scaled to the point that no single team can be responsible for the health of even a single Facebook app.
Production Engineering is a form of specialization that focuses on Reliability, Scalability and Efficiency and this specialization is an acknowledgment that Facebook has expanded to a level where the teams are interdependent. Typically, there are only a handful of Production Engineers (PEs) working on a service along with a far greater number of Software Engineers. Without the PE teams, who hold themselves accountable for the health of Facebook, we wouldn't be running at the scale and availability that we are today.
How does the position of a Production Engineer differ from a Software Engineer and how are they similar?
PEs *are* Software Engineers, with specialization. Production Engineering is a focus, not just a role.
Like traditional Software Engineers, PEs are expected to have strong development skills suited to write efficient and maintainable code. We expect Production Engineers to be adept at designing robust services, exhibit a thorough understanding of Linux internals, and have experience with large-scale network operation.
What are the high level responsibilities for Production Engineers at Facebook?
Running services at Facebook’s scale requires a balance of features, reliability, and performance. PEs focus on service reliability and performance, partnering with Software Engineering peers, who tend to focus on service features .
Production Engineering’s obsession with reliability, performance, and automation enables PEs to cultivate these values in its Software Engineering partners.
The most successful PE team is one that’s no longer required, as the software, tooling, and engineering practices exhibit PE values.
Which teams do Production Engineers support at FBNY?
In the Facebook New York office, we currently support five teams - Instagram, Messaging Infrastructure, News Feed, Kernel and AI.
- Instagram PEs are responsible for the reliability and disaster readiness of one of the largest Python-based front end services in the world. On the backend, we support the full lifecycle of machine learning for Instagram (e.g., feature engineering, data pipelines, training and inference), as well as all the product infrastructure that brings the Instagram experience to life.
- Messenger Infrastructure runs all the back-end services that power the Messenger product. The infrastructure we support includes things like the software connecting your device to a Facebook Datacenter, the mechanisms used to route messages between users, data synchronization systems, and our push notification pipeline.
- Facebook News Feed is a major component of the Facebook user experience and is critical to Facebook's Ads business. The challenges of scaling the ever-growing infrastructure for News Feed are many, and they require us to remain nimble and change the infrastructure to efficiently serve internal and external customers.
- The Facebook Kernel team contributes to the upstream Linux kernel features needed to build the next generation of scalable services in our infrastructure. PEs ensure safe roll-out, build monitoring, and bring the knowledge and understanding of Facebook’s production infra.
- The AI Infrastructure and Applied Machine Learning teams deliver the machine intelligence to Facebook, WhatsApp, Instagram and Oculus, bringing language, computer vision, personalization, and integrity techniques to production. The Production Engineers supporting these efforts help drive the scaling, efficiency, and reliability of the full ML lifecycle from feature engineering, to model training, to inference. We are solving machine learning problems that few have thought about, at a scale few have ever seen.
What type of background does the Production Engineering team generally look for when hiring?
Ideal candidates bring some of these experiences to PE:
- Developed industrial-grade software with high-availability requirements
- Managed or supported large-scale production environments
- Diagnosed application performance in a Linux environment
- Debugged issues within a multi-regional network
- Effectively managed stakeholders - either internal or external
- Automated processes - whether they be testing, build, or deployment
Examples of roles that most often overlap with Production Engineering include Back-end Software Engineers, Site Reliability Engineers, or DevOps engineers.
If PE excites you and your experiences don’t strongly align, we offer the Discover PE Program, designed to introduce candidates from diverse backgrounds and experiences to PE work. This opportunity enables candidates to build up their PE skill-set over 12 months.
What can someone expect in terms of a career path within Production Engineering either within this team or how is it suitable for other groups at Facebook?
At Facebook, there are multiple dimensions to career growth: path, level, and functional focus. You can grow as you take on additional responsibility as an individual contributor (IC) or you can switch paths and grow as a leader. We encourage mobility, you will be able to identify opportunities to explore new, functional areas for PE. You can also switch paths from leadership to IC. The choice is yours.