TripAdvisor Technical Operations team is looking for a Senior Infrastructure Engineer to join our team which is responsible for the production infrastructure and its operation for the world’s largest travel web site. Our team operates several datacenters in different geographical regions, the servers and the networking in each of the datacenter, as well as the wide-area network backbone that connects these datacenters and the offices around the world. We rely heavily on automation and software systems to improve our operational efficiency and accuracy, from inventory and provisioning of systems, to monitoring and auto-remediation issues happening in our environment. We also continuously codify behaviors and rules that are important for the systems to operate correctly and efficiently. A successful candidate must be like-minded and bring hybrid software and system engineering experience and insight to help build, operate, and maintain our rapid-growing infrastructure.
Responsibilities and duties:
- Responsible for reliability, availability and security of our infrastructure by continuously improving it as well as sharing an on-call rotation with the team.
- Responsible for gathering and analyzing requirements from teams that we support and help them implement their applications or systems on our infrastructure, as well as providing supports for the application and system operations.
- Responsible for improving the capacity of our infrastructure through capacity planning, budgeting and forecasting, and implementation.
- Responsible for improving the reliability and resilience of our infrastructure through root-cause analysis and reviewing gaps in designs and implementations of our infrastructure.
- Responsible for spearheading the adoption of containerization and Kubernetes throughout TripAdvisor Media Group, and improve the operations of our growing number production and development clusters as well as add or enhance the features that we support on these clusters.
Qualifications and skills:
- BS or MS in Computer Science or related technical field
- Strong understanding of data structures and algorithms
- Strong knowledge of UNIX and TCP/IP network fundamentals
- Good knowledge of common configuration management, deployment and orchestration tools
- Ability to code really well in at least one programming language, and have done that to enhance existing software systems. Can take advantage of tools to help code better.
- Strong understanding of large-scale Internet service architectures, such as load-balancing, DNS, CDN, http/https proxy
- Proven ability to pick up new technology and tool very quickly
- Ability to take calculated risks in order to move fast, but have a plan for when things go wrong
- Experience in an operations role supporting a 24/7 production environment
- Organized, good attention to details, and able to work both independently and with a team
- Strong written and verbal communications in English