What we need
We are looking for a passionate, hard-working, and talented Senior Site Reliability Engineer to take the lead on solving some of the toughest operational challenges in some of the most sensitive and mission-critical automated warehouse solutions. The SRE team will drive the stability and sustainability of these next-generation systems and discover innovative ways to scale and operate them reliably as we expand. In this role, you will work with cross functional teams such as Operations, Mechanical Hardware Engineering, IT infrastructure Systems, and Software Engineering teams to identify and address underlying resiliency gaps.
What we do
The Site Reliability Engineering team is part of the Technology Support organization and is responsible for all Root both industrial and enterprise systems Root Cause Analysis software. We are a passionate cross functional team solving some of the toughest challenges in some of the most sensitive and mission critical automated warehouse solutions.
What youll do
-
You will be part of the SRE Team, which is focused on hands-on root cause analysis of all critical production outages to improve resiliency.
-
You are responsible for analyzing various sources of metric, dashboards, phrasing logs and articulating that to a facts-based actionable Root Cause Analysis investigation to lead a group of Subject Matter Experts teams to find the actual cause
-
Host RCA calls as a chair and drive the RCA process to conclusion within tight SLAs with customer-facing deliverables
-
Lead problem tickets and improvements to major software components, systems, and features to improve the availability, scalability, latency, and efficiency of the Symbotic System.
-
Engage in and improve the service lifecycle from inception and design to deployment, operation, and refinement based on lessons learned through deep dives.
-
Hands-on troubleshooting of VMware, Kubernetes, Custom Software, and infrastructure performance incidents.
-
Be a trusted technical advisor who leads complex root cause analysis investigations from beginning to end until maximum improvements are identified
-
Demonstrate sound knowledge of gathering logs and facilitating a facts-based root cause analysis with cross-functional teams.
-
Assist internal teams with corrective actions and improvement tickets and influence the completion goals.
-
Flexibility to work during occasional out of standard hours including weekends may be required depending on the cruciality and workload demands.
-
Ability to travel up to 10%.
What youll need
-
Bachelors degree in Software Engineering, Information Systems, Computer Science or a related field.
-
Minimum 8 years of experience working on ITSM tools such as Jira or equivalent tool. (ITIL Problem Management experience is a plus)
-
Minimum 8 years of infrastructure engineering experience with a record demonstrating the delivery of high-quality, large-scale solutions requiring planning and change control.
-
Minimum 8 years of experience in operation of production systems including troubleshooting, testing, and automation.
-
Minimum 5 years of experience leading technical Root Cause Analysis (Software and/or industrial focus is a plus)
-
Ability to prioritize parallel RCA investigations and tasks by influencing cross-functional teams to complete actions on time with demanding quality.
-
Experience with executive incident communications, RCA report writing, and written communication skills to non-technical audiences.
-
Ability to transfer vast technical background to projects through excellent problem-solving and competence to work with other technical teams. Efficiently read and understand Gitlab technical documentation
-
Experience in the advanced use of tools like Prometheus, Grafana, Logic Monitor, Elastic, VMware and use of CLI (Kube or Linux).
-
Knowledge of Power BI, Tableau, executive report writing, and presentation skills is a plus.
Our environment
-
Up to 10% of travel may be required. Employees must have a valid drivers license and the ability to drive and/or fly to client and other customer locations.
-
The employee is responsible for owning a credit card and managing expenses personally to be reimbursed bi-weekly.
#LI-SK1
#LI-Remote