Cogito is searching for a Principal Site Reliability Engineer to join our organization. The ideal candidate has a mix of customer-facing skills, strong operational production support experience, systems know-how and leadership skills.
You have deep experience in running Kubernetes infrastructure for production SaaS systems on a global scale. You have organized a team to provide responsive on-call support for both external and internal customers, maintained and upgraded large scale production systems and cloud infrastructure to meet strict security requirements while maintaining operational SLOs and customer SLAs.
You have experience in platform migrations and have moved large production workloads from legacy systems to Kubernetes based containerized microservices architecture successfully. You possess very strong troubleshooting skills and have led a team of SREs/operations engineers to perform troubleshooting, system maintenance and software updates in the cloud native Kubernetes environment.
You have transformed the operations model to enable continuous deployments and high developer productivity while the company has experienced hypergrowth and international expansion. You are well organized and thrive in fast paced environments where priorities are set based on business needs. Conversant with a large variety of subjects, you have the ability to triage and manage a broad range of issues. You have led multiple projects to successful completion and deployment to production.
You have strong leadership skills and have been a team lead or manager of talented professionals in the past. You can delegate and collaborate smartly, leveraging the strengths of individuals while building new skills and capabilities in the organization. You have been a mentor in the past, and have found it a rewarding experience. Teaching and leading others is part of what you do naturally.
- Delight internal and external customers by responsive and well organized on-call SRE team support and highly performant and well maintained tools, systems and Kubernetes based production infrastructure.
- Provide timely resolution to customer concerns and issues. Troubleshoot software and infrastructure issues as needed.
- Maintain PCI, HITRUST, HIPAA and SOC2 status by maintaining the tools and systems you are responsible for, keeping the software updated and providing support for our security team during the security audits.
- Continuously improve the reliability and cost efficiency of our services and infrastructure.
- Develop and drive SRE engagement model, conduct production readiness reviews and improve our operational processes to enable company growth and international expansion.
- Automate processes and practices to manage cloud infrastructure lifecycle and configurations to client specifications.
- Design and architect technical solutions to meet customer requirements and communicate to a broad range of stakeholders within the business.
- Team leadership, organizing on-call schedules and running scrum meetings effectively. Keep track of deliverables and provide weekly status reports on team achievements.
- Bachelors degree in a CS/IS/IT/System Administration related field or equivalent experience
- 5+ years in a DevOps, Site Reliability Engineer or equivalent role
- Willingness to learn new technologies and skills on the fly
- Willingness to mentor junior team members
- Demonstrate a history of working in environments with any of the following compliance standards: PCI, HITRUST, HIPAA, Sarbanes Oxley, ISO27002, CIS L1 & L2
- Extensive experience in production Kubernetes clusters and related tooling (Service Mesh, Ingress Controller, Operators). This is a critical requirement to be successful in this role.
- Extensive experience with a public cloud provider (AWS is preferred).
- Proficient programming/ scripting languages to automate repeatable processes and develop/enhance microservice-based systems ( e.g. bash, Python, Go, Java)
- Experience with Configuration management tools (e.g. Ansible, Chef, Puppet, etc)
- Experience in building Infrastructure as code (e.g. Terraform, Cloudformation etc)
- Deep understanding of Linux
- Extensive experience with a CI/CD tool such as Jenkins, Travis CI etc.
- Extensive experience in troubleshooting and debugging application related network and infrastructure issues
- Experience in production SaaS environments
- Excellent communication and documentation skills
- Experience working in company that practices both an Agile and DevOps mentality
- Your choice of comprehensive benefits for you and your - dependents effective on date of hire; health, dental, vision, flexible spending, life insurance, disability, additional voluntary supplemental life insurance
- Pet Insurance
- Employee Assistance Programs (EAP)
- 20 days vacation time, 5 days sick time, 2 floating holidays and 11 company holidays
- 2 "Be Gentle" personal days
- 401(k) retirement plan options
- Competitive pay and bonus eligibility
- Stock options via equity grants
- Ongoing professional development and cross-training
- Company paid parental leave upon hire
- Office Optional policy where Cogicians choose where they work either primarily remote, primarily in office or hybrid
- Ability to support Cogicians anywhere in the US through our Office Optional policy
- Employee Referral Bonus Program
- Employee Resource Groups
Equal Opportunity Employer
Cogito is a proud equal opportunity employer. We are committed to fair hiring practices and to creating a welcoming environment for all team members. All qualified applicants will receive consideration for employment without regard to race, color, religion, gender, gender identity or expression, sexual orientation, national origin, disability, age, familial status or veteran status.
Authorization to Work
Applicants for employment in the US must be authorized to work in the US.