The Incident Response Team is a newly formed team, reporting directly to the VP of Engineering, Security and Infrastructure and will be at the heart of Global Operations at MuleSoft. This team will be responsible for the initial response and triage of all operational incident issues and will be the champion for the lifecycle of these incidents, working directly with Engineering Managers to groom work backlogs to prioritize high impact fixes.
The Director, Site Reliability Engineering will build and lead the Incident Response Team responsible for making sure our services maintain the highest availability. You will lead Incident Retrospectives across the engineering teams who identify failures in people, process, and technology that lead to incidents and develop corrective actions and track through to completion. This will involve communicating statuses of incidents to the business and support for communication outbound to customers. You will have the ability to lead, own, develop, and refine the Change Management Process, the overall Cost Management Initiatives, and the Change Control Review Board (CCRB), as well as developing statistical measures of success for the CCRB. You will own the end-to-end Incident Management and Problem Management Processes, build the policies and procedures to respond to incidents and match the business needs, and partner with various groups.
Goals for your first three months:
- Collaborate with the Engineering and DevOps teams to start to understand the environments and staffing requirements for operating a 24/7 team to respond to incidents
- Build the overall Incident and Event Management Policies and Process
- Work with various stakeholders in the organization to build requirements and identify gaps in documents and runbooks
- Start to hire a team in both SF, BA or ORD (the team doesn’t need to be 24/7 to start)
- Establish and exercise the incident response plan for operational issues
- Build metrics around SLAs, MTTx and other core KPIs for the team and start to own the statistical reporting and data management functions for incidents (SLAs, Mean Time to X calculations), Change (Change Induced Incident Minutes, etc.), and Problem Management (Actions, Completion %)
- Work with engineering teams to make sure that we have full coverage of operational issues across all services
- Start to build end-to-end knowledge and instrumentation of the system to identify if we have issues
- Establish the cadence of the team and have all the foundational set of policies and procedures in place
- Have buy-in from all engineering management and leadership for the direction of the team
- Have the team off the ground and working incidents, RCA process, and change management
The ideal candidate will have:
- Senior leadership experience with incident, change and problem management in a software engineering organization with dozens of stakeholders and conflicting priorities, and the ability to build a team from the ground-up
- PMO, PGM, Jira, and Agile experience
- Experience and ability to build and present SLA and other technical data to executive management
- Certifications involving disaster, security, incident and problem management (GIAC, SANS, ITIL, CERT, FEMA, etc.) - these are helpful but not required
What you’ll get from us:
We realize exceptional people don’t choose jobs based solely on benefits, but we do our best to make sure that you’re set up for success so you can do your best work. As a Muley, you’ll be based in our downtown Union Square HQ and receive comprehensive health benefits, life insurance, paid parental leave, 401K, equity, and flexible vacation time. Plus the fun stuff, like a fully stocked kitchen, catered lunches, volunteer opportunities, onsite happy hours and free yoga classes, annual rafting trip and offsite activities, and MeetUp, our annual all-company offsite in California. Check out our Life at MuleSoft page to learn more!