Description

  • Establish and lead the SRE / production engineering practice for the Common Services organization, including standards for reliability, incident management, and on-call, in partnership with the central Product Engineering organization.
  • Develop an Operational Excellence strategy that focuses on not only improving system performance but also monitoring and reducing operational toil
  • Partner with engineering and product teams to define SLOs, SLIs, and error budgets for critical Common Services, and ensure these become part of how teams plan and make tradeoffs.
  • Own and improve the incident management lifecycle for Common Services, including on-call rotations, escalation paths, incident tooling, post-incident reviews, and follow-through on corrective actions.
  • Drive the observability strategy (metrics, logs, traces, dashboards, alerts) for Common Services, ensuring we have actionable visibility into the health, performance, and capacity of key systems.
  • Collaborate with engineering leads to design and review architectures for reliability, scalability, resilience, and operability, including failure modes, redundancy, and graceful degradation. 
  • Are you interested in this position?

    Apply by clicking on the “Apply Now” Button below!
    #JobsHubEstonia #GlobalRecrument
    #CareerOpportunities #HiringNow
    #JobSeekersNetwork #EstoniaJobs
    #RecruitmentServices #EmploymentPortal