Overview
Site Reliability Engineer Jobs in Dubai, UAE at Canonical
Roles and responsibilities
ASite Reliability Engineer (SRE) isresponsible for ensuring that a company’s systems,services, and infrastructure are reliable, scalable, and efficient.
The role is a hybrid between software engineering and operations,with an emphasis on improving the reliability and performance ofservices through automation, monitoring, and proactive issueresolution.
SREs work to ensure that applicationsand systems are available and performant, typically using acombination of software engineering practices, systemadministration, and deep monitoring of system health. They alsocreate systems to reduce manual intervention and automate processesto increase efficiency and uptime.
Key Responsibilities
1. System Reliability andPerformance
Monitoringand Incident Management:
Set up and maintainmonitoring tools (e.g., Prometheus, Grafana, Datadog) to tracksystem performance, uptime, and error rates. Quickly identifyissues and mitigate service outages by responding toincidents.
Service-Level Objectives (SLOs):
Define and manage Service-Level Objectives (SLOs) and Service-Level Indicators (SLIs) to measureand maintain system reliability, ensuring that services meetbusiness and customerexpectations.
Incident Response:
Respond to production incidents,troubleshoot issues, and minimize downtime. After incidents,perform post-mortem analyses to identify root causes and preventrecurrence.
Capacity Planning:
Ensure the systems are capable of scalingwith the growing load, handling spikes in demand, and maintainingperformance during high traffic periods. Plan for scaling resourcesbased on traffic projections and historical usagepatterns.
2. Automationand Infrastructure as Code(IaC)
Automationof Repetitive Tasks:
Write scripts and createautomation tools to replace manual processes, such as deployments,monitoring, and scaling. This may involve using tools like Ansible,Terraform, or Kubernetes.
Infrastructure Management:
Implement and manage infrastructure ascode (IaC) practices to provision, configure, and manage cloudinfrastructure (e.g., AWS, GCP, Azure) and on-premises resources,using tools like Terraform, Cloud Formation, or Kubernetes.
Continuous Integration and Continuous Delivery (CI/CD):
Buildand maintain CI/CD pipelines to automate software deployments,ensuring that changes are automatically tested, validated, andpushed to production.
3.Reliability and System
Optimization
Root Cause  Analysis:
After an incident, conduct athorough post-mortem and root cause analysis to understand whyfailures occurred and how to prevent them in the future. Sharefindings with stakeholders and implement correctiveactions.
Performance Tuning:
Continuously optimize the performance ofservices by tuning servers, databases, networking, and applicationcode to reduce latency and increasethroughput.
Disaster Recovery Planning:
Design, implement, and test disasterrecovery strategies to ensure that systems can quickly recover frommajor failures or outages.
4.
Collaboration
andCommunication
Cross-FunctionalCollaboration:
Work closely with development teamsto integrate reliability and performance into the developmentlifecycle. Provide feedback to developers on how to improve thereliability and operability of theirservices.
Documentation:
…
Title: Site Reliability Engineer
Company: Canonical
Location: Dubai, UAE
Category: IT/Tech, Software Development