Job Role: Azure Site Reliability Engineer
Location: Toronto, ON, Canada (Hybrid)
Job Type: Contract
Job Description:
Monitoring and Alerting
Implement and maintain monitoring systems to proactively identify potential issues and alert engineers to problems before they impact users
Incident Response
Respond to incidents and outages, diagnose problems, and implement solutions to minimize downtime and restore service
Automation
Automate repetitive tasks and processes to improve efficiency and reduce manual effort
Performance Optimization
Identify and address performance bottlenecks to ensure systems run efficiently and effectively
Infrastructure Management
Manage and maintain the underlying infrastructure including servers, networks, and cloud resources
Capacity Planning
Plan for future capacity needs to ensure systems can handle anticipated workloads
Release Engineering
Develop and maintain processes for deploying software updates and releases
Collaboration
Work closely with developers, operations teams, and other stakeholders to ensure system reliability and availability
Documentation
Maintain clear and concise documentation of systems processes, and procedures
Continuous Improvement
Identify areas for improvement and implement changes to enhance system reliability and performance
Skills and Qualifications
Cloud Platform Microsoft Azure
Excellent knowledge of AKS
Monitoring tools: Dynatrace, Splunk, Grafana
Operating System Windows Linux
Scripting Shell Scripting Python PowerShell
Database MySQL Oracle SQL database management
Container Services Kubernetes Docker Helm
An understanding of Camunda is preferable