Job Description
About the Role:
We are seeking a skilled and proactive Site Reliability Engineer (SRE) to join our infrastructure team. As an SRE, you will bridge the gap between development and operations by applying software engineering principles to infrastructure and operations problems. Your mission will be to build scalable and highly reliable systems, ensuring uptime, performance, and automation at every layer.Responsibilities:
Design, implement, and maintain scalable, resilient infrastructure.
Develop and maintain tools and automation to support infrastructure and deployment.
Monitor systems for performance, reliability, and availability using observability tools.
Participate in incident response, root cause analysis, and post-mortems.
Implement and advocate for SLOs/SLIs to ensure service quality and reliability.
Work closely with development teams to improve CI/CD pipelines and deployment strategies.
Enhance security, compliance, and infrastructure governance.Requirements:
Bachelor's degree in Computer Science, Engineering, or related field (or equivalent experience).
10+ years of experience in DevOps, SRE, or related roles.
Proficient in at least one programming/scripting language (Python, Go, Bash, etc.).
Experience with cloud platforms (AWS, GCP, or Azure).
Hands-on experience with containerization (Docker, Kubernetes).
Strong knowledge of monitoring, logging, and alerting tools (e.g., Prometheus, Grafana, ELK, Datadog).
Familiar with Infrastructure as Code (Terraform, CloudFormation, or similar).
Excellent problem-solving skills and a collaborative mindset.