We are seeking a proactive Site Reliability Engineer (SRE) to drive reliability, performance, and efficiency across our systems and platforms. You'll work closely with Application Development, QA, Product, and Data Engineering teams to champion a DevOps/SRE culture rooted in automation, observability, and continuous improvement.
Key Responsibilities:
Collaborate cross-functionally to promote SRE and DevSecOps best practices across the organization.
Build and maintain reliable, scalable systems with a focus on availability, performance, and resiliency .
Establish and monitor SLOs/SLIs, and develop comprehensive dashboards to support decision-making from both technical and business perspectives.
Lead efforts to reduce toil through automation, self-healing systems, and advanced monitoring (e.g., synthetic monitoring, RUM).
Apply observability and reliability testing practices from architecture through operations, leveraging Agile and product-based models.
Drive the adoption of cutting-edge tools in observability, automation, platform engineering, AIOps, and MLOps.
Contribute to and lead Communities of Practice (CoP) and SRE Office Hours to foster knowledge sharing and continuous improvement.
Qualifications:
SRE & DevOps Expertise:
Strong experience in observability, toil reduction, incident response, and performance optimization.
Proficient with monitoring tools such as Dynatrace, CloudWatch, and Azure Monitor .
Skilled in IaC, CaC, JSON, and scripting with Python, Node.js, Ruby, PowerShell, and Shell .
Deep understanding of Dynatrace advanced features: DT Guardian, RUM, Synthetic Monitoring, AI-based event correlation .
Cloud & Automation:
Expert in AWS Cloud services: CDK, Lambda, CloudWatch, EKS, EC2, ELB, S3, SSM .
Experience with log ingestion pipelines (AWS Firehose, Dynatrace OpenPipeline), and operational dashboards.
Hands-on experience with Ansible Tower, AWS SSM, Bitbucket/GitHub, and CI/CD workflows .
Orchestration & Data:
Familiarity with orchestration tools like Step Functions, Apache Airflow, and container platforms.
Knowledge of data pipelines, data lakes, and databases (Redshift, RDS, Aurora, PostgreSQL, SQL Server, Oracle).
Leadership & Communication:
Strong problem-solving and knowledge management skills.
Effective communicator who bridges technical and business teams.
Collaborative, inclusive leader who builds high-performing teams and fosters a culture of growth and recognition.