Lead Site Reliability Engineer

Company:

Pegasus Knowledge Solutions, Inc.

Location:

Concord, NC, 28027

Posted:

June 29, 2025

Apply

Description:

Title : Lead Site Reliability Engineer Job Type : Contract Duration: 12 Months+ Location : North Carolina, Concord Location.

We are seeking a Lead Site Reliability Engineer (SRE) with deep expertise in AWS networking, infrastructure automation, and production system reliability.

This role demands a strong grasp of observability, operational excellence, and the ability to drive the adoption of DevOps/SRE best practices across engineering teams.

You will be instrumental in shaping SLIs/SLOs, defining our DevOps maturity roadmap, and building robust, scalable infrastructure using Terraform, Lambda, Step Functions, and more.

You’ll be leading a team of SREs and collaborating closely with DevOps, Security, and Application teams to ensure reliable delivery and availability of services.

Responsibilities:Lead and mentor a team of SREs in developing scalable infrastructure and operational processes.Design and implement SLIs, SLOs, and Error Budgets across critical services and evangelize them across product teams.Architect and manage AWS networking environments including VPCs, Transit Gateways, Route 53, VPNs, NACLs, and Security Groups.Manage and monitor Palo Alto and Fortigate firewalls, and integrate them with cloud environments for hybrid network visibility.Define and evolve DevOps maturity models, guiding teams toward higher automation and reliability.Build and manage observability dashboards using Grafana, Cloudwatch and Datadog to track application and infrastructure health.Implement and maintain Infrastructure as Code (IaC) using Terraform to automate cloud deployments across environments.Develop and maintain serverless applications using AWS Lambda and Step Functions to support platform automation and operations.Collaborate with developers to define GitLab CI/CD pipelines and streamline the build, test, and deployment lifecycle.Champion incident response, blameless postmortems, and continuous improvement initiatives.Write scripts in Python or Bash to automate tasks and integrate systems.

Required Qualifications:7+ years in SRE, DevOps, or Systems Engineering roles with increasing responsibility.Proven experience managing AWS production environments with a focus on networking.In-depth knowledge of Palo Alto and/or Fortigate firewall management and troubleshooting.Expertise in monitoring and observability tools, including Grafana and Datadog.Hands-on experience with Terraform in managing cloud infrastructure at scale.Experience building and deploying serverless architectures using Lambda and Step Functions.Demonstrated understanding of SLI/SLO design, error budgets, and reliability metrics.Strong understanding of CI/CD principles and tools like GitLab CI/CD.Proficiency in scripting using Python or Bash.

Preferred Qualifications:AWS Certifications (e.g., Solutions Architect, Advanced Networking, DevOps Engineer)Familiarity with DevOps/SRE maturity models and implementing organizational transformation.Experience with compliance frameworks (SOC2, ISO 27001, etc.) as they pertain to infrastructure reliability.Familiarity with container orchestration is a plus.

Soft Skills:Strong leadership and mentoring capabilities.Ability to translate complex technical problems into actionable initiatives.Excellent communication and cross-functional collaboration skills.Bias for automation and continuous improvement.

Anam Chaudhry

Apply

Lead Site Reliability Engineer

Description:

Report this job