Site Reliability Cloud Engineering

Location:

Chantilly, VA

Posted:

June 26, 2025

Contact this candidate

Resume:

Amardeep Sharma

****.*********@*****.*** • 313-***-****

LinkedIn • Washington, DC/Northern VA

Senior Engineering Leader Site Reliability DevOps Observability Cloud Engineering

Results-driven technology leader with extensive experience in Site Reliability Engineering (SRE), Observability, DevOps, Cloud Engineering, and Software Development, with a strong record in both financial services and enterprise environments. Adept at leading large cross-functional teams and scaling engineering capabilities to deliver high-performance systems through modern practices like cloud migration (AWS), infrastructure as code (Terraform, Puppet), and CI/CD automation. Proven success in driving digital transformation, modernizing legacy systems, and implementing enterprise-level observability, monitoring (OpenTelemetry, Dynatrace, Splunk), and incident management frameworks. Recognized for championing service reliability through metrics-driven development (SLIs, SLOs, error budgets) and pioneering chaos engineering initiatives to ensure system resiliency. A strategic and collaborative leader, skilled in aligning engineering goals with business objectives, fostering cross-functional partnerships, and mentoring teams with a focus on operational excellence, innovation, and continuous improvement.

Areas of Expertise

Site Reliability Engineering (SRE) Frameworks AWS Cloud Migration & Lift-and-Shift Observability & Monitoring (OpenTelemetry, Dynatrace, Splunk, Catchpoint) Cloud Governance & Security Compliance Service Level Indicators (SLIs), Objectives (SLOs) & Error Budgets Chaos Engineering & Fault Injection (AWS FIS, Chaos Toolkit) Infrastructure as Code (Terraform, GitLab CI/CD, Puppet) DevOps Automation & Single Click Push Button Deployments Automated Shakeouts Framework Resilience Automation Frameworks Incident Management & Post-Incident Reviews Self-Healing Automations Middleware & Containerized Application Deployment (ECS, Tomcat, JBoss, Springboot) Capacity Planning, Performance Optimization & Latency Reduction Disaster Recovery Strategy & Business Continuity Planning Total Cost of Ownership (TCO) Analysis & Vendor Management CI/CD Pipeline Migration (UCD/Jenkins GitLab/Terraform) Monitoring Dashboards & Shift-Left Testing Strategies

Career Experience

Fannie Mae – Reston, VA

Senior Manager, Site Reliability Engineering 2020 – 2025

Led a 47-engineer SRE organization to enhance application resilience and availability for key enterprise initiatives. Directed on-prem legacy application refactoring and cloud migration to AWS as part of major digital transformation. Steered adoption of AWS Resiliency Hub and Chaos Engineering tools (AWS FIS, Chaos Toolkit) to validate system robustness. Implemented full-stack observability using OpenTelemetry, Dynatrace, Splunk, and Catchpoint for real-time performance insights. Oversaw incident management and post-incident analysis to implement durable solutions and reduce recovery times.

Established SLIs, SLOs, error budgets, and Golden Signals dashboards to drive metrics-based service reliability.

Designed and deployed an AWS-native cross-region failover/failback automation framework for improved uptime.

Migrated CI/CD pipelines from UCD/Jenkins to Terraform/GitLab to streamline software delivery and improve reliability.

Standardized logging and tracing practices across teams to enhance monitoring, diagnostics, and operational visibility.

Introduced shift-left testing, capacity validation, and performance frameworks to improve software quality and release velocity.

Software Engineering – Manager 2018 – 2020

Steered design and development of Visor application, providing real-time operational insights, inventory management, compliance reporting, and system health monitoring to over 3,000 internal users. Built and managed a cross-functional team to deliver sprint-based application development, infrastructure integration, and operational tools to enhance platform reliability. Directed containerization initiatives by delivering ECS deployment solutions and guiding teams on CI/CD adoption for scalable microservices.

Spearheaded the adoption and deployment of Dynatrace APM and BPM (Lombardi) solutions, improving system observability and business process automation across enterprise platforms.

Engineered shift from Oracle WebLogic to open-source Tomcat and JDK, driving cost savings of ~ $3M by eliminating Oracle ULA.

Drove automation for middleware provisioning, patching, and decommissioning via ServiceNow and Script Central, reducing operational turnaround times from 15 days to under 3 days.

Implemented Infrastructure as Code (IaC) using Puppet to provision and configure web and application stacks, improving deployment speed and consistency.

Led root cause analysis and implemented robust incident response protocols for production failures, enhancing system resilience and reducing recurring outages.

Software Engineering – Product Manager 2016 – 2018

Led Total Cost of Ownership (TCO) analysis to identify cost-saving opportunities and optimize technology investments across enterprise systems. Managed vendor relationships and license agreements, ensuring compliance, favorable terms, and maximum value from enterprise software providers. Directed capacity planning and forecasting, aligning infrastructure resources with business growth and future demand. Oversaw asset lifecycle management, including decommissioning legacy systems and managing end-of-life transitions to streamline operations.

Developed technology roadmaps for database and middleware platforms, aligning upgrades and enhancements with long-term strategic objectives.

Spearheaded the "Get Current, Stay Current" initiative, standardizing patch management and system updates to strengthen performance and security.

Created pricing models and service catalogs for technology offerings, improving budgeting accuracy and cost transparency for business units.

Software Engineering – Lead 2006 – 2016

Led resolution of high-impact production issues through in-depth root cause analysis, minimizing system downtime and ensuring rapid incident recovery. Directed infrastructure maintenance, upgrades, patching, and technology refresh initiatives to enhance performance, scalability, and security. Managed installation, configuration, and support of Web and Application Servers (WebLogic, JBoss, TcServer), ensuring optimal system performance. Oversaw migration from WebLogic to JBoss/TcServer, improving system reliability and reducing operational costs.

Delivered multiple enterprise-level Middleware Operations projects, including:

oOOR Datacenter Buildout – Designed and implemented a new datacenter to support critical operations.

oWeb & Application Audits – Ensured regulatory compliance and improved system efficiency.

oDatacenter ID Swap – Orchestrated a seamless infrastructure ID transition with minimal service disruption.

oJBoss EAP Implementation – Led adoption of JBoss Enterprise Application Platform across environments.

oDisaster Recovery Exercises – Planned and executed annual DR tests to validate business continuity strategies.

Additional Experience

Software Engineer

Capgemini – BellSouth

Education

Master of Science in Computer Engineering

Wayne State University, Michigan

Bachelor of Science in Technology

Punjab Technical University, India

Certification & Trainings

AWS Certified Solutions Architect – Associate

AWS Certified Cloud Practitioner

Sun Certified Java Programmer (SCJP)

ITIL Foundations

Completed Microsoft Azure Training – AZ-900: Azure Fundamentals

Completed Microsoft Azure Training – DP-900: Data Fundamentals

Contact this candidate