Site Reliability Engineer

Location:

Raleigh, NC

Posted:

July 11, 2025

Contact this candidate

Resume:

MORRISVILLE, US, ***** • ************@*****.*** • 475-***-****

AMNA ASIF

Site Reliability Engineer

PROFESSIONAL SUMMARY

Versatile IT Professional with 9+ years of experience in enterprise infrastructure and cloud technologies. Proven track record of architecting, implementing, and optimizing scalable solutions across AWS, Azure, and GCP platforms. Expert in containerization, automation, and DevOps practices, leveraging tools such as Kubernetes, Docker, Ansible, and Terraform. Skilled in designing high-availability systems, enhancing security postures, and streamlining IT operations. Adept at leading cross-functional teams and aligning technical solutions with business goals. Committed to driving innovation and efficiency in complex IT environments while ensuring robust, reliable, and cost-effective infrastructures.

EMPLOYMENT HISTORY

DIGITAL SITE RELIABILITY ENGINEERJun 2023 - Present

HILTONMemphis, TN

Manage Kubernetes deployments, ensuring optimal pod performance and streamlined processes.

Implement comprehensive monitoring with Datadog, Dynatrace, Splunk, and AWS CloudWatch.

Enhance CI/CD pipelines with automated testing, integrating GitLab and Bitbucket.

Design secure AWS environments with IAM, encryption, and robust disaster recovery plans..

Optimized Kubernetes cluster performance through advanced pod management and automated deployments using Argo CD, enhancing system reliability and deployment efficiency

Architected multi-region AWS infrastructure with comprehensive disaster recovery strategies, implementing geographic redundancy for enhanced service resilience

Strengthened system observability by integrating Datadog and Dynatrace monitoring, creating custom service detections for improved application performance tracking

Enhanced security posture through GraphQL resolver validation, IAM configurations, and network security protocols in AWS environments

Streamlined CI/CD workflows by implementing automated Cypress testing and integrating version control systems, reducing deployment cycles

Implemented advanced monitoring solutions across AWS infrastructure, integrating multiple observability tools for enhanced system performance tracking

.Tuned Akamai cache configurations to enhance response times for www.hilton.com and decrease origin traffic.Analyzed cache hit ratio (CHR) metrics and optimized caching rules to minimize latency and backend load.

Developed and optimized Java-based monitoring integrations for Dynatrace, enhancing application observability and reliability.

Wrote Java scripts and API connectors to extract and analyze telemetry data, improving SLI tracking for client-side interactions.

Implemented deep linking strategies using Akamai query parameters to enhance Hilton’s search and shop functionality.

Optimized cache offloading using Akamai parameters to improve user experience on www.hilton.com/en/search.

Utilize scripting languages (Python, Bash) to automate task and reduce errors.

Managed code repositories using Git, Bitbucket, and GitLab, ensuring effective version control and collaboration.

Integrated Bamboo into CI/CD pipelines to automate builds, tests, and deployments.

SITE RELIABILITY ENGINEERMar 2022 - Apr 2023

PrometheusRaleigh, NC

Led high-availability infrastructure design across AWS, Azure, GCP

Spearheaded CI/CD pipeline development for rapid feature delivery

Pioneered custom Dynatrace dashboards for real-time performance monitoring

Conducted systematic performance analysis for cost-effective resource allocation

Orchestrated seamless major infrastructure upgrades and migrations

Led disaster recovery initiatives and implemented automated backup strategies across cloud platforms, strengthening system resilience and data protection

Optimized cloud resource utilization through strategic capacity planning and performance analysis, driving cost efficiency while maintaining service quality

Orchestrated cross-functional infrastructure upgrades, ensuring seamless transitions between cloud environments while maintaining operational continuity

Developed comprehensive monitoring dashboards using Dynatrace and Pingdom, enabling real-time system performance tracking and proactive issue resolution

Engineered multi-cloud infrastructure solutions with automated failover mechanisms, enhancing system resilience and minimizing downtime across distributed environments

NETWORK/LINUX ENGINEERAug 2019 - Jan 2022

Verizon BusinessCary, NC

Resolved server outages and brought them up quickly using console (ILO) for HP servers and IDRAC for Dell servers.

Used Nagios as an IT infrastructure monitoring tool for monitoring host resources such as processor load, disk usage, system logs, monitoring applications, services, and network protocols.

Managed users using LDAP Active Directory to maintain user data and security, managing network connections and server-based security using SELinux and iptables.

Monitored network traffic using tcpdump and TCP/UDP protocol. Performed vulnerability management of operating systems with scheduled patches and security hardening through Ansible playbooks.

Configured and maintained Gitlab/GitHub repository servers for code releases and application configurations.

Created AWS EC2 Instances, set up VPC, created load balancers (ELB), and used Route53 with failover and latency options for high availability and fault tolerance.

Worked closely with application and product teams in understanding their specific automation requirements and implementing them in an optimal CI/CD pipeline (Jenkins).

LINUX SYSTEM ADMINISTRATOROct 2017 - Aug 2019

DupontDelaware, DE

Hands-on experience building, patching, and maintaining Linux systems in a mission-critical bare metals and virtualized (VMware) environment using Red Hat Satellite.

Installed and configured Operating systems (RHEL 6&7, Centos 6&7, VMware Workstation), Supported in virtualization VMware ESX and ESXI hypervisors, and vSphere, vMotion and vCenter servers.

Configured and managed the volume groups, and logical volume using LVM for disk management and troubleshooting failed LVM.

Performed troubleshooting steps for system sluggishness using different tools such as TOP, VMSTAT, and SAR.

Configured NIC-Bonding with active backup and load balance based on traffic speed to exclude latency issues in the Network.

Worked with Splunk universal forwarder to get reliable, secure data collection from various sources and delivered the data to Splunk Enterprise or Splunk Cloud for indexing and log analysis.

LINUX SUPPORT OPERATORSep 2016 - Sep 2017

DatapriseJersey City, NJ

Led strategic planning for future hardware needs

Delivered tier 2 and 3 technical supports remotely

Ensured optimal hardware health and managed repair needs

Administered ACL and OpenLDAP for file system management

Streamlined complex tasks with bash and shell scripting

EDUCATION

BACHELOR OF ARTS University of the Punjab in 2015

COURSES

CERTIFIED KUBERNETES ADMINISTRATOR (CKA)2023

Kubernetes

CCIE (R&S) WRITTEN CSCO135147132020

Cisco

PROJECT MANAGEMENT PROFESSIONAL (PMP)

PMI

SKILLS

Incident Management, System Scalability, Performance Tuning, Risk Mitigation, Strategic Planning, Problem Solving, Kubernetes, Terraform, Automation, Linux, Team Collaboration.

Contact this candidate