Reliability Engineer Site

Location:

Innsbrook, VA, 23060

Posted:

February 24, 2025

Contact this candidate

Resume:

VaishnavThej Avvaru

Lead DevOps / Lead Site Reliability Engineer

Contact: +1-804-***-****; Email: ************.***@*******.***

https://www.linkedin.com/in/vaishnavthej/

SUMMARY:

●9 Years of IT Experience in SRE, Cloud, DevOps, Automation, CI/CD (Jenkins), Configuration Management (Ansible), & Continuous Monitoring. Experience in Linux and Windows Systems, Application Support, Containers and Orchestration (Docker & Kubernetes).

●Excellent working knowledge of application processes, support, and operations.

●I have extensively worked with source code Version Control tools like GIT, GitHub, Gitlab and Bitbucket.

●Familiar with AWS services like AWS S3, Secrete Manger, EC2, Auto Scaling, AMI, VPC, Load Balancer, Lambda, API, etc.

●Experience with Containerization and Orchestration tools in hybrid environments, especially for app Modernization, reducing application downtime and implementing high availability.

●Leveraging modern event-driven architecture to deliver high-quality, automated solutions.

●Implemented a Continuous Delivery pipeline with Docker, Harness and GitHub Actions

●Expertise in Application Deployments & Environment configuration using Ansible.

●Experience in writing Ansible, and Python scripts to automate tasks and deployments.

●Implement end-to-end monitoring solutions using Datadog for cloud platforms, on-premise infrastructure, and hybrid environments.

●Deploy Datadog agents across servers, containers, and services to collect telemetry data (metrics, logs, traces) for APM

●Define and track key performance indicators (KPIs) and service level objectives (SLOs) for critical applications using Dynatrace.

●Monitored system performance health, using Monitoring tools and scheduling the alerts using DataDog, and Dynatrace.

●Create custom dashboards for engineering, operations, and leadership teams to visualize real-time metrics in DataDog and Dynatrace.

●Configure and manage alerts for metrics like CPU, memory, network latency, and error rates using anomaly detection and threshold-based methods [Splunk / DataDog / Dynatrace]

●Parse, process, and enrich logs using DataDog's Log Pipelines and Processors for actionable insights using DataDog APM.

●Designed and implemented end-to-end APM solutions using tools like Datadog, Prometheus, and Grafana to monitor microservices and cloud-based infrastructure.

●Optimized APM telemetry collection, reducing observability costs by 30% while improving metric retention and alerting capabilities.

●Integrated APM monitoring with CI/CD pipelines, enabling proactive performance analysis and reducing production incidents.

●Deliver periodic health checks and performance reviews to stakeholders and leadership.

●Formed and led various initiatives for 24/7 production support, ServiceNow integration with PagerDuty, and multiple monitoring and alert systems.

●Familiar with various ticketing tools in the market like JIRA, SNOW, CA Service Desk, etc., and adhere to the documentation standards in SharePoint, and Confluence.

EDUCATION:

●Post Graduate Diploma in International Business Operations from IGNOU, New Delhi, India - 2021

●Bachelor of Technology in Computer Science & Engineering from Jawaharlal Nehru Technological University (JNTU), Hyderabad, India - 2015

TECHNICAL SKILLS:

Cloud: AWS [IaaS] VPC, EC2, S3,

Cloud Watch, CloudTrail, Lambda, ECS, API Gateway, SQS, IAM, Security Groups, SNS, EFS, EBS.

DevOps [IaaC]: GIT, Github, Bitbucket, Ansible, Jenkins, Docker,

Kubernetes, Terraform.

Programming & Scripting Languages: Python, Shell Scripting.

Operating Systems: Linux [Redhat, Ubuntu], Windows.

CI/CD Tools: Jenkins, Harness, Git/BitBucket, GitHub Actions

Web/Application Servers: Tomcat, Nginx

Artifactory: Jfrog

Monitoring / Observability Tools: Splunk, DataDog, Dynatrace EDGE Monitoring, Nimsoft, Cloud3Watch, Grafana, IBM SMARTS, CA Nimsoft

Ticketing, and Tracking Tools: ServiceNow, Jira, CA ServiceDesk.

Chat Bots: WebexChat bots, MicrosoftTeams bots.

Front End Technologies: HTML, CSS, JavaScript.

Others: Microsoft Adaptivecards.

CERTIFICATIONS:

·Linkedin Certified - Jenkins Developer

·LinkedIn Certified - Site Reliability Engineer

·Udemy Certified - Python Developer

·Udemy Certified - AWS Associate Developer.

·Udemy Certified - AWS Solutions Architect.

·Udemy Certified - Ansible Expert.

PROFESSIONAL EXPERIENCE:

Infosys Limited June 2024 – Till Date

Technical Lead - Site Reliability Engineer / DevOps

Client: US Bank - Charlotte, NC, United States

Project Description: US Bank is a multinational bank and financial services company. Build, automate, and manage the cloud infrastructure and processes for clients' in-house applications serving global enterprise-scale banking and financial solutions and services.

Responsibilities:

●Responsible for automating existing manual processes into reliable, fully automated processes using the latest DevOps tools and methodologies.

●Build CICD Pipelines for Deployment into prod and prod environments.

●Build SRE Metrics Dashboards.

●Automated end-to-end alert mechanism using Harness, python, and chatbots.

●Automated routine tasks and system optimisations through scripting and orchestration, improving operational efficiency by 25%.

●IaC, provisioning using terraform on AWS Cloud, Provisioning EC2, VPC. S3. LoadBalancer, etc.

●Work on Change Readiness and make the microservice ready for production deployment.

●Work on microservice nonproduction deployment, and prepare the code package for deployment.

●Manage a team of 7 People, Plan the sprint, and assign the Jira tasks, Help the team coordinate, understand the road blockers, and resolve the issues.

●I worked on incidents, change records, and root cause analysis.

●Create custom dashboards for engineering, operations, and leadership teams to visualise metrics in real-time in DataDog and Dynatrace.

●Configure and manage alerts for metrics like CPU, memory, network latency, and error rates using anomaly detection and threshold-based methods [ DataDog / Dynatrace]

●Parse, process, and enrich logs using DataDog's Log Pipelines and Processors for actionable insights using DataDog.

●Designed and implemented end-to-end APM solutions using tools like Datadog, Prometheus, and Grafana to monitor microservices and cloud-based infrastructure.

●Optimized APM telemetry collection, reducing observability costs by 30% while improving metric retention and alerting capabilities.

●Integrated APM monitoring with CI/CD pipelines, enabling proactive performance analysis and reducing production incidents.

●Set up notification channels for alerts (e.g., Slack, PagerDuty, email) to ensure timely incident response.

Environment: Python, Jenkins, Docker, Ansible, Ansible Tower, MS Team Chat Bots, Confluence, Sharepoint, ServiceNow, Harness, Splunk, DataDog, BigPanda SRE Principles, Jira, AWS Cloud, Java Microservices, YAML, JSON, REST API, Visual Studio Code, Notepad++, Git, Github, CSS, HTM.

Tata Consultancy Services [TCS], Hyderabad, India Dec 2020 – May 2024

DevOps / Senior Site Reliability Engineer

Client: HSBC & LSEG

Project Description: HSBC is a multinational bank and financial services company, and LSEG is a stock exchange company. Streamline the existing DevOps process and, define and rebuild the DevOps metrics and KPIs, work on implementing the SRE principles and Observability Metrics and make the infrastructure better.

Responsibilities:

●Responsible for automating existing manual processes into reliable, fully automated processes using the latest DevOps tools and methodologies.

●Automated end-to-end alert mechanism using Jenkins, python, and chatbots.

●Built Automated Processes/tasks/schedulers to identify ageing tickets in the queue, prepare charts and send messages /notifications to respective user groups and teams.

●Automated the Identify and Authorization for custom-developed in-house applications using Python.

●IaC, provisioning using terraform on AWS Cloud, Provisioning EC2, LoadBalancer ECS, etc.

●Implemented alert policies on the GCP Environment and deployed the policies in different environments of projects like., dev, test, preproduction and production.

●Automated the Huge Data Processing activities on GCP services using Python.

●Implemented Automation of Integration of Data Transition between different GCP Services within and out of different projects in GCP.

●Converted the manual tasks into IaaC [python / yaml / json], GitHub as Code Repository and version control system that enabled one-click trigger of deployment or flow.

●Monitoring the applications and servers using Datadog, and Splunk for log monitoring of applications.

●Enable log collection for applications and infrastructure, ensuring all logs are tagged and searchable in Datadog.

●Created log pipelines to parse, filter, and process application logs for actionable insights in DataDog.

●Define log retention policies and ensure cost optimisation for logging in DataDog.

●Deploy and configure Splunk Enterprise, Splunk Cloud, or Splunk Universal Forwarders on on-premise, cloud environments.

●Set up Splunk indexers, search heads, forwarders, and deployment servers for optimal performance and scalability.

●Worked on Splunk’s Add-ons and APIs to collect data from specialized systems or custom applications.

●Use regex, transforms, and lookups to normalize and enrich log data in Splunk.

●Supporting the Infra environment on issues, fixes, and bugs in code, following best practices of the ITIL process for incident and change management activities.

Environment: Python, Jenkins, Docker, Ansible, Ansible Tower, MS Team Chat Bots, Confluence, Sharepoint, ServiceNow, Jira, AWS Cloud, YAML, JSON, REST API, Visual Studio Code, Notepad++, Git, Github, CSS, HTML, DataDog, Splunk, Dynatrace, BigPanda.

Automatic Data Processing [ADP], Hyderabad, India Apr 2019 – Nov 2020

DevOps / Site Reliability Engineer

Client: Automatic Data Processing [ADP]

Project Description: ADP is an American provider of human resources management software and services. Build, automate, and manage the cloud infrastructure and processes for clients' in-house applications serving global enterprise-scale banking and financial solutions and services.

Responsibilities:

●Responsible for automating existing manual processes into reliable process automation.

●Developed Ansible Playbook to automate the process and integrated with Jenkins to form a pipeline.

●Developed Ansible templates in Ansible towers and integrated them with existing Ansible jobs.

●Developing Jenkins PipeLines using Groovy scripting.

●Responsible for fixing the issues that support teams face, able to identify and fix them, including those issues in the next sprint for automation.

●Python scripting and automating tasks built command line projects.

●Supporting the product environment and troubleshooting the issues related to Infra or Applications in different product environments.

●Deploying the applications to multiple prod and test environments to EKS Environments.

●Implementation of change orders, working closely with development teams in the process of deployment automation of applications using chatbots, Jenkins, ansible, and EKS.

●Supporting the Infra environment on issues, fixes, and bugs in code, following best practices of the ITIL process for incident and change management activities.

Environment: Python, Jenkins, Docker, Ansible, Ansible Tower, MS Team Chat Bots, ServiceNow, Jira, WebEx Bots, Microsoft Adaptive Cards, Confluence, Sharepoint, AWS Cloud, EKS, EDGE Monitoring, Splunk, YAML, JSON, REST API, Visual Studio Code, Notepad++, Git, Github, CSS, HTML.

DXC Technology Jan 2016 – Apr 2019

Cloud / DevOps Engineer / Monitoring & Observability Engineer

Client: DXC Technology

Project Description: DXC Technology is an American multinational information technology (IT) services and consulting company, Built a Solution for Deploying the IBM Bigfix Application in a fully

Automated way using DevOps tools and Pipelines for leveraging services to multiple customers of DXC.

Responsibilities:

●Experience in developing scripts in yaml (playbooks), shell scripting.

●I designed a playbook in Ansible to automate patches on the servers.

●The playbook will connect to Vcenter and will manage (create/delete/reconfigure) Virtual machines.

●Worked with various teams within the program to integrate code, fix defects, code and performance improvements.

●Involved in review meetings with the Project Manager and participated in technical discussions regarding the issues in the project.

●Developing docker images.

●Managing docker containers.

●Developed scripts to deploy the application.

Environment: Python, Jenkins, Docker, Ansible, Ansible Tower, Confluence, Sharepoint, ServiceNow, Jira, YAML, JSON, REST API, Visual Studio Code, Notepad++, Git, Github, CSS, HTML.

Client: SAS Airlines

Project Description: SAS is one of top player in Europe Commercial Airlines, Responsible for Building, running and Maintenance of Client Cloud Environments.

Responsibilities:

●Designed and implemented on Amazon Web Services cloud.

●Installing and deploying the applications in the cloud environment.

●Launching and configuring Amazon EC2(AWS) Cloud Servers using AMIs (Linux/Ubuntu) and configuring the servers for specified applications using Jenkins.

●Build application and database servers using AWS EC2 and create AMIs as well as use RDS.

●Created, and maintained the streams for different code repositories and promotions.

●Providing technical support to Production and development environments in AWS Worked with Linux / Shell Scripting and AWS CLI to maintain the AWS infrastructure on cloud reviews.

●Managing User and Group Administration-User account management creation, deletion, Assigning Special permissions and providing acquired permissions to the users. (Linux user management)

●Monitoring & troubleshooting Server Performance using Performance monitoring utilities like Dynatrace and Cloudwatch.

●Synthetic Monitoring and Real User Monitoring (RUM) using Dynatrace.

●Deploy agents on on-prem and cloud servers and configure threshold and alert mechanism using Dynatrace.

●Enable Log Monitoring to ingest logs from various sources and correlate them with application and infrastructure metrics using Dynatrace.

●Parse and process logs for specific patterns or anomalies using Dynatrace.

●Used Dynatrace’s log analysis to troubleshoot application errors, crashes, and infrastructure failures.

●Create custom dashboards to visualize key performance indicators (KPIs) for stakeholders and engineering teams using Dynatrace.

●Design dashboards for different use cases, including application health, infrastructure performance, and end-user experience in Dynatrace.

●Schedule and automate performance reports for system uptime, SLA compliance, and service health Dynatrace.

●Use Dynatrace’s Data Explorer to build custom queries and visualizations.

●Used Jenkins for continuous integration and continuous deployment, pipeline development and deployment of code in production.

●Worked with the development teams across the globe to build the components of a software product, integrate the required modules and backout changes/features with informed decisions by looping in the dev managers to keep up with the changes and quality in the product versions.

Environment: Python, Jenkins, Docker, Ansible, Ansible Tower, AWS S3, AWS RDS, Cloud Watch, Confluence, Sharepoint,

Dynatrace, ServiceNow, CA Service Desk, Jira, YAML, JSON, REST API, Visual Studio Code, Notepad++, Git, Github, CSS, HTML.

Client : DXC Technology

Project Description: DXC Technology is an American multinational information technology (IT) services and consulting company, Built a Solution for Deploying the IBM Bigfix Application in a fully

Automated way using DevOps tools and Pipelines for leveraging services to multiple customers of DXC.

Responsibilities:

●Building CAM Instances (Central Administrative Machine, it’s a DXC Developed Automation Engine)

●Configuring the CAM and other instances in the customer environments.

●Configuring the docker containers.

●Configuring the Ansible Playbooks.

●Deploying the BigFix Application using Ansible Playbooks.

●Attending Daily Review Calls and Deadlines.

●Resolved several issues while deploying the BigFix application.

●Handing over the project to the support teams.

●Deployed and helped the supported team, over 15+ BigFix Application Instances.

●Handling the L1 and L2 issues on day to day business operations.

●Creating users and providing them access based on RBAC Policies.

●Performing the health checks in application.

●Fixing the issues related to the application patch issue and web reports issues.

●Agent installation on the client instances on different OS Environments.

●Creating / Generating the Patch compliance reports.

●Supported around the 15 BigFix accounts.

●Received multiple customer appreciations.

●Upgrading the application instances to the latest version using Ansible, Docker.

Environment: Python, Jenkins, Docker, Ansible, Ansible Tower, Confluence, Sharepoint, ServiceNow,

IBM Smarts, CA Nimsoft, CA Service Desk, Jira, YAML, JSON, REST API, Visual Studio Code, Notepad++, Git, Github, CSS, HTML.

Contact this candidate