Utkarsh Gandhi
www.linkedin.com/in/utkarshpgandhi
****************@*****.*** 910-***-****
Summary:
With 11 years of strong experience in Site Reliability Engineering (SRE), DevOps, AWS, and Build/Release Engineering, expertise has been developed in Software Configuration Management (SCM), Build/Release Management, Continuous Integration, and Continuous Delivery using a wide range of tools.
Skilled in configuring and deploying infrastructure and applications in the cloud using AWS services such as EC2, S3, RDS, EBS, VPC, SNS, IAM, Route 53, Auto Scaling, CloudFront, CloudWatch, CloudTrail, CloudFormation, OpsWorks, and Security Groups, with a focus on fault tolerance and high availability.
Strong understanding of SCM processes, including compiling, packaging, and deploying applications.
Proficient in Continuous Integration and Deployment methodologies using Jenkins, SonarQube, and GitLab.
Use Infrastructure as Code (IaC) and CI/CD pipelines to automate deployment processes using Google Cloud Platform.
Skilled in troubleshooting production issues related to CPU resource utilization, application performance, and code logic.
Solid knowledge of Object-Oriented Design and Programming concepts in Java.
Experienced in scripting with Shell, Python, C, Bourne, and Perl for maintaining and developing scripts, as well as troubleshooting.
Proficient in using build automation tools like Jenkins and Maven to implement end-to-end automation and working experience with Dynatrace
Hands-on experience with tools such as POSTMAN and SOAP in order to test the web-service.
Utilized AWS CloudWatch to monitor environments for operational and performance metrics during load testing.
Extensively worked with Docker for virtualization, deploying and securing applications for streamlined Build/Release Engineering processes.
Experienced with.
Skilled in creating Docker containers from scratch and leveraging Linux Containers and AMIs, along with Dockerfiles.
Managed Docker containers with Kubernetes, automating container maintenance and working with REST APIs.
Utilized Terraform for managing AWS Infrastructure as Code (IaC).
Designed scalable and reliable systems on the Google Cloud Platform which includes providing efficiency such as Compute Engine, App Engine and Kubernetes.
Integrated machine learning models into production environments using CI/CD pipelines, leveraging AWS services, Kubernetes, and Docker for automated deployment and monitoring.
Collaborated with data scientists and developers to streamline model versioning, testing, deployment, and monitoring, ensuring the smooth transition of models from development to production.
Leveraged AI-driven monitoring tools, such as Splunk and AWS CloudWatch, to automate incident detection, root cause analysis, and performance optimization, enhancing system reliability and operational efficiency.
Actively mentored junior engineers, providing guidance on best practices for DevOps, AWS infrastructure, and Build/Release Engineering and Linux environment.
Conducted training sessions and code reviews, fostering the professional growth of team members and improving the overall effectiveness of the engineering team.
Created dashboards for log analysis and visualization using Prometheus and Grafana.
Leveraged monitoring tools such as CloudWatch and Splunk for log analysis, performance monitoring, and dashboard creation during production.
Provided 24x7 production support, including on-call and weekend shifts.
Experienced in troubleshooting, backup, and recovery processes.
Skills:
Category
Skills/Tools
SRE & Cloud
AWS, Fault tolerance, High availability, Monitoring, Incident response
IaC
Terraform, CloudFormation, Pulumi
Containers & Orchestration
Docker, Kubernetes, OpenShift, Helm
CI/CD & Automation
Jenkins, GitLab CI, CircleCI, Python, Bash, Go
Version Control
Git, SVN
Monitoring & Logging
Prometheus, Dynatrace, Grafana, Datadog, Splunk, CloudWatch
Scripting & Programming
Python, Shell, Java, Go, Perl
Build/Release
Jenkins, Maven, ANT
Security
AWS IAM, KMS, TLS/SSL, Encryption
Performance
Datadog APM, Prometheus, New Relic
AI/ML
XGBoost, Anomaly detection, Forecasting
Database
PostgreSQL, Deadlock prevention
Microservices
Kubernetes, Docker, Helm, Microservices architecture
Agile
Scrum, Kanban
Professional Experience:
VISA, Atlanta Georgia Aug 2022- Present
Site Reliability Engineer/DevOps
Responsibilities:
·Participate in lead development of requirements, user story development, use cases and test cases.
·Responsible for using AWS Console and AWS Command Line Interface for deploying and operating AWS services specifically VPC, EC2, S3, EBS, IAM, ELB, Cloud Formation, ECS, EKS and Cloud Watch.
·Worked on development projects of automating an Internal Monitoring Dashboard using Visa Payable Automation using Java Development which would show the status of inbound, outbound and the status of various other file processing.
·Spearheaded the design of 20 Iaas based solutions in Google Cloud Platform and delivery of Proof of Concepts of solutions savings $30k in costs..
·Teamed with developers to execute API that enabled internal analytics to increase the reporting speed from 15to 25% in two weeks.
·Design and implement monitoring systems using tools like Prometheus, Grafana, and Nagios to track application performance, system health, and resource usage (CPU, memory, disk I/O).
·Responsible for scheduling automated data collection, performance metrics collection from production environment and leveraging them for application development such as creating machine learning models.
·After data collection and cleaning, leveraged the data to build an AI/ML pipeline for future forecasting. This pipeline includes early detection of anomalies using an unsupervised algorithm like the k-means classifier and predicting future values of performance metrics.
·Automate repetitive tasks using scripting languages like Python, Bash, or Go to improve efficiency in deployment, monitoring, and scaling operations.
·Actively involved in patching activities and certificate renewal activities to make sure the systems are up and running and experience working on monitoring tools such as Dynatrace.
·Perform root cause analysis (RCA) of incidents, leveraging debugging skills in stack traces, error logs, and systems like Elasticsearch or Splunk to uncover the underlying cause of failures.
·Along with development of future forecasting AI/ML tool, also leveraged the data collection to develop a AI/ML using XgBoost to detect long running queries and maximum DB connections in postgres database with an ability to prevent future deadlock failures.
·Implement Infrastructure as Code (IaC) with tools such as Terraform, CloudFormation, or Pulumi to provision and manage cloud resources automatically and consistently.
·Design and maintain CI/CD pipelines with Jenkins, GitLab CI, CircleCI, or ArgoCD to ensure continuous integration, automated testing, and seamless deployment of software updates.
·Developed an utility tool to send notification to an operations engineer stating the overall health of production system leveraging AWS SES services.
·Design and manage load balancing strategies using technologies like NGINX, HAProxy, or AWS Elastic Load Balancing (ELB) to ensure optimal resource distribution and minimize service interruptions during peak traffic.
·Monitor resource utilization and prevent over-provisioning by using tools like CloudWatch (AWS) to track usage and adjust resources dynamically.
·Implement security best practices by enforcing encryption protocols (e.g., TLS/SSL) for data in transit and at rest, and use AWS KMS for key management.
·Conduct disaster recovery drills and implement backup strategies with AWS S3, or on-prem replication to ensure the system can be restored in case of major failure.
·Creation of automation scripts to have the sanity & regression testing robot framework
·Used Kubernetes to orchestrate the deployment, scaling and management of Docker Containers.
·Used Kubernetes to manage containerized applications using nodes, config maps, services and deployed application containers as a Pod's.
·Worked with Splunk for Log monitoring and dashboard generation during production
·Orchestrated CI/CD processes by responding to Git triggers, human input, and dependency chains and environment setup.
·Worked with Docker for convenient environment setup for development and testing.
·Experience working on monitoring tools such as Dynatrace.
·Used CloudFormation Templates written in YAML to deploy infrastructure in the AWS cloud.
·Troubleshooting and monitoring of Third-party applications using Splunk and Cloud Watch in the Amazon Web Services (AWS) environment.
UPS Atlanta, Georgia Oct 2018- Aug 2022
Site Reliability Engineer/DevOps
Responsibilities:
·Participate in lead development of requirements, user story development, use cases and test cases.
·Hands-on experience implementing PaaS, IaaS, SaaS style delivery models inside the
·Enterprise (Data center) and in Public Cloud like AWS, and Kubernetes etc.
·Responsible for using AWS Console and AWS Command Line Interface for deploying and
·operating AWS services specifically VPC, EC2, S3, EBS, IAM, ELB, Cloud Formation, ECS, EKS
·and linux enviornment.
·Developed Cloud Formation templates to Automate the AWS Services VPC, Bastion hosts,
·Auto scaling and load balancing, Cloud Watch Alarms, ECS Cluster, Elastic Beanstalk, AWS Backup Resources.
·Write code to improve system reliability, such as developing self-healing services, automated testing tools, or reliability-focused application components using languages like Go, Python, or Java.
·Define and measure SLOs (Service Level Objectives), SLIs (Service Level Indicators), and SLAs (Service Level Agreements) using frameworks and platforms like Prometheus or Datadog to ensure systems meet availability and performance expectations.
·Identify and resolve performance bottlenecks by using profiling tools like Datadog APM, New Relic, or Prometheus to analyze application and system performance metrics.
·Expertise in using built tools like Maven and ANT for the building of deployable artifacts.
·Optimize resource usage (CPU, memory, disk) to improve the efficiency of applications and infrastructure, leveraging containerization with Docker or Kubernetes and auto-scaling technologies.
·Hands-on experience setting up Kubernetes (k8s) Clusters for running micro services.
·Ensure systems meet compliance standards (e.g., GDPR, SOC2, PCI-DSS) by leveraging automation tools like Chef InSpec and ensuring secure software development practices.
·Took several micro services into production with Kubernetes backed Infrastructure.
·Managed Docker orchestration and Docker containerization using Kubernetes.
·Used Kubernetes to orchestrate the deployment, scaling and management of Docker Containers.
·Used Kubernetes to manage containerized applications using nodes, config maps, services and deployed application containers as a Pod
·Created reproducible builds of the Kubernetes applications, managed Kubernetes manifest files and managed releases of Helm packages.
·Experience in using Jenkins pipelines to drive all micro services builds out to the Docker
·registry and then deployed to Kubernetes, Created Pods and managed using Kubernetes.
·Orchestrated CI/CD processes by responding to Git triggers, human input, and dependency chains and environment setup.
·Worked with Docker for convenient environment setup for development and testing.
·Implemented build stage to build the micro service and push the Docker container image to
·the private Docker registry.
·Work closely with development teams to ensure that reliability is baked into the software lifecycle, from design to deployment, using agile frameworks like Scrum and Kanban to manage work.
CNSI, Rockville Maryland Jan 2018- Oct 2018
DevOps Engineer/SRE
Responsibilities:
·Migrated legacy applications from On-premises to AWS cloud environment.
·Leveraged Jenkins for build and deployment automation for Terraform scripts
·Provisioned EC2 instances using Terraform and Ansible playbooks
·Used Docker for packaging applications and designed the entire cycle of application development and used Virtualized Platforms for Deployment of containerization of multiple apps.
·Worked on integration of Terraform migrate legacy and Monolithic systems to AWS
·Designed, implemented and maintained Splunk log collection solution for the performance Engineering.
·Installed and managed monitoring tools like Splunk, Analyzing and reviewing the system performance tuning and network configurations, CPU utilization, memory profiles, disk utilization, network connectivity, system log files.
·Troubleshooting and monitoring of Third-party applications using Splunk and Cloud Watch in the Amazon Web Services (AWS) environment.
· Configured and Deployed application packages on to the Apache Tomcat, WebLogic and JBoss server.
·Coordinated with software development teams and QA teams.
·Involved in writing SQL queries to implement the related changes and debugged the build errors using SQL queries to make sure Database is not corrupted.
·Developed custom OpenShift templates to deploy the applications and to create the OpenShift objects built, deployment on figs, services, routes and persistent volumes.
·Automated the build and release management process including monitoring changes between releases. Documented the entire build and release process and provided support.
Pepsico - Plano, TX Jan 2015 – Dec 2018
DevOps Engineer/SRE
Responsibilities:
·Involved in DevOps migration/automation processes for build and deploy systems.
·Used Jenkins and implemented CI/CD pipeline using plugins like conditional build step, deployed to Git.
·Configured various jobs in Jenkins for deployment of Python based applications and running test suites using Robot Framework
·Configured Git with Jenkins and scheduled jobs using POLL SCM option. Responsible for design and maintenance of the GIT Repositories, and the access control strategies.
·Developed Python and Shell scripts for automation of the build and release process. Wrote automation scripts in shell and Python to enhance the CI-CD pipeline.
·Managed MYSQL Server, performed CRUD operations, stored procedures and triggers to support the project.
·Performed application server builds in AWS EC2 environment and monitoring them using cloud watch.
·Automated Regular AWS tasks like snapshots creation using python scripts.
·Building and monitoring in the project has been done continuously with a CI Tools like Jenkins.
·Monitoring AWS Instances regularly using ops view and New Relic tools.
·Created post commit and pre-push hooks using Python in SVN and GIT repos. Setting up the
·SVN and GIT repos for Jenkins build jobs.
·Created multiple ANT, MAVEN, Shell scripts for build automation and deployment.
· Used Jenkins AWS Code Deploy plugin to deploy to AWS.
[24]7.ai - India Dec 2012 - Dec 2013
System Engineer
Responsibilities:
·Design, develop, and implement scalable, reliable, and secure systems.
·Create architecture blueprints and ensure systems meet operational and performance requirements.
·Collaborate with cross-functional teams to understand project requirements and align them with technical solutions.
·Implement system components based on the design and specifications.
·Install and configure system software, including operating systems, applications, and middleware.
·Ensure the system complies with relevant security policies and regulatory requirements (e.g., GDPR, HIPAA, ISO 27001).
·Implement security controls and measures, such as encryption, firewalls, and authentication protocols.
·Work with vendors and third-party service providers to resolve external system issues.
·Perform root cause analysis (RCA) to identify underlying problems and propose permanent solutions.
·Optimize system resource utilization (CPU, memory, storage) through automation.
·Implement and manage Infrastructure as Code (IaC) for system provisioning.
·Validate system functionality, performance, and scalability before production deployment.
·Provide technical support for end-users, addressing issues related to system functionality.