Reliability Engineer Site

Location:

Telford, PA

Posted:

March 03, 2025

Contact this candidate

Resume:

AKHIL G

Email: ***************@*****.*** Mobile: +1-484-***-****

7+ years of diversified experience in monitoring tools (Prometheus, Grafana, ELK, Splunk, Dynatrace), Site Reliability Engineer, DevOps, Kubernetes Engineer, Build and Release, deployment, and Tool Engineering in agile environments.

•Worked on Monitoring tools like Prometheus, Grafana, Kubernetes, Splunk, Kibana, Dynatrace, Elasticsearch (ELK), and Datadog.

•Designed, deployed, and maintained highly available Kubernetes clusters across cloud and on-prem environments.

•Optimized cluster resource allocation, cost efficiency, and performance tuning to enhance scalability and reliability.

•Automated CI/CD pipelines by integrating Kubernetes with tools like ArgoCD, Helm, and GitOps workflows.

•Developed and maintained Infrastructure-as-Code (IaC) using Terraform, Ansible, or Helm for consistent and repeatable deployments.

•Troubleshot and resolved Kubernetes cluster, networking, and application deployment issues to ensure minimal downtime. Experience working with Docker containers.

•Collaborated with development and DevOps teams to optimize containerized application performance and scalability.

•Having Knowledge on cloud providers such as AWS, Azure, or GCP.

•Deep knowledge of Docker and Kubernetes and initiated deployment of a few applications.

•Initiated the process of setting up the CI/CD pipeline using Docker and Kubernetes

•Initiated the process of dockerizing the application and deployed the containers into the clusters in Kubernetes.

•I have excellent knowledge of monitoring tools like Prometheus, Grafana, Splunk and ELK.

•Having a good knowledge of Grafana Thanos as a data store.

•Efficient in managing the end-to-end operations and working on automation orchestration tools like ANSIBLE and PUPPET, Jenkins.

•Proficient with container systems like Docker and container orchestration like EKS, ECS, EC2Container Service, Kubernetes, Docker Swarm.

•Worked on Architecting, designing, and building infrastructure on Public, Private, and hybrid Clouds.

•Implementing GitHub Actions for automated testing, security scans, and performance monitoring in Kubernetes environments

•Creating and Managing GitHub Actions for CI/CD Automation, including workflows for building, testing, and deploying Kubernetes applications

•Writing YAML workflows in GitHub Actions to automate Kubernetes deployments and monitoring tasks.

•Building/Maintaining Docker container clusters managed by Kubernetes Linux, Bash, GIT, and Docker on GCP (Google Cloud Platform). Utilized Kubernetes and Docker for the runtime environment of the CI/CD system to build, test, and deploy.

•Designed/developed distributed private cloud system solution using Kubernetes Docker.

•Proficient with onboarding/integrating tools like Terraform, Ansible, and Prometheus for better logging, monitoring, tracing, alerting,

•Create monitors using Datadog for applications that ingest data into the platforms and applications that serve customers.

•Experience in configuration management tools like Ansible, Terraform, Puppet, and IBM.

•Sound Experience in branching/ merging/ conflict resolution using source control tools like GIT, GitHub.

•Extensive knowledge in designing and developing Java applications using programming languages like Java and web technologies like JavaScript/HTML5/CSS3.

•Sound experience in AWS services like EC2, VPC, S3, Glacier, ELB, EBS, RDS, deployment services (Elastic Beanstalk, Operational works and Cloud Formation), security practices (IAM, Cloud watch and Cloud trail).

•Deployed and managed scalable AWS infrastructure using EC2, S3, and RDS for high-availability applications.

•Monitored application performance using AWS CloudWatch and set up automated alerts for system health.

•Designed/developed distributed private cloud system solution using Kubernetes Docker on Efficient experience in using Nexus and Artifactory Repository managers for package management and docker registry.

•Conversant on various chef cookbook components and cookbook authoring workflows like attributes, definitions, files, libraries, recipes, resources, and templates.

•Used Helm Charts for the deployment.

•Created Jenkins pipelines involving several Downstream/Upstream job configurations based on dependencies from other applications & based on Release methodologies.

•Developed and managed robust CI/CD pipelines using Jenkins, focusing on automation, zero-downtime deployments, and reliability. Implemented comprehensive monitoring and alerting systems to enhance application reliability and resiliency.

Technologies

Monitoring Tools : Splunk, Datadog, Dynatrace, ELK, Prometheus, Grafana.

Databases : Oracle, MS SQL Server, MS Access, Apache, Cassandra, RDS.

Configuration Management : ANSIBLE, Puppet, Chef.

Scripting languages : Bash, Shell, PowerShell and Python.

Source Version Control : Bitbucket, GIT, GitHub.

Cloud Technology : Google Cloud (GCP), Amazon Web Service (AWS).

CI/ CD : Jenkins, GitHub Actions.

Operating System : Linux, Windows

Programming Languages : Java, C, C++,

Build Tools : Maven, Gradle, Ant.

Quality/Testing Tools : HP Quality Center (HPQC), Selenium.

Education

•Bachelor’s in computer science and engineering, Yogi Vemana University, INDIA.

Certifications

•ITIL V4 Foundation

•Completed Kubernetes, Prometheus, Docker, AWS, Microsoft Azure in Udemy

Professional Experiences

Statefarm, Bloomington, IL

Project: NORAM Tools & Autonomics

Site Reliability Engineer April 2022 - February 2025

Responsibilities:

•Responsible for building, releasing, deploying and supporting the monitoring software stack on private cloud environments.

•Designed, implemented, and maintained monitoring and observability solutions using Prometheus for real-time metric collection and alerting.

•Developed custom PromQL queries to extract meaningful insights and optimize system performance.

•Integrated Prometheus exporters for monitoring databases, containers, Kubernetes, and cloud environments.

•Implemented alerting rules in Prometheus and Grafana for proactive incident response and root cause analysis.

•Managed and orchestrated containerized applications using Kubernetes, ensuring high availability, scalability and efficient resource utilization.

•Implemented monitoring and logging solutions (Prometheus, Grafana, ELK Stack) for Kubernetes clusters to ensure performance and availability. Prometheus for monitoring and alerting in Kubernetes Environments. Monitored cluster health using Prometheus and visualized in Grafana.

•Diagnosed and resolved issues in Kubernetes clusters, including pod failure, resource contention, and network problems.

•Datadog agent installation on NGINX, PostgreSQL etc.

•Datadog Dashboards creation.

•Dynatrace One Agent installation.

•Dynatrace RUM monitoring.

•Solving critical incidents by using Kubernetes(K8s) commands.

•Writing PromQL queries to find/filter the metrics to gain the observability aspects of the system

•Handling P1, P2, P3, P4 incidents

•Creating dashboards by using Grafana to display the pods and server status

•ServiceNow monitoring for SLA

•Responsible for CICD pipelines such as building pipelines and deployment pipelines on GitLab for all the components in the software stack.

•Involved in setting targets (SLO) for uptime metrics on the Linux and Windows servers.

•Source code management on GitLab like merging, branching, Tagging, and maintaining the version across the environments

•Utilized Kubernetes as the platform for monitoring stack in all environments.

•Good experience in deploying Prometheus, Alert Manager, Thanos, Splunk and Grafana using on premises S3 monitor critical processes on all bare metal servers, Virtual Machines, Network devices and applications.

•Identifying and planning automation opportunities in the same project

•Ticket Management (execution and monitoring)

Environment: Prometheus, Grafana, Kubernetes, Thanos, PromQL, Alert Manager, Datadog, Dynatrace, GitHub, Git Bash, Puppet, docker, ServiceNow, Helm, Splunk, Python.

State Street Corporation, Boston, MA September 2019 – January 2022

DevOps Engineer

Responsibilities:

•Good experience in deploying Prometheus, Alert Manager, Thanos, Grafana using on premises S3 monitor critical processes on all bare metal servers, Virtual Machines, Network devices and applications.

•Used Kubernetes to manage containerized applications using its nodes, ConfigMaps, and services, and deployed application containers as pods.

•On board new tools and technologies like consul, ECS, Terraform, Ansible, Kong, Jaeger, Prometheus, fluent for better logging, monitoring, tracing, alerting, orchestrating, managing, configuring the environments.

•Proficient in Git branching strategies (feature, release, hotfix branches).

•Experience with git branch, git checkout, git merge, and git rebase.

•Developed GitHub Actions workflows for automated testing, builds, and deployments.

•Configured YAML-based workflows for event-driven automation (push, pull request).

•Created and managed Git tags (git tag -a v1.0.0 -m "Release v1.0") for versioning.

•Automated versioning and deployment using GitHub Actions.

•Create and manage Infrastructure as Code (IaC).

•Integrated client-provided SSL/TLS certificates into Prometheus, automating the deployment process using Puppet to ensure secure and encrypted connections.

•Source code management on GitLab like merging, branching, Tagging, and maintaining the version across the environments

•Developed and deployed Puppet manifests to automate the installation and configuration of Prometheus, reducing manual intervention and minimizing the risk of configuration errors.

•Identifying and planning automation opportunities in the same project

•Successfully configured Prometheus to use HTTPS for secure communication, ensuring data integrity and confidentiality between the monitoring server and clients.

Environment: Git, GitHub, Git Bash, Prometheus, Grafana, Alert Manager, Kubernetes, ANSIBLE, ServiceNow.

Deutsche Banking, Chennai, India October 2017 - August 2019

Software Engineer

Responsibilities:

•Creating Monitors using Datadog and API for applications that ingest data into the Equifax Platform and applications that serve the customers.

•Setup Alerting and monitoring using Datadog.

•Automated the process of Data ingestion from client files which we receive on SCP/SFTP, by creating process locks to avoid data duplication.

•Automated the process of patching to mitigate the security vulnerability and usage of End of Lifecycle packages.

•Experience in designing a Terraform and deploying it in cloud deployment manager to spin up resources like cloud virtual networks, Compute Engines in public and private subnets along with AutoScaler in Google Cloud Platform.

•Experienced in deploying and configuring Chef Server including bootstrapping of Chef Client nodes for provisioning. Created roles, recipes, cookbooks and data bags for server configuration

•Created a plan and roadmap to migrate to GCP seamlessly with impacting any customers.

•Worked with Bitbucket/GitHub, Jenkins, Chef, Nexus repository, for automating Build and release Pipeline Automation.

•We have used Dynatrace and Sum logic for log aggregation and metrics.

•Build and develop the architectural roadmap, plan for migrating from Centos 6 to Centos 7, as we approach the end of life.

•Setup Alerting and monitoring using Datadog in GCP.

•Designed/developed distributed private cloud system solution using Kubernetes Docker on CoreOS. Used Run deck to run our automated jobs with schedulers that need to run on a timely manner.

•Worked on automating the monthly billing of Equifax to customers on the data platform side.

•Worked with individuals across different teams to support changes and resolve incidents.

Environment: Git, GitHub, Git Bash, Prometheus, Grafana, Alert Manager, Kubernetes, ServiceNow.

Contact this candidate