Site Reliability Cloud Infrastructure

Location:

Iowa City, IA

Posted:

March 02, 2025

Contact this candidate

Resume:

Mani Sonti

Sr. Cloud Architect & DevOps/SRE Leader

Address Austin, TX

Phone 737-***-****

E-mail **********@*****.***

With 12 years of experience in AWS, GCP, Azure Public Cloud, Private Cloud (Openshift), DevOps, and Site Reliability Engineering (SRE), I specialize in improving system design, scaling operations, and maintaining high uptime for enterprise-level platforms. I have a proven track record of leading globally distributed teams to deliver secure, cost-effective solutions that enhance platform security and performance. I am highly skilled in automating workflows, reducing operational toil, and driving effective incident management. Passionate about collaborating with cross-functional teams, including Engineering, Sales, and Operations, I deliver innovative technical solutions and operational excellence. Additionally, I excel in creating guidelines and troubleshooting documentation for non-technical staff and enjoy working closely with customers and teammates to solve complex technical challenges.

Certifications

2023-25

AWS Certified Solutions Architect - Associate https://www.credly.com/earner/earned/badge/8f69294b-5c64-4c4a-a41c-e5ff1c0c1ff9

2024-26

CKA Certified Kubernetes Administrator

https://www.credly.com/badges/f46cab4f-92cc-4701-b329-5bbf068c9a07/public_url

Skills

AWS, GCP, Microsoft Azure, OpenShift, Cloud Foundry, Bare Metals, VMware, DataCenter

Terraform, Ansible, Salt, Puppet, Chef, CloudFormation.

Jenkins, Harness, CircleCI, GitHub Actions, Bamboo, AWS Code Pipeline, Azure DevOps, Rundeck, ArgoCD.

Python, NodeJs, Java, Bash, Shell, Groovy, YAML, JSON.

MySQL, PostgreSQL, MongoDB, Cassandra, Oracle, DynamoDB

Microsoft Windows, Red Hat, Centos, Ubuntu.

Planview Tasktop, LaunchDarkly, MABL, AWS SDK, Azure Deployment Manager, ARM.

GIT, Jira, Bit Bucket, Stash, Gitlab.

Apache HTTP, Tomcat, Nginx, Node.js, Gunicorn Django.

Ant, Maven, Gradle, flywaydb, NPM, Node, SLS, AWSCLI.

Splunk, New Relic, Dynatrace, Grafana, Prometheus, App Dynamics, ELK, FluentD, Thousand Eyes, DataDog

Docker, Podman, Containerisation, ECS, AKS, EKS, Kubernetes, Rancher

TCP/UDP, DNS, Ping, NFS, LDAP, SSH, BGP, SSL, SFTP, SMTP, Traceroute, Curl, F5 BIG IP, ELB, ALB, NLB, Akamai, CloudFlare, ITSO service mesh

Leading projects, Leadership and team building, Team leadership skills

Work History

2023-06 - Current

Sr. Cloud Architecture & DevOps/SRE Leader

United Airlines

Designed and implemented AWS cloud architecture, leading the migration from on-premise platforms to OpenShift 4.1X, leveraging Kubernetes for container orchestration. Worked extensively with a wide range of AWS services, including VPC, Subnets, NAT, IGW, Route 53 (R53), ALB, Network Peering, MWAA (Managed Workflows for Apache Airflow), Batch Jobs, AWS Telemetry, CodePipeline, EC2, ECS, EKS, Auto Scaling Groups (ASG), SNS, SQS, Lambda, RDS, NoSQL Databases, ACM, Security Groups, AWS Systems Manager (SSM), AWS Landing Zone, Control Tower, KMS, S3, ElastiCache, CloudFront, EMR, Service Catalog, ETL Jobs, Glue, API Gateway, Internet Gateways, NACLs, Transit Gateway, Redshift Cluster, Apache Airflow, FSx, Step Functions, ChangeSets, ExecuteSets, CMKs, DynamoDB, CloudWatch, CloudTrail, Kubernetes, and IaC tools like Terraform and CloudFormation (CFN). Also worked with Airflow DAGs to automate and manage data workflows, ensuring a highly scalable and efficient infrastructure.

Led automation efforts to reduce operational toil by implementing Terraform, Kubernetes, and cloud-native services (AWS, GCP). Managed infrastructure using IaC tools like Terraform and CloudFormation (CFN), ensuring scalability and efficient resource management across cloud platforms.

Migrated legacy CI/CD pipelines from TeamCity to Harness, modernizing the entire pipeline process. Designed and implemented 700+ pipelines, integrating Terraform for project onboarding and managing delegates, templates, and connectors, streamlining the CI/CD workflow. Worked extensively on the Harness platform, alongside other CI/CD tools like Digital.ai, GitHub Actions, Jenkins, and Helm for Kubernetes integration, ensuring seamless orchestration and automation of workloads across various environments.

Specialized in Kubernetes orchestration and troubleshooting, with expertise in Gloo, Istio, and Kubernetes networking. Managed Helm charts for Kubernetes deployments, ensuring seamless integration and efficient workload automation across environments.

Designed and maintained comprehensive monitoring and alerting solutions using tools like Splunk, ELK, DataDog, Dynatrace, CloudWatch, and CloudTrail,Wiz. Created custom dashboards and fine-tuned alerts to ensure proactive issue detection, reducing Mean Time to Detect (MTTD) and Mean Time to Resolve (MTTR).

I work closely with cross-functional teams to design custom dashboards tailored to specific operational metrics, ensuring clear visibility into key performance indicators. My efforts focus on improving system reliability, enhancing observability, and streamlining incident management, all of which are critical in maintaining smooth operations and achieving SRE objectives.

Environment: AWS (VPC, EC2, ECS, EKS, Lambda,DynamoDB, RDS, S3, Redshift, CloudWatch, CloudTrail, CodePipeline, Control Tower, KMS, Redshift, Glue, EMR, MWAA, Batch Jobs), GCP, OpenShift 4.X, Kubernetes, Helm, Gloo, Istio, GTM, LTM, Ansible Tower, Jenkins, GitHub Actions, Digital.ai, TeamCity, Harness, Terraform, CloudFormation (CFN), Splunk, ELK Stack (Elasticsearch, Logstash, Kibana), AppDynamics, Dynatrace, DataDog, Python, Java, Spring, NodeJS, JBoss, Docker, MySQL, XLR Templates, DR Architecture.

2022-05 - 2023-05

Sr. Lead DevOps & SRE Engineer

Charles Schwab Corporation, Austin TX

Led a team of 9 engineers in the design, development, and maintenance of cloud infrastructure and DevOps processes, ensuring high performance and uptime across systems. Spearheaded the design and implementation of AWS cloud architecture while also driving the migration to GCP, utilizing resources like Cloud APIs, GCE, GKE, Cloud Functions, Cloud Spanner, BigQuery, IAM, Cloud SQL, Pub/Sub, and Cloud Storage.

Managed the end-to-end implementation of Harness Next Gen, including project onboarding, managing Harness Delegates, templates, connectors, and CI/CD pipeline architecture. Developed Node.js and Python tools to automate operational tasks and enhance efficiency.

Onboarded a suite of new DevOps tools such as LaunchDarkly, TasktopViz, TasktopHub, Harness, Digital.ai, SauceLabs, Mabl, GitHub Actions, Artifactory, AppDynamics, Splunk, and ThousandEyes, creating automated workflows to streamline operations.

Implemented Disaster Recovery Plans (DRP), integrating frontend and backend services, synthetic monitoring, and metrics collection to create comprehensive observability dashboards. Utilized Docker Compose and Kubernetes for containerized builds and deployments, creating Helm charts for seamless service deployment.

Developed KPI metrics and log traces to enhance observability, ensuring end-to-end telemetry for monitoring both application and infrastructure performance. Mentored and trained team members, fostering a collaborative and service-focused work environment.

Environment: GCP, Bamboo, CloudFoundry, Datacenter, GitHub Actions, ThousandEyes, Splunk, Dynatrace, MySQL, Python, Node.js, Java, Docker, Kubernetes, Terraform, Harness, Digital.ai, Akamai, Bitbucket, Grafana, DRPlans, Planview TasktopViz, Hub, Linux, Windows, SauceLabs, LaunchDarkly, MABL, Apache Airflow.

2021-07 - 2022-05

SRE & DevOps Manager

Sony Networks

Cloud Infrastructure Design & Migration Led the design and migration of AWS and Azure cloud architectures, including Three-Tier, Serverless, and Microservices. Migrated 100+ resources from CloudFormation to Terraform, reducing provisioning time by 40% and enabling scalability for business growth.

CI/CD Pipeline & Automation Optimization Enhanced Jenkins for high availability and scalability, reducing build times by 30%. Automated over 50+ releases, increasing deployment frequency by 25%. Integrated New Relic for real-time monitoring, improving application performance tracking and reducing incident response times.

Containerization & Kubernetes Orchestration Designed and deployed containerized applications using ECS, EKS, and Azure Kubernetes Service (AKS), increasing availability by 99.99%. Optimized auto-scaling, resource allocation, and load balancing to support 100,000+ concurrent users, improving operational efficiency by 35%.

Serverless Solutions & Cloud Automation Developed serverless solutions with AWS Lambda and Azure Function Apps, automating 80% of application management tasks and reducing onboarding time by 50%. Worked with Python, Node.js, and TypeScript to build tools that streamlined cloud management and enhanced scalability.

Security, Compliance & Monitoring Ensured GDPR compliance and resolved vulnerabilities using Qualys and Nessus. Improved security and deployment processes, reducing incidents by 30%. Integrated monitoring tools like ElasticSearch, CloudWatch, and DataDog, creating 50+ dashboards and reducing downtime incidents by 20%.

Environment: AWS, AZURE, DataDog, Datacentre, Splunk, Kibana, Grafana, App Dynamics, MySQL, Python, JAVA, Node, Coralogix, Docker, Kubernetes, ECS, Ngnix, Terraform, Ansible, Jenkins, ELK, Prometheus, Grafana, Maven, MuleSoft deployments, DynamoDB, DR, Cloud Watch, SAP, API Gateway, VPC, Transit Gateway, JIRA, Qualisys, Nessus, Control tower, SCP, WAFs, Akamai, AWS QuickSight.

2020-03 - 2021-07

Sr. Lead SRE Engineer / Manager DevOps

Aditya Birla Capital

I led a team of over 15 members across DevSecOps, DevOps, and SRE, building the SRE team from scratch and aligning technical solutions with business goals through close collaboration with the VP, CTO, Senior IT Manager, product managers, and operations teams. I scaled application traffic from 1.5 TPS to 15 TPS by configuring SLA, SLO, and KPI metrics, meeting performance objectives while driving system reliability and scalability.

I designed and implemented a highly available platform with 99.99% uptime, ensuring seamless scalability and reliability for critical business operations. I created observability dashboards, monitoring dashboards, automated alert systems, and auto-resolve automation jobs to support a 24x7 production environment. These tools enabled proactive issue resolution and enhanced operational visibility significantly.

My work also involved architecting AWS three-tier environments using Terraform and CloudFormation to automate infrastructure provisioning. I established ECS environments, API Gateways, and VPC models to streamline deployments and infrastructure management. To ensure resilience, I implemented high-availability and disaster recovery setups, enabling support for traffic loads of 30-50 TPS and reducing recovery times by 50%.

Incorporating MLOps principles, I designed and managed end-to-end Machine Learning pipelines to support business-critical AI/ML workloads. This included automating model deployment with CI/CD pipelines, streamlining data preprocessing workflows, and ensuring the reproducibility of ML experiments. I leveraged tools such as Amazon SageMaker, Kubernetes, and MLFlow to optimize the delivery of machine learning models at scale.

Security was a top priority, and I addressed vulnerabilities, implemented GDPR compliance measures, and strengthened network security with WAF and encryption. By collaborating with multiple vendors, I ensured seamless integration of solutions that supported operational efficiency across diverse platforms.

Additionally, I introduced modern observability tools like Grafana, Prometheus, CloudWatch, and DataDog, which improved monitoring capabilities and reduced issue resolution times by 30%. By automating deployments and patching with Ansible, I decreased build cycle times by 40% while managing CI/CD pipelines for production releases to maintain seamless delivery processes.

These efforts consistently delivered value by scaling systems, enhancing reliability, and fostering a culture of innovation.

Environment: AWS, ELK, New Relic One, SageMaker, MLFlow, Kubernetes, MySQL, Python, Java, BPM, Celery, Nginx, Customer & Ops Portals, Ansible, Akamai, Jenkins, Terraform, Prometheus, Containers, Grafana, Loan Management Services, Loan Originating Services, CloudFormation, RDS, CloudWatch, API Gateway, VPC, Transit Gateway, JIRA, DataDog, Planner

2019-04 - 2020-02

Sr. Infrastructure Engineer

Apple Inc, Austin, TX

I led the design and migration of AWS Two-Tier and Microservices Architectures, transitioning from bare-metal environments to the cloud. This included creating and managing AWS CloudFormation templates for resource provisioning, ensuring a smooth migration of on-premises data centers to AWS.

I worked extensively with Docker and Kubernetes (ECS/EKS), managing clusters, namespaces, nodes, and pods while automating deployments to streamline microservices operations. As a release manager, I implemented blue-green deployments for high availability, coordinated releases with SRE and infrastructure teams, and ensured seamless application delivery.

I automated server configurations, builds, and deployments using Ansible, creating playbooks and managing SSL certificates with Cert-Manager and Ansible Vault. Additionally, I configured Cloudflare for CDN services to optimize performance and security.

Supporting CI/CD pipelines, I configured Jenkins and developed automation for QA, regression, and deployment tasks. I contributed to application development by integrating APIs using Node.js and maintaining Java-based applications with Tomcat and Apache servers.

I enhanced monitoring and troubleshooting capabilities by leveraging Splunk for log analysis, alert creation, and dashboard development, as well as Dynatrace for infrastructure metrics. My database responsibilities included managing Cassandra nodes, performing validations, and planning migrations to MongoDB to meet application requirements.

Environment: AWS, Datacenters, JAVA, Tomcat, Splunk, Dynatrace, Jenkins, Ansible, Cassandra, MongoDB, Oracle, Tomcat, Apache, node js, Microservices, Patching, Autoscaling, ECS, EKS, S3, VPC, VPC endpoint, Networking, Shield, WAF, ELB, ALB, Artifactory, Nexus, Espresso, Self Service, Cloudflare, Radar.

2017-02 - 2019-03

Sr. DevOps Engineer & Site Reliability Engineer

National Geographic Partners & Walt Disney, Washington DC

Designed and implemented AWS hybrid and monolithic architectures, as well as GCP hybrid environments using shared VPC models, optimizing network usage, configuring subnets, and establishing firewall rules to improve security and scalability.

Migrated 400+ services from on-premises to AWS and GCP, including the SPI project, where I collaborated with leadership to move 20 TB of image data, achieving a 40% reduction in storage costs and a 50% improvement in data accessibility.

Automated infrastructure provisioning using CloudFormation templates, Ansible playbooks, and later transitioned to Terraform for scalability, reducing provisioning times by 60% and standardizing deployments.

Partnered with InfoSec teams to resolve application vulnerabilities, improving security posture by 50% and ensuring compliance with organizational standards.

Designed and managed QA automation pipelines, created and optimized Docker files, and implemented Kubernetes setups, enhancing deployment efficiency and reducing testing times by 40%.

Applied SRE principles post-migration to improve site reliability, reduce latency by 30%, and ensure consistent performance for cloud-hosted applications.

Managed and administered Bamboo CI/CD pipelines, optimizing workflows, creating automated build and release plans, and accelerating release cycles by 50%.Designed and implemented disaster recovery workflows for AWS, improving recovery speed by 50% and reducing downtime for critical applications.

Environment: AWS, GCP, Datacentre, Python, Ansible, Puppet, Chef, Docker, Java/J2ee, NFS, DNS, Jenkins, Maven, GIT, Splunk, New Relic Shell script.EC2, Logstash, Kibana, Centos, Sensu.Cloud trail, HP Fortify, Watchdog, EKS, microservices, Bamboo, CircleCI, Bitbucket, Stash, API, Akamai, NewRelic, Ansible, Network Troubleshooting, Tunneling, IP, TCP/UDP, Traceroute, StackDriver,Cloud Trail, Cloud Checker, Guard Duty, ParkMyCloud for Cloud services and Qualisys, Crowd Strike, Fortify Scan, and TrendMicro

2013-04 - 2015-08

Build &Release Engineer / Linux Admin

Efftronics Systems Private LTD, LMD

Managed source code versioning and release processes using Git and CVS, defining branching and merging strategies to streamline collaboration.

Automated application packaging and deployment pipelines via Jenkins, improving deployment efficiency and consistency.

Administered Linux environments and facilitated data transfers across hardware components to optimize infrastructure operations.

Enhanced testing and code quality by implementing JUnit and Selenium, with detailed code coverage reports generated via SonarQube.

Developed and deployed J2EE applications on JBoss and Apache Tomcat servers, ensuring reliable performance across environments.

Created and executed SQL scripts for deployment in multiple environments, standardizing database operations.

Environment: GIT, Datacentre, IBM Rational Clear Quest 7.0.1, Make, ANT, MAVEN, Shell (bash), Apache Tomcat Application Server, Java/J2EE, Linux, SQL, Concurrent Versions System (CVS), Perforce, ANT, MAVEN, Jenkins, Remedy, JBoss, UNIX, Selenium.

Education

2015-09 - 2016-12

Master of Science: Computer Science

SVU - San Jose, CA

2009-07 - 2013-02

Bachelor of Science: Electronics and Communication Engineering

JNTUK University - India

Contact this candidate