Post Job Free
Sign in

Machine Learning Reliability Engineer

Location:
Frisco, TX, 75035
Posted:
April 30, 2025

Contact this candidate

Resume:

Naga Vardhan. K

Dallas TX •****.********@*****.***. 469-***-****

SUMMARY

Experienced Senior DevOps/Site Reliability Engineer (SRE) with approximately 12 years of expertise, combining a strong analytical and programming background with a focus on Bash, Python, Flask, Django, PySpark, Golang, and JavaScript. Skilled in developing, deploying, and scaling advanced applications and machine learning models, leveraging cloud technologies and big data solutions on Azure. Proficient in utilizing various Python libraries and frameworks to drive data-driven solutions, build RESTful APIs, and enhance decision-making processes. Deep knowledge of algorithm development, data structures, and cloud-native solutions.

Experienced in SRE and DevOps practices, with a strong background in the financial and media sectors. Adept at defining and managing Service Level Indicators (SLIs), Service Level Objectives (SLOs), and Service Level Agreements (SLAs) to maintain system reliability and align performance with business goals. Proficient in setting up and managing CI/CD pipelines using Jenkins and GitHub Actions, streamlining deployment workflows and enhancing automation for reliable software delivery.

Highly skilled in managing and scaling Kubernetes clusters and implementing Infrastructure as Code (IaC) with Terraform. Demonstrated expertise in automating deployments, optimizing performance, and ensuring robust security in cloud environments while improving reliability and reducing downtime. Experienced in managing virtualized environments using VMware and KVM, ensuring high availability, performance, and resource efficiency across infrastructure.

Implemented comprehensive IAM solutions leveraging Okta, Azure AD to ensure secure and efficient user and application access across cloud environments. Designed and deployed Okta Single Sign-On (SSO) to enable seamless and secure access to cloud applications, improving user experience and reducing password fatigue. Developed automated workflows using Okta Lifecycle Management for provisioning and deprovisioning users, roles, and groups, minimizing manual administrative tasks and ensuring operational consistency.

SKILLS

●Languages: Python, Javascript, ReactJS, NodeJS, Bash, GoLang, PowerShell,C++, Java

●Big Data: Hadoop, Scala, Spark, Kafka, Kinesis, Zookeeper, Grafana, Data visualization

●DataBases: SQL, MySQL, PostgreSQL, DynamoDB, MongoDB, Redis, RedShift, Neo4j.

●CI/CD: GitHub Actions, Azure DevOps, Jenkins, Concourse CI, Gitlab

●Frameworks: Torch, Django, Flask, Numpy, Pandas, Jupyter, Polars

●Configuration Management: Ansible,Puppet, chef

●Containerization Tools: Docker, Kubernetes, Amazon ECS

●IAC : Terraform, AWS Cloudformation

●Monitoring: Datadog, Loki, Prometheus, Grafana, Splunk

PROFESSIONAL EXPERIENCE

AMDOCS -TMobile Plano, TX

SR. DevOps Engineer Sep 2022 – Present.

●Designed and implemented robust ETL pipelines leveraging Snowflake and orchestration tools like Apache Airflow, ensuring reliable and timely data ingestion from diverse sources.

●Developed and deployed machine learning models on Azure Machine Learning using Python, streamlining data science workflows and enabling real-time decision-making capabilities.

●Integrated CI/CD pipelines with Kubernetes using Jenkins and Argo CD, accelerating deployment cycles.

●Designing, planning, and implementing existing on-premises applications to AZURE Cloud (ARM) Configured and deployed Azure Automation Scripts utilizing Azure Stack Services and Utilities focusing on Automation

●Led the transformation of a legacy monolithic application to microservices architecture on Kubernetes.

●Designed and implemented scalable microservices architecture using Golang on Azure Kubernetes Service (AKS), improving application performance and reducing latency.

●Developed RESTful APIs using Golang for Azure-based applications, optimizing data retrieval processes and reducing response times.

●Built concurrent and efficient backend systems in Golang, leveraging Azure Redis Cache and Azure Cosmos DB to handle high-volume traffic with minimal downtime.

●Created custom CI/CD pipelines for Golang applications using Azure DevOps, automating testing, deployment, and monitoring, which improved deployment efficiency by 50%.

●Utilized Golang-based scripts for Infrastructure as Code (IaC) solutions with Terraform on Azure, automating the provisioning and configuration of cloud resources.

●Developed data ingestion pipelines to efficiently load structured and unstructured data into ADLS using Azure Data Factory, Databricks, and Synapse Analytics.

●Designed and implemented scalable ETL pipelines using PySpark to process and transform large datasets (terabytes) on distributed systems, reducing data processing time.

●Deployed PySpark applications on cloud platforms like Azure Databricks ensuring efficient resource utilization and autoscaling capabilities for large-scale data workloads.

●Developed and deployed microservices using Java and Spring Boot on Azure, enhancing application scalability and maintainability.

●Designed RESTful APIs with Spring Boot, integrated with Azure Functions and Azure API Management for serverless execution.

●Used Azure Kubernetes service to deploy a managed Kubernetes cluster in Azure and created an AKS cluster in the Azure portal with the Azure CLI, also used template-driven deployment options such as Resource Manager templates and Terraform

●Implemented security best practices using Spring Security and Azure Active Directory (Azure AD), ensuring robust access control and compliance.

●Designed automated incident response workflows to quickly mitigate security incidents leveraging Azure Functions and Azure Logic App

Bank Of America NY, NY

Software Engineer/SRE

JUN2018 - Sep 2022

●Integrated React.js front-end with Python-based back-end services using RESTful APIs and GraphQL, facilitating data exchange and synchronization between client and server-side components.

●Set up Grafana dashboards to visualize key performance metrics and logs from various AZURE services.

●Automated vulnerability scanning and patch management processes across cloud resources to ensure timely remediation of security vulnerabilities.

●Designed, configured, and deployed Azure Resource Manager (ARM) templates for multiple applications utilizing the Azure stack including Compute, Web App, function app, Blobs, Data Factory, Resource Groups, HDInsight Clusters, and Azure Cosmos DB.

●Utilized Azure Monitor to implement autoscaling strategies based on predefined metrics or custom alerts.

●Integrated Azure AD with on-premises Active Directory and third-party applications, enabling seamless Single Sign-On (SSO) experience and reducing administrative overhead.

●Deployed and managed Azure Stack environments to extend Azure services to on-premises data centers, supporting hybrid cloud scenarios and enabling consistent application development.

●Created reusable Terraform modules for deploying Azure resources such as VNets, Virtual Machines, and Azure Kubernetes Service (AKS) clusters, reducing deployment time and increasing code reusability across projects.

●Used Ansible and Ansible Tower as Configuration management tools to automate repetitive tasks, quickly deploy applications and proactively manage change.

●Developed and implemented a Kafka-based solution to replace a legacy batch processing system, reducing data processing time from hours to minutes

●Implemented monitoring solutions for Kafka clusters on Confluent Cloud using tools like Confluent Control Center, Prometheus, and Grafana.

●Conducted regular security audits and automated compliance reporting to meet regulatory requirements and industry standards.

●Recreated complex Datadog dashboards in Grafana, maintaining key metrics, visualizations, and alerts.

●Integrated Grafana with various data sources such as Prometheus, InfluxDB, and Elasticsearch to replicate Datadog’s monitoring capabilities.

●Configured Prometheus to monitor Kubernetes clusters, including node health, pod metrics, and application performance

Fannie Mae Dallas TX

DevOps engineer

Feb 2016 – May 2018

●Integrated Python applications with Kubernetes APIs and resources, utilizing Kubernetes client libraries to interact with clusters, deploy workloads, and manage resources programmatically.

●Developed Terraform modules to automate the creation and management of Azure Virtual Networks (VNets), including subnets, network security groups (NSGs), and routing tables, ensuring a consistent and secure network architecture.

●Developed custom Terraform modules and Ansible playbooks in Python to abstract and encapsulate infrastructure configurations, promoting code reuse, modularity, and standardization across projects and environments.

●Created custom Ansible modules using Python to extend automation capabilities, tailored to handle complex tasks such as multi-tier application setups and dynamic scaling of resources.

●Automated CI/CD processes using Python and Ansible, integrating them with Jenkins and GitHub Actions to deploy containerized applications efficiently across staging and production environments.

●Developed custom Terraform providers using Golang to extend Terraform’s capabilities, enabling integration with proprietary APIs and third-party services not natively supported by Terraform.

●Developed custom Terraform modules and Ansible playbooks in Python to abstract and encapsulate infrastructure configurations, promoting code reuse, modularity, and standardization across projects and environments.

●Orchestrated complex multi-cloud deployments and hybrid infrastructure setups using Terraform and Ansible, ensuring consistency and reproducibility in infrastructure provisioning and management workflows.

●Optimized existing Terraform providers written in Golang by refactoring code and enhancing API call efficiency, resulting in significant reduction in infrastructure deployment times.

●Installed Docker Registry for local upload and download of Docker images and even from Docker hub.

●Worked on the Docker ecosystem with many open-source tools like Docker Machine, Docker Compose, and Docker Swarm.

●Designed and implemented Azure AD solutions to manage identity and access for cloud-based and hybrid environments, ensuring secure and efficient authentication processes.

●Developed automated workflows using Azure AD Lifecycle Management to streamline provisioning and deprovisioning of users, roles, and groups, enhancing operational efficiency.

●Led efforts to integrate Azure AD with DevOps pipelines to enforce access controls, secure build processes, and manage secrets effectively.

AT&T Dallas TX

DevOps Engineer Jan 2014- Nov2015

Worked on automating the build and deployment process using Jenkins as part of Continuous Integration & Continuous Deployment using Ansible Tower onto AWS EC2 instances

Installed applications on AWS EC2 instances and configured storage on EBS Volume

Worked on deploying and operating AWS services specifically VPC, EC2, S3, EBS, IAM, ELB and Cloud Formation using AWS console

Configured multi-platform servers using Ansible roles and developed roles for VM patching of AWS, on-prem instances

Used Ansible and Ansible Tower as Configuration management tools to automate repetitive tasks, quickly deploy applications and proactively manage change

Used YAML for AWS CloudFormation to manage all the applications located in Multi-Region Architecture.

Used Jenkins for Continuous Integration and deployment onto Apache Tomcat Server

Building/Maintaining Docker container clusters managed by Kubernetes

Deployed applications on OpenShift using deployment configurations, and helm charts. Also, managed application lifecycle, rolling updates, and automated scaling to ensure high availability and performance

Used Git as source code repository, and managed for branching, merging, and tagging the files

Automated the complete CI jobs using Jenkins pipeline DSL and Jenkins Groovy

Experience in building CI/CD pipelines using Jenkins Multi Branch Pipeline, DSL Scripts and Groovy with minimal/no manual intervention for multiple applications

Responsible for designing and deploying the best SCM processes and procedures with Bitbucket, GitHub, Git & eclipse

Integrated Cloud front with S3 storage for content delivery (CDN)

Client: AIG JUN 2011-DEC 2013

Role: System ADIM

Implemented branching strategies using Git and responsible for creating branches and resolving the conflicts while merging in Git and performed migration setup from SVN to Git using bit buckets. Also worked with GIT to store the code and set up new development branches, merging branches, facilitating the releases

Responsible for Continuous Integration (CI) and Continuous Delivery (CD) process implementation using Jenkins along with Shell scripts to automate routine jobs.

Used Nexus for periodic archiving and storage of the source code for disaster recovery, sharing artifacts and handling dependency management within the company.

Implemented Continuous Integration using Jenkins and GIT. Installed Jenkins/Plugins for GIT Repository, Setup SCM Polling for Immediate Build with ANT and ANT Repository (Nexus Artifactory) and Deployed the EARs and WARs in TOMCAT Application server using ANT script as a CI/CD Process.

Integration of Automated Build with Deployment Pipeline. Installed puppet Server and clients to pick up the build from Jenkins repository and deploy in target environments (Integration, QA, and Production).

Worked with Build & Release team to enhance the current process to implement a better software packaging and delivery by automation using Jenkins & puppet.

EDUCATION

TEXAS A&M University

M.S. in Computer Science (GPA: 3.8/4.0) Aug 2009- Dec 2010

Coursework: Algorithms, Programming Design Paradigms, Database Management Systems, Machine Learning, Deep Learning, Large Language Models

B.Tech in Computer Science,ANNA University June 2004-June 2008

Coursework: Java, Python, NLP, Computer Networks, Cloud Computing



Contact this candidate