Post Job Free
Sign in

Machine Learning Cloud Engineer

Location:
Dallas, TX
Posted:
June 24, 2025

Contact this candidate

Resume:

Rajesh Guthikonda

Linkedin:rajeshguthikonda**** Gmail: ************@*****.*** Ph: +1-404-***-****

PROFESSIONAL SUMMARY:

• Google Cloud Platform (GCP) Cloud Engineer with 10+ years of experience delivering secure, scalable, and automated cloud infrastructure. Skilled in deploying and managing GCP VPCs, IAM policies, and CI/CD using Terraform and Jenkins. Proven track record of implementing GCP-native monitoring, logging, and compliance tools for production-grade infrastructure in DevOps and SRE environments.

• Built and maintained comprehensive CI/CD pipelines using DevOps tools to streamline development workflows. Automated testing, deployment, and monitoring processes while ensuring code quality and supporting agile development practices.

• Implemented Google Cloud Platform services including Compute Engine, Kubernetes Engine (GKE), Storage bucket, Artifact registry, secret manager, and Cloud SQL. Set up infrastructure to maximize availability, scalability, and system reliability.

• Served as Tier-2 escalation engineer for critical GCP production incidents, diagnosing and resolving high-priority issues related to IAM, VPC, and compute workloads.

• Created complex data visualizations and reports using Looker, developing custom dashboards and LookML models to improve data analysis and decision-making processes.

• Used BigQuery ML to build and implement predictive analytics solutions and machine learning models. Integrated ML models within BigQuery to improve data processing speed, enhance performance metrics, and deliver data-driven business solutions.

• Managed complete machine learning workflows on Vertex AI platform, implementing ML pipelines and AutoML features. Optimized model performance through hyperparameter tuning and established efficient deployment processes for production environments. Developed Python and Bash scripts to validate GCP security configurations, automating compliance checks against CIS benchmarks.

• Implemented GCP environments following DevOps principles and automated infrastructure provisioning using Terraform and Python, enabling repeatable and secure cloud deployments.

• Hands-on experience managing Identity and Access Management (IAM), VPC configuration, monitoring, and policy-as-code practices using Terraform and DevOps pipelines.

• Deployed and maintained Kubernetes clusters on Google Kubernetes Engine (GKE) to support large-scale container operations. Set up automatic scaling, implemented comprehensive monitoring systems, and resolved complex issues in containerized environments while ensuring high availability and optimal performance.

• Automated cloud infrastructure using Terraform and CloudFormation, resulting in 40% faster deployments. Maintained mission-critical applications with 99.9% uptime through robust infrastructure management and monitoring practices.

• Used Google Cloud’s Operations Suite (formerly Stackdriver) for log and metric ingestion, alerting, and visualization of system health.

• Managed container deployments using Docker, including Docker Hub administration and image management across multiple registries. Implemented Kubernetes clusters with Helm charts to build scalable, resilient container environments and streamline application deployments.

• Automated system configurations and deployments with Ansible, Puppet, and Chef to reduce manual interventions. Created and maintained infrastructure-as-code practices to ensure consistency across development, staging, and production environments.

• Collaborated with Cloud Architects and NetSecOps to configure secure GCP VPC networks, automate infrastructure deployment using Terraform, and monitor cloud usage with custom dashboards in Stackdriver.

• Administered Linux/Unix systems (RHEL, Ubuntu, CentOS) with focus on security and stability, including managing Linux Containers and Docker implementations within DevOps workflows for efficient deployment and scaling.

• Provided comprehensive support across Linux, Windows, and macOS platforms while maintaining consistent infrastructure operations, implementing robust monitoring systems, and ensuring system reliability across all environments.

• Managed network infrastructure including DNS, LDAP, Load Balancing, SMTP, and Firewall configurations to maintain high availability and system performance. Implemented TCP/IP protocols and VPN solutions to strengthen network security while ensuring reliable connectivity across all environments.

• Configured Azure Firewalls, NSGs, and Route Tables as part of Terraform-based network automation, ensuring secure traffic flow and compliance with internal security standards.

• Worked with VMware virtualization technologies to optimize server infrastructure and resource management. Created and maintained virtual machines and VMware appliances while implementing best practices for system performance and availability.

• Implemented and managed monitoring systems using Prometheus, Grafana, ELK Stack, and Nagios to enhance system observability. Established proactive alert mechanisms and dashboards to identify and resolve potential issues before impacting production systems.

• Used APM tools including Dynatrace, New Relic, and AppDynamics to monitor and optimize application performance. Created comprehensive monitoring strategies and troubleshooting workflows to improve system efficiency and maintain optimal response times.

• Developed automation scripts using Python, Bash, Ruby, Linux shell, and YAML to streamline routine operations. Created reusable code libraries and automation frameworks to minimize manual interventions and improve team productivity.

• Participated in on-call rotations providing 24/7 support for critical systems and infrastructure. Created and maintained detailed runbooks and Standard Operating Procedures (SOPs) to standardize incident response and ensure consistent problem resolution.

• Managed Azure cloud infrastructure including Azure Storage and Azure Virtual Machines. Implemented automated deployment pipelines using Azure DevOps while ensuring security compliance and high availability of cloud resources.

EXPERIENCE

Visa 11/2022 – Present

Site Reliability Engineer Foster City, CA

• Developed and maintained ETL pipelines using Azure Data Factory (ADF) to ingest and transform healthcare-related datasets.

• Created Python and JSON-based scripts for automation, error handling, and dynamic ADF pipeline parameterization.

• Migrated legacy workflows from Informatica ICS to ADF, improving reliability and reducing runtime by 30%.

• Designed and optimized SQL queries and stored procedures in Snowflake and Azure SQL DB for data transformation and analytics.

• Authored Interface Control Documents (ICDs) documenting source-to-target mappings, transformations, and data validation logic.

• Integrated data validation and quality checks within pipelines to ensure compliance with Medicare/Medicaid regulatory standards.

• Collaborated with DevOps team to implement CI/CD for data pipelines using GitHub Actions and Azure DevOps.

• Created and managed regional and global load balancers in GCP to optimize traffic distribution and achieve faster response times across different geographic locations.

• Set up and maintained failover DNS records to ensure continuous service by automatically routing traffic to backup servers during main server outages. Implemented regular testing of failover mechanisms and documented recovery procedures to minimize potential downtime.

• Implemented GCP security controls through Identity and Access Management (IAM) to enforce role-based access restrictions and safeguard sensitive data across cloud environments. Created detailed access policies and conducted regular audits to maintain security compliance across development, staging, and production environments.

• Applied GCP security best practices by configuring VPC Service Controls and firewall rules to establish protected network boundaries and limit access to critical cloud resources. Monitored security logs and implemented automated alerts to detect and respond to potential security incidents.

• Collaborated with NetSecOps teams to define and automate GCP network security controls using Terraform and Python scripts for firewall rules, IAM role boundaries, and VPC Service Controls.

• Participated in CloudOps initiatives by defining SOPs for IaaS, PaaS, and SaaS usage on GCP, helping standardize cloud consumption and enforce compliance.

• Managed GCP IAM users, organizations, and groups by interpreting access policies and automation scripts aligned with security policies defined by IAM and CISO teams.

• Implemented security controls for DNS records management to ensure only authorized personnel could modify DNS settings and configurations. Created detailed change management procedures and maintained audit logs of all DNS modifications to track and verify authorized changes.

• Defined and documented SOPs for IaaS/PaaS/SaaS consumption on GCP to enforce compliance and standardize cloud usage.

• Collaborated with .NET development teams to deploy and troubleshoot APIs hosted on AKS clusters, integrating deployment hooks in GitLab pipelines.

• Created and managed CI/CD pipelines with Jenkins and GitHub Actions to deploy Java, .NET, and Python applications, using SonarQube for code quality checks to increase deployment speed and code standards.

• Set up GCP integration with CI/CD pipelines to automate code deployment to GKE and App Engine, resulting in quicker and more stable releases.

• Implemented Docker containerization and Kubernetes orchestration for microservices, using Helm charts and ArgoCD for automated deployment and scaling across cloud environments.

• Built and ran applications in containers on Google Kubernetes Engine (GKE) to support business-critical services. Implemented microservices architecture that improved system reliability, simplified scaling, and strengthened security controls across the platform.

• Set up automatic scaling rules and policies for GKE clusters to handle varying workload demands throughout the day. Created resource optimization strategies that reduced cloud costs by 30% while maintaining consistent performance levels.

• Tracked GKE clusters health and performance using Prometheus and Grafana monitoring tools. Set up alerts and built custom monitoring dashboards that helped identify and resolve system issues before they impacted users.

• Created detailed analytics dashboards in Looker that transformed complex data into clear visual insights for different teams. Developed custom visualizations and LookML models that helped stakeholders understand trends and make data-driven decisions.

• Connected Looker with BigQuery to create a seamless data analytics pipeline for the organization. Built efficient data models and optimized queries that reduced report loading times from minutes to seconds, allowing teams to analyze data directly in the Looker interface.

• Configured and deployed VPCs, subnets, firewall rules, and NAT gateways in GCP using Terraform, ensuring secure and scalable network architectures across multiple projects.

• Collaborated with NetSecOps to automate GCP firewall rule and IAM role management using Python and Terraform, enforcing least-privilege access and compliance with CIS benchmarks.

• Managed IAM roles, org policies, and group permissions in GCP, integrating with Terraform for repeatable and auditable access control.

• Created GCP Stackdriver dashboards and alert policies to monitor health of GCP workloads, integrating logs and metrics with centralized monitoring platforms.

• Wrote and maintained SOPs for deploying IaaS, PaaS, and SaaS services in GCP, establishing standardized onboarding workflows and access provisioning.

• Supported tier-2 on-call escalations, resolving production incidents in GCP infrastructure, including DNS failover, load balancer configurations, and VM auto-healing workflows.

• Developed monitoring dashboards and alerting strategies using GCP’s Operations Suite (Stackdriver), integrated with custom Prometheus exporters and Grafana visualizations to ensure platform health and proactive incident resolution.

• Designed and developed scalable backend APIs using Node.js to support growing user demands. Implemented caching strategies and performance optimizations that improved web applications response times.

• Implemented BigQuery ML models to detect anomalies in system performance and user behavior patterns. Created automated alerting systems based on ML predictions that reduced incident response time.

• Developed end-to-end machine learning pipelines on Vertex AI for automated model training and deployment. Created monitoring systems to track model performance and trigger automatic retraining when accuracy dropped below thresholds.

• Enhanced feature engineering processes in BigQuery ML through automated data transformation pipelines. Implemented feature selection algorithms that improved model accuracy while reducing training time.

• Designed secure hybrid cloud networking architecture using VPNs and Interconnect for multi-environment connectivity. Implemented automated failover mechanisms that ensured high uptime for critical cross-network services.

• Designed robust deployment strategies including blue-green and canary releases on GCP infrastructure. Implemented automated health checks and progressive traffic shifting that eliminated deployment-related downtime.

• Built and deployed machine learning models with TensorFlow, PyTorch, and Scikit-Learn frameworks for business insights. Implemented solutions for predictive analysis and customer grouping that improved decision-making processes.

• Monitored and enhanced SLA/SLO metrics through systematic performance analysis and improvements. Created automated monitoring systems that streamlined incident detection and response procedures.

• Set up SSL/TLS termination configurations on Load Balancers to enhance security measures. Optimized system performance by efficiently managing encryption processes at the edge layer.

• Managed and optimized RHEL production server environments through continuous monitoring and performance tracking. Implemented automated alert systems and response protocols that improved system reliability, resulting in 99.9% uptime across 50+ servers.

• Developed comprehensive Ansible playbooks for automated system management and wrote modular Bash scripts for

• RHEL maintenance tasks including patching, backups, and log rotation.

• Established standardized ServiceNow workflows for incident response and implemented detailed Change Management procedures for infrastructure modifications. Created and maintained documentation for incident handling procedures, reducing average incident resolution time from 4 hours to 45 minutes and decreasing unplanned downtime by 60%.

• Implemented Nagios monitoring across multiple data centers to track critical infrastructure metrics and system performance bottlenecks. Developed and deployed custom Python-based Nagios alerts for specialized monitoring requirements

• Executed comprehensive regular security audits on quarterly basis and conducted thorough penetration tests across all system layers. Implemented security improvements for application, network, and infrastructure vulnerabilities, resulting in a 70% reduction in security incidents.

• Deployed GCP monitoring infrastructure using Stackdriver for real-time performance tracking and anomaly detection. Created custom dashboards and alert policies to monitor resource utilization, application performance, and system health, enabling rapid response to potential issues.

• Developed automated shell scripts integrated with CLI to enhance monitoring capabilities through custom CloudWatch metrics and SNS notifications. Created escalation procedures and alert routing rules to ensure critical alerts reached appropriate teams, reducing system downtime by 45% through faster incident response.

Environment: GCP, GKE, Aws, Terraform, Docker, Kubernetes, Lambda, Groovy, CloudWatch, CloudFormation, ELK, Splunk, Ansible, VPC’s, Git, GitHub, VS code, Agile Methodology, JIRA, Maven build, Apache, Nagios, Python, Ruby, Linux, VMware, Service Now.

BNY Mellon 01/2020 – 10/2022

DevOps Engineer New York, NY

• Supported data ingestion and transformation workflows on Azure, focusing on healthcare claims and financial records.

• Utilized Python for automating recurring data loads, metadata tagging, and post-load validations.

• Built reusable Terraform modules for ADF-linked services (e.g., storage, networking, triggers).

• Maintained Snowflake objects (warehouses, schemas, views) and scheduled tasks for curated datasets.

• Created Looker dashboards integrated with BigQuery and Snowflake for stakeholder reporting.

• Participated in change management and incident response for critical data pipelines.

• Created automated systems for data ingestion and transformation workflows, reducing manual data handling time by 50% and improving pipeline efficiency. Streamlined data processing by implementing automated quality checks and validation procedures.

• Deployed and managed monitoring and reporting stacks for GCP services using Stackdriver (now Cloud Operations Suite), automating alerts and visualization for executive-level reporting.

• Acted as the Tier-2 escalation point for GCP environment production incidents, performing post-mortem analysis and break-fix operations across compute, IAM, and networking issues.

• Built and managed Azure infrastructure (AKS, SQL, VMs, VNets) using Terraform modules executed via GitLab CD CI/ pipelines, enabling fully automated IaC deployments with reusable templates and secure credential management.

• Used VPC Flow Logs and Cloud Monitoring to track and enhance network performance, successfully reducing packet loss and improving system response times. Implemented proactive monitoring alerts to identify and address potential network bottlenecks.

• Championed internal adoption of GCP services by developing enablement guides, conducting brown bag sessions, and hands-on demos for developers.

• Built and deployed custom LookML models in Looker to enhance query speed and establish standardized metrics, enabling consistent reporting capabilities across different teams. Created modular LookML code to improve maintainability and reduce duplicate development efforts.

• Integrated Looker's embedded analytics capabilities to build interactive dashboards for external stakeholders, ensuring secure data access while meeting compliance requirements. Developed customized visualization templates to meet specific reporting needs across different business units.

• Identified and resolved network issues including IP address conflicts, firewall rule misconfigurations, and VPC routing problems to maintain reliable service communication. Established systematic troubleshooting procedures to minimize service disruptions and improve recovery time.

• Created SOPs and documentation for infrastructure provisioning, access control, DNS failover management, and CI/CD workflows to support GCP platform adoption.

• Maintained and resolved critical issues in GKE workloads on GCP through comprehensive monitoring and diagnostics. Provided rapid incident response and implemented proactive system checks to enhance operational stability and overall system reliability metrics.

• Set up and managed GCP security through Cloud Armor implementation, focusing on robust DDoS attack prevention and mitigation. Strengthened multi-layer security controls across GCP security infrastructure while establishing automated threat response protocols.

• Implemented Puppet for configuration management across hybrid cloud environments, automating the provisioning and management of infrastructure resources on Azure and on-premises servers.

• Developed Puppet manifests and modules to define and enforce consistent server configurations, ensuring compliance with company policies and industry best practices for security and performance.

• Developed automation scripts in Python (core) using Puppet to deploy and manage Java applications across Linux servers.

• Built and optimized GCP security monitoring systems using Security Command Center for comprehensive threat visibility. Configured automated security responses and alert mechanisms throughout GCP security infrastructure, leading to faster incident detection and response times.

• Connected GCP Load Balancers with Managed Instance Groups to enable intelligent traffic distribution and handling. Implemented dynamic auto-scaling rules based on real-time traffic patterns and resource utilization metrics to optimize performance and cost efficiency.

• Set up and operated GKE clusters on GCP with emphasis on high availability and fault tolerance. Implemented comprehensive stability measures including automated proxy management, failover strategies, and blue-green deployment methods to ensure seamless application updates with minimal service impact.

• Created and refined application failover systems within GKE to handle service transitions during critical incidents and maintenance windows. Established automated health checks and failover triggers that improved system resilience and achieved 99.9% uptime across all services.

• Monitored and improved network performance for GCP VPC, meticulously configuring proxy settings, routing tables, and VPC peering to enhance cross-region communication efficiency and establish robust network redundancy.

• Implemented automated DNS record management through Terraform, creating consistent configurations across multiple environments and accelerating domain provisioning processes.

• Conducted comprehensive monitoring of DNS performance and usage, proactively identifying latency bottlenecks and optimizing DNS configurations to achieve faster, more reliable domain lookups.

• Developed operational resilience strategies using Load Balancing within GKE, ensuring optimal traffic distribution and intelligent resource allocation for high-traffic applications.

• Deployed Load Balancers for hybrid cloud infrastructure, effectively routing traffic between on-premise and GCP environments to create seamless, integrated network connectivity.

• Optimized incident response times through strategic release management, incident management, and production troubleshooting processes, deploying advanced monitoring tools to proactively detect and resolve critical system issues.

• Created and maintained comprehensive process documentation and detailed runbooks for operational tasks, standardizing troubleshooting procedures and establishing consistent incident response protocols, resulting in a 30% reduction in MTTR.

• Executed a complex migration from On-Prem to GCP, ensuring smooth data transitions and implementing automated monitoring processes to detect, track, and recover from potential failures throughout the migration lifecycle.

• Dramatically improved system observability by integrating monitoring and logging tools like Stackdriver, guaranteeing 24/7 availability for mission-critical applications and implementing robust SLOs/SLIs to comprehensively track service reliability.

• Conducted thorough post-incident reviews and implemented rigorous Root Cause Analysis (RCA) methodologies, driving continuous system improvements and systematically reducing issue recurrence.

• Configured GKE Ingress controllers and network policies to manage external traffic with advanced security measures and operational efficiency.

• Managed automated GKE cluster deployments using Helm charts and Terraform, creating consistent and repeatable infrastructure across multiple development and production environments.

• Developed and automated model training and evaluation processes in BigQuery ML, continuously improving machine learning model accuracy and performance through systematic testing and refinement.

• Connected BigQuery ML with Looker to visualize machine learning model outcomes, enabling stakeholders to extract actionable insights from predictive analytics.

• Implemented comprehensive health monitoring by setting up automated alerts and dashboards using Prometheus, Grafana, and Cloud Monitoring, enabling rapid detection and response to anomalies within GKE clusters.

• Worked directly with development teams to enhance system reliability, integrating continuous monitoring across CI/CD pipelines and production environments to improve overall system resilience.

• Collaborated with security teams to maintain data protection standards and configured BGP routes for ExpressRoute connections between on-premises data centers and Azure cloud infrastructure.

• Deployed and managed GitHub Enterprise instances, handling complete installation, configuration, and administrative responsibilities to support collaborative software development processes.

• Managed container deployments using Docker, creating Dockerfiles, establishing automated builds on Docker Hub, and utilizing Docker Compose for efficient multi-container provisioning.

• Used Puppet to define infrastructure as code for Azure compute resources, and automated backup and deployment processes on Kubernetes clusters running on AKS and GKE using Chef.

• Configured load balancers to manage traffic routing between Blue-Green and canary deployment environments, ensuring seamless and minimal-disruption user experiences during software updates.

• Selected and implemented NoSQL databases like MongoDB and DynamoDB to achieve high scalability, flexibility, and performance, while developing Python scripts using Pymongo to identify and alert stakeholders about long-running database queries.

• Optimized Google BigQuery data warehouses to ensure low-latency, high-performance queries for large-scale data analytics operations.

• Integrated PostgreSQL with cloud services, developing scalable database solutions for microservices and cloud-native applications, and ensuring high availability through robust replication and failover configurations.

• Connected Cloud Composer with Dataproc and BigQuery to automate complex ETL workflows, reducing manual intervention by 40% and improving overall data processing efficiency.

• Created automated monitoring and alerting systems for BigQuery and Dataproc jobs using Prometheus and Grafana, improving incident response times by 25% and ensuring timely execution of critical data pipelines.

• Maintained disaster recovery and high availability for Vertex AI deployments by managing failover mechanisms and implementing continuous monitoring to achieve 99.9% uptime for machine learning workflows.

• Written Python and Bash scripts to automate the creation of Amazon Machine Images (AMIs) and database backup processes.

• Implemented Kafka as the primary streaming data platform for handling high-throughput event data.

• Configured Splunk Open Telemetry (OTEL) collectors for comprehensive logging and monitoring across GCP and Azure platforms, and deployed Splunk agents as sidecar containers for Kubernetes services.

• Established centralized monitoring and logging systems for cloud and on-premises environments using New Relic and Azure Monitor, tracking changes in CI/CD pipelines to provide comprehensive insights and enable proactive system management.

• Automated infrastructure deployment using Terraform, developing custom Terraform modules that enable scalable GCP resource management and enforce infrastructure-as-code best practices to minimize operational risks.

Environment: Azure, GCP, GKE, Azure, Chef, Puppet, Docker, Kubernetes, Isto, Jenkins, PowerShell, Python, Bash, GitHub, CI/CD, App dynamics, Grafana, Prometheus, Dynatrace, ELK, MongoDB, VPC’S, DNS, ARO, IAM, ServiceNow, Terraform, Kubernetes, New Relic, Shell/Perl Scripts, Bitbucket, Python, TFS, SCM, API, GI, Jenkins, TomCat, Java, Azure TFS, Azure VSTS, Visual Studio, Visual Studio Code, Checkmarx, GitBash, Python, PowerShell.

Walgreens 03/2017 – 12/2019

DevOps Engineer Deerfield, IL

• Designed and implemented highly available and fault-tolerant cloud architectures on GCP using services like Compute Engine, Cloud Storage, and Cloud SQL, optimizing performance and cost-efficiency for large-scale deployments.

• Automated infrastructure provisioning and management on GCP using Terraform, creating reusable modules for VPC networks, subnets, firewalls, IAM roles, and other resources, ensuring consistent and reliable environments.

• Developed and maintained CI/CD pipelines using Jenkins and Google Cloud Build, integrating with GCP services to automate build, test, and deployment processes for applications and infrastructure, enhancing deployment speed and reliability.

• Deployed and managed GKE clusters, automating container orchestration, scaling, and management of microservices architectures, ensuring high availability and efficient resource utilization.

• Created custom Terraform providers and plugins to extend Terraform functionality and integrate with internal systems and APIs, enhancing automation capabilities and supporting unique infrastructure requirements.

• Managed the source codes repository of multiple development applications using SVN and Git version control...



Contact this candidate