Resume

Reliability Engineer Site

Location:

Dallas, TX

Posted:

February 22, 2024

Contact this candidate

Resume:

Vamsi Krishna Email: ad3uc3@r.postjobfree.com

Sr. Site Reliability Engineer Phone: 209-***-****

Professional Summary:

Dedicated and highly skilled Site Reliability Engineer with a proven track record of around 10+ years ensuring the availability, performance, and scalability of complex, distributed systems across multiple cloud environments.

Adept at blending deep technical expertise with strong problem-solving abilities to design, implement, and maintain robust and resilient infrastructure solutions.

Committed to delivering exceptional reliability and optimizing system performance while reducing downtime and mitigating risks.

Proficient in managing and optimizing cloud services on major providers such as Amazon Web Services (AWS), Microsoft Azure, Google Cloud Platform (GCP).

Expertise in using tools like Terraform and Ansible to define and automate infrastructure deployment, configuration, and maintenance.

Skilled in container technologies (Docker) and orchestrators like Kubernetes for building and managing scalable and portable applications.

Proficient in setting up monitoring solutions with tools like Prometheus, Grafana, and Elasticsearch, and creating intelligent alerting systems.

Experience with continuous integration and continuous delivery (CI/CD) pipelines using tools like Jenkins, GitLab CI/CD, and CircleCI.

Proficient in scripting languages (Python, Bash) and capable of writing code to automate routine tasks and improve system reliability.

Expertise in configuring load balancers and implementing auto-scaling strategies for ensuring high availability and optimal resource utilization.

Knowledge of cloud security best practices, identity and access management, and compliance standards (e.g., HIPAA, PCI DSS).

Skilled in incident response, root cause analysis, and post-incident reviews to minimize downtime and improve system resilience.

Effective communicator and team player, capable of collaborating with cross-functional teams to ensure smooth operations and timely resolution of issues.

Proficient in creating comprehensive documentation for system configurations, procedures, and best practices to facilitate knowledge sharing and onboarding.

Proficient in devising disaster recovery plans and implementing backup and restoration strategies to ensure business continuity in case of unexpected outages or data loss.

Skilled at optimizing cloud resource utilization and reducing operational costs through monitoring, right-sizing instances, and implementing cost-effective solutions.

Experienced in defining, tracking, and meeting SLOs and SLIs to maintain high service reliability and user satisfaction.

Capable of identifying and addressing performance bottlenecks in applications and infrastructure components to enhance system responsiveness and efficiency.

Technical Experience:

Cloud

AWS, Azure, GCP

Configuration Management tools

AWS cloud formation, Ansible, Jenkins, Terraform, Chef

Build Tools

ANT, Maven, Docker, Google Cloud Build

Container Tools

Docker, AWS Elastic Kubernetes Service (EKS), AWS Elastic Container Service (ECS), Azure Kubernetes Service (AKS), Kubernetes

Version Control Tools

Git, GitHub, SVN

Log and monitoring tools

AWS CloudWatch, AWS CloudTrail, Azure Monitor, Application Insights, Azure Log Analytics, Google Cloud Monitoring, Google Cloud Logging, Prometheus

Scripting

Perl, Python, Java, Bash Script, PowerShell, Shell Scripting

Databases

SQL, PL/SQL, Amazon RDS, DynamoDB, RedShift, Amazon RDS, DynamoDB, RedShift, Azure Databricks, Azure Data Factory, Azure SQL Database, Google Cloud Storage, MySQL

Application Servers

Apache, Tomcat, Azure VMs, WebLogic,

Operating Systems

Linux, Unix, Ubuntu

CI/CD Tools

Jenkins, Azure DevOps, Google Cloud Build, AWS CodePipeline, AWS CodeBuild

Build Automation Tools

Jenkins, Git, Google Cloud Functions, Puppet, Google cloud run, Azure Automation, AWS Lambda

Professional Experience:

US Courts Administrative Office, Washington, DC Mar ' 21 – Present

Sr. Site Reliability Engineer

Responsibilities:

●Experienced in creating, setting up, and looking after various tools in Amazon Web Services (AWS), like virtual computers, storage, and databases, to make sure they run well and can grow when needed.

●Designed and set up systems that automatically build, test, and deliver software using AWS tools, which helps in quick and reliable software updates.

●Expert in using code to define and create AWS infrastructure, making it easy to set up and maintain resources consistently.

●Experienced in using containers (like Docker) and tools like AWS Elastic Kubernetes Service (EKS) or Amazon Elastic Container Service (ECS) for deploying and managing applications efficiently.

●Set up real-time monitoring and warning systems with AWS CloudWatch to keep an eye on resource health and security, and AWS CloudTrail for keeping records of activities.

●Proficient in making sure AWS systems are safe by using encryption, network protection, and managing who can access what using AWS Identity and Access Management (IAM).

●Effectively balanced responsibilities on AWS projects while providing crucial support to Azure deployments for my parallel team, showcasing adaptability and multi-cloud proficiency, ensuring the seamless execution of projects in a hybrid cloud environment.

●Spearheaded automation initiatives for seamless deployment and monitoring of Hadoop clusters using tools like Ansible and Puppet. Enhanced system reliability and reduced downtime by implementing proactive alerting and monitoring solutions, ensuring optimal performance of Big Data infrastructure.

●Collaborated with cross-functional teams to identify and implement optimizations in Big Data workflows. Worked closely with data engineers and scientists to fine-tune algorithms, improving data processing speed and resource utilization in the Hadoop environment.

●Proficient in ERM (Enterprise Risk Management) to ensure effective risk identification and mitigation strategies, enhancing operational stability and security.

●Using PySpark for data processing and analysis, streamlining DevOps workflows and enabling data-driven decision-making for improved system performance.

●Working on PySpark for optimizing data pipelines and enhancing scalability.

●Has experience in integrating ERM practices into DevOps processes for continuous improvement.

●Good at utilizing PySpark for efficient data ingestion and analysis in workflows.

●Collaborated with finance and accounting teams to integrate FinOps principles into budgeting and financial reporting processes, ensuring alignment with organizational goals.

●Experience in Golang Development, with expertise in writing clean and efficient Go code for building AWS infrastructure automation tools and microservices.

●Strong experience in using Go to interact with AWS services through AWS SDKs and APIs, automating infrastructure provisioning and management.

●Implemented robust CI/CD pipelines using Go and AWS CodePipeline, AWS CodeDeploy and container technologies Docker, Kubernetes to ensure seamless software delivery and deployment on AWS cloud environments.

●Orchestrated Kubernetes clusters for Clienture's infrastructure, optimizing performance and scalability, resulting in 30% reduction in downtime.

●Implemented automated monitoring and alerting solutions, enhancing system reliability and enabling proactive issue resolution, minimizing client disruptions by 40%.

●Handled migration of legacy systems to Kubernetes, streamlining deployment processes and reducing resource overhead, resulting in a 50% improvement in infrastructure efficiency for Clienture.

●Continuously tracked and reported on FinOps KPIs, such as Cost-to-Run (C2R) ratios, to measure the financial efficiency of DevOps initiatives and drive cost-conscious decision-making.

●Have experience in Azure cloud services, managing and deploying applications, contributing to the organization's Azure projects, and maintaining a strong skill set in Azure technologies.

●Skilled in using AWS tools and strategies to save money, like making sure we're not using more resources than we need, setting budgets, and following cost-saving recommendations.

●Had experience in creating and maintaining Helm charts to streamline and automate Kubernetes deployments, ensuring efficient container orchestration and seamless application scaling.

●Worked on enhancing deployment workflows by leveraging Helm to manage the lifecycle of containerized applications, optimizing infrastructure resources and minimizing downtime.

●Implemented Helm charts as part of Infrastructure as Code initiatives, enabling version-controlled, repeatable deployments and fostering collaboration between development and operations teams for improved project agility.

●Proficient in conducting Failure Mode Analysis to assess application behavior during hardware failures, ensuring robustness and continuity of processing.

●Used Failure Mode Analysis techniques to identify potential points of failure, implementing preemptive measures to enhance system resilience and minimize downtime.

●Extensive experience in AutoSys, leveraging its capabilities to streamline and automate job scheduling and execution, enhancing operational efficiency.

●Utilized AutoSys to orchestrate complex workflows, ensuring timely and reliable execution of critical tasks, and optimizing resource utilization.

●Skilled in utilizing ServiceNow to streamline IT service management processes, improving service delivery and enhancing user satisfaction.

●Leveraged ServiceNow to facilitate incident, problem, and change management, enabling proactive resolution of issues and ensuring seamless IT operations.

●Proficient in using JIRA for agile project management, facilitating collaboration, tracking tasks, and ensuring project milestones are met effectively.

●Utilized JIRA to track and manage software development processes, improving team productivity, and enhancing project visibility and transparency.

●Working on AI/ML model integration into DevOps workflows, covering deployment, monitoring, version control, and cost management.

●Proficient in using C++ for developing robust and efficient automation tools and scripts to streamline deployment processes, configuration management, and infrastructure provisioning.

●Skilled in utilizing C++ to optimize performance-critical components of DevOps tools and systems, enhancing scalability, reliability, and resource utilization in cloud and on-premises environments.

●Experienced in integrating C++ libraries and frameworks into DevOps workflows for tasks such as monitoring, logging, and system administration, contributing to the seamless orchestration and management of complex infrastructure setups.

●Efficient management of data pipelines to ensure clean data input for ML models, facilitating collaboration with data scientists.

●Implementation of AI-driven automation for proactive issue resolution, cost optimization, and improved system performance, alongside model explainability techniques.

●Expertise in multi-cloud environments, enabling seamless AI/ML deployment across various cloud platforms and adaptability within hybrid cloud configurations.

●Planned for worst-case scenarios and tested strategies to keep AWS systems running even if something goes wrong.

●Used scripting languages like Python and Bash to automate tasks and build custom solutions to save time and reduce errors.

●Automated routine tasks and system maintenance using PowerShell, Azure CLI, and Python scripts, reducing manual intervention.

●Kept detailed records of how AWS systems are set up and how to fix problems, making it easy for the team to understand and learn.

●Proficient in using AWS Lambda and AWS Serverless Application Model (SAM) for serverless computing, which reduces the need for managing servers.

●Experienced in handling AWS database services like Amazon RDS, DynamoDB, and Redshift for storing and accessing data.

●Implemented robust security measures using HashiCorp Vault to safeguard sensitive data and credentials, enhancing overall system resilience and compliance.

●Streamlined deployment workflows by integrating HashiCorp Vault for secret management, optimizing secure access and authentication processes across multiple environments.

●Expertise in Splunk for real-time monitoring and log analysis, enabling proactive troubleshooting and incident response, optimizing system performance, and ensuring high availability in complex DevOps environments.

●Designed and implemented custom Splunk dashboards and alerts to streamline operational insights, driving efficient decision-making, and contributing to cost reduction through resource optimization.

●Deployed and managed continuous delivery pipelines using ArgoCD, automating application deployments and ensuring consistency and reliability across multiple environments.

●Orchestrated seamless automation of secrets management with HashiCorp Vault, significantly reducing manual intervention and bolstering system scalability.

●Optimized ArgoCD configurations to seamlessly integrate with Kubernetes clusters, enabling efficient resource synchronization and enhancing application management and scalability.

●Set up systems that can automatically handle more work when needed, and scale down when there's less work, using AWS Auto Scaling and Amazon CloudWatch.

●Used tools to automatically check if AWS resources are set up correctly and if they follow security and configuration rules.

●Followed best practices for version control and keeping infrastructure and application configurations organized and consistent.

●Working closely with different teams to make sure everyone works together smoothly, from developers to quality assurance to operations.

●working with different cloud systems, like AWS, and experienced how to connect them when needed.

Environment: Amazon Web Services (AWS), Docker, AWS Elastic Kubernetes Service (EKS), Amazon Elastic Container Service (ECS), AWS CloudWatch, AWS CloudTrail, AWS Identity and Access Management (IAM), AWS Lambda, AI/ML, AWS Serverless Application Model (SAM), Jupyter, Amazon RDS, Amazon DynamoDB, Amazon Redshift, AWS Auto Scaling, ERM, Infrastructure as Code (IaC) tools like AWS CloudFormation and Terraform, Git, Python and Bash Scripting, ArgoCD, Spark, PySpark, AutoSys, ServiceNow, Jira, SageMaker, Hashicorp Vault, Hadoop.

Air Products, Allentown, PA Jan’ 20 – Dec' 20

Site Reliability Engineer

Responsibilities:

●Orchestrated deployment and management of containers using Docker and Kubernetes on Azure Kubernetes Service (AKS).

●Leveraged Terraform to define and manage Azure infrastructure as code (IaC) for scalable and reproducible deployments.

●Collaborated with development teams to implement Git-based version control and integrate Git workflows with Azure DevOps.

●Configured Azure Logic Apps and Azure Functions for serverless computing and event-driven automation.

●Implemented Azure Functions and Azure Logic Apps for serverless computing and event-driven automation.

●Integrated Azure Active Directory (Azure AD) for secure identity and access management (IAM) across Azure services.

●Utilized Azure Key Vault for centralized management of secrets and keys, enhancing security and compliance.

●Implemented Ansible automation on Red Hat Enterprise Linux (RHEL) to streamline system configurations, ensuring consistency and reducing manual intervention.

●Enhanced RHEL security posture by implementing and maintaining security policies, including SELinux configurations and regular system patching, to meet compliance standards.

●Utilized performance tuning techniques on RHEL servers, optimizing resource utilization, and ensuring robust performance for critical applications and services.

●Deployed and configured Azure Site Recovery (ASR) for disaster recovery planning and failover of critical workloads. Designed and deployed robust, scalable Big Data solutions using Hadoop ecosystem tools like HDFS, MapReduce, Hive, and Spark and streamlined data processing workflows, resulting in improved efficiency and reduced time-to-insights.

●Worked on seamless integration of Apache Spark into the DevOps pipeline, enhancing data processing efficiency and reducing latency and Implemented optimization strategies for Spark jobs, resulting in significant performance improvements and resource utilization.

●Designed and implemented automated data pipelines using Apache Spark, streamlining data workflows and ensuring timely and reliable data processing. Leveraged Spark's capabilities to achieve scalability, fault tolerance, and parallel processing, contributing to overall system reliability and efficiency.

●Implemented and optimized CI/CD pipelines on OpenShift, streamlining the development and deployment lifecycle of containerized applications.

●Implemented and maintained end-to-end CI/CD pipelines for ML models using tools like Docker, Kubernetes, and Jenkins, enabling automated model training, testing, and deployment, reducing release cycles by 40%.

●Utilized infrastructure as code (IAC) principles with Terraform and Ansible to provision and scale ML infrastructure on Azure, optimizing resource allocation and reducing operational costs by 30%.

●Integrated ML monitoring and observability into existing DevOps practices, leveraging tools like Prometheus and Grafana to proactively detect model performance issues, resulting in a 20% reduction in downtime.

●Collaborated with data scientists to containerize ML workflows, ensuring reproducibility and consistency across environments, leading to improved model deployment success rates and faster time-to-market for ML applications.

●Utilized OpenShift's Kubernetes orchestration to manage scalable and resilient containerized workloads, ensuring high availability and efficient resource utilization.

●Used the Operator Framework on OpenShift to automate complex application management tasks, enhancing operational efficiency and maintaining consistency in deployments.

●Experienced in operating microservices on a global scale, addressing challenges related to distributed systems, geographical diversity, and ensuring seamless deployment across different regions.

●Worked on optimizing performance and reliability of microservices in diverse environments, leveraging techniques such as load balancing and efficient resource utilization.

●Worked on operational realities of microservices ecosystems, involving continuous monitoring, troubleshooting, and fine-tuning for optimal performance in a production setting.

●Skilled in writing, debugging, and maintaining Java code for developing custom automation scripts and tools to streamline DevOps processes.

●Extensive experience integrating and configuring Java-based DevOps tools like Jenkins, Maven, and Gradle to enhance CI/CD pipelines and infrastructure automation.

●Monitored and maintained Azure resources using Azure Monitor, Application Insights, and Azure Log Analytics.

●Managed Azure Virtual Machines (VMs) through Azure Automation and Azure Update Management for patching and updates.

●Implemented Prometheus and Grafana for real-time monitoring, alerting, and visualization of infrastructure and application metrics.

●Created custom Prometheus exporters and Grafana dashboards to track key performance indicators (KPIs) and ensure system stability.

●Set up alerting rules in Prometheus and integrated them with PagerDuty for immediate incident response.

●Reduced downtime by 30% through proactive monitoring and alerting in Grafana, ensuring faster issue resolution.

●Implemented Azure Data Factory and Azure Databricks for data integration, ETL, and analytics.

●Designed and configured Azure Virtual Networks, ExpressRoute, and VPN connections to ensure secure and efficient network architecture.

●Utilized Azure Load Balancer and Application Gateway for load distribution and traffic management.

●Used Terraform, Ansible, and Confluence to automate and document infrastructure configurations, procedures, and best practices.

●Collaborated with cross-functional teams, conducted code reviews, and provided technical expertise to optimize development and deployment workflows.

●Demonstrated strong problem-solving skills in diagnosing and resolving Azure infrastructure and application issues.

●Managed Azure costs and optimized resource allocation to ensure cost-effectiveness and efficiency.

Environment: Docker, Kubernetes, Azure Kubernetes Service (AKS), Terraform, Git, Azure DevOps, PowerShell, Azure CLI, Java, Azure Logic Apps, Azure Active Directory (Azure AD), Azure Key Vault, Azure Site Recovery (ASR), Azure Blueprints, Azure Monitor, Azure Log Analytics, Azure Automation, Azure Data Factory, Azure Databricks, Azure Virtual Networks, ExpressRoute, Azure Load Balancer.

GameStop, Dallas, TX Oct’ 16 – Dec’ 19

Site Reliability/DevOps Engineer

Responsibilities:

●Proficient in efficiently deploying and managing various Google Cloud Platform (GCP) resources, including virtual machines, scalable storage solutions, networking components, and database services.

●Experienced in the orchestration and maintenance of Kubernetes clusters using Google Kubernetes Engine (GKE) for streamlined containerized application management, scaling, and deployment.

●Proficiently employ Infrastructure as Code (IaC) principles through tools like Terraform, ensuring the consistent and automated provisioning of GCP resources.

●Design, implement, and maintain robust CI/CD pipelines, harnessing the power of Google Cloud Build and other GCP services to automate application deployment workflows.

●Experienced in making GCP environments more secure by carefully creating rules for who can do what (IAM policies), strong codes to keep data safe (encryption), and clear rules for blocking unwanted access (firewall).

●Proficient in using Java to write Infrastructure as Code (IaC) scripts, such as using libraries like AWS SDK for Java, to automate cloud infrastructure provisioning and management.

●Experienced in leveraging Java for developing custom monitoring and alerting solutions to proactively identify and address issues in DevOps environments, ensuring high availability and reliability.

●Implemented automated monitoring and alerting solutions with Dynatrace to proactively identify performance issues, reducing incident response times and ensuring high application uptime.

●Collaborated with cross-functional teams to optimize application performance, utilizing Dynatrace insights to fine-tune code, infrastructure, and configuration, resulting in improved application speed and efficiency.

●Led Dynatrace migration and upgrade projects, seamlessly transitioning from on-premises to cloud-based solutions, enhancing scalability and reducing infrastructure costs while maintaining high-level performance visibility.

●Employ cost-saving strategies, including resource rightsizing and meticulous budget management, to maximize resource utilization while minimizing expenses.

●Develop comprehensive disaster recovery plans and execute strategies to ensure high availability and data resilience within GCP environments.

●Maintained meticulous documentation detailing GCP infrastructure configurations, operational procedures, and troubleshooting guidelines, facilitating effective knowledge sharing.

●Experienced in leveraging Google Cloud Functions and Google Cloud Run for serverless computing, reducing operational overhead, and optimizing resource utilization.

●Skilled in the setup and management of Google Cloud Storage solutions, providing scalable and cost-effective storage for diverse needs.

●Experienced with Istio and other service mesh technologies on GCP for enhanced application communication, observability, and security.

●Implemented Prometheus to enhance infrastructure monitoring, enabling custom metrics collection and advanced alerting capabilities.

●Implemented advanced deployment strategies like blue-green deployments and canary releases to minimize downtime and deployment risks.

●Utilized GitOps principles and tools like Argo CD for efficient management and synchronization of infrastructure and application configurations stored in Git repositories.

●Collaborated seamlessly with cross-functional teams and maintained comprehensive documentation to facilitate knowledge sharing and smooth onboarding.

●Exhibited proficiency in incident response and resolution, minimizing service disruptions and downtime during critical incidents.

Environment: Google Cloud Platform (GCP), Google Kubernetes Engine (GKE), Terraform, Google Cloud Build, Google Cloud Monitoring, Google Cloud Logging, Java, Bash Scripting, Google Cloud Functions, Google Cloud Run, Google Cloud Storage, Dynatrace, Istio, Argo CD

MedPlus, Hyderabad, India Jul’ 13 – Aug’ 16

Build and Release Engineer

Responsibilities:

Managed source code repositories, build, and release processes.

Supported daily development, testing, and production deployments.

Collaborated with an agile team for CI/CD in an open-source environment using Ansible and Jenkins.

Configured server settings for Apache, MySQL, and WebLogic with Ansible.

Authored Perl and bash scripts for build and integration.

Established software build and release best practices.

Transitioned project dependencies from ANT to MAVEN.

Wrote, debugged, and troubleshot PL/SQL code.

Administered Linux systems for hosting build and release apps.

Orchestrated Cloud-based product builds using Power Shell, TFS, and Python.

Set up and configured Jenkins for automation.

Participated in release meetings and managed release calendars.

Managed source code repositories, branches, and code freeze processes.

Worked with Puppet for configuration management, creating custom modules for various components.

Environment: CI/CD, Ansible, Jenkins, Apache, MySQL, WebLogic, Perl, bash script, ANT, Maven, PL/SQL, Linux, Power Shell, TFS, Python, Puppet.

EDUCATION

●Bachelors in computer science from KKR & KSR Institute of Technology & Sciences

Contact this candidate