Oswaldo Villa
Lead DevOps / Site Reliability Engineer / DevOps Engineer
********@*****.***
https://www.linkedin.com/in/villatux/
San Jose CA, 95129
EDUCATION: Bachelor’s degree on Computer Systems Engineering Certifications see badges
SUMMARY Experienced SRE Engineer with over 12 years of hands-on experience in designing, implementing, and managing large-scale cloud infrastructure. Strong background in Linux systems administration and deep knowledge of CI/CD pipelines, automation, and DevOps practices. Proven track record of optimizing deployment processes and ensuring high availability, reliability, and performance in complex, dynamic environments.
EXPERIENCE
Site Reliability Engineer at TikTok
Location: (Mountain View CA) www.tiktok.com August 2022 - Jan 2025 Technologies: Linux, Docker, Golang, Bash, SQL, Redis, Hive, HDFS, Load Balancers, Gitlab/Github, GCP, OCI, Kibana MySQL, Grafana, Kubernetes, Nginx, ByteCycle, Trivy, Gitleaks, Istio, ElasticSearch, Linters, Hashicorp VAULT, OPA, WatchDog, Argos, Postman, Kerberos, phyton
■CI/CD
■ Designed, built, and maintained CI/CD pipelines using ByteCycle, a solution similar to GitLab CI/Bitbucket to automate the deployment of applications to improve SRE/Developers duties.
■ Implemented security best practices in pipelines, including image scanning with Trivy, vulnerability scanning, static code analysis, and automated secret detection with Gitleaks and custom scripts.
■ Optimized deployment strategies for zero-downtime releases using canary strategy for deployments, feature flags, and blue-green deployments.
■ Managed the end-to-end software delivery lifecycle with automated build, test, and deployment processes across multiple environments with Docker, Go, ByteCycle and Bash scripting for testing.
■ Created rollback mechanisms in pipelines to ensure quick recovery in case of failed deployments.
■ INFRASTRUCTURE
● Deployed and managed Kubernetes clusters in multiple regions on Cloud Platforms like OCI, GCP, and TikTok Data Centers..
● Configured auto-scaling policies to dynamically adjust the cluster size based on workload demand, improving resource utilization
● Designed and implemented disaster recovery strategies with multi-region failover policies for high availability.
● Automated cluster management tasks, including node scaling, configuration updates, and version upgrades
● Designed and Implemented a deployment and tracking solution to speed up service deployments in new Data Centers, I achieved a reduction from 1 month to 5 days of the services I own. (Used Golang, bash, MySQL, Grafana, ByteCycle, ByteCode)
● Reduced SRE - PROD workload by configuring new parameters for automigration/deletion of bad/stuck pods in order to let the cluster replace it with a new one
■ SCRIPTING & DEVELOPMENT
● Developed automation BASH scripts to reduce manual intervention regarding CVE vulnerabilities mass checks/fix
● Built a Golang application to scan all repositories (270+) for hardcoded credentials and security vulnerabilities in order to deploy those services on new Data Centers, save the data in MySQL and track the progress with Grafana. It reduced deployment from 1 month to 5 days
● Developed internal tools in Python in order to integrate existing legacy scripts
● Created a Chatbot using Golang that integrates with Tiktok Pager to provide real-time top High/Medium severity alerts for the last 24 hours.
● Developed a custom API log masking tool in Golang with a web UI to mask user data for security compliance with the US-Tiktok Agreements and avoid sharing data with any employee outside of the United States.
● Created and standardized a Docker container to be our primary development environment with all SDK’s and languages and libraries we support.
■ ALERTING & MONITORING
● Configured and managed data sources from TSBD and SQL for metric collection on Grafana for proactive issue notification.
● Created centralized Grafana dashboards displaying key metrics across services, clusters, and 3 regions.
● Developed SQL-based reports on alarm efficiency and response times, improving on-call performance tracking.
● Configured WatchDog and Argos alerts for anomaly detection across 180+ services, reducing mean time to detect (MTTD).
● Managed Telemetry with Istio (Bytedance version in Argos) for distributed tracing of the service mesh to diagnose performance bottlenecks in microservices.
● Built Kibana dashboards to visualize logs from Elasticsearch, enabling real-time monitoring of application and infrastructure performance
■ SECOPS
● Implemented policy-as-code solutions using OPA (Open Policy Agent) to enforce security compliance in infrastructure provisioning on some pipelines using ConfTest package.
● Configured /Implemented a self-hosted HashiCorp Vault for dynamic secrets management, reducing the risk of credential leaks on 3 internal tools.
● Enforced container security best practices with security scan (SAST/DAST) tools like Trivy and Linters like golangci-lint & ShellCheck
● Conducted periodic CVE vulnerability assessments and patched container images accordingly.
■ PROJECT MANAGEMENT & WORKFLOW AUTOMATION
● Designed and maintained project management workflows in JIRA, improving task tracking and collaboration.
● Documented multiple Runbooks of operation & tasks on Wiki
● Conducted post-mortem analysis on major incidents and proposed long-term fixes to prevent recurrence.
● Enforced Daily SCRUM meetings and DevOps best practices to have better visibility across the teams.
■ PRODUCTION SUPPORT
● Provided 12/7 global on-call support across three regions (Asia, Europe, and America) with overlapping coverage, ensuring high availability and reliability of User, Location, and Search services.
● Partnered with law enforcement and government agencies to securely analyze metadata, aiding in risk assessment and threat mitigation while ensuring adherence to data privacy policies.
■ KEY ACHIEVEMENTS
● Engineered a scalable deployment and tracking system to streamline data center expansions, reducing deployment time and enhancing real-time tracking of infrastructure rollout.
● Designed and implemented a centralized Grafana dashboard providing a real-time overview of all services, reducing incident response time and enabling proactive issue detection.
DevOps Lead at Flex LLC
Location: (Milpitas CA) www.flex.com Jun/2021 - Aug/2022 Technologies: Linux, Docker, Bash, PHP, Python, Nginx, Ansible, BIND9, NFS, Gitlab CI, ServiceNow, Teams, Azure-cloud, GCP, Ansible, Terraform, Splunk, Salt-Stack, Kubernetes, Docker-Swarm, Load Balancers, GSLB Big IP, Github Actions, Selenium, PyLink, ShellCheck, Trivy, Django, Prometheus, Grafana, Hashicorp VAULT, SwaggerUI, Node Exporter, Nikto, Postman, Kerberos
● CI/CD
● Designed and implemented CI/CD pipelines using GitLab CI for Net SecOps applications, including automated build, test, and deployment stages.
● Transformed all legacy applications into CI/CD workflows, reducing manual intervention and improving deployment efficiency
● Integrated automated testing (unit, integration, and end-to-end) using Selenium, Trivy and linters like PyLint and ShellCheck into pipelines to ensure code quality and reliability.
● Migrated projects from manual deployment processes to automated pipelines, reducing more than 80% of deployment errors
● Designed and automated Blue-Green deployments with GitLab CI/CD, enabling safe release rollouts, minimizing service disruptions, and providing instant rollback capabilities.
● INFRASTRUCTURE
● Managed Docker Swarm clusters and Kubernetes for container orchestration, ensuring high availability and scalability for 7+ applications.
● Automated server provisioning and configuration using Ansible scripts, standardizing software versions across RedHat servers.
● Administered multi-cloud infrastructure (Azure, GCP) using Terraform for deploying and scaling applications.
● Created Terraform modules to implement an on-demand environment creation, which helps to deploy applications on Azure Cloud to test, validate and remove the environment once the developer confirms on manual step through GItlab CI/CD pipeline.
● Upgraded and patched servers and containers based on CVE vulnerabilities and internal security audits reports.
● Implemented and Configured monitoring for Docker Swarm clusters using Prometheus and Grafana to handle traffic spikes and optimize resource usage by scaling accordingly.
● SCRIPTING & DEVELOPMENT
● Developed Bash and Python scripts for automating vulnerability scans, syntax checks, and health checks across multiple languages (Python, Bash, PHP).
● Developed a HashiCorp Vault token auto-renewal tool using BASH for seamless integration into CI/CD pipelines.
● Built Ansible playbooks for automating server configurations, application deployments, and infrastructure updates.
● Collaborated on development of a DNS management tool with Python-Django UI and rest API to replace manual script-based processes, improving usability and efficiency to manage all record types on FLEX LLC domains .
● ALERTING & MONITORING
● Created and customized Grafana dashboards to monitor application health metrics of nodes (CPU, RAM, disk usage, traffic, etc.).
● Configured alerts using Prometheus Alertmanager for production and QA environments to proactively identify and resolve issues.
● Integrated SwaggerUI with API gateway, enabling real-time API health monitoring and automated documentation updates, reducing manual effort by 50%.
● Developed a reporting dashboard using SQL and Grafana for on-call alarms, tracking top alarms.
● Integrated Prometheus and Node Exporter for real-time monitoring and alerting of containerized applications.
● SECOPS
● Planned and Implemented self-hosted HashiCorp Vault for secure storage and management of secrets, database credentials, and API tokens.
● Configured isolation on nodes/containers by Creating and managing user-defined bridge networks and subnets to isolate application containers, ensuring controlled inter-container communication and restricting external access.
● Conducted vulnerability security scan (SAST/DAST) using Nikto and applied security patches for servers and containers based on CVE advisories.
● Enforced zero-trust policies by integrating Hashicorp Vault authentication into CI/CD pipelines and application workflows.
● Automated security scans for container images using Trivy to ensure compliance with security standards.
● PROJECT MANAGEMENT & WORKFLOW AUTOMATION
● Led the transformation to a DevOps culture by implementing Agile + SAFe frameworks, leveraging SCRUM methodologies in Jira to enhance collaboration and delivery efficiency. Introduced and integrated multiple DevOps tools to streamline workflows, ensuring alignment across teams towards a unified objective.
● Documented and demoed MVP workflows to showcase the benefits of DevOps practices, including faster deployments and improved collaboration.
● Organized cross-functional meetings with management and development teams to align on DevOps adoption and workflow changes.
● Automated repetitive tasks using Bash, Python and Ansible scripts, reducing manual effort and operational workload.
● Created runbooks on Wiki and Ansible playbooks for incident response and deployment processes ready to use, ensuring consistency and knowledge sharing.
● PRODUCTION SUPPORT
● Provided 8/5 production support for critical applications, resolving incidents and ensuring high availability using ServiceNow
● Conducted root cause analysis (RCA) for production issues and implemented preventive measures to avoid recurrence.
● Managed on-call rotations and ensured timely resolution of alerts and incidents using ServiceNow.
● Automated health checks for applications and infrastructure using Prometheus + Grafana, to proactively identify and resolve issues.
● Collaborated with QA teams to identify and resolve issues before they reached production using Selenium and Postman by implementing it on pipeline tests.
● NETWORKING
● Administered DNS and GSLB configurations using BIND9 and F5 BIG-IP for internal tools, ensuring high availability and failover capabilities.
● Configured F5 BIG-IP load balancers for traffic distribution and caching across multiple regions.
● Created and managed subnets for Dockerized applications and its nodes on Azure Cloud and applied user docker networks to enforce network isolation and security.
● Repointed domains and subdomains using BIND9 and F5 BIG-IP to align with infrastructure changes and application requirements.
● KEY ACHIEVEMENTS
● Reduced deployment failures by 80% through automated CI/CD pipelines of all existent applications and rollback strategies using GitLab CI
● Improved team productivity (project work) by automating repetitive tasks and implementing DevOps best practices using Ansible + Bash + CI/CD pipelines.
● Teams collaboration improved a lot after Agile + SaFe frameworks implementation,
● Enhanced application security by integrating HashiCorp Vault and enforcing zero-trust policies.
● Designed and implemented a centralized monitoring system using Grafana and Prometheus, allowing the team to proactively improve the infrastructure needs
DevOps Engineer at NXP Semiconductors
Location: Guadalajara Jalisco Mexico www.nxp.com Sept/2019 - Jun/2021 Technologies: Linux, KDE, Ansible, Docker/Swarm, Bash, PHP, NFS, Zabbix, Splunk, VMWare ESXi, VCenter, Jira, Azure-Cloud, Python, Teams Gitlab
,ServiceNow, Nginx,, Gitlab CI, ServiceNow, Teams, Terraform, Salt-Stack, Kubernetes, Load Balancers, GSLB Big IP, Github Actions, Apache AB, ETX, Leostream, MySQL, Nikto, Kerberos
● CI /CD
● Designed and implemented CI/CD pipelines using GitLab CI and Azure Cloud for packaging, testing, and deploying web applications and RPM system patches.
● Created multi-environment pipelines (dev, qa, production) with manual triggers on Gitlab for high-risk deployments, reducing production incidents..
● Integrated automated testing using Apache AB + custom scripts to validate application functionality and performance.
● Migrated 3 projects from GitHub Actions to self-hosted GitLab CI, standardizing the CI/CD process across teams.
● Automated RPM package creation within pipelines, ensuring consistent builds and reducing manual errors.
● INFRASTRUCTURE
● Managed VMWare ESXi servers and vCenter for a cloud environment hosting 9,000+ virtual machines.
● Installed and configured Leostream and ETX to provide a cloud UI for users to request temporary VMs for semiconductor design.
● Administered Docker Swarm and Kubernetes clusters for container orchestration, ensuring high availability and scalability.
● Automated infrastructure provisioning and configuration using Ansible and Terraform, reducing setup time by 70%.
● Upgraded and patched VMWare ESXi servers/nodes and containers based on CVE vulnerabilities and internal security audits.
● SCRIPTING & DEVELOPMENT
● Developed scripts for automating vulnerability security scan (SAST/DAST) on pipelines using Trivy and Linters like golangci-lint, PyLint & ShellCheck syntax checks, and across multiple languages (Python, Bash, PHP, Python).
● Developed a HashiCorp Vault token auto-renewal tool for seamless integration into CI/CD pipelines.
● Built Ansible playbooks for automating server configurations, application deployments, and infrastructure updates.
● Developed a desk reservation web application using PHP, HTML, CSS, and JavaScript + MySQL to streamline R&D department operations.
● ALERTING & MONITORING
● Configured Zabbix dashboards to monitor services like ETX, EoD, and Leostream, ensuring real-time visibility into system health.
● Created custom alerts in Splunk to detect and resolve issues proactively, such as VM availability dropping below 95%.
● Developed Grafana dashboards to visualize key metrics (CPU, RAM, disk usage, traffic) and improve incident response times.
● Integrated Prometheus and Node Exporter for real-time monitoring and alerting of containerized applications.
● SECOPS
● Implemented HashiCorp Vault for secure storage and management of secrets, database credentials, and API tokens.
● Conducted vulnerability security scan (SAST/DAST) using Nikto and applied security patches for servers and containers based on CVE advisories.
● Enforced zero-trust policies by integrating Hashicorp Vault authentication into CI/CD pipelines and application workflows.
● Automated security scans for container images using Trivy to ensure compliance with security standards.
● Configured network isolation for Dockerized applications using cloud subnets + built in custom user docker networks to prevent unauthorized access.
● PROJECT MANAGEMENT & WORKFLOW AUTOMATION
● Collaborated on implementation of Agile + SAFe frameworks and SCRUM methodologies using Jira, improving team efficiency and collaboration.
● Documented and demoed MVP workflows to showcase the benefits of DevOps practices, including faster deployments and improved reliability.
● Organized cross-functional meetings with management and development teams to align on DevOps adoption and workflow changes.
● Automated repetitive tasks using Bash scripts, reducing manual effort and improving team productivity
● Created runbooks and Ansible playbooks for incident response and deployment processes, ensuring consistency and knowledge sharing through Wiki.
● PRODUCTION SUPPORT
● Provided 12/7 production support for critical applications, resolving incidents and ensuring high availability using ServiceNow.
● Conducted root cause analysis (RCA) for production issues and implemented preventive measures to avoid recurrence.
● Managed on-call rotations and ensured timely resolution of alerts and incidents using ServiceNow.
● Automated health checks for applications and infrastructure using Prometheus + Grafana, to proactively identify and resolve issues.
● Collaborated with QA teams to identify and resolve issues before they reached production using Selenium and Postman.
● NETWORKING
● Administered DNS and GSLB configurations using F5 BIG-IP for internal tools, ensuring high availability and failover capabilities.
● Configured F5 BIG-IP load balancers for traffic distribution and caching across multiple regions.
● Created and managed subnets for Dockerized applications and its nodes on Hybrid infrastructure, on-premise & Azure CLoud, applied user docker networks to enforce network isolation and security.
● Troubleshoot VPN and subnet connectivity issues during infrastructure migrations and expansions using Wireshark and tcpdump.
● KEY ACHIEVEMENTS
● Migrated 9000+ Virtual machines from Leostream to ETX VDI solution by creating a robust ansible playbook, without having downtime for the users
(designers)
● Enhanced linux KDE environment to have better performance with the Semiconductors applications like Cadence, Synopsys, EDA etc.
● Developed a multipurpose tool for RPM package creator in order to use it as a service for multiple applications
● Created multiple temporary cloud environments for testing purposes before it goes to production. Automation Lead at Morgan Stanley
Location: Guadalajara Jalisco Mexicowww.morganstanley.com Apr/2016 - Sept/2019 Technologies: Linux Bash, DB2, SQL, Nagios, Apache Tomcat, NFS, Jira, LDAP, Python, Gitlab, MongoDB, Sybase, SQLite, ServiceNow, Jira, Gitlab, Git, Nmap, Python, Kerberos
● SOFTWARE RELEASE MANAGEMENT
● Automated deployment process for DB2 and Sybase databases using GitLab + Custom Scripts, reducing manual errors and deployment time by 80%.
● Coordinated cross-functional teams to ensure on-time delivery of software Deployments.
● Implemented automated testing for database configurations and middleware using Bash and Python scripts.
● Configured QA environment with manual triggers, ensuring proper testing of every deployment to PROD environment.
● Integrated security checks scripts to validate configurations and ensure compliance with organizational standards.
● INFRASTRUCTURE
● Developed Bash and Python scripts to automate DB2 Standalone installations, security patching, and middleware version checks, reducing manual effort by 60%.
● Created a KornShell (KSH) script to automate DB2 HADR cluster setup with customizable configurations, significantly improving efficiency.
● Configured MongoDB Replica Sets and Sharding for fault tolerance and distributed workloads, ensuring data availability and resilience under high traffic conditions.
● Designed and automated backup and disaster recovery strategies using MongoDB native tools (mongodump, mongorestore, Ops Manager) and cloud solutions to ensure data availability.
● SCRIPTING & DEVELOPMENT
● As Automation Lead, designed and developed multiple Bash and Python automation scripts/tools to streamline repetitive tasks across seven teams
(Storage, IT Governance, Analytics, Network, DBA, and Infrastructure). Achieved a workload reduction of 30% to 70%, significantly improving operational efficiency and freeing up resources for high-value initiatives.
● Built automation tools for verifying middleware and Linux library versions, ensuring consistency across environments.
● Developed PHP-based tools for automating daily tasks, such as LDAP configuration updates for the Employee Profile Administration app.
● ALERTING & MONITORING
● Configured Nagios modules for monitoring DB2, Sybase and MongoDB databases, ensuring real-time visibility into system health and performance.
● Created custom alerts for critical issues, such as database connectivity failures and high CPU usage, reducing incident response time.
● Developed health check scripts to monitor database and middleware status, proactively identifying and resolving issues.
● SECOPS
● Executed security patching for Linux servers and DB2/Sybase middleware based on CVE advisories and internal audits.
● Tested and enforced security configurations, including encryption, certificates, and Kerberos authentication.
● Automated vulnerability scans for middleware and Linux libraries using Bash and Python scripts.
● Implemented LDAP integration for the Employee Profile Administration app, ensuring secure authentication and access control.
● PROJECT MANAGEMENT & WORKFLOW AUTOMATION
● Led the creation of an Automation Team to streamline processes across 7 teams, reducing manual effort and improving efficiency.
● Developed reusable automation scripts and playbooks to standardize workflows and reduce duplication of effort.
● Integrated security and syntax checks into automation pipelines, ensuring code quality and compliance.
● Used Jira to track and manage automation projects, ensuring timely delivery and alignment with business goals.
● PRODUCTION SUPPORT
● Provided 8/7 production support for critical applications, including DB2, Sybase, and the Employee Profile Administration app.
● Conducted root cause analysis (RCA) for production incidents, implementing preventive measures to avoid recurrence
● Managed on-call rotations and ensured timely resolution of alerts and incidents using warm hand-over strategy.
● Automated frequent health checks for middlewares, reducing downtime and improving reliability.
● NETWORKING
● Configured and administered NFS for QA projects, ensuring secure and efficient file sharing across servers.
● Troubleshoot network connectivity issues for DB2,Sybase and MongoDB databases, ensuring seamless communication between services.
● Implemented LDAP integration for the Employee Profile Administration app, enabling secure user authentication and access control.
● Collaborated with network teams to optimize subnet configurations and ensure high availability for critical applications.
● KEY ACHIEVEMENTS
● Reduced DB2 installation time from 45 minutes to 7 minutes by developing an automated KornShell (KSH) script, and Cluster HADR installation from 1:30 Hours to17 minutes
● Improved 7 teams productivity on average from 30% to 70% by automating repetitive tasks and creating reusable automation tools.
● Enhanced system reliability by implementing automation of HADR for DB2 database installation, ensuring high availability and disaster recovery.
● Streamlined workflows across 7 teams by leading the creation of an Automation Team and developing standardized automation tools. Infrastructure Linux/AIX Specialist at IBM
Location: Guadalajara Jalisco Mexico www.ibm.com Jan/2015 - Apr/2016 Technologies: Linux, AIX, VCenter, DB2, WebSphere(WAS), ApacheTomcat, Bash, Docker, NFS, Nagios, and other internal tools for ticketing and chat, Bash, Ksh, db2cli validate, shellcheck, IHS, MQ, VMWare ESXi, vCenter, Python, Kerberos, LVM, SSL/TLS certificates, LDAP, Ansible,
● SOFTWARE RELEASE MANAGEMENT
● Automated deployment pipelines for WebSphere Application Server (WAS) and Apache Tomcat applications using Bash and KornShell (KSH) scripts, reducing manual errors and deployment time more than 50%
● Implemented automated testing for middleware configurations using Bash scripts, ensuring consistency across environments.
● Configured multi-environment deployment tools (QA, PROD) with manual triggers for high-risk deployments, ensuring safe rollouts.
● Integrated security checks into deployment scripts to validate configurations and ensure compliance with organizational standards by implementing linters for configuration files like db2cli validate, shellcheck.
● INFRASTRUCTURE
● Managed 300+ servers running AIX and RedHat Linux, including performance tuning, patching, and upgrades.
● Installed and configured WebSphere Application Server (WAS), IBM HTTP Server (IHS), and IBM MQ for enterprise applications.
● Automated server provisioning and configuration using Bash and KornShell (KSH) scripts, reducing setup time.
● Administered and configured NFS (Network File System) for file sharing across servers, ensuring secure and efficient access.
● Created and managed virtual machines using VMWare ESXi and vCenter for development and testing environments.
● Configured Server nodes from bare-metal to production, all the process from setting up hardware on Data Center, Installing Linux Operating system, setting up LVM partition and NFS shared configurations, upgrading packages and installing required middleware ready to production.
● SCRIPTING & DEVELOPMENT
● Developed Bash and KornShell (KSH) scripts to automate middleware deployments MQ, Websphere, IHS and DB2, and configuration checks, reducing manual effort by 60%.
● Created tools to automate security checks and configuration verification for multiple middlewares at once.
● Modified and optimized legacy Python scripts to meet current business needs, improving efficiency and reliability for new software versions and custom configurations.
● Built automation tools for deploying and configuring WebSphere and Apache Tomcat applications, ensuring consistency across environments.
● ALERTING & MONITORING
● Configured Nagios modules for monitoring WebSphere, IBM MQ, and Apache Tomcat applications, ensuring real-time visibility into system health and performance.
● Created custom alerts for critical issues, such as application downtime, high CPU usage and Storage reducing incident response time.
● Developed health check scripts to monitor middleware and application status on demand, proactively identifying and resolving issues.
● SECOPS
● Executed security patching for AIX and RedHat Linux servers based on CVE advisories and internal audits.
● Tested and enforced security configurations for WebSphere, IBM MQ, IHS and Apache Tomcat, including SSL/TLS certificate management and access controls.
● Automated vulnerability scans for middleware and operating systems using Bash and Python scripts.
● Managed LDAP configurations of new servers for secure user authentication and access control.
● PROJECT MANAGEMENT & WORKFLOW
● Documented usage of new DevOps Tools to archive Agile + DevOps adoption .
● Developed reusable automation scripts and Ansible playbooks to standardize workflows and reduce duplication of effort..
● Integrated security and syntax checks into automation pipelines, ensuring code quality and compliance using Jenkins.
● Used ELM IBM Engineering Lifecycle Management (Internal tool) to track and manage automation projects, ensuring timely delivery and alignment with business goals
● PRODUCTION SUPPORT
● Provided 10/7 production support for critical applications, including WebSphere, IBM MQ, and Apache Tomcat. Also everything installed on the servers I administrate (300+ servers)
● Conducted root cause analysis (RCA) for production incidents, implementing preventive measures to avoid recurrence.
● Managed on-call rotations and ensured timely resolution of alerts and incidents.
● Conducted cross functional analysis in order to find out Infrastructure improvement blockers regarding Networking and Slowness on HTTPS responses that were not clear.
● NETWORKING
● Configured and administered NFS for file sharing across servers, ensuring secure and efficient access.
● Troubleshoot network connectivity issues for WebSphere, IBM MQ, and Apache Tomcat applications, ensuring seamless communication between services.
● Implemented load balancing for IBM HTTP Server (IHS) to distribute traffic and ensure high availability.
● Collaborated with network teams to optimize subnet configurations and ensure high availability for critical applications.
● KEY ACHIEVEMENTS
● Reduced deployment time by 50% by automating WebSphere and Apache Tomcat deployments using Bash and KornShell (KSH) scripts.
● Improved team productivity by automating repetitive tasks and creating reusable automation tools.
● Streamlined workflows across multiple teams by developing the first standardized automation scripts and playbooks..
● Handled 300+ Linux servers smoothly by implementing automation on