Customer-focused Cloud Support and Site Reliability Engineer with 13+ years of experience providing advanced technical support for distributed data systems and cloud platforms. Proven expertise in diagnosing complex data pipeline issues, driving incident management, and ensuring reliability across AWS and GCP ecosystems. Adept at managing high-severity escalations, conducting root cause analysis, and collaborating with cross-functional teams to deliver rapid, effective solutions - ensuring exceptional customer satisfaction and operational continuity.
Cloud & Data Platforms: AWS (EMR, Glue, Athena, DynamoDB), GCP, Hadoop, Spark, Hive, Cloudera CDP Support & Incident Response: Escalation Management, RCA, On-call Operations, Service Restoration Monitoring & Observability: CloudWatch, Stackdriver, Prometheus, CloudTrail, Ganglia Automation & Scripting: Python, Bash, Shell
Infrastructure Management: Terraform, CloudFormation, IAM, VPC, Security Groups, Load Balancing Customer-Facing Expertise: Case Management, Troubleshooting Guidance, Technical Escalation, Documentation
ALIGNED VALUE ADD
Proven success in customer-facing technical support for distributed and cloud-based data systems.
Skilled in triaging critical issues in production and ensuring SLA-bound responses.
Strong communicator capable of explaining complex data workflows in simple terms.
Adept at RCA, documentation, and preventive automation to reduce incident recurrence.
CAREER HISTORY
Sr. Cloud Operations & SRE - Lowe's Dallas, TX Nov 2021 - Present
• Delivered end-to-end operational support for large-scale data environments on AWS and GCP, ensuring optimal performance, stability, and uptime.
• Managed and optimized Cloudera CDP 7.1.x clusters, including upgrades, patching, and validation during maintenance cycles.
• Acted as a key point of contact for critical incidents, coordinating rapid response and recovery while maintaining transparent communication with engineering and business stakeholders.
• Troubleshot and tuned EMR and Glue ETL workflows for latency, resource utilization, and schema inconsistencies.
• Collaborated with engineering teams to identify systemic issues, implement long-term fixes, and document remediation processes.
• Authored and maintained SOPs, runbooks, and RCA reports for consistent operational response and future learning.
• Provided ongoing on-call coverage for production-critical issues across time zones, maintaining exceptional SLA compliance.
Big Data Cloud Support Engineer II - Amazon Web Services (AWS) Dallas, TX Aug 2019 - Nov 2021
Supported enterprise customers across EMR, Glue, Athena, LakeFormation, and DynamoDB ecosystems.
• Investigated logs, job failures, and data inconsistencies to restore service functionality and improve performance.
• Led technical deep dives for customer incidents, guiding clients through log-based diagnostics, Spark tuning, and resource optimization.
• Partnered with AWS service teams to resolve product-level issues and deliver post-incident summaries to customers.
Authored troubleshooting documentation and knowledge base articles to accelerate future case resolution.
Recognized for reducing incident recurrence through proactive monitoring, RCA, and preventive automation.
• Mentored junior support engineers on issue triage, escalation protocols, and best practices in distributed systems troubleshooting.
Sr. Hadoop Administrator - VISA (via Sriven Systems) Austin, TX Nov 2018 - Aug 2019
Supported Hadoop environments for business-critical applications ensuring consistent uptime and data reliability.
Implemented security and compliance controls with Kerberos and Ranger to protect data assets.
Diagnosed cluster-level performance issues, performing RCA to address HDFS, YARN, and Spark bottlenecks.
• Partnered with platform teams to optimize data pipeline onboarding and ensure compliance with best practices.
Senior Hadoop Administrator - State Farm (via Sriven Systems) Bloomington, IL Oct 2012 - Oct 2018
Managed multi-node Hadoop clusters ensuring performance, scalability, and availability.
• Monitored and tuned Hive, Spark, and HDFS configurations to enhance data workflow efficiency.
Authored diagnostic playbooks and operational guidelines that reduced mean time to recovery (MTTR).
• Delivered seamless upgrades and patching with zero unplanned downtime.
Middleware Administrator - Bank of America Dallas, TX Feb 2012 - Oct 2012
• Managed enterprise middleware infrastructure including WebSphere and WebLogic environments.
• Resolved application-level and infrastructure-level incidents through collaboration with cross-functional teams.
• Performed JVM and thread tuning, SSL configuration, and deployment automation.
EDUCATION
Master's in Computer Science — Virginia, USA 2011
B.Tech in Computer Science — JNTU Hyderabad 2009