Data Engineer

Location:

Clovis, CA

Posted:

February 14, 2025

Contact this candidate

Resume:

SAI KALYAN VANAMA

Contact: +* - 619-***-****

Email: ***************@*****.***

Location: Fremont, CA, USA

linkedin: https://www.linkedin.com/in/saikalyanvanama/

Recent MS in computer science graduate and Hackathon runner-up and have over 4 years of exceptional track record in driving business insights and efficiency by streamlining ETL processes, utilizing cloud platforms, and data analytics.

TECHNICAL SKILLS:

Programming Languages & Libraries: Python, PySpark, Pandas, scikit-learn, Matplotlib, Seaborn, NLTK, SQL, Selenium, BeautifulSoup

Cloud Platforms: Azure Databricks, Azure Synapse Analytics, Azure Data Factory, Azure Blob Storage, AWS S3, AWS Glue, AWS RDS, AWS Redshift, AWS Lambda, Google Cloud Storage, Google Cloud Run, Cloud Function, DataFlow, Pub/Sub, Google Cloud SQL, Dataproc, BigQuery

Data Management & Warehousing: SQL Server, MySQL, Postgres, MongoDB, Snowflake, Data Extraction, Transformation, and Loading (ETL), Data Modeling, Data Warehousing

DevOps & CI/CD: Azure DevOps, Git, Jira, Terraform, Docker Data

Processing & Orchestration: Apache Airflow, Apache Spark, Apache Kafka

Data Analysis & Visualization: Data Pre-processing, Machine Learning Algorithms, Data Visualization (Power BI), PowerBI, Tableau, SSPS

PROFESSIONAL EXPERIENCE:

The KeelWorks Foundation Data Engineer 07/2024 - Present

● Developed and automated scalable data pipelines on GCP, integrating Apache Airflow, Dataflow, and BigQuery to process and extract business-relevant insights from scraped data. Designed and implemented an end-to-end ETL pipeline using Apache Spark with Scala/PySpark, optimizing structured and semi-structured data ingestion from Surge/Data Lake into RDBMS for real-time analytics.

● Led cloud migration and workflow automation, modernizing an on-prem Hadoop system to GCP, reducing operational overhead, and leveraging Cloud Dataflow, Cloud Run, and Composer for efficient data transformation and orchestration. Built fault-tolerant ETL workflows with Airflow, optimizing Spark performance using Accumulators and Broadcast variables, significantly improving data reliability and processing efficiency.

● Hands-on expertise in GCP big data services, including BigQuery, Cloud Data Proc, and Google Cloud Storage, enhancing cloud-based data infrastructure. Engineered complex JSON data handling, optimized batch and real-time data migration using Google Cloud Dataflow, and implemented Spark tuning strategies for high-performance distributed computing.

Mindtree Data Engineer 10/2019 - 08/2022

● Developed and optimized data pipelines on Azure, integrating Apache Airflow DAGs and Azure Synapse Analytics to process 10,000+ records daily, enhancing data ingestion workflows and improving reporting performance with advanced data modeling techniques.

● Led large-scale ETL migrations using Azure Data Factory and Databricks, successfully migrating 500+ GB of data from on-prem SQL Server to Azure SQL Storage, while leveraging PySpark and SQL to optimize data transformation, reducing processing time and infrastructure costs.

● Designed and implemented high-performance data processing solutions, developing scalable pipelines using Apache Spark and Azure Synapse to handle 30TB+ daily data, achieving 60% processing time reduction and 40% cost savings through tuning and cloud optimization.

● Built an automated data integration and analytics framework, consolidating data from 20+ source systems, improving data freshness by 70%, and enabling real-time analytics processing of 500K+ events per second using Azure Event Hubs and Stream Analytics with 99.95% accuracy.

● Enhanced data governance, CI/CD, and testing automation, implementing Azure DevOps pipelines, reducing deployment time by 80%, automating 90% of data quality checks, ensuring GDPR compliance with PII detection and masking, and achieving 95% test coverage for production stability.

● Engineered performance improvements using Redis caching and Spark tuning, increasing query efficiency by 70%, reducing latency for frequently accessed datasets, and streamlining validation processes to reduce data discrepancies by 75% and enhance system responsiveness.

Bookslelo Data Engineer 04/2018 - 11/2018

● Developed ETL pipelines using Python and SQL, processing 1TB+ of daily data with 99.9% accuracy while reducing processing time by 40%.

● Implemented data quality checks and monitoring systems, identifying and resolving 95% of data anomalies before production impact.

● Created an automated reporting system using Power BI, reducing manual report generation time by 80% and improving data visualization accuracy by 90%.

● Optimized database queries and implemented indexing strategies, improving query performance by 65% across critical workflows.

● Developed Python scripts for data cleaning and transformation, reducing data preparation time by 70% while maintaining 99% accuracy.

● Implemented version control for data pipelines using Git, improving code quality by 85% through systematic review processes.

EDUCATION

1. University of Texas at Arlington Arlington, Texas, USA GPA : 3.8/4 Masters in Computer Science, Specialization in Big Data Management and Intelligent Systems May 2024

2. Visvesvaraya Technological University India CGPA: 7/10 Bachelors of Engineering – Information Science and Engineering June 2019

Certifications

● DP-203: Data Engineering on Microsoft Azure

Contact this candidate