Bhavya V
Senior. Data Engineer
*****************@*****.*** 205-***-**** LINKEDIN
PROFESSIONAL SUMMARY:
Over 10+ years of experience designing, building, and optimizing scalable data pipelines and distributed systems across diverse cloud platforms, including AWS, Azure, and GCP.
Hands-on expertise in leveraging Apache Spark, PySpark, and Hadoop ecosystems for batch and real-time data processing in large-scale environments.
Proven track record of integrating structured and unstructured data using Apache Kafka, Kafka Connect, and Google Cloud Pub/Sub to enable real-time analytics and event-driven architectures.
Strong background in ETL and data integration tools like AWS Glue, Azure Data Factory, Airflow, Talend, and DBT to support end-to-end data workflows across hybrid environments.
Experienced in building secure and cost-effective data lakes and warehouses using Snowflake, Amazon Redshift, Azure Synapse, and BigQuery.
Skilled in developing data models using Kimball methodology, Star, and Snowflake schemas, ensuring high performance and data integrity for BI and reporting use cases.
Adept at building and managing ML pipelines using MLflow, Kubeflow, Azure ML, and integrating models built with Scikit-learn, PyTorch, and TensorFlow.Deeply understand data governance, identity, and access controls using AWS IAM, Azure AD, Key Vault, Apache Ranger, and OAuth 2.0.
Strong command over SQL, PL/SQL, and NoSQL databases like MongoDB, Cassandra, and HBase for advanced querying, optimization, and data manipulation.
Proficient in orchestrating complex workflows using Apache Airflow, AWS Step Functions, and Azure Logic Apps to efficiently automate and monitor data pipelines.
Implemented secure and reusable APIs using RESTful services, XML, and XSLT to enable data access for both internal applications and external clients.
Hands-on experience with Delta Lake and Databricks to unify batch and streaming workloads and enable reliable, ACID-compliant data lakehouses.
Delivered multiple BI solutions using Power BI, Tableau, Grafana, and Excel, enabling actionable insights and performance dashboards across business units.
Built and deployed scalable applications and scripts using Python, Scala, Java, and Shell scripting to support diverse data engineering workflows.
Experienced in CI/CD integration with Azure DevOps and infrastructure provisioning using AWS CloudFormation for scalable, production-grade environments.
Demonstrates strong cross-functional collaboration capabilities and excels at translating complex technical solutions into clear, actionable insights for diverse stakeholder groups.
Recognized for a proactive approach to problem-solving, adaptability in fast-paced environments, and commitment to continuous improvement and mentoring junior engineers.
TECHNICAL SKILLS:
Programming Languages & Scripting: Python, Java, Scala, SQL, PL/SQL, JavaScript, Shell Scripting
Big Data & Distributed Computing: Apache Spark, PySpark, Apache Hive, Hadoop, HDFS, MapReduce, Presto, Apache Sqoop, YARN, Databricks, Delta Lake
Streaming & Messaging: Apache Kafka, Kafka Streams, Kafka Connect, Google Cloud Pub/Sub
Machine Learning & AI: PyTorch, TensorFlow, Scikit-learn, MLflow, Azure Machine Learning (Azure ML), Kubeflow
Cloud Platforms & Services: AWS (Glue, Redshift, Lambda, Step Functions, EMR, IAM, CloudWatch, EC2, S3, RDS, API Gateway, CloudFormation), Azure (Data Factory, Databricks, Data Lake Storage, Synapse, Active Directory, Key Vault, DevOps, Logic Apps, Monitoring), Google Cloud Platform (Dataflow, Pub/Sub, BigQuery, AI Platform)
Orchestration & Workflow Automation: Apache Airflow, AWS Step Functions, Azure Logic Apps
Version Control & CI/CD: Git, GitHub, Jenkins, Azure DevOps
Infrastructure as Code & Automation: Terraform, AWS CloudFormation, Ansible, Docker, Kubernetes
APIs & Web Technologies: RESTful APIs, XML, XSLT
Data Modeling & Warehousing Approaches: Kimball-style data marts, Star Schema, Snowflake Schema
Security & Governance: AWS IAM, Azure Active Directory (AAD), AWS Glue Data Catalog, Apache Ranger, OAuth 2.0
ETL & Data Integration: AWS Glue, Airflow, Azure Data Factory, Informatica, Talend, DBT, Apache Sqoop, SSIS, Google Cloud Dataflow, Matillion
Databases & Data Warehousing: Snowflake, Redshift, PostgreSQL, MySQL, Oracle PL/SQL, SQL Server, MongoDB, Cassandra, HBase, HiveQL
Data Visualization & BI Tools: Power BI, Tableau, Grafana, Google Analytics, Excel (pivot tables, VLOOKUP)
Monitoring & Logging: AWS CloudWatch, New Relic, Splunk
EXPERIENCE:
T-Mobile, Atlanta, GA Jan 2024 – Present
Senior Data Engineer
Responsibilities:
Delivered hybrid batch and streaming pipelines to support dynamic BI reporting and analytics use cases across the enterprise.
Developed scalable ETL pipelines using AWS Glue, integrating diverse structured/unstructured data sources for centralized analysis.
Deployed containerized data apps using Docker and Kubernetes, improving application scalability and cross-environment compatibility.
Streamlined data governance using AWS Glue Data Catalog, making data assets easier to locate, access, and manage.
Designed schema-less NoSQL models with MongoDB to support complex, semi-structured data with reduced query latency.
Built and launched real-time data streaming solutions using Kafka Streams and Connect, ensuring fast and reliable processing of complex, high-volume datasets.
Improved large-scale data pipelines on Databricks using Apache Spark and PySpark, significantly boosting throughput and reducing processing delays.
Automated AWS infrastructure with CloudFormation and Ansible.
Strengthened security posture by designing fine-grained access controls through AWS IAM, aligning with compliance and best practices.
Enhanced ML data readiness by applying synthetic data generation and augmentation techniques, increasing model generalizability.
Delivered end-to-end ETL solutions for seamless data movement, enrichment, and analytics-ready output.
Provisioned high-performance EC2 clusters for compute-intensive operations, improving runtime efficiency.
Parsed and transformed XML data via XSLT, ensuring compatibility with legacy systems and third-party applications.
Administered Unity Catalog in Databricks to centralize policy enforcement, metadata tracking, and access control.
Implemented a structured data governance program focusing on policy standardization, lineage tracking, and security compliance.
Created event-driven architectures using AWS Lambda and Step Functions to automate workflows with minimal latency.
Conducted peer code reviews and mentored junior engineers, cultivating a high-performance engineering culture.
Built scalable HDFS-based storage systems to support distributed data computation and efficient storage of massive datasets.
Migrated high-volume relational data using Apache Sqoop and Spark SQL, supporting seamless integration with big data platforms.
Managed ML development lifecycle with MLflow and PyTorch on AWS EMR, ensuring scalable training, deployment, and version control.
Oversaw model governance processes from data preprocessing through production deployment, enabling continuous improvement.
Monitored distributed systems using AWS CloudWatch to ensure uptime and proactively resolve performance bottlenecks.
Built robust DAGs with Apache Airflow to schedule and orchestrate batch and streaming jobs, increasing workflow reliability.
Developed REST APIs to enable secure and efficient communication across multiple systems and applications.
Designed AWS Redshift data models using Kimball’s approach, supporting fast analytical queries and dimensional modeling.
Established Jenkins-based CI/CD pipelines to streamline testing and release automation with reduced risk.
Environment: AWS Glue, Lambda, IAM, EC2, CloudFormation, CloudWatch, Redshift, Apache Kafka, Spark, PySpark, Databricks, HDFS, Sqoop, Spark SQL, MLflow, PyTorch, EMR, MongoDB, Docker, Kubernetes, Ansible, XSLT, Apache Airflow, REST APIs, Jenkins, Git, SQL
Baptist Health,Miami, FL Aug 2021 – Dec 2023
Data Engineer
Responsibilities:
Processed big data using Hadoop frameworks like Hive and MapReduce, enabling scalable analytics.
Delivered real-time event processing systems with Kafka, supporting critical, low-latency data operations.
Wrote advanced SQL for complex data transformations using joins, CTEs, UDFs, and window functions.
Engineered Snowflake schemas with star modeling for high-performance querying and enterprise analytics.
Built real-time dashboards using JavaScript to visualize operational insights and key metrics.
Tuned Spark workloads in Azure Databricks for large-scale data processing, reducing runtime and increasing scalability.
Created robust Python solutions for data manipulation and machine learning leveraging TensorFlow, Azure ML, and Kubeflow.
Implemented RBAC using Azure Active Directory and encrypted sensitive data via Azure Key Vault to enforce security protocols.
Diagnosed and resolved pipeline issues in real-time using ServiceNow, maintaining uptime and data accuracy.
Automated ETL workflows in Azure Data Factory to efficiently handle diverse data sources and transformations.
Used Spark-SQL for exploratory analysis and advanced querying on distributed data sources.
Led implementation of data governance protocols, ensuring data integrity and compliance.
Designed NoSQL models in Cassandra that are tailored for fast ingestion and analytical performance.
Built scalable Python scripts to automate workflows and streamline integrations across systems.
Enforced data access controls via Apache Ranger, managing security across hybrid data platforms.
Used DBT for modular, version-controlled data transformations in the analytics pipeline.
Leveraged Pandas and NumPy for preprocessing, statistical analysis, and ML feature engineering.
Maintained and transformed data across structured and unstructured formats for optimal access and storage.
Built dynamic Power BI dashboards to present actionable insights to business stakeholders.
Used Terraform to automate cloud infrastructure provisioning following IaC principles.
Managed source control and branching strategies through GitHub, supporting collaboration and traceability.
Deployed CI/CD pipelines in Azure DevOps to streamline testing, versioning, and deployment activities.
Set up Grafana dashboards for real-time metrics and system alerts, enhancing visibility across data pipelines.
Environment: Hadoop, Hive, MapReduce, Kafka, SQL, Snowflake, JavaScript, Apache Spark, Azure Databricks, Python, TensorFlow, Azure ML, Kubeflow, Azure AD, Azure Key Vault, ServiceNow, ADF, Spark-SQL, Cassandra, Pandas, NumPy, Power BI, Terraform, GitHub, Azure DevOps, Grafana.
Blue Shield of CA, Long Beach, CA Jan 2019 – Jul 2021
Data Engineer
Responsibilities:
Tuned BigQuery queries to accelerate performance and minimize costs in the cloud data warehouse.
Delivered Tableau dashboards to enable executive-level reporting and interactive analytics.
Designed PostgreSQL databases with performance-optimized queries and advanced indexing.
Automated ETL pipelines using Python to increase reliability and reduce manual interventions.
Developed secure RESTful APIs to integrate various systems and enable seamless data communication.
Built robust ETL processes using Informatica to ensure efficient data ingestion, cleansing, and transformation.
Created machine learning models using TensorFlow and Scikit-learn to derive predictive insights.
Modeled Snowflake data warehouses using dimensional schemas for fast analytical performance.
Developed distributed Spark pipelines to enable fast computation on massive datasets.
Built real-time data pipelines with Google Dataflow, powering stream analytics at scale.
Integrated multi-format datasets to support structured reporting and dashboarding needs.
Ensured platform reliability and availability using New Relic for system monitoring and diagnostics.
Managed cloud-based ETL workflows using Matillion, improving scalability and operational efficiency.
Conducted web traffic and user behavior analytics via Google Analytics, informing strategic decisions.
Utilized Pandas and NumPy to support complex transformations and statistical computations.
Built scalable, real-time processing systems using Google Pub/Sub for streaming data.
Automated infrastructure provisioning through Terraform, ensuring repeatable and scalable deployments.
Secured application APIs using OAuth 2.0, ensuring safe and compliant access to sensitive data.
Environment: BigQuery, Tableau, PostgreSQL, Python, RESTful APIs, Informatica, TensorFlow, Scikit-learn, Snowflake, Apache Spark, Google Dataflow, New Relic, Matillion, Pandas, NumPy, Google Pub/Sub, Terraform.
Charter Communications, Plano, TX Aug 2017 – Dec 2018
Data Engineer
Responsibilities:
Managed AWS EC2 instances to optimize compute resource usage for large-scale jobs.
Implemented MapReduce for distributed computation, accelerating batch processing times.
Developed high-performance data engineering components using Scala and Java.
Set up data governance controls around enterprise data quality, security, and compliance.
Tuned cluster performance and job execution using YARN for optimal resource management.
Automated system checks and monitoring tasks with shell scripts to reduce manual errors.
Built serverless workflows using AWS Lambda to support real-time data automation.
Designed scalable data storage leveraging HDFS, AWS S3, and RDS for high availability.
Developed Talend ETL pipelines for seamless integration and movement of diverse data formats.
Collaborated via Git for version control and streamlined CI/CD processes across teams.
Automated structured data ingestion into Hadoop using Apache Sqoop, improving transfer efficiency.
Created and maintained Hive data models, writing complex HiveQL queries for business reporting.
Wrote and optimized Oracle PL/SQL procedures for data processing and transaction workflows.
Built scalable data models for structured and semi-structured formats, improving data access.
Worked in Agile settings, contributing to sprint planning and cross-functional data solutions.
Troubleshoot pipeline failures using Splunk logs and alerts, reducing downtime.
Environment: AWS EC2, MapReduce, Scala, Java, YARN, Shell Scripting, AWS Lambda, HDFS, AWS S3, AWS RDS, Talend, Git, Apache Sqoop, Hive, HiveQL, Oracle PL/SQL, Agile, Splunk.
Wipro, Nanakramguda, Hyd, INDIA Oct 2014 – Jul 2017
Junior Data Engineer
Responsibilities:
Designed Excel reports using advanced functions like PivotTables, VLOOKUP, and macros.
Configured and maintained HBase clusters for low-latency data storage and retrieval.
Created and distributed Spark jobs to process large-scale data sets efficiently.
Implemented Apache Kafka pipelines for streaming ingestion and event handling.
Wrote Python scripts for data preparation, automation, and analytics across projects.
Developed ML models using Scikit-learn to support predictive analysis initiatives.
Managed unstructured data using HDFS with built-in fault tolerance and scalability.
Built ETL processes using SSIS to automate routine data workflows across systems.
Used Git for version control and collaborative development.
Leveraged NumPy for efficient data handling and numerical computation in analytics.
Implemented data security best practices, including encryption and access controls.
Structured and indexed Hive tables to optimize storage and enable fast queries.
Tracked and resolved issues via Bugzilla, contributing to stable data pipeline operations.
Environment: Excel, PivotTables, VLOOKUP, Macros, HBase, Apache Spark, Apache Kafka, Python, Scikit-learn, HDFS, SSIS, Git, NumPy, Hive, Bugzilla.