Data Engineer Big

Location:

Milwaukee, WI, 53224

Salary:

90k

Posted:

July 30, 2025

Contact this candidate

Resume:

NAME: VINAY KUMAR BATHULA

**************@*****.*** cell phone:341-***-****

Senor Big data Engineer

BACKGROUND SUMMARY:

Over 10+ years of experience in designing, developing, and deploying data pipelines and big data analytics solutions across cloud platforms and distributed computing environments.

Proficient in the Hadoop ecosystem including HDFS, MapReduce, YARN, Hive, Pig, Sqoop, Flume, Impala, and Oozie for efficient big data management and processing.

Expert in developing scalable and high-performance data pipelines using Apache Spark (Scala/Python), Spark Streaming, Kafka, and Apache Flink.

Deep experience with AWS services (EMR, EC2, S3, Lambda, Redshift, RDS, CloudWatch) and Azure services (ADF, Data Lake Gen2, Synapse, Data Bricks).

Strong background in implementing real-time data streaming pipelines using Kafka, Spark Streaming, and NoSQL databases such as HBase, Cassandra, and MongoDB.

Developed and maintained ETL frameworks using PySpark, Python 3.11, and Talend for structured and semi-structured data processing.

Built interactive dashboards and KPIs using Power BI 2022 and Tableau 2024.1 to visualize performance metrics and business insights.

Migrated on-premises Hadoop clusters and Hive/MapReduce workloads to AWSEMR and Azure Synapse for cost optimization and performance improvements.

Experienced in integrating and automating pipelines using CI/CD tools such as Jenkins, Git, Airflow, and Azure DevOps.

Knowledgeable in various file formats and data serialization techniques including Avro, ORC, Parquet, and JSON.

Skilled in implementing data governance and security using Kerberos, Sentry, Ranger, and SSL/TLS for enterprise-grade compliance.

Extensive knowledge of data warehousing concepts and dimensional modeling for BI/reporting environments.

Experienced in handling petabyte-scale data ingestion, transformation, and loading into HDFS, S3, and Azure Blob Storage.

Responsible for developing data pipeline using Spark, Scala, Apache Kafka to ingestion the data from CSL source and store in HDFS protected folder.

Migrate databases to cloud platform SQLAzure and as well the performance tuning.

Written Map Reduce code to process and parsing the data from various sources and storing parsed data into HBase and Hive using HBase-Hive Integration.

Knowledge of ETL methods for data extraction, transformation and loading in corporate-wide ETL Solutions and DataWarehouse tools for reporting and data analysis.

Extracted and updated the data into HDFS using Sqoop import and export.

Created Hive, Pig, SQL and HBase tables to load large sets of structured, semi-structured and unstructured data coming from UNIX, NoSQL and a variety of portfolios.

Hands on experience in installing configuring and using Hadoop ecosystem components like HadoopMap reduceHDFSHBaseHive SqoopPigZookeeper and Flume

Experienced with the Scala, Spark improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark -SQL, PairRDD's, Spark YARN

TECHNICAL SKILLS:

•Big Data & Processing: Hadoop (HDFS, MapReduce, YARN), Apache Spark (1.6 – 3.5), Spark SQL, Spark Streaming, Apache Flink, Pig, Hive (1.2 – 3.1), Sqoop, Flume, Impala, Oozie, Zookeeper, Tez

•Cloud Platforms: AWS (EMR, EC2, S3, Lambda, Redshift, CloudWatch, CloudFormation, Lake Formation), Azure (Data Factory, Data Bricks, Synapse Analytics, Blob Storage, Data Lake Gen2, Cosmos DB, Azure DevOps)

•Programming Languages: Python (2.7 – 3.11), Scala (2.12), Java, Shell Scripting, SQL, C, C++, JavaScript

•Data Streaming & Messaging: Kafka (2.4 – 3.6), Azure Event Hub, Apache NiFi

•NoSQL & RDBMS: HBase, Cassandra, MongoDB, Oracle, MySQL, PostgreSQL, SQL Server, Delta Lake

•Visualization & Reporting: Tableau (2024.1), Power BI (2015 – 2024), Kibana

•CI/CD & DevOps Tools: Git, Jenkins, Azure DevOps, Docker, Airflow, Control-M

•Serialization & Formats: JSON, Avro, Parquet, ORC, CSV

•Security & Governance: Kerberos, Sentry, Ranger, SSL/TLS

•Other Tools: Hue, Ambari, Jira, Confluence, Great Expectations, Kusto Explorer, GitHub

WORK EXPERIENCE:

Client: Mayo Clinic, Rochester, MN March 2023 - Present

Role: Principal Big Data Engineer

Responsibilities:

Designed and implemented data ingestion and transformation pipelines on AWSEMR using Spark 3.5 and Python 3.11 for real-time analytics.

Built distributed data processing systems to ingest over 20 TB of data per day from Kafka topics and batch sources into S3 and Redshift.

Leveraged AWS Glue and Lambda functions to orchestrate and monitor ETL pipelines while optimizing for cost and performance.

Created Spark-based batch jobs to cleanse, transform, and aggregate raw JSON/CSV data into curated S3 layers.

Developed data lake governance policies using AWSLake Formation and integrated with Ranger for fine-grained access control.

Performed log analysis using Elasticsearch and Kibana to track and debug streaming job failures.

Designed highly available and scalable solutions for real-time anomaly detection using Spark Streaming and Kafka.

Built Hive external tables with partitions for efficient querying on large datasets stored in S3.

Tuned Spark jobs using dynamic allocation, broadcast joins, and caching strategies for optimal performance.

Automated cluster provisioning and deployment scripts using AWSCloudFormation and Python-based CLI tools.

Led migration efforts from on-prem Hadoop to AWSEMR, reducing operational overhead by 30%.

Configured CloudWatch metrics and custom dashboards to monitor job performance and cluster health.

Defined data quality checks and alerts in Airflow using Great Expectations framework.

Created Tableau dashboards to visualize streaming pipeline metrics and alerting trends.

Mentored a team of 5 junior data engineers and conducted weekly knowledge-sharing sessions on AWS best practices.

Installed and configured Hive, Pig, Sqoop, Flume and Oozie on the Hadoop cluster.

Developed a strategy for Full load and incremental load using Sqoop.

Documented the requirements including the available code which should be implemented using Spark, Hive, HDFS, HBase and ElasticSearch.

Developed a Python Script to load the CSV files into the S3 buckets and created AWSS3buckets, performed folder management in each bucket, managed logs and objects within each bucket.

Good knowledge in Cluster coordination services through Zookeeper and Kafka.

Experienced in writing live Real-time Processing using Spark Streaming with Kafka

Assisted in Cluster maintenance, cluster monitoring, adding and removing cluster nodes and installed and configured Hadoop, Map Reduce, HDFS, developed multiple Map Reduce jobs in java for data cleaning and pre-processing.

Involved in file movements between HDFS and AWS S3 and extensively worked with S3 bucket in AWS.

Created data partitions on large data sets in S3 and DDL on partitioned data.

Converted all Hadoop jobs to run in EMR by configuring the cluster according to the data size.

Environment: AWS (EMR, S3, EC2, Lambda, Redshift, CloudFormation, CloudWatch), Spark 3.5, Python 3.11, Hive 3.1, Kafka 3.6, Ranger, Tableau 2024.1, Airflow, ElasticSearch, Kibana, Git, Jenkins

Client: Barclays Invvestment Bank . Whippany, NJ April 2021 - February 2023

Role: Lead Big Data Engineer

Responsibilities:

Developed ingestion processes using Kafka and Azure Event Hub to bring streaming data into Data Lake Gen2.

Designed ETL pipelines to clean and prepare structured/unstructured data from REST APIs and IoT feeds.

Implemented Spark jobs using PySpark 3.9 and Spark SQL to transform raw data and store curated datasets in Parquet format.

Enabled role-based access control across Azure services using Azure AD and service principals.

Deployed and maintained CI/CD pipelines for deploying PySpark scripts and ADF artifacts via Azure DevOps.

Scheduled and orchestrated jobs using Apache Airflow for data integration between Blob Storage, Synapse, and SQL pools.

Built real-time fraud detection engine using Spark Structured Streaming and Azure Cosmos DB.

Tuned Spark configurations for memory management and executor optimization, resulting in a 40% speedup.

Defined reusable data quality rules and validation logic in Data Factory for ingestion-level checks.

Conducted POC on Kusto Query Language (KQL) and Azure Data Explorer for telemetry analytics.

Collaborated with product teams to understand analytics needs and translate into robust, scalable solutions.

Worked with business analysts to define KPIs and developed Power BI 2022 dashboards for reporting.

Performed metadata management using Purview and integrated with Data Factory lineage view.

Provided L3 support and handled on-call issues related to Spark job failures and ADF latency.

Initiated cloud cost optimization by introducing spot instances and auto-scaling policies in Azure Databricks.

Implementing various resources in Azure using Azure Portal, PowerShell on ARM deployment models.

Designed and implemented data ingestion Spark streaming framework from various data source like REST API, Kafka using Spark StreamingScalaAPI and Kafka.

Wrote, compiled, and executed programs as necessary using Apache Spark in Scala to perform ETL jobs with ingested data.

Collaborated with data teams to showcase the projects KPIs by using Big Data System Azure Data Lake, Scope and Azure data Explorer (Kusto).

Using Azure SynapseSQL Pool to load in Fact and/or Dimension Tables.

Environment: Azure (Data Factory, Synapse, Blob Storage, Data Bricks, Data Lake Gen2, Azure DevOps), Kafka, PySpark 3.9, Delta Lake, Power BI 2022, Kusto Explorer, Cosmos DB, Git, Airflow, REST API, KQL

Client: Macy's Duluth, GA September 2019 - March 2021

Role: Senior Big Data Engineer

Responsibilities:

Created end-to-end data pipeline frameworks using Spark 2.4 with Scala and Hive 2.3 for processing clickstream and transaction data.

Ingested multi-terabyte datasets from Oracle and MySQL using Sqoop and transformed using MapReduce.

Integrated HBase and Hive for storage and queried using Hive QL to perform behavioral analysis.

Developed batch processing jobs to clean, join, enrich, and load data into HDFS and Hive tables.

Used Oozie to coordinate and schedule Hadoop jobs across different time zones.

Implemented Kafka consumers using Scala for high-throughput ingestion of application logs.

Tuned and optimized MapReduce jobs to improve performance by 35% over legacy implementation.

Built dashboards using Power BI to monitor performance of Hive queries and cluster utilization.

Leveraged Avro and Parquet for storage format optimization and schema evolution.

Conducted unit testing and performance benchmarking for various ingestion strategies.

Handled 24/7 production support for Spark and MapReduce jobs with on-call rotation.

Collaborated with QA and data analysts for UAT and sign-off before production release.

Designed reusable Hive UDFs for custom transformations and text parsing.

Developed monitoring scripts in Python and Shell to validate job completion.

Responsible for developing data pipeline with Amazon AWS to extract the data from weblogs and store in HDFS and worked extensively with Sqoop for importing metadata from Oracle.

Analyzed the data by performing Hive queries (HiveQL) and running Pig scripts (Pig Latin) for data ingestion

Exporting of a result set from HIVE to MySQL using Sqoop export tool for further processing.

Installed and configured Hive, Pig, Sqoop, Flume and Oozie on the Hadoop cluster.

Involved in support for Amazon AWS and RDS to host static/media files and the database into Amazon Cloud.

Implemented the automated workflows for all the jobs using the Oozie and shell script.

Used Spark SQL functions to move data from stage hive tables to fact and dimension tables.

Created Micro services using AWS Lambda and API Gateway using REST API.

Generated report on predictive analytics using Python and Tableau including visualizing model performance and prediction results.

Implemented data ingestion and handling clusters in real time processing using Kafka.

Environment: AWS (EMR, S3, EC2), Hive 2.3, HDFS, Sqoop, Spark 2.4, Scala 2.12, MapReduce, Oozie, Power BI, Kafka 2.4, Avro, Parquet, Git, Shell

Client: Comcast Philadelphia, PA January 2017 to August 2019

Role: Mid-Level Big Data Engineer

Responsibilities:

Developed scalable batch ETL jobs using Spark 2.1 and Hive 2.1 to process raw data into consumable datasets.

Built and maintained ingestion pipelines using Flume and Sqoop for log and metadata ingestion.

Created Hive external tables with custom SerDes to handle semi-structured JSON data.

Used Pig and Hive for preprocessing and aggregating log events for user segmentation.

Assisted in implementing S3-based data lake and used AWS Glue for data cataloging.

Collaborated with DevOps team to containerize Spark jobs using Docker and deploy via Jenkins.

Designed partitioning strategies to enhance Hive query performance over large datasets.

Troubleshooted long-running jobs by analyzing logs using YARN Resource Manager and Hue.

Implemented data deduplication logic using Spark Window functions.

Performed daily health checks of Hadoop cluster and managed job failures.

Automated metadata refresh and Hive schema reconciliation scripts in Python 3.6.

Created and maintained version-controlled codebase in Git with proper branching.

Participated in weekly sprint meetings to plan and review deliverables.

Engaged in code reviews and peer testing for collaborative development.

Conduct systems design, feasibility and cost studies and recommend cost-effective cloud solutions such as Amazon Web Services (AWS).

Created Session Beans and controller Servlets for handling HTTP requests from Talend

Utilized Waterfall methodology for team and project management.

Used Git for version control with Data Engineer team and Data Scientists colleagues.

Environment: AWS (EMR, S3), Spark 2.1, Hive 2.1, Pig, Sqoop, Flume, Docker, Python 3.6, Jenkins, Git, Hue, YARN, Shell

Client: RedPine Solutions, Hyderabad, India June 2013 - September 2016

Role: Junior Data Engineer

Responsibilities:

Supported data engineering team in writing Hive QLscripts and optimizing queries.

Maintained and ingested small data batches into HDFS using Flume and Sqoop.

Conducted data validation checks for consistency and completeness in HDFS.

Wrote simple Pig Latin scripts to perform transformation on structured log files.

Documented Hive tables, partitions, and lineage for reporting.

Generated summary statistics using Spark RDDs and basic transformations in Scala.

Created shellscripts for automating daily data loads from UNIX servers to HDFS.

Assisted in creating internal dashboards using Tableau for operational KPIs.

Implemented access controls on Hive tables using Ranger policies.

Gained familiarity with Kerberos authentication and job scheduling via Oozie.

Built unit tests for data quality checks using Python and Hive queries.

Participated in Hadoop cluster upgrades and testing under supervision.

Contributed to team knowledge base on Spark job tuning techniques.

Performed backup and archival of Hive metadata and data lineage documentation.

Supported business users in ad-hoc reporting using Hive queries.

Environment: Hadoop 2.7, HDFS, Hive 1.2, Sqoop, Pig, Spark 1.6, Scala, Tableau, Python 2.7, Shell, Ranger, Oozie

Contact this candidate