Data Engineer Big

Location:

Austin, TX, 78717

Posted:

October 15, 2025

Contact this candidate

Resume:

Data Engineer

Arun kumar

+1-512-***-****

****.*********@*****.***

PROFESSIONAL SUMMARY:

Data Engineer with 6+ years of professional experience in building, optimizing, and maintaining scalable data pipelines, ETL/ELT workflows, and distributed data platforms.

Expertise in Big Data technologies including Hadoop (HDFS, MapReduce, Yarn), Spark (Core, SQL, Streaming, MLlib), Kafka, Hive, Sqoop, Flume, Oozie, and Impala.

Proficient in Python, PySpark, Scala, and SQL for data engineering, analytics, and large-scale processing.

Strong background in data ingestion, transformation, and warehousing using Teradata, Oracle, SQL Server, MySQL, PostgreSQL, and NoSQL (HBase, Cassandra, MongoDB).

Extensive experience developing real-time streaming solutions with Kafka, Spark Streaming, and Flink for low- latency analytics.

Skilled in designing data lakehouse architectures with Databricks, Delta Lake, and Snowflake to unify batch and streaming workloads.

Hands-on expertise in cloud platforms including AWS (EMR, Glue, S3, Redshift, Athena, Lambda, DynamoDB, RDS, SNS, SQS), Azure HDInsight, and GCP services.

Experienced with data modeling and warehouse design (Star Schema, Snowflake Schema, Fact/Dimension design).

Proficient in working with multiple file formats (Parquet, ORC, Avro, JSON, CSV, XML) for analytics and data processing.

Strong knowledge of orchestration and scheduling tools including Airflow and Oozie for workflow automation and monitoring.

Implemented CI/CD pipelines for data workflows using Jenkins, Git, Docker, Kubernetes, and Terraform/CloudFormation for infrastructure as code (IaC).

Expertise in data quality, governance, lineage, and security for enterprise data platforms with compliance considerations.

Experienced in building production-ready Spark applications for batch processing, streaming, and ML feature engineering.

Collaborated with cross-functional teams in Agile/Scrum environments to deliver business-focused, cost- optimized, and scalable data solutions.

Strong understanding of software development life cycle (SDLC) and DevOps practices for automated testing, deployment, and monitoring.

TECHNICAL SKILLS:

Big Data & Distributed Systems: Hadoop (HDFS, MapReduce, YARN), Spark (Core, SQL, Streaming, MLlib), Hive, Pig, Sqoop, Flume, Oozie, Airflow, Kafka (Connect, Streams), Zookeeper, StreamSets, Apache Flink.

Cloud & Lakehouse Platforms: AWS (EC2, S3, EMR, Glue, Redshift, Athena, Lambda, DynamoDB, RDS, SNS, SQS, CloudWatch, CodeBuild), Azure (Databricks, Data Factory, Data Lake, Blob Storage, Synapse, Cosmos DB, Azure DevOps), Snowflake, Databricks Delta Lake, Google BigQuery.

Programming & Scripting: Python, PySpark, Scala, Java, SQL, HiveQL, Shell Scripting, Pig Latin.

Data Warehousing & Modeling: Star Schema, Snowflake Schema, Fact/Dimension design, Data Vault Modeling.

Databases: Oracle, MySQL, PostgreSQL, MS SQL Server, Teradata, DB2.

NoSQL Databases: HBase, Cassandra, MongoDB, DynamoDB, Cosmos DB.

ETL & BI Tools: Informatica, Power BI, Tableau, AWS Glue, Talend.

DevOps, CI/CD & Infrastructure: Git, GitHub, Bitbucket, SVN, Jenkins, Docker, Kubernetes, Terraform, CloudFormation.

Data Governance & Observability: Apache Atlas, AWS Lake Formation, Collibra, Prometheus, Grafana, CloudWatch, Datadog.

Operating Systems: Linux (Ubuntu, Unix), Windows (7/10/11), macOS. PROFESSIONAL EXPERIENCE:

Client: Morgan Stanley – New York, NY jan 2024- Current Role: AWS Data Engineer

Responsibilities:

Migrated terabytes of data from on-prem data warehouses to AWS cloud storage (S3, Redshift) using incremental and batch ingestion strategies.

Designed and developed data pipelines with Airflow to schedule and orchestrate PySpark jobs for incremental data loads, integrating with Flume for weblog ingestion.

Built and optimized Spark jobs on Databricks (PySpark/Scala) for data cleansing, validation, enrichment, and transformation to support analytics use cases.

Designed and developed ETL processes in AWS Glue to migrate and transform data from external sources (S3, MySQL, Parquet, JSON) into Redshift clusters.

Developed Spark-Streaming applications to process real-time data streams from Snowflake, Kafka, and other sources, persisting results into DynamoDB and S3.

Created PySpark Glue jobs to implement transformation logic, normalization, and aggregations, optimizing for Redshift-based reporting.

Implemented batch and streaming data pipelines leveraging Databricks Delta Lake for ACID-compliant, scalable data lakehouse architecture.

Fine-tuned Spark applications/jobs by optimizing partitioning, caching, and broadcast variables to improve performance and processing time.

Built Hive external tables and scripted HiveQL queries for ETL, partitioning, and data analysis.

Worked with multiple file formats (CSV, JSON, XML, Avro, ORC, Parquet) for ingestion, transformation, and analytics.

Automated end-to-end ETL pipelines to support both full and incremental loads of data across multiple datasets.

Leveraged AWS services including EMR, S3, Glue, Glue Data Catalog, Redshift, Athena, Lambda, DynamoDB, and QuickSight for large-scale data applications.

Developed Spark JDBC APIs to integrate and exchange data between Snowflake and AWS S3 for reporting and downstream consumption.

Generated dashboards and reports in AWS QuickSight and Power BI to deliver insights on customer behavior and usage patterns.

Integrated CI/CD pipelines using Jenkins and Git for Spark/Glue workflows, deploying jobs into Dockerized containers managed via Kubernetes.

Performed Git repository design, branching strategies, and access control, ensuring proper version control and collaboration.

Designed custom adapters for data ingestion from Snowflake, SQL Server, and MongoDB into HDFS for analytics.

Utilized Terraform/CloudFormation for provisioning AWS infrastructure, ensuring reproducibility and automation of data environments.

Built data quality checks, monitoring, and alerting within pipelines using CloudWatch, Prometheus, and Airflow SLA monitoring.

Partnered with business stakeholders and BI teams to deliver data products and dashboards aligned with business KPIs.

Environment: Python, PySpark, Scala, SQL, Redshift, Oracle, Hive, Snowflake, MongoDB, DynamoDB, Kafka, MySQL, PostgreSQL, Hadoop, Databricks, AWS (S3, EMR, Glue, Lambda, Athena, QuickSight, Data Pipeline, DynamoDB, RDS, SQS, SNS, CloudWatch), Informatica, Power BI, Docker, Kubernetes, Airflow, Git, Jenkins, Terraform, CloudFormation.

Client: Elevance Health – Indianapolis, Indiana SEP 2022 – DEC 2023 Role: Data Engineer

Responsibilities:

Installed, configured, and worked with Hadoop ecosystem components including HDFS, MapReduce, Hive, Sqoop, Pig, HBase, Flume, Zookeeper, and Spark for large-scale data processing.

Designed and developed data warehouse and BI architectures, implementing ETL pipelines to move data from diverse structured/unstructured sources into HDFS and Azure Data Lake for downstream analytics.

Built and optimized Azure Data Factory (ADF) pipelines to ingest data from on-premises and cloud relational/non-relational systems into Azure Data Lake Gen2, Azure SQL DB, and Synapse DW.

Created and provisioned Azure Databricks clusters for batch and streaming data processing, leveraging PySpark and Spark SQL for transformations, cleansing, and enrichment.

Implemented data ingestion frameworks with ADF, Databricks, and Kafka to process high-volume, real-time streaming data into Azure Lakehouse (Delta Lake).

Developed and scheduled complex ADF pipelines and dataflows for orchestration of multi-step transformations, validations, and aggregations.

Implemented custom ETL solutions using PySpark, Shell scripting, and Informatica for batch and near real-time ingestion from XML, JSON, and RDBMS into the data lake.

Automated deployments of ADF and Databricks pipelines through CI/CD pipelines in Jenkins, GitHub, and Docker, integrating with Kubernetes for containerized execution.

Built and maintained infrastructure as code (IaC) with Terraform and ARM templates for provisioning Azure resources (ADF, Databricks, Storage, Synapse).

Integrated AWS Data Pipeline to configure data movement from S3 to Redshift, enabling hybrid cloud data flows.

Implemented real-time tax computation engine using Oracle, StreamSets, Kafka, and Spark Structured Streaming.

Designed and implemented data quality frameworks and validation scripts to ensure consistency, accuracy, and integrity across ETL pipelines.

Performed end-to-end data validations, unit testing, and production support for ETL workflows loading into enterprise data warehouse.

Collaborated with BI teams to deliver dashboards and visualizations in Power BI and AWS QuickSight for business stakeholders.

Utilized scheduling/orchestration tools (Tidal, Airflow) to automate ETL workflows and monitoring.

Conducted code reviews, design reviews, and test sign-off, ensuring best practices in coding, optimization, and maintainability.

Environment: Python, PySpark, Scala, SQL, Hive, Oracle, MongoDB, Snowflake, Power BI, Tableau, Azure Data Factory (ADF), Azure Databricks, Azure Data Lake Gen2, Azure Synapse, AWS (S3, Redshift, EMR, Lambda, Glue, Data Pipeline), Kafka, StreamSets, Hadoop, Informatica, Docker, Kubernetes, Git, Jenkins, Terraform, Airflow, GCP BigQuery.

Client: Bank of America — Charlotte, NC July 2021– Aug 2022 Role: Big Data Developer

Responsibilities:

Designed and developed MapReduce, Hive, and Pig scripts for large-scale data cleansing, transformation, and pre-processing.

Migrated legacy data from SQL Server and Teradata into Amazon S3, enabling cloud-based data lake architecture.

Exported transformed datasets into Snowflake by staging tables and automating loads from Amazon S3.

Created consumption views and optimized queries to reduce execution time for complex analytics workloads.

Developed Spark applications in Scala, PySpark, and Spark SQL/Streaming for data validation, cleansing, transformation, enrichment, and aggregation.

Implemented data profiling and quality metrics across structured, semi-structured, and unstructured sources using Spark and Python.

Built Kafka-based real-time streaming pipelines integrated with Spark Streaming and NiFi to ingest server logs, sensor data, and Kinesis streams into the data lake.

Developed NiFi data pipelines for ingesting real-time and batch data in multiple formats (JSON, Avro, Parquet, CSV) into Hadoop/HDFS.

Implemented workflows with Apache Oozie for batch pipeline orchestration and scheduling.

Optimized Spark jobs by leveraging RDD transformations, caching, partitioning, and broadcast variables to improve performance and scalability.

Executed Hadoop/Spark workloads on AWS EMR, storing intermediate and final results in S3 buckets.

Designed and implemented PL/SQL procedures, triggers, and materialized views for performance optimization of downstream queries.

Created data ingestion and transformation pipelines for high-volume tuning events from ElasticSearch, Kafka, and Amazon Kinesis into Enterprise Data Lake.

Performed cluster analysis and performance monitoring of Hadoop/Spark jobs, including HDFS file format optimization (Avro, ORC, Parquet, JSON, Sequence Files).

Integrated Databricks and Delta Lake for unified batch and streaming pipelines with ACID compliance and schema evolution.

Automated ETL workflow deployments via Airflow and Jenkins CI/CD, with Git for version control and Docker for containerized execution.

Implemented infrastructure as code (IaC) with Terraform to provision and manage AWS resources for big data workloads.

Set up data quality checks, monitoring, and alerting using CloudWatch, Prometheus, and Airflow SLA metrics.

Partnered with data science and BI teams to provide clean, curated datasets for analytics and reporting. Environment: HDFS, MapReduce, Spark (Core, SQL, Streaming, MLlib), Scala, PySpark, Hive, Pig, Kafka, NiFi, Snowflake, PL/SQL, AWS (S3, EMR, Kinesis, CloudWatch), Oozie, Databricks, Delta Lake, Cassandra, ElasticSearch, Python, Git, Jenkins, Docker, Terraform, Airflow. Client: Amazon – Seattle, Washington May2019-June2021 Role: Big Data Engineer

Developed Spark Streaming models to process transactional data from multiple sources, creating micro-batches for real-time fraud detection and error record handling.

Performed data transformations, mapping, cleansing, monitoring, debugging, and performance tuning across Hadoop clusters.

Converted complex Hive/SQL queries into Spark RDDs and DataFrame transformations using Python and Scala to optimize ETL pipelines.

Designed and implemented DDL/DML scripts in SQL and HQL for table creation, data analysis, and reporting in RDBMS and Hive.

Used Sqoop for importing/exporting data between HDFS and relational databases (MySQL, SQL Server, Oracle).

Created Hive tables, performed data loading, and developed Hive UDFs for custom transformations.

Exported processed data to MySQL and Redshift for reporting and visualization purposes.

Loaded and transformed flat-file datasets (CSV/JSON) using Informatica, staging the data for downstream analytics.

Researched and recommended Hadoop migration technologies aligned with enterprise architecture and scalability requirements.

Built scalable distributed data solutions using Hadoop ecosystem components.

Processed and structured large log datasets and staging data in HDFS, followed by ingestion into AWS S3 and Redshift for analytics.

Developed Hive queries and Impala queries for analyzing structured and semi-structured datasets; created custom filters on HBase using APIs.

Handled ETL pipelines from multiple sources into HDFS using Sqoop and Hive, followed by processing with Spark and storage in Redshift.

Validated datasets using Spark components and implemented data quality checks for processed data in AWS Redshift.

Worked as an ETL and Tableau developer, designing ETL mappings in Informatica and creating advanced visualizations, complex calculations, and dashboards in Tableau Desktop.

Integrated BI reporting and analytics using Tableau with processed datasets from Redshift and Hive to deliver business insights.

Early adoption of Data Lake practices, storing raw and curated data in S3 buckets for analytics and reporting. Environment: Spark (Core, SQL, Streaming), Hive, HDFS, Sqoop, HBase, Python, Scala, MySQL, Impala, AWS

(S3, EC2, Redshift), Tableau, Informatica.

EDUCATION

Indiana Wesleyan University, Merrillville, IN

Master of Science, Information Technology Project Management

Contact this candidate