Big Data Processing

Location:

Dallas, TX

Salary:

$79900

Posted:

January 31, 2025

Contact this candidate

Resume:

Saujan Baniya

*************@*****.***

470-***-****

PROFESSIONAL SUMMARY

6+ years of experience in developing, deploying, and managing big data applications.

Expertise in designing data-intensive applications using the Hadoop ecosystem, cloud platforms, and data engineering solutions.

Proficient in building data lakes and data warehouses, orchestrating data movement using Azure Data Factory, and transforming data pipelines.

Strong experience in AWS services, including S3, Redshift, Glue, Lambda, and EMR, for large-scale data processing and analytics.

Skilled in designing and deploying ETL pipelines using PySpark and AWS Glue for efficient data transformation and loading.

Played a key role in migrating Teradata objects to Snowflake, with in-depth knowledge of Snowflake Multi-Cluster Warehouses.

Hands-on expertise with Spark (RDD transformations, SQL, and streaming) and Kafka for real-time data processing.

Experience in migrating SQL databases to Azure DataLake, Azure SQL Database, and Azure Blob Storage.

Proficient in developing SSIS packages to extract, transform, and load (ETL) data from heterogeneous sources into data warehouses.

Skilled in cloud provisioning tools like Terraform and CloudFormation for automation and infrastructure management.

Experienced in developing data governance frameworks, monitoring data quality, and troubleshooting data-related issues.

Strong understanding of data warehousing concepts with implementation experience using Redshift and Matillion.

Expertise in handling structured, semi-structured, and unstructured datasets and optimizing ETL processes for performance.

Knowledgeable in NoSQL databases such as MongoDB, Cassandra, and HBase, with experience in their integration with Spark RDDs.

Hands-on experience in setting up workflows using Airflow and Oozie for scheduling and managing Hadoop jobs.

Technical Skills

Big Data Ecosystem: Hadoop Map Reduce, Impala, HDFS, Hive, Pig, HBase, Flume, Storm, Sqoop, Oozie, Airflow, Kafka, Spark

Programming Languages: Python, Scala, SAS, Java, SQL, HiveQL, PL/SQL, UNIX Shell Scripting

Machine Learning: Linear Regression, Logistic Regression, Decision Tree, Random Forest, SVM, XGBoost, GBM, CatBoost, Naïve Bayes, PCA, LDA, K-Means, KNN

Deep Learning: Pytorch, TensorFlow, Convolutional Neural Networks (CNN), Recurrent Neural Networks (RNN), LSTM, GRUs

Databases: Snowflake, MySQL, Teradata, Oracle, MS SQL, PostgreSQL, DB2, Cassandra, MongoDB, DynamoDB, CosmosDB

DevOps Tools: Jenkins, Docker, Maven, Kubernetes

Cloud Platforms: AWS (Amazon Web Services), Azure Cloud, Snowflake

Version Control: Git, GitHub, Bitbucket

ETL/BI Tools: Informatica, SSIS, SSRS, SSAS, Tableau, Power BI

Operating Systems: Mac OS, Windows 7/8/10, Unix, Linux, Ubuntu

SDLC Methodologies: Jira, Confluence, Agile, Scrum

PROFESSIONAL WORK EXPERIENCE

Unum Group- Chattanooga, TN Aug 2022 – Present

Role: Data Engineer

Collaborated with business stakeholders to gather data requirements and design big data applications to support business goals.

Built data solutions to help teams make data-driven decisions for customer acquisition and service improvements.

Extracted, transformed, and loaded data into Azure storage services using tools like Azure Data Factory, T-SQL, Spark SQL, and U-SQL.

Deployed and managed data pipelines using Kubernetes, Docker containers, and virtual machines.

Developed Kafka pipelines to process and transform data from multiple sources using Scala.

Migrated data using AWS Data Migration Services, Schema Conversion Tool, and Matillion ETL.

Created data pipelines for both SQL and NoSQL sources using Matillion, Google Dataflow, and Python.

Migrated data objects from Teradata to Snowflake for better data accessibility and performance.

Automated table creation on S3 files using AWS Lambda and Glue with Python and PySpark.

Processed data from multiple sources to build algorithms, including content-based search, clustering, and personalization for user insights.

Designed and managed ETL/ELT pipelines for data ingestion and transformation using GCP and coordinated team tasks.

Set up monitoring solutions using Ansible, Terraform, Docker, and Jenkins to track pipeline performance.

Developed Spark applications with PySpark and Spark SQL to analyze, transform, and extract insights from large datasets.

Prepared data using Alteryx and SQL for Tableau and published data sources for reporting purposes.

Scheduled workflows with Oozie and Hive on AWS EC2 to automate data processing tasks.

Enhanced ETL job performance on EMR clusters using optimization techniques.

Imported and processed data from various sources into Spark RDD for analysis and transformation.

Environment: HDFS, Alteryx 11, EMR, Glue Spark, PySpark, ADF, Kafka, AWS, Pig, SBT, SSIS, Maven, Python, Spark SQL, Snowflake.

LifePoint Health- Brentwood, TN Aug 2021 – July 2022

Role: Data Engineer

Used Spark for both streaming and batch data processing with Scala.

Performed data cleansing and transformation using Hive in Spark.

Built data pipelines with Scala and Kafka to process and structure incoming data.

Designed distributed data solutions in Amazon EMR 5.6.1 for efficient processing.

Migrated Hive and MapReduce jobs from on-premises MapR to AWS cloud using EMR and Qubole.

Conducted performance testing with Apache JMeter and visualized results on Grafana dashboards.

Created advanced data and analytics solutions by developing algorithms to support predictive analytics and business intelligence.

Built real-time data feeds and microservices using AWS Kinesis, Lambda, Kafka, and Spark Streaming.

Applied existing tools and algorithms to improve the accuracy of target assessments.

Performed ETL testing using an automated SSIS testing tool for unit and integration testing.

Retrieved and queried data from S3 using AWS Glue catalog and SQL operations.

Developed and deployed Spark and Scala solutions on Hadoop clusters running in GCP.

Set up and managed clusters in AWS using Docker, Ansible, and Terraform.

Conducted testing in Snowflake to determine optimal usage of cloud resources.

Wrote efficient Spark code with Python and Spark SQL for data loading and transformations.

Created user-defined functions (UDFs) in Scala and PySpark to meet specific requirements.

Developed JSON scripts for deploying pipelines in Azure Data Factory (ADF) for SQL activities.

Used Oozie to automate workflows for ingesting data from multiple sources into Hadoop.

Designed Spark jobs with Scala for efficient data processing and used Spark SQL for querying.

Environment: HDFS, Spark, Scala, Pyspark, ADF, Kafka, AWS, Pig, SBT, Maven.

Kroger- Cincinnati, OH Jan 2019 – June 2021

Role: Data Engineer

Analyzed Hadoop clusters using tools such as Flume, Pig, Hive, HBase, Oozie, Zookeeper, Sqoop, Spark, and Kafka.

Wrote Spark code with Scala and Spark-SQL/Streaming to enhance data testing and processing speeds.

Utilized Spark API on Cloudera Hadoop Yarn to perform analytics on Hive data.

Implemented solutions for ingesting and processing data using Hadoop, MapReduce, MongoDB, Hive, Oozie, Flume, Sqoop, and Talend.

Built a job server (REST API, Spring Boot, Oracle DB) and a job shell for submission, profiling, and monitoring in HDFS.

Improved algorithm performance in Hadoop with Spark Context, Spark SQL, Data Frames, PairRDDs, and SparkYARN.

Deployed applications to AWS and managed load balancing across EC2 instances.

Imported data from various sources and transformed it with Hive, MapReduce, and Sqoop, storing it in HDFS.

Designed analytical components using Scala, Spark, Apache Mesos, and Spark Streaming.

Installed Hadoop, MapReduce, and HDFS; developed MapReduce jobs with Pig and Hive for data cleaning.

Worked on Big Data integration and analytics projects involving Hadoop, SOLR, Spark, Kafka, Storm, and web methods.

Built a custom data ingestion framework using Python and REST APIs.

Created Kafka producers and consumers alongside Spark and Hadoop MapReduce jobs.

Managed interdependent Hadoop jobs and automated workflows using Oozie for Java MapReduce, Hive, Pig, and Sqoop.

Imported data from HDFS and HBase into Spark RDD for processing.

Configured and maintained multi-node development and test Kafka clusters.

Environment: Hadoop, Python, HDFS, Spark, MapReduce, Pig, Hive, Sqoop, Kafka, HBase, Oozie, Flume, Scala, Java, Cassandra, Zookeeper, HBase, MongoDB, AWS EC2, EMR, S3.

Education

Master’s in business Analytics (MBA): The University of Findlay, Findlay/OH, 2024

Contact this candidate