BIG DATA DEVELOPER

Location:

Dallas, TX

Posted:

May 25, 2022

Contact this candidate

Resume:

Phone: 214-***-****

Email: *********@*****.***

Professional Summary

Big Data Engineer with 8+ years of experience in Big Data technologies. Expertise in Hadoop/Spark using cloud services, automation tools, and software design process. Outstanding communication skills dedicated to maintaining up-to-date IT skills.

Good understanding/knowledge of installation, configuration, management and deployment of Big Data solutions and the underlying infrastructure of Hadoop Cluster and cloud services.

Hands-on experience on MapReduce Programming/HDFS Framework including Hive, Spark, PIG, Oozie, and Sqoop.

In depth understanding knowledge of Hadoop Architecture and various components such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node and MapReduce concepts

Experience in YARN Architecture and its components such as Node Manager, Application Masters, Resource Manager.

Set up standards and processes for Hadoop based application design and implementation.

Experience in analyzing data using HIVEQL, PIG Latin and custom MapReduce programs in python. Extending HIVE and PIG core functionality by custom UDF's.

Experience in Designing, developing and implementing connectivity products that allow efficient exchange of data between our core database engine and Hadoop ecosystem.

Experience in importing and exporting the data using Sqoop from HDFS to Relational Database systems and vice-versa.

Responsible for building scalable distributed data solutions using Hadoop.

Experienced in coding SQL, Procedures/Functions, Triggers and Packages on database (RDBMS) packages.

Strong Communication skills of written, oral, interpersonal and presentation.

Ability to perform at a high level, meet deadlines, adaptable to ever changing priorities.

Technical Skills

IDE:

Jupyter Notebooks (formerly iPython Notebooks), Eclipse, IntelliJ, PyCharm

PROJECT METHODS:

Agile, Kanban, Scrum, DevOps, Continuous Integration, Test-Driven Development

HADOOP DISTRIBUTIONS:

Hadoop, Cloudera Hadoop, Hortonworks Hadoop

CLOUD PLATFORMS:

Amazon AWS - EC2, SQS, S3, MapR, Elastic Cloud

CLOUD SERVICES:

Databricks

CLOUD DATABASE & TOOLS:

Redshift, DynamoDB, Cassandra, Apache Hbase, SQL

PROGRAMMING LANGUAGES:

Spark, Spark Streaming, Java, Python, Scala, PySpark, PyTorch

SCRIPTING:

Hive, MapReduce, SQL, Spark SQL, Shell Scripting

CONTINUOUS INTEGRATION (CI-CD):

Jenkins

VERSIONING:

Git, GitHub

PROGRAMMING METHODOLOGIES:

Object-Oriented Programming, Functional Programming

FILE FORMAT AND COMPRESSION:

CSV, JSON, Avro, Parquet, ORC

FILE SYSTEMS:

HDFS

ETL TOOLS:

Apache Nifi, Flume, Kafka, Talend, Pentaho, Sqoop

DATA VIZIUALIZATION TOOLS:

Tableau, Kibana

SEARCH TOOLS:

Elasticsearch

SECURITY:

Kerberos, Ranger

AWS:

AWS Lambda, AWS S3, AWS RDS, AWS EMR, AWS Redshift, AWS Kinesis, AWS ELK, AWS Cloud Formation, AWS IAM

Data Query:

Spark SQL, Data Frames

Professional Project Experience

Jun 2021 - Current

BIG DATA ANALYST/ DEVELOPER

United Health Group – Dallas, TX

Developed solution to manage payments going to and from Health Resources and Services Administration (HRSA) to providers and vice-versa for COVID-related procedures on uninsured patients.

Wrangled data for COVID-related procedures on uninsured COVID patients using Spark SQL queries and Scala code.

Optimized long running queries with broadcast joins.

Partitioned data into monthly directories for efficient querying.

Used data frames vs HIVE table when more efficient.

Categorized and split each payment amount between testing, treatment, and vaccine procedures.

Implemented Spark using Scala and utilized DataFrames and Spark SQL API for faster processing of data.

Joined data between 835 tables and FED file to find splits.

Performed data cleaning and accuracy evaluation using Spark SQL queries after wrangling data.

Pin-pointed mal-data by looking at and querying HIVE tables.

Debugged payment issues when payment amounts between Optum Bank and FED file were not matching.

Updated process architecture to handle recycled audit numbers.

Managed refunds going from providers back to HRSA.

Used Python code to upload refund report to HIVE table.

Queried HIVE table and 835 tables to find the splits between testing treatment and vaccine.

Generated Refund checks report matching the attributes and format of the FED file.

Placed Refund Report in a HIVE table partition.

Processed Refund Checks file and FED file in a single process for HRSA.

Updated process architecture to run on a weekly basis instead of monthly.

Automated, configured, and deployed instances on AWS, Azure, and Google Cloud Platform (GPC) environments.

Developed Data Pipeline with Kafka and Spark.

Jan 2019 – Jun 2021

DATA DEVELOPER

Asics - Oakdale, PA

Ingested data into AWS S3 data lake from various devices using AWS Kinesis.

Implemented usage of Amazon EMR for processing Big Data across Hadoop Cluster of virtual servers on Amazon Elastic Compute Cloud (EC2), Amazon Simple Storage Service (S3) and AWS Redshift.

Implemented AWS Lambda functions to run scripts in response to events in the Amazon Dynamo DB table or S3.

Created AWS Lambda functions using boto3 module in Python.

Migrated SQL database to Azure Data Lake.

Used AWS Cloud Formation templates to create a custom infrastructure of our pipelines.

Decoded raw data from JSON and streamed it using the Kafka producer API.

Integrated Kafka with Spark Streaming for real-time data processing using Dstreams.

Implemented AWS IAM user roles and policies to authenticate and control access.

Specified nodes and performed the data analysis queries on Amazon redshift clusters using AWS Athena on AWS.

Processed multiple terabytes of data stored in AWS using Elastic Map Reduce (EMR) to AWS Redshift.

Created POCs on Microsoft Azure using Azure Blob storage and Azure DataBricks.

Created UDFs in Spark using Scala programs.

Assisted in designing, building, and maintaining database to analyze lifecycle of checking transactions.

Designed and developed ETL jobs to extract data from AWS S3 and loaded it into data mart in Amazon Redshift.

Implemented and maintained EMR and Redshift pipeline for data warehousing.

Used Spark engine and Spark SQL for data analysis and provided intermediate results to data scientists for transactional analytics and predictive analytics.

Used Azure HDInsight to process big data across Hadoop clusters of virtual servers on Azure Data Lake.

Created POCs and developed airflow DAGS using Python.

Jan 2018 - Jan 2019

DATA ENGINEER

Yellow Corporation – Overland Park, KS

Processed multiple terabytes of data stored in AWS using Elastic Map Reduce (EMR) to AWS Redshift.

Wrote scripts to automate workflow, execution, loading, parsing of data into a different database system.

Deployed Hadoop to create data pipelines involving HDFS.

Managed multiple Hadoop clusters to support data warehousing.

Created and maintained Hadoop HDFS, Spark, HBase pipelines.

Supported data science team with data wrangling, data modeling tasks.

Built continuous Spark streaming ETL pipeline with Kafka, Spark, Scala, HDFS and MongoDB.

Scheduled and executed workflows in Oozie to run Hive and Pig jobs.

Worked on importing unstructured data into HDFS using Spark Streaming and Kafka.

Worked on installing clusters, commissioning and decommissioning data nodes, configuring slots, and ensured name node availability with regards to capacity planning and management.

Implemented Spark using Spark SQL for optimized analysis and processing of data.

Configured Spark Streaming to process real time data via Kafka and store it to HDFS.

Loaded data from multiple servers to AWS S3 bucket and configured bucket permissions.

Used Scala for concurrency support, which is the key in parallelizing processing of the large datasets.

Used Apache Kafka to transform live streaming with batch processing to generate reports.

Implemented Partitioning, Dynamic Partitions, and Buckets in HIVE for performance optimization and organized data in logical grouping.

Maintained and performed Hadoop log file analysis for error, access statistics for fine tuning.

Optimized data storage in Kafka Brokers within the Kafka cluster by partitioning Kafka Topics.

Responsible for analysis, design coding, unit and regression testing and implementation of the new requirements.

Produced Visio Diagrams for Process and Execution Flow, Design Documentation and Unit Test cases.

Built generic tool to import data from Flat files and from various other Database Engines using Podium.

Creating Sqoop Jobs and Commands to pull data from database and manipulate the data using Hive Queries.

Developed data ingestion code using the Sqoop and performed ETL and processing phase using Apache Hive, Spark using Pyspark, and SQL Spark scripting.

Performed orchestra using Apache Oozie and coordinators to schedule and run workflows and packages containing Shell, Sqoop, Spark, Hive, Email actions.

Built generic tool for Data Validation and automation report.

Performed recursive deployments through Jenkins and maintained code at Git repository.

Jan 2016 - Jan 2018

BIG DATA DEVELOPER

The Nielson Company - New York, NY

Ingested data from multiple sources into HDFS Data Lake.

Analyzed, designed, coded unit and regression testing and implemented new requirements.

Created Visio Diagrams for Process and Execution Flow, Design Documentation and Unit Test cases.

Built generic tool to import data from Flat files and from various other database engines.

Created Sqoop Jobs and Commands to pull data from database and manipulate the data using Hive Queries.

Programmed data ingestion code using Sqoop and performed ETL and processing phase using the Apache Hive, Spark using Pyspark, and SQL Spark scripting.

Performed orchestra using Apache Oozie Jobs and coordinators to schedule and run workflows and packages containing Shell, Sqoop, Spark, Hive, Email actions.

Built generic tool for data validation and automation reporting.

Managed onsite and offshore resources by assigning tasks using Agile methodology and retrieved updates through Scrum calls and Sprint meetings.

Developed Spark scripts by using Scala shell commands as per the requirement and used PySpark for proof of concept.

Installed and configured Hadoop MapReduce, HDFS, and developed multiple MapReduce jobs in Python for data cleaning and preprocessing.

Loaded customer profile data, customer spending data, and credit data from legacy warehouses onto HDFS using Sqoop.

Built data pipeline using Pig and Hadoop commands to store onto HDFS.

Used Oozie to orchestrate the map reduce jobs that extract the data on a timely manner.

Migrated ETL code base from Hive, Pig to PySpark.

Developed and deployed Spark submit command with suitable Executors, Cores and Drivers on the Cluster.

Applied transformations and filtered both traffic using Pig.

May 2013 - Jan 2016

HADOOP DEVELOPER

United Airlines - Chicago, IL

Created multi-node Hadoop and Spark clusters in AWS instances to generate terabytes of data and stored it in AWS HDFS.

Deployed the application jar files into AWS instances.

Built data pipeline using MapReduce scripts and Hadoop commands to store onto HDFS.

Migrated Spark applications from Map Reduce to improve performance.

Created a benchmark between Cassandra and HBase for fast ingestion.

Processed Terabytes of information in real-time using Spark Streaming.

Loaded data from legacy warehouses onto HDFS using Sqoop.

Connected various data centers and transferred data between them using Sqoop and various ETL tools.

Configured Oozie workflow engine scheduler to run multiple Hive and Sqoop jobs.

Used Oozie to orchestrate the MapReduce jobs that extract the data on time.

Utilized Hive and Python scripts for ETL processing.

Performed aggregation functions with SQL

Produced graphs and plots using Python library PyPlot to visualize data.

Education

Bachelor's Degree: Computer Engineering - University of Texas at Dallas

Certifications

Microsoft Certified database Administrator in MS SQL Server Databases Administration

Contact this candidate