Post Job Free

Resume

Sign in

Hadoop developer

Location:
Thousand Oaks, CA
Salary:
83$ Hour
Posted:
January 06, 2021

Contact this candidate

Resume:

Ali Mir

Data Engineer

Contact

Phone

805-***-****

E-mail

adi7j9@r.postjobfree.com

Strength

HADOOP DISTRIBUTIONS:

Amazon AWS

Spark

Python

Scala

SQL

Big Data Engineer with 5 years of experience designing, implementing and maintaining big data solutions to meet client needs using industry recognized software’s

Skills

APACHE - Apache Ant, Apache Flume, Apache Hadoop, Apache YARN, Apache Hive, Apache Kafka, Apache MAVEN, Apache Oozie, Apache Spark, Apache Tez, Apache Zookeeper, Cloudera Impala, HDFS, Hortonworks, MapR, MapReduce

SCRIPTING - HiveQL, MapReduce, XML, FTP, Python, UNIX, Shell scripting, LINUX

OPERATING SYSTEMS - Unix/Linux, Windows 10, Ubuntu, Apple OS

FILE FORMATS - Parquet, Avro & JSON, ORC, Text, CSV

DISTRIBUTIONS - Cloudera, Hortonworks, AWS, Elastic, ELK, Cloudera CDH 4/5, Hortonworks HDP 2.5/2.6, Amazon Web Services (AWS)

DATA PROCESSING (COMPUTE) ENGINES - Apache Spark, Spark Streaming, Flink

DATA VISUALIZATION TOOLS - Pentaho, QlikView, Tableau, PowerBI, Matplot

COMPUTE ENGINES - Apache Spark, Spark Streaming, Storm

DATABASE - Microsoft SQL Server Database (2005, 2008R2, 2012), Database & Data Structures, Apache Cassandra, Amazon Redshift, DynamoDB, Apache HBase, Apache Hive, MongoDB,

SOFTWARE - Microsoft Project, Primavera P6, VMWare, Microsoft Word, Excel, Outlook, Power Point; Technical Documentation Skills

Work History

092019- Current

Data Engineer

MGM Studio, Thousand Oaks, CA

Developed Spark jobs using Spark SQL and Data Frames API to process structured data into Spark clusters

Implemented Spark using Python and utilized Data Frames and Spark SQL API for faster processing of data.

Ingested Images responses in Kafka producer written in Python.

Defined a schema for a custom HBase table

Analyze the performance of digital strategies to yield business recommendations.

Worked on AWS Kinesis for processing real-time data

Handling schema changes in the data stream using Spark

Connected databases and Tableau,

Developed multiple SQL queries to join tables and create dashboards.

Started and configured master and slave nodes for Spark

Reviewing Business requirement document for completeness, analyzes actual, and forecast budgeting and forecasting requirements.

Created Kafka topics for Kafka brokers to listen from

Created a pipeline to gather data using PySpark, Kafka and Hive

01/2018- 09/2019

Data Engineer

Epic Systems, Verona, WI

Involved in creating Hive Tables, loading with data and writing Hive queries, which will invoke and run MapReduce jobs in the backend.

Experience in optimizing the data storage in Hive using partitioning and bucketing mechanisms on both the managed and external tables.

Consumed data from Kafka queue using Spark.

Worked on importing and exporting data using Sqoop between HDFS to RDBMS.

Collect, aggregate, and move data from servers to HDFS using Apache Spark & Spark Streaming.

Administered Hadoop cluster(HDP) and reviewed log files of all daemons.

Used Impala where possible to achieve faster results compared to Hive during data Analysis.

Used Spark API over Hadoop YARN to perform analytics on data in Hive.

Used Spark SQL and DataFrames API to load structured and semi-structured data into Spark Clusters.

Migrated complex MapReduce programs into Apache Spark RDD operations.

Migrated ETL jobs to Pig scripts for transformations, joins, aggregations before HDFS.

Apache Storm, with Redis, Kafka and Netty for real-time data analytic system, and triggering real-time actions from statistical data.

Set up Apache Storm on AWS for ETL pipeline with Pig Latin, with Hbase and AS for storage.

Implemented data ingestion and cluster handling in real-time processing using Kafka.

Implemented workflows using Apache Airflow framework to automate tasks.

Performed both major and minor upgrades to the existing Cloudera Hadoop cluster.

10/2016- 01/2018

Big Data Engineer

Merck, Inc., New York, NY

Integrated Kafka with Spark Streaming for real time realtime data processing.

Built a prototype for real-time analysis using Spark Streaming and Kafka while considering different fields such as: increase in positive cases; increase in test cases, recovery rate.

Used shell scripts to migrate data between Hive, HDFS and MySQL.

Fully configured a HDFS cluster for faster and more efficient processing and analyzing of the data.

Utilized Zookeeper and Spark interface for monitoring the proper execution of the Spark Streaming job.

Created Hive Tables, loaded retail transactional data from Teradata using Scoop.

Loaded home mortgage data from the existing DWH tables (SQL Server) to HDFS using Scoop.

Worked on POC and implementation & integration of Cloudera & Hortonworks for multiple clients.

Created Hive external tables and designed data models in Hive.

Performed performance tuning for Spark Steaming e.g. setting right Batch Interval time, correct level of Parallelism, selection of correct Serialization & memory tuning.

Data ingestion is done using Flume with source as Kafka Source & sink as HDFS.

For one of the use case, used Spark Streaming with Kafka & HDFS & MongoDB to build a continuous ETL pipeline. This is used for real-time analytics performed on the data.

Performed import and export of dataset transfer between traditional databases and HDFS using Sqoop.

Worked on disaster management with Hadoop cluster.

Designed and presented a POC on introducing Impala in project architecture.

Configured Spark streaming to receive real-time data from Kafka and store the stream data to HDFS.

04-2015- 10/2016

Data Engineer

UPS, Atlanta, GA

Connected to Rest API and ingested data using different ingestion tools such as Kafka and Flume.

Worked on importing the received data into Hive using Spark Streaming and Kafka.

Utilized variety of Hive queries for accessing desirable data for further analysis.

Implemented Partitioning, Dynamic Partition and Buckets in Hive which resulted in an increase in performance as well as proper and logical organization of data.

Installed and configured Hive and also written Hive UDFs.

Designed and Developed ETL jobs to extract data from Salesforce replica and load it in data mart in Amazon Redshift.

Configure Kafka producer with API endpoints using JDBC Autonomous REST Connectors

Performed upgrades, patches and bug fixes in Hadoop in a cluster environment

Worked on Spark SQL to check the data; Wrote Spark applications for data validation, cleansing, transformation, and custom aggregation.

Loaded into ingested data into Hive Managed and External tables.

Learned many technologies on the job as per the requirement of the project. Good communication, interpersonal and analytical skills, and a highly motivated team player with the ability to work independently

Created Hive queries to summarizing and aggregating business queries by comparing Hadoop data with historical metrics

Education

Master of Science: Biomedical Engineering

The University of Texas At Dallas - Richardson, TX

Certifications

IBM: Big Data 101

IBM: Spark Fundamentals

IBM: Hadoop 101

AWS: Machine Learning Specialty Certification

.



Contact this candidate