Post Job Free

Hadoop Developer Data

Location:

Posted:

June 21, 2021

Contact this candidate

Resume:

Rowan Ahmed

***********@*****.***

GC/EAD

New York, NY

PROFESSIONAL SUMMARY

Over 4 and half years of IT experience as a Developer, Designer & quality Tester with cross platform integration experience using Hadoop Ecosystem.

Configured Spark streaming to receive real-time data from the Kafka and stored the stream data to HDFS using Scala and Python.

Experienced in importing and exporting data using stream processing using Flume, Kafka and Python

Written Hive UDFs as required and executed complex HQLs to extract data from Hive tables.

Used partitioning and bucketing in Hive and designed both managed and external tables for performance optimization.

Converted Hive/SQL queries into Spark transformations using Spark Data frames and Scala.

Experienced in Python cluster to manipulate data for data loading and extraction. Worked with Python libraries like Matplotlib, NumPy, SciPy and Pandas for data analysis.

Automated recurring reports using SQL and Python and visualized them on BI platform.

Used Spark Data Frames API over Cloudera platform to perform analytics on Hive data.

Good understanding and knowledge of NoSQL databases like MongoDB, HBase and Cassandra

Experienced in workflow scheduling and locking tools/services like Oozie and Zookeeper.

Practiced ETL methods in enterprise-wide solutions, data warehousing, reporting and data analysis.

Experienced in working with AWS using EMR, EC2 for computing and S3 as storage mechanism, Spark, Oozie, Zookeeper, Kafka and Flume. Configured Spark streaming to receive real-time data from the Kafka and stored the stream data to HDFS using Scala and Python.

Used partitioning and bucketing in Hive and designed both managed and external tables for performance optimization.

Converted Hive/SQL queries into Spark transformations using Spark Data frames and Scala.

Experienced in Python cluster to manipulate data for data loading and extraction. Worked with Python libraries like Matplotlib, NumPy, SciPy and Pandas for data analysis.

Automated recurring reports using SQL and Python and visualized them on BI platform.

Used Spark Data Frames API over Cloudera platform to perform analytics on Hive data.

Good understanding and knowledge of NoSQL databases like MongoDB, HBase and Cassandra

Experienced in workflow scheduling and locking tools/services like Oozie and Zookeeper.

Practiced ETL methods in enterprise-wide solutions, data warehousing, reporting and data analysis.

Experienced in working with AWS using EMR, EC2 for computing and S3 as storage mechanism.

Developed Impala scripts for extraction, transformation, loading of data into data warehouse.

Good knowledge in using Apache NiFi to automate the data movement between Hadoop systems.

Imported and exported data with Sqoop to and from HDFS to RDBMS including Oracle, MySQL and c

Good Knowledge in UNIX Shell Scripting for automating deployments and other routine tasks.

Experienced in using IDEs like Eclipse, NetBeans, IntelliJ.

Used JIRA and Rally for bug tracking and GitHub and SVN for various code reviews and unit testing.

Experienced in working in all phases of SDLC - both agile and waterfall methodologies

Good understanding of Agile Scrum methodology, Test Driven Development and CI-CD

PROFESSIONAL EXPERIENCE

HADOOP DEVELOPER

American Express, New York January 2021 – Present

Responsibilities

Developed architecture document, process documentation, server diagrams, requisition documents.

Developed stream pipelines and consumed real time events from Kafka using Kafka streams API and Kafka clients.

Configured Spark streaming to get incoming messages from Kafka topics and store the stream data in to HDFS.

Optimized MapReduce Jobs to use HDFS efficiently by using various compression mechanisms.

Worked with Senior Engineer on configuring Kafka for streaming data.

Developed Spark jobs by using Scala as per the requirement.

Performed processing on large sets of structured, unstructured and semi structured data.

Created applications using Kafka, which monitors consumer lag within Apache Kafka clusters.

Handled importing of data from various data sources using Sqoop, performed transformations using Spark and loaded data into DynamoDB.

Analyzed the data by performing Hive queries and use visualization tools to generate insights from data and analyze customer behavior.

Worked with Spark Ecosystem using Scala and Sparks Queries on different data formats like Text file and parquet.

Used Hive UDF's to implement business logic in Hadoop.

Implemented business logic by writing UDFs in Python and used various UDFs.

Responsible to migrate from Hadoop to Spark frameworks, in-memory distributed computing for real time fraud detection.

Used Spark to store data in-memory.

Implemented batch processing of data sources using Apache Spark.

Responsible for Data Cleaning, features scaling, features engineering by using NumPy and Pandas in Python.

Developed workflow in Oozie to automate the tasks of loading the data into HDFS and pre-processing with HiveQL.

Involved in creating Hive tables, loading data and writing hive queries which will run internally as MapReduce job.

Develop predictive analytic using Apache Spark Scala APIs

Cluster co-ordination services through Zookeeper.

Used Apache Kafka for collecting, aggregating, and moving large amounts of data from application servers.

Developed Spark code using Scala and Spark-SQL/Streaming for faster testing and processing of data.

As Part of POC setup Amazon web services (AWS) to check whether Hadoop is a feasible solution or not.

Used Docker as part of CI/CD to build and deploy applications using ECS in AWS.

Installed Oozie workflow engine to run multiple Hive and Pig jobs.

Exported the analyzed data to the relational databases using Sqoop for visualization and to generate reports for the BI team.

Environment: Hadoop, Map Reduce, HDFS, Hive, Java, Oozie, Linux, XML, Java 6, Eclipse, Oracle 10g, PL/SQL, YARN, Spark, Pig, Sqoop, DB2, java, XML, UNIX, Catalog.

HADOOP DEVELOPER

CBS, New York City, New York November 2019 - December 2020

Responsibilities

Worked extensively on Hadoop Components such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node, YARN and Map Reduce programming.

Involved in importing the data from various data sources into HDFS using Sqoop and applying various transformations using Hive, Apache Spark and then loading data into Hive tables or AWS S3 buckets.

Involved in moving data from various DB2 tables to AWS S3 buckets using Sqoop process.

Configuring Splunk alerts in-order to get the log files while execution and storing them to a location in S3 bucket when cluster is running.

Involved in Hive/SQL queries performing spark transformations using Spark RDDs and Python (spark).

Configured Spark Streaming to receive real time data from the Apache Kafka and store the stream data to DynamoDB using Scala.

Experienced in creating the EMR cluster and deploying code into the cluster in S3 buckets.

Experienced in using NoMachine and Putty in-order to SSH the EMR cluster and running spark-submit.

Developed Apache Spark Applications by using Scala, Python and Implemented Apache Spark data processing module to handle data from various RDBMS and Streaming sources.

Experience in developing and scheduling various Spark Streaming / batch Jobs using python (spark) and Scala.

Developing spark code using spark to apply various transformations and actions for faster data processing.

Achieved high-throughput, scalable, fault-tolerant stream processing of live data streams using Apache Spark Streaming

Used Spark Stream processing using Scala to get data into in-memory, created RDDs, Data Frames and applied transformations and actions.

Involved in using various Python libraries with spark in order to create data frames and store them to Hive.

Sqoop jobs and Hive queries were created for data ingestion from relational databases to analyze historical data.

Experience in working with Elastic MapReduce (EMR) and setting up environments on amazon AWS EC2 instances.

Knowledge on handling Hive queries using Spark SQL that integrates with Spark environment.

Executed Hadoop/Spark jobs on AWS EMR using programs, stored in S3 Buckets.

Knowledge on creating the user defined functions (UDF's) in Hive.

Worked with different File Formats like c, Avro, parquet for HIVE querying and processing based on business logic.

Involved in pulling the data from AWS Amazon S3 bucket to data lake and built Hive tables on top of it and created data frames in Spark to perform further analysis.

Worked on Sequence files, RC files, Map side joins, bucketing, partitioning for Hive performance enhancement and storage improvement.

Implemented Hive UDF's to implement business logic and Responsible for performing extensive data validation using Hive.

Involved in loading the structured and semi structured data into spark clusters using Spark SQL and Data Frames API.

Involved in developing code and generated various data frames based on the business requirement and created temporary tables in hive.

Experience in build scripts using SBT and did continuous system integrations like Bamboo.

Used JIRA for creating the user stories and creating branches in the bitbucket repositories based on the story.

Involved in story-driven agile development methodology and actively participated in daily scrum meetings.

Used Bitbucket as a repository for storing the code and integrated with bamboo for integration purpose.

Involved in Test Driven Development writing unit and integration test cases for the code.

Environment: Hadoop, Cloudera Hadoop, Map Reduce, Hive, Pig, Sqoop, Flume, HBase, Java, JSON, Spark, HDFS, YARN, Oozie Scheduler, Zookeeper, Mahout, Linux, UNIX, ETL, My SQL.

HADOOP DEVELOPER(Internship)

GOLDMAN SACHS, New York, NY Feb 2017-September 2019

Developed stream pipelines and consumed real time events from Kafka using Kafka streams API and Kafka clients.

Configured Spark streaming to get incoming messages from Kafka topics and store the stream data in to HDFS.

Responsible for building scalable distributed data solutions on Cloudera distributed Hadoop.

Involved in using spark streaming and SPARK jobs for ongoing transactions of customers and Spark SQL to handle structured data in Hive.

Involved in migrating tables from RDBMS into Hive tables using SQOOP and later generate visualizations using Tableau.

Written Spark code to calculate aggregate data like mean, Co-Variance, Standard Deviation and etc.

Responsible for writing Hive Queries for analyzing data in Hive warehouse using Hive Query Language (HQL).

Written UDF in Scala and used in sampling of large data sets.

Used distinctive data formats while stacking the data into HDFS.

Worked in AWS environment for development and deployment of custom Hadoop applications.

Worked on NiFi workflows development for data ingestion from multiple sources. Involved in architecture and design discussions with the technical team and interface with other teams to create efficient and consistent Solutions.

Involved in creating Shell scripts to simplify the execution of all other scripts (Pig, Hive, Sqoop, Impala and MapReduce) and move the data inside and outside of HDFS.

Creating files and tuned the SQL queries in Hive utilizing HUE.

Involved in converting Hive/Sql queries into Spark transformations using Spark RDD's.

Experienced in working with spark eco system using Spark SQL and Scala queries on different formats like Text file, CSV file.

Expertized in implementing Spark using Scala and Spark SQL for faster testing and processing of data responsible to manage data from different sources.

Worked with NoSQL databases like HBase in making HBase tables to load expansive arrangements of semi structured data.

Acted for bringing in data under HBase using HBase shell also HBase client API.

Used Kafka to patch up a customer activity taking after pipeline as a course of action of steady appropriate subscribe supports.

Developed workflow in Oozie to automate the jobs.

Provided design recommendations and thought leadership to sponsors/stakeholders that improved review process and resolved technical problems.

Developed complete end to end Big-Data Processing in Hadoop Ecosystems.

EDUCATION

LaGuardia Community College (CITY UNIVERSITY OF NEWYORK)

ASSOCIATES IN ACCOUNTING January 2014 – January 2017

NewYork,USA

Contact this candidate