Hadoop Developer

Location:

New York, NY

Posted:

February 03, 2023

Contact this candidate

Resume:

Dongyan Dai (Ph.D)

Big Data/ Cloud/ Hadoop Developer

Email: ************@*****.*** Phone: 929-***-****

PROFESSIONAL PROFILE

10+ years’ experience of working in Big Data, Hadoop, and Cloud.

Integrate Kafka with Spark for fast information handling.

Extensive Knowledge in the development, analysis, and design of ETL methodologies in all the phases of data warehousing life cycle.

Excellent understanding of Hadoop architecture and its components such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node and MapReduce concepts and experience.

Expertise in Python and Scala, user-defined functions (UDF) for Hive using Python.

Hands-on use of Spark and Scala APIs to compare the performance of Spark with Hive and SQL, and Spark SQL to manipulate Data Frames in Scala.

Skilled preparing test cases, documenting and performing unit testing and Integration testing.

Work Experience with Cloud Infrastructure like Amazon Web Services.

Experience developing Shell Scripts, Oozie Scripts and Python Scripts.

Expert in writing complex SQL queries with databases like MySQL, SQL Server and MS SQL Server.

Hands-on experience on major components in Hadoop eco-systems like Spark, HDFS, HIVE, HBase, Zookeeper, Sqoop, Oozie, Flume, Kafka.

Experience importing and exporting data using Sqoop and SFTP for Hadoop to/from RDBMS.

Experience using Kafka as a messaging system to implement real-time Streaming solutions using Spark Streaming.

Experience ingesting data from different sources utilizing Apache Flume and Kafka.

Work with No-SQL databases like Cassandra and HBase.

Arrange AWS IAM and Security Group in Public and Private Subnets in VPC.

Design and fabricate virtual server farms in the AWS cloud to help Enterprise Data.

Well versed in Hive queries and in-depth understanding about information on joins.

Create start to finish Hive Queries to parse the crude information.

Create custom total capacities and UDF - Spark SQL.

Structure Spark Streaming and Kafka Producer interfaces.

Use Apache Flume and Kafka for gathering, collecting, moving from different sources.

Extract data from HDFS, DynamoDB, Cassandra, HBase, Hive, MS SQL, and others.

Extend HIVE core functionality by using custom User Defined Functions (UDF), User Defined Table-Generating Functions (UDTF) and User Defined Aggregate Functions (UDAF) for Hive.

Good Knowledge on Spark framework on both batch and real-time data processing.

Deep knowledge in incremental imports, partitioning and bucketing concepts in Hive and Spark SQL needed for optimization.

TECHNICAL SKILLS

Programming Languages: Python, Scala, Shell Scripts(Unix).

Databases: MS SQL Server, Oracle, DB2, MySQL, PostgreSQL, Cassandra, MongoDB.

Scripting: HiveQL, SQL, MapReduce, Python, PySpark, Shell.

Distributions: Cloudera, MapR, Databricks, AWS, MS Azure, GCP.

Big Data Primary Skills: Hive, Kafka, Hadoop, HDFS, Spark, Cloudera, Azure Databricks, HBase, Cassandra, MongoDB, Zookeeper, Sqoop, Tableau, Kibana, MS Power BI, QuickSight, Hive Bucketing and Partitioning, Spark performance Tuning, Optimization, Spark Streaming, HiveQL

Apache Components: Cassandra, Hadoop, YARN, HBase, Hcatalog, Hive, Kafka, Nifi, Airflow Oozie, Spark, Zookeeper, Cloudera Impala, HDFS, MapReduce, Spark.

Data Processing: Apache Spark, Spark Streaming, Storm, Pig, MapReduce.

Operating Systems: UNIX/Linux, Windows.

Cloud Services: AWS S3, EMR, Lambda Functions, Step Functions, Glue, Athena, Redshift Spectrum, Quicksight, DynamoDB, Redshift, CloudFormation, MS ADF (Data Factory), Azure Databricks, Azure Data Lake, Azure SQL, Azure HDInsight, GCP, Cloudera, Anaconda Cloud, Elastic, Google Dataproc, Dataprep, Dataflow, Bigquery.

Testing Tools: PyTest.

PROFESSIONAL EXPERIENCE

Big Data Engineer Client: Dataminr Inc., New York since January 2021

(Dataminr’s real-time AI platform detects the earliest signals of high-impact events and emerging risks from within publicly available data)

Designed a cost-effective archival platform for storing big data using Hadoop and its related technologies.

Used Spark with Kafka for real-time streaming of data.

Imported unstructured data into the HDFS using Spark Streaming and Kafka.

Wrote scripts using Spark SQL to check, validate, cleanse, transform, and aggregate data.

Migrated MapReduce jobs to Spark using Spark SQL and Data Frames API to load structured data into Spark clusters.

Used SparkSQL for creating and populating HBase/MongoDB warehouse.

Implemented Kafka messaging consumer.

Utilized Avro-tools to build the Avro schema to create an external hive tables using PySpark.

Ingested data into Snowflake as Data warehouse for batch analytics.

Utilized Spark and Spark SQL for faster testing and processing.

Used Sqoop to efficiently transfer data between databases and HDFS and used Flume to stream the log data from servers.

Managed and reviewed Hadoop log files.

Created a NoSQL MongoDB database to store the processed data from PySpark into different collections.

Applied Google Dataproc to streamline data processing between clusters and Google Cloud Storage.

Applied Zookeeper for cluster coordination services.

Download data through Sqoop and Hive in the HDFS platform.

Defined data security standards and procedures in Hadoop using Apache Ranger and Kerberos.

Executed Hadoop/Spark jobs on Google Dataflow using programs, data stored in Cloud Storage Buckets and used of BigQuery for analytics.

Performed upgrades, patches, and bug fixes in HDP in a cluster environment.

Used Jenkins CI for CICD and Git for version control.

Big Data Engineer Client: Nuralogix, Toronto Canada Remote Feb 2018 – Jan 2021

(Dataminr’s real-time AI platform detects the earliest signals of high-impact events and emerging risks from within publicly available data)

Analyzed large amounts of data sets to determine optimal way to aggregate and report on them.

Designed and developed web app BI for performance analytics.

Designed Python-based notebooks for automated weekly, monthly, quarterly reporting E.T.L

Designed the backend database and AWS cloud infrastructure for maintaining company proprietary data.

Built PySpark applications on AWS EMR cluster and Glue jobs to transform data based on business use case and write the data back to S3 Buckets, then data can be queried using AWS Athena.

Installed and configured Hive, Pig, Sqoop, and Oozie on the Hadoop cluster.

Installed, designed, and tried an AWS Lambda work process in Python with S3 triggers.

Installed, configured, and monitored Apache Airflow cluster.

Used Sqoop to import data from Oracle to Hadoop.

Worked on POCs to ETL with S3, EMR(Spark) and Snowflake.

Used Oozie workflow engine to manage interdependent Hadoop jobs and automate several types of Hadoop jobs such as Java map-reduce Hive, Pig, and Sqoop.

Produced scripts for doing transformations using Scala.

Developed and implemented Hive custom UDFs involving date functions.

Developed Java Web Crawler to scrape market data for Internal products.

Wrote Shell scripts to orchestrate execution of other scripts and move the data files within and outside of HDFS.

Wrote simple SQL scripts on the final database to prepare data for visualization with Tableau.

Developed DAG data pipeline to on-board and change management of datasets.

Migrated various Hive UDFs and queries into Spark SQL for faster requests.

Created a benchmark between Hive and HBase for fast ingestion.

Configured Spark Streaming to receive real-time data from the Apache Kafka and store the streamed data to HDFS using Scala.

Hands on experience in Spark and Spark Streaming creating RDDs.

Scheduled jobs using Control-M.

Used Kafka on publish-subscribe messaging as a distributed commit log.

Created Airflow Scheduling scripts in Python to automate data pipeline and data transfer.

Orchestrated Airflow/workflow in hybrid cloud environment from local on-premise server to the cloud.

Wrote Shell FTP scripts for migrating data to Amazon S3.

Configured AWS Lambda for triggering parallel Cron jobs scheduler for scraping and transforming data.

Use Cloudera Manager for installation and management of a multi-node Hadoop cluster.

Programmed Flume and HiveQL scripts to extract, transform, and load the data into database.

Implemented AWS Fully Managed Kafka streaming to send data streams from the company APIs to Spark cluster in AWS Databricks

Big Data Cloud Developer Client: Dataiku, New York Aug 2016 – Feb 2018

(Dataiku DSS is the collaborative data science software platform for teams of data scientists, data analysts, and engineers to explore, prototype, build, and deliver their own data products more efficiently)

Assembled large, complex data sets that met functional and non-functional business requirements.

Created CloudFormation Template to set up and configure the EMR cluster through Jenkins.

Set up a data pipeline using Docker image container with AWS CLI and Maven for deployment on AWS EMR

Created basic infrastructure of data pipeline using AWS Cloud Formation.

Created Hive tables, loaded with data, and wrote Hive queries.

Developed and maintained standards for administration and operations, including scheduling, running, monitoring, logging, error management, failure recovery, and output validation.

Wrote Hive Queries to analyze data in Hive warehouse using Hive Query Language.

Wrote shell scripts for automating the process of data loading.

Completed Hive partitioning and joins on Hive tables utilizing Hive SerDes.

Wrote Bash script to be used during cluster launch to set up HDFS and copied directories to EMR.

Moved relational database data using Spark to transform in and move into Hive Dynamic partition tables using staging tables.

Developed Hive queries to count user subscription data.

Created YAML files for GitLab Runner for scheduled daily, weekly, and monthly reports.

Responsible for the development of Python/PySpark code using Databricks cluster to run ML models from Data Science team and optimizing scripts for better performance.

Ingested thousands of different files into

Applied continuous monitoring and management of Elastic MapReduce (EMR) cluster through AWS console.

Changed schema for the table in Hive using AVRO schema to avoid errors when changing it on the side Data Engineering team. Old versions used hardcoded queries.

Appended EMR cluster steps using JSON format to execute tasks preparing cluster during the launch.

Wrote Hive queries that generated daily, weekly, and monthly reports.

Introduced the use of AWS Glacier to store historical data and reduce the cost of AWS S3.

Used AWS S3 to populate Hive tables, store finished reports, and to provide access for business to another S3 bucket.

Added steps to the EMR cluster that deliver reports via email as per client’s requests.

Participated in the migration from Gitlab to Jenkins and GitHub as continuous integration/continuous development tools.

Big Data Developer Client: Charles Schwab Corp., San Francisco, CA Mar 2014 – Aug 2016

(The Charles Schwab Corp. is a savings and loan holding company, which engages in the provision of wealth management, securities brokerage, banking, asset management, custody, and financial advisory services)

Develop, test and update ETLs from E2E using AWS instances to process terabytes of data and store it in HDFS and Amazon S3.

Create, update and fix spark applications using Scala as main language and Spark ecosystem tools such as data frames with Spark SQL.

Create and manage data pipelines using Python as main language and Airflow as workflow scheduler.

Create queries for Hive HQL, Spark SQL, Presto and Microsoft SQL Server database using Qubole as data warehouse.

CI/CD for Spark and Scala applications using Spinnaker and Jenkins as applications deployment tools and Git/GitHub to track and host code.

Closely work with data science team to provide them data for further analysis.

Used Spark APIs to perform necessary transformations and actions on the real-time data and visualize results on Tableau and Data Dog for performance visualization results.

Use Spark-SQL and Hive Query Language (HQL) for getting customer insights to be used for critical decision making by business users.

Perform streaming data ingestion to the Spark distribution environment using Kafka.

Data Engineer Client: SurgeTrader, Naples, Florida Feb 2012 – Mar 2014

Created Kafka topics for Kafka brokers to listen from and data transferring to function in a distributed system.

Created a consumer and listened to the topic and brokers on port and created a direct streaming.

Created a Data Frame in Apache Spark by passing in the schema as a parameter to the ingested data.

Developed Spark UDFs using Scala for better performance.

Created Lambda functions on AWS using Python.

Collected data using REST API.

Decoded raw data and loaded into JSON before sending the batched streaming file over the Kafka producer.

Parsed JSON files received and loaded for further transformation.

Built data structure to access information from a layered JSON file.

Utilized Python to produce Airflow DAGs.

Used Spark to parse out required data by using Spark SQL Context and select the columns with target information and assigned names.

Configured the packages and jars for Spark work to load data into HBase.

Loaded data into HBase under default namespace and assigned row key and datatype.

Applied Python scripts to parse out information from unstructured data.

Configured master and slave nodes for Spark session.

Utilized the transformation and action in Spark to interact with the data frame to show and to process the data.

Configured HBase and YARN settings to build connections between HBase and a task manager to assign adequate tasks to the HBase node.

Worked in a standalone virtual machine as the node and ran the pipeline to implement a distributed system.

Utilized Spark jobs, Spark SQL, and Data Frames API to load structured data into Spark clusters.

Created a Kafka broker that used the schema to fetch structured data in structured streaming.

Interacted with data residing in HDFS using Spark to process the data.

EDUCATION

M.S., Financial 2011

UCLA Anderson School of Management, Los Angeles, CA

Ph.D., Mathematics 2008

University of California, Santa Barbara, CA

M.S., Mathematics 2002

University of California, Santa Barbara, CA

B.S., Physics 2000

Jilin University, Changchun, China

Contact this candidate