Senior Big Data Engineer

Location:

Cincinnati, OH

Posted:

July 20, 2021

Contact this candidate

Resume:

Summary

*+ years in Hadoop/Big Data development and architecting.

Experience in Apache NIFI which is a Hadoop technology and also Integrating Apache NIFI and Apache Kafka.

Strong knowledge in Upgrading MapR, CDH and HDP Cluster.

Full-stack software engineer experienced in Hadoop and other big data platforms.

Proven experience showcasing technical and operational feasibility of Hadoop developer solutions.

Design and build big data architecture for unique projects, ensuring development and delivery of the highest quality, on-time and on budget.

Significant contribution to the development of big data roadmaps.

Creates and maintains environment configuration documentation for all pre-production environments

Able to drive architectural improvement and standardization of the environments.

Expertise in Storm for reliable real-time data processing capabilities to Enterprise Hadoop.

Extending HIVE and PIG core functionality by using custom User Defined Function's (UDF), User Defined Table-Generating Functions (UDTF) and User Defined Aggregating Functions (UDAF) for Hive and Pig.

Good Knowledge on Spark framework on both batch and real-time data processing

Provides clear and effective testing procedures and systems to ensure an efficient and effective process.

Clearly documents big data systems, procedures, governance and policies.

Participated in design, development and system migration of high performant metadata-driven data pipeline with Kafka and Hive/Presto on Qubole, providing data export capability through API and UI.

Experienced in deployment of Hadoop Cluster using Puppet tool.

Experienced in collecting metrics for Hadoop clusters using Ambari & Cloudera Manager.

Hands on experience in writing Pig Latin scripts, working with grunt shells and job scheduling with Oozie.

Good knowledge in Cluster coordination services through Zookeeper and Kafka.

Extensive Experience on importing and exporting data using Flume and Kafka.

Experienced in Cloudera and Hortonworks Hadoop distributions.

Experienced in Java, Scala and Python programming languages.

Experienced as MySQL Database Specialist.

Skills

Cloud Technologies

Amazon AWS (EMR, EC2, EC3, SQS, S2, S3, Redshift), Azure Cloud, Google Cloud

Programming

Python, Scala, Java, C++ programming languages, Running Scripts with BASH

Database Technologies

Amazon Redshift, Amazon RDS

Data Modeling

Toad, Podium, Talend, Informatica

Administration

Ambari, Cloudera Managers, Nagios

Analytics & Visualization

Tableau, PowerBI, Kibana, Qlik View, Pentaho

Data Warehouse

Teradata, Hive, Amazon Redshift, BigQuery, Azure Data Warehouse

ETL/Data Pipelines

Datastage, Apache Sqoop, Apache Camel, Flume, Apache Kafka, Apatar, Atom, Talend, Pentaho

Project

Agile Processes, Problem Solution skills, Mentoring, requirement gathering

Communication

Very strong technical written and oral communication skills

Compute Engines

Apache Spark, Spark Streaming, Flink

File Systems/Formats

CSV, Parquet, Avro, Orc, JSON

Data Visualization

Pentaho, QlikView, Tableau, Informatica, Power BI

Data Query Engines

Impala. TEZ, Spark SQL

Search Tools

Apache Lucene, Elasticsearch, Kibana,

Apache SOLR, Cloudera Search

Cluster

Yarn, Puppet

Frameworks

Hive, Pig, Spark, Spark Streaming, Storm

Experience

SENIOR BIG DATA ENGINEER

Procter & Gamble – Cincinnati, OH

04/2019 – Present

Responsibilities:

Worked with Spark to create structured data from a pool of unstructured data received.

Implemented advanced procedures such as text analytics and processing using in-memory computing capabilities such as Apache Spark written in Scala.

Implemented Spark using Scala and Spark SQL for faster testing and processing of data.

Converted Hive/SQL queries into Spark transformations using Spark RDDs and Scala.

Documented requirements, including the available code which should be implemented using Spark, Hive, HDFS and Elastic Search.

Maintained ELK (Elastic Search, Kibana) and wrote Spark scripts using Scala shell.

Implemented Spark using Scala and utilized DataFrames and Spark SQL API for faster processing of data.

Developed Spark Streaming applications to consume data from Kafka topics and insert the processed streams to HBase.

Provided a continuous discretized DStream of data with a high level of abstraction with Spark Structured Steaming.

Moved transformed data to Spark cluster where the data is set to go live on the application using Kafka.

Created a Kafka producer to connect to different external sources and bring the data to a Kafka broker.

Handled schema changes in data stream using Kafka.

Developed new Flume agents to extract data from Kafka.

Created a Kafka broker in structured streaming to get structured data by schema.

Analyzed and tuned Cassandra data model for multiple internal projects and worked with analysts to model Cassandra tables from business rules and enhance/optimize existing tables.

Designed and deployed new ELK clusters.

Created log monitors and generated visual representations of logs using ELK stack.

Implemented CI/CD tools Upgrade, Backup, and Restore.

Played a key role installing and configuring various Big Data ecosystem tools such as Elastic Search, Logstash, Kibana, Kafka, and Cassandra.

Reviewed functional and non-functional requirements on the Hortonworks Hadoop project and collaborated with stakeholders and various cross-functional teams.

Customized Kibana for dashboards and reporting to provide visualization of log data and streaming data.

Developed Spark applications for the entire batch processing by using Scala.

Developed Spark scripts by using Scala shell commands as per the requirement.

Performed advanced procedures such as text analytics and processing using the in-memory computing capabilities of Spark using Scala.

Defined the Spark/Python (PySpark) ETL framework and best practices for development.

Installed and configured Tableau Desktop to connect to the Hortonworks Hive Framework (Database), which contained the bandwidth data form the locomotive through the Hortonworks JDBC connector for further analytics of the data.

Versioning with Git and set-up a Jenkins CI to manage CI/CD practices.

Built Jenkins jobs for CI/CD infrastructure from GitHub repository.

Developed Spark programs using Python to run in the EMR clusters.

Created User Defined Functions (UDF) using Scala to automate few of the business logic in the applications.

Used Python for developing Lambda functions in AWS.

Designed AWS Glue pipelines to ingest, process, and store data interacting with different services in AWS.

HADOOP DEVELOPER

Aetna Hartford, CT

06/2017 – 04/2019

Responsibilities:

Installed Kafka to gather data from disperse sources and store for consumption.

Utilized PyCharm for majority of project testing and coding.

Consumed rest APIs and wrote source code so it could be used for the Kafka program.

Worked on various real-time and batch processing applications using Spark/Scala, Kafka and Cassandra.

Built Spark applications to perform data enrichments and transformations using Spark Data frames with Cassandra lookups.

Used Data Stax Spark Cassandra Connector to extract and load data to/from Cassandra.

Worked in a team to develop an ETL pipeline that involved extraction of Parquet serialized files from S3 and persisted them in HDFS.

Developed Spark application that uses Kafka Consumer and Broker libraries to connect to Apache Kafka and consume data from the topics and ingest them into Cassandra.

Developed applications involving Big Data technologies such as Hadoop, Spark, Map Reduce, Yarn, Hive, Pig, Kafka, Oozie, Sqoop, and Hue.

Worked on Apache Airflow, Apache Oozie, and Azkaban.

Designed and implemented data ingestion framework to load data into the data lake for analytical purposes.

Developed data pipelines using Hive, Pig, and MapReduce.

Wrote Map Reduce jobs.

Administered clusters in the Hadoop ecosystem.

Installed and configured the Hadoop Cluster of Major Hadoop Distributions.

Designed the reporting application that uses the Spark SQL to fetch and generate reports on Hive.

Analyzed data using Hive and wrote User Defined Functions (UDFs).

Used AWS services like EC2 and S3 for small data sets processing and storage.

Executed Hadoop/Spark jobs on AWS EMR using programs, data stored in S3 Buckets.

HADOOP DATA ARCHITECT/ENGINEER

City of Long Beach – Long Beach, CA

05/2015 – 06/2017

Responsibilities:

Analyzed Hadoop cluster using big data analytic tools, including Kafka, Pig, Hive, Spark, and MapReduce.

Configured Spark streaming to receive real-time data from Kafka and stored to HDFS using Scale.

Implemented Spark using Scala and Spark SQL for faster analyzing and processing of data.

Built continuous Spark streaming ETL pipeline with Spark, Kafka, Scala, HDFS and MongoDB.

Imported/exported data into HDFS and Hive using Sqoop and Kafka.

Created Hive tables, loaded data, and wrote Hive queries.

Designed and developed ETL workflows using Python and Scala for processing data in HDFS and MongoDB.

Wrote complex Hive queries, Spark SQL queries and UDFs.

Wrote shell scripts to execute scripts (Pig, Hive, and MapReduce) and move the data files to/from HDFS.

Worked on importing unstructured data into the HDFS using Spark Streaming and Kafka.

Converted Hive/SQL queries into Spark transformations using Spark RDDs, Python, and Scala.

Worked with Amazon Web Services (AWS) and involved in ETL, Data Integration, and Migration.

Handled 20 TB of data volume with 120-node cluster in Production environment.

Loaded data from diff servers to AWS S3 bucket and set appropriate bucket permissions.

Applied Apache Kafka to transform live streaming with the batch processing to generate reports.

Imported data into HDFS and Hive using Sqoop and Kafka. Created Kafka topics and distributed to different consumer applications.

Worked on Spark SQL and DataFrames for faster execution of Hive queries using Spark and AWS EMR.

Implemented Partitioning, Dynamic Partitions and Buckets in HIVE for increasing performance benefit and helping organize data in a logical fashion.

Scheduled and executed workflows in Oozie to run Hive and Pig jobs.

Worked with Spark Context, Spark -SQL, DataFrame and Pair RDDs.

Used Hive and Spark SQL Connection to generate Tableau BI reports.

Created Partitions and Buckets based on State to further process using Bucket-based Hive joins.

Created Hive Generic UDFs to process business logic that varied based on policy.

Developed various data connections from data sourced to SSIS, and Tableau Server for report and dashboard development.

Worked with clients to better understand their reporting and dashboarding needs and presented solutions using structured Waterfall and Agile project methodology approach.

Developed metrics, attributes, filters, reports, and dashboards and created advanced chart types, visualizations, and complex calculations to manipulate data.

BIG DATA DEVELOPER

Intuitive Research & Technology, Huntsville, AL

01/2014 – 05/2015

Responsibilities:

Migrated data from RDBMS for streaming or static data into the Hadoop cluster using Hive, Pig, Flume, and Sqoop.

Implemented HDFS access controls, directory, and file permissions user authorization that facilitated stable, secure access for multiple users in a large multi-tenant ClusterDFS

Hadoop.

Application development using Hadoop Ecosystems such as Spark, Kafka, HDFS, HIVE, Oozie and Sqoop.

Worked in Big Data Hadoop Ecosystem technologies such as HDFS, Map Reduce, YARN, Apache Hive, Apache Spark, HBase, Scala, and Python for distributed processing of data.

Automated all jobs for pulling data from FTP server to load data into Hive tables using Oozie workflows.

Scheduled Oozie workflow engine to run multiple HiveQL, Sqoop, and Pig jobs.

Designed HBase row key and data modelling to insert to HBase tables using concepts of lookup tables and staging tables.

Created Spark frameworks that utilized a large number of Spark and Hadoop applications running in series to create one cohesive E2E Big Data pipeline.

Used Spark-SQL to Load Parquet data and created Datasets defined by Case classes and handled structured data using Spark SQL, which were stored into Hive tables for downstream consumption.

Implemented Cloudera of several applications, highly distributive, scalable, and large in nature using Cloudera Hadoop.

Cloudera Manager used to collect metrics.

Developed Shell Scripts, Oozie Scripts, and Python Scripts.

HADOOP DATA ENGINEER

Charles Schwab – Littleton, CO

10/2012 – 12/2013

Responsibilities:

Administration, customization and ETL data transformations for BI analysis and data analysis of financial transactions, investments, and risk analysis.

Worked with a team to gather and analyze the client requirements.

Analyzed large data sets distributed across clusters of commodity hardware.

Connected to Hadoop cluster and Cassandra ring and executed sample programs on servers.

Applied Hadoop and Cassandra as part of a next-generation platform implementation.

Developed several advanced YARN programs to process received data files.

Responsible for building scalable distributed data solutions using Hadoop.

Handled importing of data from various data sources, performed transformations using Hive and YARN, loaded data into HDFS, and extracted the data from Oracle into HDFS using Sqoop

Bulk loaded data into Cassandra using Stable loader.

Loaded the OLTP models and Perform ETL to load Dimension data for a Star Schema.

Built-in Request builder developed in Scala to facilitate running of scenarios using JSON configuration files.

Analyzed data by performing Hive queries and running Pig scripts to study customer behavior.

Involved in HDFS maintenance and loading of structured and unstructured data.

Developed workflow in Oozie to automate the tasks of loading the data into HDFS and pre-processing with Pig.

Data was formatted using Hive queries and stored on HDFS.

Created complex schema and tables for analysis using Hive.

Worked on creating MapReduce programs to parse the data for claim report generation and running the Jars in Hadoop. Coordinated with Java team in creating MapReduce programs.

Implemented the project by using Spring Web MVC module.

Responsible for managing and reviewing Hadoop log files. Designed and developed data management system using MySQL.

Cluster maintenance as well as creation and removal of nodes using tools like Cloud Era Manager Enterprise, and other tools.

Followed Agile methodology, interacted directly with the client and received feedback on the features, suggested/implemented optimal solutions, and tailored application to customer needs.

MySQL DATABASE SPECIALIST

Progressive Insurance, Mayfield, OH

09/2009 – 09/2012

Responsibilities:

Developed database architectural strategies at the modeling, design, and implementation stages.

Translated a logical database design or data model into an actual physical database implementation.

Analyzed and profiled data for quality and reconciled data issues using SQL.

Applied performance tuning to improve issues with a large, high-volume, multi-server MySQL installation for job applicant site of clients.

Designed and configured MySQL server cluster and managed each node on the Cluster.

Defined procedures to simplify database upgrades.

Standardized MySQL installs on all servers with custom configurations.

Modified database schemas.

Performed security audits of MySQL internal tables and user access, and revoked access for unauthorized users.

Set up replication for disaster and point-in-time recovery.

Maintained existing database systems.

Education

Associae in Software Engineering - Cincinnati State Technical & Community College

Contact this candidate