Big Data Engineer

Location:

Atlanta, GA, 30339

Posted:

July 26, 2022

Contact this candidate

Resume:

Professional Summary

*+ years working hands on with Big Data and Hadoop technologies in large-scale industrial/business environments.

Developed Spark code for Spark-SQL/Streaming in Scala and PySpark.

Experience integrating Kafka and Spark by using Avro for serializing and deserializing data, and for Kafka Producer and Consumer.

Skilled with Cluster management such as Ambari, Hue and Cloudera Manager.

Implemented CI/CD tools such as Jenkins and used to improve and automate processes.

Configured GitHub plugin to offer integration between GitHub & Jenkins.

Utilized Elasticsearch and Logstash (ELK) for performance tuning.

Experienced converting Hive/SQL queries into Spark transformations using Spark RDD and Data Frame.

Used Spark SQL to perform data processing on data residing in Hive.

Involved in processes using Spark streaming to receive real-time data using Kafka on prem and in AWS cloud.

Used Spark Structured Streaming for high performant, scalable, fault-tolerant data processing of real-time data streams by extending core Spark API on prem and AWS.

Hands on with Apache Spark, which provides fast and general engine for large data processing integrated with functional programming language Scala.

Experienced writing Hive / Hive QL scripts to extract, transform, and load into databases.

Experienced writing Hive UDFs and completing incremental imports into Hive tables.

Experienced with Kafka for data ingestion and extraction with move into HDFS.

Experienced with Kafka for cluster handling in real-time processing.

Skilled with highly available, scalable and fault tolerant systems using Amazon Web Services (AWS).

Experience with multiple terabytes of data stored in AWS using Elastic Map Reduce (EMR).

Hands on with Cassandra, HBase, NoSQL, and DynamoDB databases.

Prepare reporting visualizations using Tableau and PowerBI.

Technical Skills

Big Data Platforms: Hadoop, Cloudera Hadoop, Hortonworks.

Hadoop Ecosystem (Apache) Tools: Cassandra, Flume, Kafka, Spark, Hadoop, Hadoop YARN, HBase, Hive, Oozie, Spark Streaming,

Hadoop Ecosystem Components: Zookeeper, Sqoop, Kibana, Tableau, AWS, RDDs, DataFrames, Datasets, Zookeeper, Cloudera Impala, HDFS, Hortonworks.

Scripting: Python, Scala, SQL, Spark, Hive QL.

Data Storage and Files: HDFS, Data Lake, Data Warehouse, Redshift, Parquet, Avro, JSON.

Databases: Cassandra, HBase, MongoDB, SQL, MySQL, RDBMS, NoSQL, DB2, DynamoDB.

Schedulers: Airflow, Oozie

Cloud Platforms and Tools: S3, AWS, EC2, EMR, Lambda services, Microsoft Azure, Adobe Cloud, Amazon Redshift, Open Stack, Google Computer Cloud, MapR cloud, Elastic Cloud.

Data Reporting and Visualization: Tableau, PowerBI, Kibana.

Web Technologies and APIs: XML, REST API, Spark API, JSO.

Pipeline Tool: Nifi

Professional Experience

06.2020 – Present

Big Data Engineer

Stanley Black & Decker, Inc., New Britain, CT

Stanley Black & Decker, Inc., formerly known as The Stanley Works, is a Fortune 500 American manufacturer of industrial tools and household hardware and provider of security products.

Installed and configured Apache Hadoop, Hue, Spark, Hive on a Cloudera framework.

Employed Hadoop ecosystem to create data pipelines involving HDFS.

Managed multiple Hadoop clusters to support data warehousing.

Created and maintained Hadoop HDFS/Spark/HBase pipelines.

Implemented HDFS data lake on Cloudera framework.

Helped build ML model and lead deployment of them in cloud environment.

Led development of AWS pipeline for ML models, utilizing S3 and EMR to store and process data respectively.

Worked with Spark to create structured data from a pool of unstructured data received.

Implemented advanced procedures such as text analytics and processing using in-memory computing capabilities such as Apache Spark written in Scala.

Implemented Spark using Scala and Spark SQL for faster testing and processing of data.

Used Hive and Spark SQL association to get Tableau reports.

Converted Hive/SQL queries into Spark transformations using Spark RDDs and Scala.

Documented requirements, including the available code which should be implemented using Spark, Hive, HDFS and Elastic Search.

Maintained ELK (Elastic Search, Kibana) and wrote Spark scripts using Scala shell.

Implemented Spark using Scala and utilized DataFrames and Spark SQL API for faster processing of data.

Developed Spark Streaming applications to consume data from Kafka topics and insert the processed streams to HBase.

Worked with data scientists to configure preliminary AWS SageMaker settings for models.

Worked with unstructured data and parsed out the information by Python built-in function.

Worked with Data Lakes and Big Data ecosystems (Hadoop, Spark, Hortonworks, Cloudera).

Created a Kafka producer to connect to different external sources and bring the data to a Kafka broker.

Handled schema changes in data stream using Kafka.

Developed new Flume agents to extract data from Kafka.

Created a Kafka broker in structured streaming to get structured data by schema.

Analyzed and tuned Cassandra data model for multiple internal projects and worked with analysts to model Cassandra tables from business rules and enhance/optimize existing tables.

Designed and deployed new ELK clusters.

Created log monitors and generated visual representations of logs using ELK stack.

Implemented CI/CD tools Upgrade, Backup, and Restore.

11.2018 – 06.2020

Hadoop Developer

Archer-Daniels-Midland Company (ADM), Chicago, IL

The Archer-Daniels-Midland Company, commonly known as ADM, is an American multinational food processing and commodities trading corporation.

Applied Kafka Stream library.

Configured Kafka producer with API endpoints using JDBC Autonomous REST Connectors.

Configured a multi-node multi-broker cluster of 10 Nodes and 30 brokers for consuming high-volume, high-velocity data.

Wrote complex queries into Apache Hive on Hortonworks Sandbox.

Utilized HiveQL to query data to discover trends from week to week.

Configured and deployed production-ready multimode Hadoop services Hive, Sqoop, Flume, and Oozie on the Hadoop Cluster with latest patches.

Created Hive queries to summarize and aggregate business queries by comparing Hadoop data with historical metrics.

Loaded ingested data into Hive managed and external tables.

Wrote custom user-defined functions (UDF) for complex Hive queries (HQL).

Performed upgrades, patches, and bug fixes in Hadoop in a Cluster environment.

Wrote shell scripts to automate workflows to pull data from various databases into Hadoop framework for users to access data through Hive-based views.

Wrote Hive queries for analyzing data in Hive warehouse using Hive Query Language.

Built the Hive views on top of the source data tables.

Used Cloudera Manager for installation and management of single-node and multi-node Hadoop Cluster.

Wrote shell scripts for automating the process of data loading.

Added and removed cluster nodes and performed cluster monitoring and troubleshooting.

Managed and reviewed data backups and managed and reviewed Hadoop log files.

Developed Kaka producer and consumer programs using Scala.

Created UDFs in Scala.

Utilized Spark SQL API in PySpark to extract and load data and perform SQL queries.

Created Spark programs using Scala for better performance.

Implemented parser, query planner, query optimizer, and native query execution using replicated logs combined with indexes supporting full relational KQL queries, including joins.

Developed distributed query agents for performing distributed queries against shards.

Wrote producer /consumer scripts to process JSON response in Python.

Wrote Queries, Stored Procedures, Functions and Triggers using SQL.

Supported development, testing, and operations teams during new system deployments.

Developed pipelines using Apache Nifi and moved data from various sources to HDFS.

Used Hbase for storing data in NoSQL database.

Evaluated and proposed new tools and technologies to meet the needs of the organization.

Applied understanding/knowledge of Hadoop architecture and various components such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node and Map Reduce programming paradigm.

Created metadata and tested database.

Applied hands-on object-oriented (OO) programming, integrating, and testing.

Collected business specifications, user requirements, design confirmations, and developed and documented the entire software development life cycle and QA.

Developed scripts for collecting high-frequency log data from various sources and integrated it into HDFS using Flume, and staged data in HDFS for further analysis.

05.2016 – 11.2018

Big Data Engineer

Atria Senior Living, Louisville, KY

Atria Senior Living is a leading operator of independent living, assisted living, supportive living and memory care communities in more than 430 locations in 45 states and seven Canadian provinces

Developed and maintained continuous integration systems in a Cloud computing environment (Azure).

Created UNIX shell scripts to automate the build process and perform regular jobs like file transfers.

Developed Shell Scripts, Oozie Scripts, and Python Scripts.

Wrote Hive queries and optimized the Hive queries with Hive QL.

Integrated ETL to Hadoop file system (HDFS) and wrote HIVE UDFs.

Imported real-time logs to HDFS using Flume.

Managed Hadoop clusters and check the status of clusters using Ambari.

Developed scripts to automate the workflow processes and generate reports.

Transferred data between a Hadoop ecosystem and structured data storage in a RDBMS such as MySQL using Sqoop.

Wrote PySpark code, creating dataframes from tables in data service layer and writing them to a Hive data warehouse.

Wrote incremental imports into Hive tables.

Optimized Hive Queries.

Streamed prepared information to DynamoDB utilizing Spark to make data accessible for representation and report age by the BI group.

Loaded into HBase tables and Hive tables consumption purposes.

Provided connections to different Business Intelligence tools to the tables in the data warehouse such as Tableau and Power BI.

Analyzed and tuned Cassandra data model for multiple internal projects/customers.

Download data through Hive in HDFS platform.

Configured Hadoop components (HDFS, Zookeeper) to coordinate the servers in clusters.

Wrote shell scripts to automate workflows to pull data from various databases into Hadoop.

08.2014 – 05.2016

Data Analytics Developer

Sun Trust Banks, Atlanta, GA

SunTrust Banks, Inc. was an American bank holding company with SunTrust Bank as its largest subsidiary. The company was purchased by BB&T in 2019.

Built Data Warehousing/Storage solutions along with data processing pipelines to enable pulling data from various sources and file formats into Hadoop HDFS in preparation for cleansing and analysis.

Designed an archival platform that provided a cost-effective platform for storing Big Data using Hadoop and its related technologies.

Created Hive tables, loaded data and wrote Hive Queries.

Used the Hive JDBC to verify the data stored in the Hadoop cluster.

Worked on Impala to compare processing time of Impala with Apache Hive for batch applications to implement the former in project. Extensively Used Impala to read, write and query the Hadoop data in HDFS.

Archived data to Hadoop cluster and performed search, query and data retrieval from the cluster.

Analyzed data originating from various Xerox devices and stored it in Hadoop warehouse. Used Pig as ETL tool to do transformations, joins and some pre-aggregations before storing the data into HDFS.

Optimized Hive queries using map-side join, parallel execution, and cost-based optimization.

Implemented Partitioning, Dynamic Partitions and Buckets in Hive for optimized data retrieval.

Connected various data centers and transferred data between them using Sqoop and various ETL tools.

Extracted the data from RDBMS (Oracle, MySQL) to HDFS using Sqoop.

Developed workflow in Oozie to automate the tasks of loading data into HDFS and pre-processing with Pig and Hive.

Used Spring Framework for developing business objects and integrating all the components in the system. Hands-on integrating spring with HDFS.

Education

Bachelors (Computer Systems Engineering) – Instituto Technologico de Toluca

Contact this candidate