Hadoop Data Engineer

Location:

Tappahannock, VA, 22560

Posted:

September 01, 2022

Contact this candidate

Resume:

Guillermo Resendiz

Guillermo Resendiz - Big Data Developer

Columbus, OH 43215

***************************@*****.***

+1-404-***-****

· Close to 7 +years of professional experience in Big Data, Hadoop and Cloud.

· Experienced with Hadoop ecosystem (Hadoop, Cloudera, Impala, Hortonworks) and cloud computing

· environments such as Amazon Web Services (AWS), and Microsoft Azure.

· Hands on experience in Hadoop Framework and its ecosystem, including but not limited to HDFS Architecture, MapReduce Programming, Hive, Pig, Sqoop, HBase, MongoDB, Cassandra, Oozie, Spark RDDs, Spark DataFrames, Spark Datasets, Spark MLlib, etc.

· Involved in building a multi-tenant cluster, with disaster management with Hadoop cluster.

· Hands on experience in installing, configuring Cloudera's and Horton distribution.

· Excellent knowledge on Hadoop Architecture and ecosystems such as HDFS, configuration of nodes, YARN, MapReduce, Sentry, Spark, Falcon, HBase, Hive, Pig, Sentry, Ranger.

· Develop Scripts and automated end-end data management and sync between all the clusters.

· Extend Hive and Pig core functionality by writing custom UDFs.

· Use Apache Flume to collect logs and error messages across the cluster.

· Troubleshoot and tune relevant programming languages like SQL, Java, Python, Scala, PIG, Hive, RDDs, DataFrames, and MapReduce.

· Design elegant solutions through the use of problem statements.

· Accustomed to working with large complex data sets, real-time/near real-time analytics, and distributed big data platforms.

· Proficient in major vendor Hadoop distribution like Cloudera, Hortonworks, and MapR.

· Deep knowledge in incremental imports, partitioning and bucketing concepts in Hive and Spark SQL needed for optimization.

· Experience collecting log data from various sources and integrating it into HDFS using Flume; staging data in HDFS for further analysis.

· Experience collecting real-time log data from different sources like webserver logs and social media data from Facebook and Twitter using Flume, and storing in HDFS for further analysis.

· Experience deploying large multiple nodes of a Hadoop and Spark cluster.

· Experience developing Oozie workflows for scheduling and orchestrating the ETL process.

· Solid understanding of statistical analysis, predictive analysis, machine learning, data mining, quantitative analytics, multivariate testing and optimization algorithms.

· Proficient in mapping business requirements, use cases, scenarios, business analysis, and workflow analysis. Act as liaison between business units, technology and IT support teams. Willing to relocate: Anywhere

Work Experience

Big Data Developer

Cricket Wireless - City

August 2020 to Present

Cricket Wireless is a wireless service provider, owned by AT&T. It provides wireless services to millions of subscribers in the United States.

· Developed Spark streaming application to pull data from cloud servers to the Hive table.

· Used Spark SQL to process the huge amount of structured data.

· Worked with the Spark-SQL context to create data frames to filter input data for model execution.

· Designed and developed data pipelines in an Azure environment using ADL Gen2, Blob Storage, ADF, Azure Databricks, Azure SQL, Azure Synapse for analytics and MS PowerBI for reporting.

· Converted some existing hive scripts to Spark applications using RDD's for transforming data and loading into HDFS.

· Developed Hive queries in daily use to retrieve datasets

· Ingested data from JDBC Databases to Hive,

· Developed Spark jobs using Spark SQL, Python, and Data Frames API to process structured data into Spark clusters.

· Used Spark to do transformation and preparation of Dataframes

· Removed and filtered unnecessary data from raw data using Spark.

· Configured Stream-set to store the converted data to HIVE using JDBC drivers.

· Wrote Spark code to remove certain fields from the Hive table.

· Joined multiple tables to find the correct information for certain addresses.

· Wrote code to standardize string and IP addresses over datasets.

· Used Hadoop as a data lake to store large amounts of data.

· Developed Oozie workflows and ran Oozie job to run jobs in parallel.

· Used Oozie to update tables automatically over Hive.

· Wrote unit tests for all code using different frameworks like PyTest.

· Wrote code according to the schema change of the source table based on the data warehouse dimensional modeling.

· Created a geospatial index, performed a geo-spatial search, and populated a result column that indicated Y/N based on the distance found.

· Used Scala and Spark SQL for faster testing and processing of data.

· Processed multiple terabytes of data stored in AWS using Elastic Map Reduce

(EMR), and AWS Redshift.

· Developed test cases for features deployed for Spark code and UDFs.

· Optimized JDBC connection with bulk upload for Hive import and Spark Imports

· Handled defects from internal testing tools to increase code coverage over Spark.

· Hands-on with EC2, Cloud Watch, Cloud Formation and managing security groups on AWS.

AWS Hadoop Data Engineer

American Electric Power - Columbus, OH

June 2018 to August 2020

American Electric Power, is a major investor-owned electric utility in the United States, delivering electricity to more than five million customers in 11 states.

· Hands-on with AWS data migration between database platforms Local SQL Servers to Amazon RDS and EMR HIVE.

· Optimized Hive analytics SQL queries and created tables/views and wrote custom queries and Hive- based exception processes.

· Implemented AWS Fully Managed Kafka streaming to send data streams from the company APIs to Spark cluster in AWS Databricks.

· Developed consumer intelligence reports based on market research, data analytics, and social media.

· Joined, manipulated, and drew actionable insights from large data sources using Python and SQL.

· Worked on AWS to form and manage EC2 instances and Hadoop Clusters.

· Implemented a Hadoop Cloudera distributions cluster using AWS EC2.

· Deployed the large knowledge Hadoop application mistreatment Talend on Cloud AWS.

· Utilized AWS Redshift to store Terabytes of data on the Cloud.

· Used Spark SQL and DataFrames API to load structured and semi-structured data into Spark Clusters.

· Wrote shell scripts for log files to Hadoop cluster through automatic processes.

· Used Spark-SQL and Hive Query Language (HQL) for obtaining client insights.

· Ingested large data streams from company REST APIs into EMR cluster through AWS kinesis.

· Streamed data from AWS Fully Managed Kafka brokers using Spark Streaming and processed the data using explode transformations.

· Finalized the data pipeline using DynamoDB as a NoSQL storage option. Data Engineer

General Motors Company - Detroit, MI

January 2017 to June 2018

General Motors Company is an American multinational automotive manufacturing corporation headquartered in Detroit, Michigan,

· Managed, configured, tuned and ran continuous deployment of 100+ Hadoop nodes in a Red Hat Enterprise (Edition 5).

· Configured via the AWS console for 2 medium-scale AMI instances for the Name Nodes.

· Designed batch processing jobs using Apache Spark to increase speed compared to that of MapReduce jobs.

· Developed Python scripts to automate the workflow processes and generate reports.

· Developed a task execution framework on EC2 instances using SQL and DynamoDB.

· Developed custom aggregate functions using Spark SQL and performed interactive querying.

· Built a prototype for real-time analysis using Spark streaming and Kafka.

· Used Spark SQL and Data Frame API extensively to build Spark applications.

· Used Spark engine, Spark SQL for data analysis and provided to the data scientists for further analysis.

· Performed streaming data ingestion to the Spark distribution environment, using Kafka.

· Used Spark to process data on top of YARN and performed analytics on data in Hive.

· Programmed Spark codes in Python to run a sorting application on the data stored on AWS.

· Deployed the application JAR files into AWS instances.

· Used Spark Streaming to divide streaming data into batches as an input to Spark engine for batch processing.

· Developed and maintained continuous integration systems in a Cloud computing environment (Google Cloud Platform (GCP)).

· Implemented advanced procedures of feature engineering for data science team using the in-memory computing capabilities like Apache Spark written in Scala.

· Migrated complex MapReduce scripts to Apache Spark code.

· Wrote Spark applications for data validation, cleansing, transformation, and custom aggregation.

· Imported data from disparate sources into Spark RDD for processing.

· Utilized Spark DataFrame and Spark SQL API extensively for processing.

· Stored non-relational data on HBase.

· Created alter, insert, and delete queries involving lists, sets, and maps in DataStax Cassandra.

· Worked on Impala to compare processing time of Impala with Apache Hive for batch applications to implement the former in project.

· Utilized Impala to read, write, and query Hadoop data in HDFS. Hadoop Data Engineer

iTechArt - New York, NY

November 2015 to December 2016

iTechArt is a top-tier, one-stop custom software development company with a talent pool of 3500+ experienced engineers. We help VC-backed startups and fast-growing tech companies build successful, scalable products that users love.

· Assisted in building out a managed cloud infrastructure with improved systems and analytics capability.

· Researched various available technologies, industry trends, and cutting-edge applications.

· Designed and set-up POCs to test various tools, technologies, and configurations, along with custom applications.

· Used Oozie to automate/schedule business workflows which invoked Sqoop and Pig jobs.

· Used Zookeeper and Oozie for coordinating the cluster and scheduling workflows.

· Used the Hive JDBC to verify the data stored in the Hadoop cluster.

· Loaded and transformed large sets of structured, semi-structured, and unstructured data.

· Used different file formats like Text files, Sequence Files, Avro.

· Loaded data from various data sources into HDFS using Kafka.

· Tuned and operated Spark and its related technologies like SQL.

· Used shell scripts to dump the data from MySQL to HDFS.

· Used NoSQL database MongoDB in implementation and integration.

· Worked on streaming the analyzed data to Hive tables using Storm for making it available for visualization and report generation by the BI team.

· Used the image files of an instance to create instances containing Hadoop installed and running.

· Designed a cost-effective archival platform for storing big data using Hadoop and its related technologies.

· Implemented GCP Dataproc to create a cloud cluster and run Hadoop jobs.

· Connected various data centers and transferred data between them using Sqoop and various ETL tools.

· Extracted the data from RDBMS (Oracle, MySQL) to HDFS using Sqoop.

· Worked with the client to reduce churn rate, read, and translate data from social media websites.

· Integrated Kafka with Spark Streaming for real-time data processing.

· Imported data from disparate sources into Spark RDD for processing.

· Built a prototype for real-time analysis using Spark streaming and Kafka.

· Collected the business requirements from the subject matter experts like data scientists and business partners.

· Configured Oozie workflow engine scheduler to run multiple Hive, Sqoop, and Pig jobs.

· Consumed the data from Kafka queue using Storm.

Education

Certification in Mathematics

Benemerita Universidad Autonoma de Puelbla

Skills

• PROGRAMMING/ SCRIPTING and Frameworks Java

• Python

• Scala

• R

• PIG/Pig Latin

• Hive

• HiveQL

• MapReduce

• UNIX

• Shell scripting

• Yarn

• Spark

• Spark Streaming

• Storm

• Kafka

DEVELOPMENT ENVIRONMENTS Eclipse

• IntelliJ

• NetBeans

• PyCharm

• Visual Studio

AMAZON CLOUD Amazon AWS (EMR, EC2, EC3, SQL, S3, DynamoDB, Cassandra, Redshift, Cloud Formation) DATABASES: NoSQL

• MongoDB

• Cassandra

• Dynamo

• Mongo

• SQL

• MySQL

• Oracle

HADOOP DISTRIBUTIONS Cloudera

• Hortonworks

• MapR

• Elastic

QUERY/SEARCH SQL

• HiveQL

• Impala

• Apache SOLR

• Kibana

• Elasticsearch

Visualization Tools Tableau

• Qlikview

• File Formats Parquet

• Avro

• Orc

• JSON

Data Pipeline Tools Apache Airflow

• Nifi

Admin Tools Oozie

• Cloudera Manager

• Zookeeper

• Active Directory

SOFTWARE DEVELOPMENT Agile

• Continuous Integration

• Test-Driven Development

• Unit

Testing

• Functional Testing

• Git

• Jenkins

• Jira

Contact this candidate