Hadoop Data Engineer

Location:

Minneapolis, MN

Posted:

August 25, 2022

Contact this candidate

Resume:

Ismael Illan

Ismael Illan - Big Data Engineer

Minneapolis, MN 55402

************@*****.***

+1-507-***-****

· 6 years in Hadoop, Big Data, and Cloud.

· Hands on with Spark framework on both batch and real-time streaming data processing.

· Hands-on experience processing data using Spark Streaming API and Spark SQL.

· Skilled in AWS, Redshift, DynamoDB and various cloud tools.

· Streamed over millions of messages per day through Kafka and Spark Streaming.

· Move and transform Big Data for insightful information using Sqoop.

· Build Big Data pipelines to optimize utilization of data and configure end-to-end systems.

· Use Kafka for data ingestion and extraction into HDFS Hortonworks system.

· Use Spark SQL to perform preprocessing using transformations and actions on data residing in HDFS.

· Create Spark Streaming jobs to divide streaming data into batches as an input to Spark engine for data processing.

· Construct Kafka brokers with proper configurations for the needs of the organization using Big Data.

· Write Spark DataFrames to NoSQL databases like Cassandra.

· Build quality for Big Data transfer pipelines for data transformation using Kafka, Spark, Spark Streaming, and Hadoop.

· Design and develop new systems and tools to enable clients to optimize and track using Spark.

· Work with highly available, scalable and fault tolerant big data systems using Amazon Web Services (AWS).

· Provide end-to-end data solutions and support using Hadoop big data systems and tools on AWS cloud services as well as on-premise nodes.

· Well versed in Big Data ecosystem using Hadoop, Spark, Kafka with column-oriented big data systems such as Cassandra and HBase.

· Implement Spark in EMR for processing Big Data across our Data Lake in AWS System.

· Work with various file formats (delimited text files, click stream log files, Apache log files, Avro files, JSON files, CSV, XM).

· Use Kafka and HiveQL scripts to extract, transform, and load the data into multiple databases.

· Perform cluster and system performance tuning on Big Data systems. Willing to relocate: Anywhere

Work Experience

Cloud Engineer

Ameriprise Financial, Inc. - Minneapolis, MN

August 2020 to Present

Ameriprise Financial, Inc. is a diversified financial services company and bank holding company that provides financial planning products and services, including wealth management, asset management, insurance, annuities, and estate planning.

· Worked as part of the Big Data Engineering team to design and develop data pipelines in an Azure environment using ADL Gen2, Blob Storage, ADF, Azure Databricks, Azure SQL, Azure Synapse for analytics and MS Power BI for reporting.

· Worked on AWS to form and manage EC2 instances and Hadoop Clusters.

· Clusters.

· Wrote shell scripts for log files to Hadoop cluster through automatic processes.

· Implemented AWS Fully Managed Kafka streaming to send data streams from the company APIs to Spark cluster in AWS Databricks.

· Streamed data from AWS Fully Managed Kafka brokers using Spark Streaming and processed the data using explode transformations.

· Developed AWS Cloud Formation templates to create custom infrastructure of our pipeline.

· Performed streaming data ingestion process through PySpark.

· Finalized the data pipeline using DynamoDB as a NoSQL storage option.

· Hands-on with AWS data migration between database platforms Local SQL Servers to Amazon RDS and EMR HIVE.

· Optimized Hive analytics SQL queries, created tables/views, and wrote custom queries and Hive-based exception processes.

· Implemented a Hadoop Cloudera distributions cluster using AWS EC2.

· Deployed the large knowledge Hadoop application mistreatment Talend on Cloud AWS.

· Utilized AWS Redshift to store Terabytes of data on the Cloud.

· Used Spark SQL and DataFrames API to load structured and semi-structured data into Spark

· Used Spark-SQL and Hive Query Language (HQL) for obtaining client insights.

· Ingested large data streams from company REST APIs into EMR cluster through AWS kinesis.

· Developed consumer intelligence reports based on market research, data analytics, and social media.

· Implemented Spark using Scala and SparkSQL for faster testing and processing of data.

· Automated AWS components like EC2 instances, Security groups, ELB, RDS, Lambda and IAM through AWS Cloud Formation templates.

· Joined, manipulated, and drew actionable insights from large data sources using Python and SQL.

Big Data Engineer

TD Ameritrade - Omaha, NE

January 2019 to August 2020

TD Ameritrade is a broker that offers an electronic trading platform for the trade of financial assets including common stocks, preferred stocks, futures contracts, exchange-traded funds, forex, options, cryptocurrency, mutual funds, fixed income investments, margin lending, and cash management services.

· Created PySpark streaming job to receive real time data from Kafka.

· Defined Spark data schema and set up development environment inside the cluster.

· Processed data with natural language toolkit to count important words and generated word clouds.

· Created a pipeline to gather data using PySpark, Kafka, and HBase.

· Used Spark Streaming to receive real-time data using Kafka.

· Worked with unstructured data and parsed out the information by Python built-in function.

· Configured a Python API Producer file to ingest data from the Slack API using Kafka for real-time processing with Spark.

· Developed Spark programs using Python to run in the EMR clusters.

· Ingested Images responses in Kafka producer written in Python.

· Started and configured master and slave nodes for Spark.

· Designed Spark Python jobs to consume information from S3 Buckets using Boto3.

· Set up cloud compute engine in managed and unmanaged mode and SSH key management.

· Worked on virtual machines to run pipelines to on a distributed system.

· Utilized a cluster of multiple Kafka brokers to handle replication needs and allow for fault tolerance.

· Wrote simple SQL scripts on the final database to prepare data for visualization with Tableau.

· Used spark streaming as Kafka Consumer to process consumer data.

· Wrote Spark SQL to create and read Cassandra tables,

· Wrote streaming data into Cassandra tables with spark structured streaming.

· Wrote Bash script to gather cluster information for Spark submits.

· Developed Spark UDFs using Scala for better performance.

· Managed Hive connection with tables, databases, and external tables.

· Installed Hadoop using Terminal and set the configurations.

· Formatted the response from Spark jobs to data frames using a schema containing News Type, Article Type, Word Count, and News Snippet to parse JSONs.

· Interacted with data residing in HDFS using PySpark to process data.

· Configured Linux on multiple Hadoop environments setting up Dev, Test, and Prod clusters within the same configuration.

· Set up HDFS monitoring job status function and life of the DataNodes according to the specs. Hadoop Engineer

Ameren Corporation - Saint Louis, MI

September 2017 to December 2018

Ameren Corporation is an American power company.

· Configured Kafka Producer with API endpoints using JDBC Autonomous REST Connectors.

· Configured a multi-node cluster of 10 Nodes and 30 brokers for consuming high volume, high-velocity data.

· Implemented parser, query planner, query optimizer, and native query execution using replicated logs combined with indexes, supporting full relational KQL queries, including joins.

· Developed distributed query agents for performing distributed queries against shards.

· Wrote Producer/Consumer scripts to process JSON response in Python.

· Developed DBC//ODBC connectors between Hive and Spark for the transfer of the newly populated data frame.

· Developed scripts for collecting high-frequency log data from various sources and integrating it into HDFS using Flume; staging data in HDFS for further analysis.

· Wrote complex queries the API into Apache Hive on Hortonworks Sandbox.

· Utilized HiveQL to query the data to discover trends from week to week.

· Configured and deployed production-ready multi-node Hadoop services Hive, Sqoop, Flume, Oozie on the Hadoop cluster with latest patches.

· Created Hive queries to summarize and aggregate business queries by comparing Hadoop data with historical metrics.

· Worked on various real-time and batch processing applications using Spark/Scala, Kafka and Cassandra.

· Loaded into ingested data into Hive Managed and External tables.

· Wrote custom user define functions (UDF) for complex Hive queries (HQL).

· Built the Hive views on top of the source data tables and built a secured provisioning.

· Used Cloudera Manager for installation and management of single-node and multi-node Hadoop cluster.

· Implemented parser, query planner, query optimizer, and native query execution using replicated logs combined with indexes supporting full relational Kibana Query Language (KQL) queries, including joins.

· Performed upgrades, patches and bug fixes in Hadoop in a cluster environment

· Wrote shell scripts for automating the process of data loading.

· Evaluated and proposed new tools and technologies to meet the needs of the organization. Hadoop Administrator

Mars, Incorporated - Tacoma, WA

April 2016 to September 2017

Mars, Incorporated is an American multinational manufacturer of confectionery, pet food, and other food products and a provider of animal care services.

· Completed regular backups and cleared logs from HDFS space.

· Used Impala where possible to achieve faster results compared to Hive during data Analysis.

· Implemented workflows using Apache Oozie framework to automate tasks.

· Implemented data ingestion and cluster handling in real time processing using Kafka.

· Designed and presented a POC on introducing Impala in project architecture.

· Implemented YARN Resource pools to share resources of cluster for YARN jobs submitted by users.

· Scripted the requirements using BigSQL and provided time statistics of running jobs.

· Performed code reviews of simple to complex Map/Reduce Jobs using Hive and Pig.

· Implemented cluster monitoring using Big Insights ionosphere tool.

· Imported data from various data sources and parsed into structured data.

· Analysed data by performing Hive queries and ran Pig scripts to study customer behaviour.

· Imported and exported data using Sqoop between HDFS to RDBMS.

· Optimized data storage in Hive using partitioning and bucketing mechanisms on both the managed and external tables.

· Performed import and export of dataset transfer between traditional databases and HDFS using Sqoop.

· Wrote shell scripts for time-bound command execution.

· Commissioned and de-commissioned data nodes and involved in Name Node maintenance.

· Worked with application teams to install operating system, Hadoop updates, patches, and version upgrades.

· Edited and configured HDFS and tracker parameters. Education

BACHELOR in INFORMATICS ENGINEERING

INSTITUTO TECNOLOGICO JOSE MARIO MOLINA PASQUEL Y HENRIQUEZ Skills

• SCRIPTING Python, Unix Shell Scripting

• HADOOP DISTRIBUTIONS Cloudera, Hortonworks

• PROGRAMMING Java

• Python

• Scala

• SOFTWARE DEVELOPMENT Agile

• Continuous Integration

• Test-Driven Development

• Unit Testing

• Functional Testing

• Gradle

• Git

• GitHub

• SVN

• Jenkins

• Jira

• DEVELOPMENT ENVIRONMENTS Eclipse

• IntelliJ

• PyCharm

• Visual Studio

• Atom

• AMAZON CLOUD Amazon AWS (EMR, EC2, EC3, SQL, S3, DynamoDB, Cassandra, Redshift, Cloud Formation)

• Cassandra

• HBase

• Mongo

• SQL: SQL

• MySQL

• PostgreSQL

• QUERY/SEARCH SQL

• HiveQL

• Apache SOLR

• Kibana

• Elasticsearch

• BIG DATA COMPUTE Apache Spark

• Spark Streaming

• SparkSQL MISC: Hive

• Yarn

• Spark

• Spark Streaming

• Kafka

• Flink

• Kibana

• Tableau

• PowerBI

• Grafana

Contact this candidate