Sr. Big Data Engineer

Location:

Miami, FL

Posted:

March 14, 2023

Contact this candidate

Resume:

Phone: 305-***-****

Email: ***************************@*****.***

Professional Summary

Big Data Engineer with 7+ years of experience in Big Data technologies and overall 10+ years of experience in IT. Expertise in Hadoop/Spark using cloud services, automation tools, and software design process.

Good knowledge of installation, configuration, management and deployment of Big Data solutions and the underlying infrastructure of Hadoop Cluster and cloud services.

Hands-on experience on Map Reduce Programming/HDFS Framework including Hive, Spark, PIG, Oozie, and Sqoop.

In depth understanding of Hadoop Architecture and various components such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node and MapReduce concepts

Hands on experience in:

oYARN Architecture and its components such as Node Manager, Application Masters, Resource Manager.

oSet up standards and processes for Hadoop based application design and implementation.

Well versed in analyzing data using HIVEQL, PIG Latin and custom Map Reduce programs in python. Extending HIVE and PIG core functionality by custom UDF's.

Expertise in designing, developing and implementing connectivity products that allows efficient exchange of data between our core database engine and Hadoop ecosystem.

Skilled in importing & exporting the data using Sqoop from HDFS to Relational Database systems and vice-versa.

Experienced in coding SQL, Procedures/Functions, Triggers and Packages on database (RDBMS) packages.

Hands on with AWS, GCP, Azure and Hortonworks-Cloudera environments to handle large datasets with semi-structured and unstructured data.

Outstanding communication skills of written, oral, interpersonal and presentation; Ability to perform at a high level, meet deadlines, adaptable to ever changing priorities.

Technical Skills

IDE:

Jupyter Notebooks (formerly iPython Notebooks), Eclipse, IntelliJ, PyCharm

PROJECT METHODS:

Agile, Kanban, Scrum, DevOps, Continuous Integration, Test-Driven Development

HADOOP DISTRIBUTIONS:

Hadoop,Cloudera Hadoop, Hortonworks Hadoop

CLOUD PLATFORMS:

Amazon AWS - EC2, Azure, S3, Google Cloud Platform, Elastic Cloud

CLOUD SERVICES:

Databricks

CLOUD DATABASE & TOOLS:

Redshift, DynamoDB, Cassandra, Apache Hbase, SQL, BigQuery

PROGRAMMING LANGUAGES:

Spark, Spark Streaming, Java, Python, Scala, PySpark, PyTorch

SCRIPTING:

Hive, MapReduce, SQL, Spark SQL, Shell Scripting

CONTINUOUS INTEGRATION (CI-CD):

Jenkins

VERSIONING:

Git, GitHub

PROGRAMMING METHODOLOGIES:

Object-Oriented Programming, Functional Programming

FILE FORMAT AND COMPRESSION:

CSV, JSON, Avro, Parquet, ORC

FILE SYSTEMS:

HDFS

ETL TOOLS:

Apache Nifi, Flume, Kafka, Talend, Pentaho, Sqoop

DATA VIZIUALIZATION TOOLS:

Tableau, Kibana

SEARCH TOOLS:

Elasticsearch

SECURITY:

Kerberos, Ranger

AWS:

AWS Lambda, AWS S3, AWS RDS, AWS EMR, AWS Redshift, AWS Kinesis, AWS ELK, AWS Cloud Formation, AWS IAM

Data Query:

Spark SQL, Data Frames

Professional Experience

July 2021 - Current

Big Data Engineer/ Developer

Telcel America – Miami, FL

Ingested data into AWS S3 data lake from various devices using AWS Kinesis.

Implemented usage of Amazon EMR for processing Big Data across Hadoop Cluster of virtual servers on Amazon Elastic Compute Cloud (EC2), Amazon Simple Storage Service (S3) and AWS Redshift.

Implemented AWS Lambda functions to run scripts in response to events in the Amazon Dynamo DB table or S3.

Created AWS Lambda functions using boto3 module in Python.

Migrated SQL database to Azure Data Lake.

Used AWS Cloud Formation templates to create a custom infrastructure of our pipelines.

Decoded raw data from JSON and streamed it using the Kafka producer API.

Integrated Kafka with Spark Streaming for real-time data processing using Dstreams.

Implemented AWS IAM user roles and policies to authenticate and control access.

Specified nodes and performed the data analysis queries on Amazon redshift clusters using AWS Athena on AWS.

Processed multiple terabytes of data stored in AWS using Elastic Map Reduce (EMR) to AWS Redshift.

Created POCs on Microsoft Azure using Azure Blob storage and Azure DataBricks.

Automated, configured, and deployed instances on AWS, Azure, and Google Cloud Platform (GCP) environments.

Developed Data Pipeline with Kafka and Spark.

Migrate from an Oracle-based ETL pipeline to one that was based on Hadoop cluster, especially using Big Data tech stack built on top of Hadoop, such as Hive.

Jan 2019 – Jun 2021

DATA ENGINEER

Simple Movil, Inc – Remote

Processed multiple terabytes of data stored in AWS using Elastic Map Reduce (EMR) to AWS Redshift.

Wrote scripts to automate workflow, execution, loading, parsing of data into a different database system.

Deployed Hadoop to create data pipelines involving HDFS.

Managed multiple Hadoop clusters to support data warehousing.

Created and maintained Hadoop HDFS, Spark, HBase pipelines.

Supported data science team with data wrangling, data modeling tasks.

Built continuous Spark streaming ETL pipeline with Kafka, Spark, Scala, HDFS and MongoDB.

Scheduled and executed workflows in Oozie to run Hive and Pig jobs.

Worked on importing unstructured data into HDFS using Spark Streaming and Kafka.

Automated cluster creation and management using Google Dataproc and Dataflow as processing tools to create some POCs.

Pull data from Google Big Query to S3 buckets using Spark in a batch analytics pipeline.

Developing SQL complex queries, and have made code contributions to these projects. Tasks involved range from robust Joins, aggregate functions, window functions, scheduled jobs doing complex ETL all the way to ad hoc scripts used to process local data.

Implemented Spark using Spark SQL for optimized analysis and processing of data.

Configured Spark Streaming to process real time data via Kafka and store it to HDFS.

Loaded data from multiple servers to AWS S3 bucket and configured bucket permissions.

Used Scala for concurrency support, which is the key in parallelizing processing of the large datasets.

Used Apache Kafka to transform live streaming with batch processing to generate reports.

Experience developing Oozie and Apache Airflow workflows for scheduling and orchestrating the ETL process.

Implemented Partitioning, Dynamic Partitions, and Buckets in HIVE for performance optimization and organized data in logical grouping.

Maintained and performed Hadoop log file analysis for error, access statistics for fine tuning.

Optimized data storage in Kafka Brokers within the Kafka cluster by partitioning Kafka Topics.

Responsible for analysis, design coding, unit and regression testing and implementation of the new requirements.

Produced Visio Diagrams for Process and Execution Flow, Design Documentation and Unit Test cases.

Built generic tool to import data from Flat files and from various other Database Engines using Podium.

Creating Sqoop Jobs and Commands to pull data from database and manipulate the data using Hive Queries.

Developed data ingestion code using the Sqoop and performed ETL and processing phase using Apache Hive, Spark using Pyspark, and SQL Spark scripting.

Performed orchestra using Apache Oozie and coordinators to schedule and run workflows and packages containing Shell, Sqoop, Spark, Hive, Email actions.

Built generic tool for Data Validation and automation report.

Performed recursive deployments through Jenkins and maintained code at Git repository.

Nov 2016 - Dec 2018

DATA ENGINEER

Nielsen Holdings - New York, NY

Nielsen Holdings plc is an American information, data and market measurement firm. Nielsen operates in over 100 countries and employs approximately 44,000 people worldwide

I was leading the GCP data team to ingest all data from disparate sources such as data warehouses, RDBMS and json files into our GCP Data Lake, mainly set up and configured Data Proc, Data Flow and BigQuery to process batch data and perform analytics to run our Machine learning models in production.

Design one POC for prototyping an internal messaging system using GCP Pub/Sub moving semi-structured data into GCP Cloud Storage, Dataflow and Bigtable as target.

Ingested data from multiple sources into HDFS Data Lake.

Analyzed, designed, coded unit and regression testing and implemented new requirements.

Created Visio Diagrams for Process and Execution Flow, Design Documentation and Unit Test cases.

Built generic tool to import data from Flat files and from various other database engines.

Created Sqoop Jobs and Commands to pull data from database and manipulate the data using Hive Queries.

Programmed data ingestion code using Sqoop and performed ETL and processing phase using the Apache Hive, Spark using Pyspark, and SQL Spark scripting.

Performed orchestra using Apache Oozie Jobs and coordinators to schedule and run workflows and packages containing Shell, Sqoop, Spark, Hive, Email actions.

Orchestrated workflows in Apache Airflow to run ETL pipelines using tools in AWS.

Managed onsite and offshore resources by assigning tasks using Agile methodology and retrieved updates through Scrum calls and Sprint meetings.

Developed Spark scripts by using Scala shell commands as per the requirement and used PySpark for proof of concept.

Installed and configured Hadoop MapReduce, HDFS, and developed multiple MapReduce jobs in Python for data cleaning and preprocessing.

Loaded customer profile data, customer spending data, and credit data from legacy warehouses onto HDFS using Sqoop.

Built data pipeline using Pig and Hadoop commands to store onto HDFS.

Used Oozie to orchestrate the map reduce jobs that extract the data on a timely manner.

Migrated ETL code base from Hive, Pig to PySpark.

Developed and deployed Spark submit command with suitable Executors, Cores and Drivers on the Cluster.

Applied transformations and filtered both traffic using Pig.

Oct 2014 - Sep 2016

DATABASE DEVELOPER-PL/SQL

SmartByte Solutions, Palatine, il

Created multi-node Hadoop and Spark clusters in AWS instances to generate terabytes of data and stored it in AWS HDFS.

Leading projects about database upgrades and updates.

Handling ETLs, data analysis with billions of records.

Requirements capture with the management team to design reports to make decisions.

Predictive user analysis for advertising campaigns.

Propose solutions by analyzing messy data.

Apr 2013 - Aug 2014

SOFTWARE ENGINEER

Fulton Bank, Lancaster, PA

Fulton Bank, a part of Fulton Financial Corporation, is a full-service commercial bank headquartered in Lancaster, PA, operating 250 branches and specialty offices and almost 300 ATMs. We offer a suite of products, including lending, mortgage, investment management and trust services, and business solutions to our communities in Pennsylvania, Maryland, Delaware, Virginia and New Jersey, and the technology to bank anytime, anywhere.

Worked on risk management applications and data processing applications for the banking industry.

Developed and maintained code that integrated software components into a fully functional software system.

Developed software verification plans, test procedures and test environments.

Executed test procedures and documented test results to ensure software system requirements were met.

Worked with business operations stakeholders to identify business metrics and then produced applications to optimize business operations with the objective of helping the bank meet defined key performance indictor metrics.

Troubleshot problems and provided customer support on application issues, including troubleshooting complex application issues/problems.

Wrote SQL server scripts.

Interfaced with vendors about technical support issues.

Interfaced directly with customer technical personnel to support and service installed systems.

Analyzed requirements and developed Software Requirements Specifications for new and re-engineered systems.

Used UNIX system shell scripts to alter the build method, and to perform regular jobs like file transfers between totally different hosts.

Documented Technical Specs, Dataflow, information Models and sophistication Models.

Documented needs gathered from stake holders.

Configured and installed RedHat and Centos Linux Servers on virtual machines and bare metal installations.

Performed kernel and database configuration optimization such as I/O resource usage on disks.

Created and modified users and groups with root permissions.

Administered local and remote servers using the SSH on a daily basis.

Education

Bachelor's Degree: Computer systems Engineering, 2013

La Universidad Tecnologica de Mexico

Contact this candidate