Big Data Engineer

Location:

San Antonio, TX

Posted:

March 20, 2023

Contact this candidate

Resume:

Correa Ochoa Sergio Jeffrey

Big Data, Hadoop, Cloud Engineer

(914) 930 – 6280

************************@*****.***

Resume Summary

IT Professional with 7+ years of experience in Big Data development along with Data wrangling; Passionate about Big Data technologies and the delivery of effective solutions through creative problem solving with a track record of building large-scale systems using Big Data technologies.

Experienced Data Engineer providing mentoring to team members, and liaison for team with stakeholders, business units, data scientists/analysts and making sure all teams collaborate smoothly.

Used to working in production environments, migrations, installations, and development.

Work with Data Lakes and Big Data ecosystems like HDFS, Hadoop, Spark, PySpark, Hortonworks, Cloudera, Hive, Sqoop, HBase, Flume, Airflow, and Zookeeper.

Experience driving application enhancements and change initiatives.

Add value to Agile/Scrum processes such as Sprint Planning, Backlog, Sprint Retrospective, requirements gathering and provide planning and documentation for projects.

Write SQL queries, stored Procedures, Triggers,, Views, Cursors and Packages.

Create Spark Core ETL processes to automate using a workflow scheduler.

Use Apache Hadoop to work with Big Data and analyze large data sets efficiently.

Experienced with databases and data warehouses like SnowFlake

Write SQL, PL/SQL for creating tables, views, indexes, stored procedures, and functions.

Track record of results in an Agile methodology using data-driven analytics.

Experience importing and exporting terabytes of data between HDFS and Relational Database Systems using Sqoop, MySQL, Snowflake.

Load and transform large sets of structured, semi structured, and unstructured data working with data on Amazon Redshift, Apache Cassandra, and HDFS in Hadoop Data Lake.

Experience handling XML files as well as Avro and Parquet SerDes.

Performance tuning at source, Target and Data Stage job levels using Indexes, Hints and Partitioning in DB2, ORACLE.

Skilled with BI tools like Tableau and Power BI, data interpretation, modeling, data analysis, and reporting with the ability to assist in directing planning based on insights.

Technical Skills

Programming & Scripting - Spark, Python, Java, Scala, HiveQL, Kafka, SQL, Shell scripting

Databases - Cassandra, HBase, Amazon Redshift, DynamoDB, MongoDB, MS Access, SQL, MySQL, Oracle, PL/SQL, SQL, RDBMS

Database Skills - Database partitioning, database optimization, building communication channels between structured and unstructured databases. Snowflake, Snowpipes.

Batch & Stream Processing - Apache Hadoop, Spark Streaming

Data Stores - Data Lake, Data Warehouse, SQL Database, RDBMS, NoSQL Database, Amazon Redshift, Apache Cassandra, MongoDB, SQL, MySQL, Postgres, and more

Search Tools - Elasticsearch, Kibana, Apache SOLR

Data Pipelines/ETL – Flume, Airflow, Snowflake, Apache Spark, Nifi, Apache Kafka

Data Cleansing - Cloudera CDH 4/5, Hortonworks HDP 2.5/2.6, Amazon Web Services (AWS), Azure Data Factory, GCP

Professional Experience

Sr. Big Data Engineer - PepsiCo, Harrison, New York since July 2022 – Present

(PepsiCo, founded in 1965 and headquartered in Purchase, New York, is a food and beverage manufacturer best known for its name-brand soft drink, Pepsi)

Analyzed large amounts of data sets to determine optimal way to aggregate and report on them.

Designed and developed web app BI for performance analytics.

Designed Python-based notebooks for automated weekly, monthly, quarterly reporting E.T.L

Designed the backend database and AWS cloud infrastructure for maintaining company proprietary data.

Installed and configured Hive, Pig, Sqoop, and Oozie on the Hadoop cluster.

Developed and implemented data pipelines using AWS services such as Kinesis, S3, EMR, Athena, Redshift to process petabyte-scale data in real time.

Installed, configured, and monitored Apache Airflow cluster.

Used Sqoop to import data from Oracle to Hadoop.

Used Oozie workflow engine to manage interdependent Hadoop jobs and automate several types of Hadoop jobs such as Java map-reduce Hive, Pig, and Sqoop.

Produced scripts for doing transformations using Scala.

Develop Spark code for Spark-SQL/Streaming in Scala and PySpark creatig dataframes, rdds and datasets from different sources and file formats such as Avro, Parquet, Json, ORC, CSV.

Developed and implemented Hive custom UDFs involving date functions.

Configured AWS S3 to receive and store data from the resulting PySpark scripts in AWS Glue jobs as ETL services.

Developed Java Web Crawler to scrape market data for Internal products.

Wrote Shell scripts to orchestrate execution of other scripts and move the data files within and outside of HDFS.

Wrote simple SQL scripts on the final database to prepare data for visualization with Tableau.

Developed DAG data pipeline to on-board and change management of datasets.

Migrated various Hive UDFs and queries into Spark SQL for faster requests.

Created a benchmark between Hive and HBase for fast ingestion.

Configured Spark Streaming to receive real-time data from the Apache Kafka and store the streamed data to HDFS using Scala.

Hands on experience in Spark and Spark Streaming creating RDDs.

Scheduled jobs using Control-M.

Used Kafka on publish-subscribe messaging as a distributed commit log.

Created Airflow Scheduling scripts in Python to automate data pipeline and data transfer.

Orchestrated Airflow/workflow in hybrid cloud environment from local on-premises server to the cloud.

Wrote Shell FTP scripts for migrating data to Amazon S3.

Configured AWS Lambda for triggering parallel Cron jobs scheduler for scraping and transforming data.

Use Cloudera Manager for installation and management of a multi-node Hadoop cluster.

Programmed Flume and HiveQL scripts to extract, transform, and load the data into database.

Implemented AWS Fully Managed Kafka streaming to send data streams from the company APIs to Spark cluster in AWS Databricks

Big Data Engineer - Citi Bank, Century City, California May 2021 – June 2022

(It is a banking service provider. The bank offers savings, debit, insurance, exchange rate services and more.

Tis role was strong hands-on Data Engineer within the Big Data program to work on existing applications and framework to develop new models and frameworks to meet business use cases as our data strategy is developed further.

Developed PySpark application as ETL job to read data from various file system sources, apply transformations, and write to NoSQL database (Cassandra).

Installed and configured Kafka cluster and monitored the cluster.

Architected a lightweight Kafka broker and integrated Kafka with Spark for real-time data processing.

Wrote Unit tests for all code using PyTest for Python.

Used Python Boto3 for developing Lambda functions in AWS

Used the Pandas library and Spark in python for data cleansing, validation, processing, and analysis.

Created Hive external tables and designed data models in Apache Hive.

Developed multiple Spark Streaming and batch Spark jobs using Python on AWS.

Implemented advanced procedures of feature engineering for data science team using the in-memory computing capabilities like Apache Spark written in Python.

Implemented Rack Awareness in the Production Environment.

Wrote user-defined functions (UDFs) to apply custom business logic to datasets using PySpark and Databricks.

Collected data using REST API, built HTTPS connection with client server, sent GET request and collected response in Kafka Producer.

Imported data from web services into HDFS and transformed data using Spark.

Executed Hadoop/Spark jobs on AWS EMR using programs, data stored in S3 Buckets.

Used Spark SQL for creating and populating HBase warehouse.

Automated AWS components like EC2 instances, Security groups, ELB, RDS, EMR, Glue jobs, Crawlers, Lambda and IAM through AWS cloud Formation templates.

Worked with Spark Context, Spark -SQL, Data Frames, and Pair RDDs.

Ingested data through AWS Kinesis Data Stream and Firehose from various sources to S3.

Extracted data from different databases and scheduling Oozie workflows to execute the task daily.

Worked with Amazon Web Services (AWS) and involved in ETL, Data Integration, and Migration.

Documented the requirements including the available code which should be implemented using Spark, Amazon DynamoDB, Redshift, and Elastic Search.

Imported and exported data into HDFS and Hive using Sqoop.

Worked on AWS Kinesis for processing huge amounts of real-time data.

Big Data Developer - LG Electronics, Englewood Cliffs, NJ December 2019 – April 2021

(LG Electronics is a company that provides electronic solutions. It operates through four business divisions: Home Appliance & Air Solutions (H&A), Home Entertainment (HE), Vehicle Component Solutions (VS), and Business Solutions (BS))

Used a REST API to extract real-time financial data of bitcoin, alt-coin prices every minute.

Set up and implemented Kafka brokers to write data to topics and utilize its fault tolerance mechanism.

Applied Microsoft Azure Cloud Services (PaaS & IaaS), Storage, Active Directory, Application Insights, Internet of Things (IoT), Azure Search, Key Vault, Azure Databricks, Visual Studio Online (VSO) and SQL Azure.

Created and managed Topic creation inside Kafka.

Loaded transformed data into several AWS database and data warehouse services using Spark and JDBC connectors.

Configured a full Kafka cluster with multi broker system for high availability.

Used Spark Streaming to consume from Kafka topics and transform the data processed.

Helped apply business analytic techniques, including clustering and segmentation of existing clients as well as predictive modeling.

Helped to perform statistical analysis using Scikit-learn machine learning library for supervised and unsupervised learning.

Installed and configured replication factor on topic partitions.

Created a benchmark between Hive and HBase for fast ingestion.

Processed Terabytes of information in real time using Spark Streaming.

Used Spark SQL and knowledge Frames API to load structured and semi-structured knowledge into Spark Clusters.

Supported Hadoop Cluster infrastructure by performing on-going monitoring, preventative maintenance, and upgrades to the infrastructure.

Used Oozie for coordinating the cluster and programming workflows.

Re-worked knowledge from tables to HDFS and HBase tables.

Wrote Spark UDFs to support the Data Science team.

Wrote Spark SQL queries and optimized the Spark queries in Spark SQL.

Database Developer Client: Rutland Appliances, VT Jan 2018 – Nov 2019

Responsible for designing and developing enterprise data solutions on Cloud Platforms integrating Azure services and 3rd party data technologies.

Worked closely with a multidisciplinary agile team to build high quality data pipelines driving analytic solutions. These solutions generated insights from our connected data, enabling Kimberly-Clark to advance the data-driven decision-making capabilities of our enterprise.

Created multi-node Hadoop and Spark clusters in AWS instances to generate terabytes of data and stored it in AWS HDFS.

Leading projects about database upgrades and updates.

Database Developer Client: Frederick’s Appliance Center, Redmond, WA Feb 2016 – Dec 2017

Handling ETLs, data analysis with billions of records.

Requirements capture with the management team to design reports to make decisions.

Predictive user analysis for advertising campaigns.

Construction and customization of integration systems using technologies such as Saas, API and web services. Designed and performed of web applications with C# ASP.NET framework, SQL Management and UI.

Wrote shell scripts for time-bound command execution.

Worked with application teams to install operating system, updates, patches, and version upgrades.

Utilize conceptual knowledge of data and analytics, such as dimensional modeling, ETL, reporting tools, data governance, data warehousing, structured and unstructured data to solve complex data problems.

Education

Bachelor of Actuarial Sciences

Universidad Autonoma del Estado de Mexico (UAEM)

Courses

Administration of Databases and Preparation of Reports in R: duration of 20 hours taught by V&M Servicios de Consultoría S.C

Data Analytic People STANFORD: Duration of 20 hours by STANFORD UNIVERSITY CALIFORNIA

Python programming: duration of 20 hours by the Autonomous University of the State of Mexico.

Microsoft Power BI desktop: duration of 25 hours by UDEMY

SQL introduction: duration of 20 hours by DATACAMP

Diploma in financial education CONDUSEF: Duration of 180 hours given by CONDUSEF

Contact this candidate