Sr. Big Data Engineer

Location:

Houston, TX

Salary:

$84 on c2c

Posted:

March 22, 2023

Contact this candidate

Resume:

Eugenio Cano Serrano

Sr. Data Cloud Engineer

Phone: 281-***-**** Email:******************@*****.***

Profile

•A strategic professional with 10+ years of experience in Big Data, and Hadoop with skills in programming languages such as Python, SQL, and MATLAB

•Well versed with on-prem as well as big data clouds like AWS and Azure.

•Developed robust ETL pipelines in Databricks and ingested into an AWS Redshift Data Warehouse

•Experience with Spark Structured Streaming to process structured streaming data.

•Experience with multiple terabytes of data stored in AWS using Elastic Map Reduce (EMR) and Redshift.

•Implemented Azure Fully Managed Kafka streaming to send data streams from the company APIs to Spark cluster in AWS/Azure Databricks into Azure Data Lake Gen2.

•Gained knowledge of understanding report necessities, including accessible code that should be executed with Spark, Hive, HDFS, and Elastic Search

•Experience working with both SQL and NoSQL databases.

•Add value to Agile/Scrum processes such as Sprint Planning, Backlog, Sprint Retrospective, and Requirements Gathering and provide planning and documentation for projects.

•Worked with Hadoop distribution like Hortonworks, Cloudera and Hadoop ecosystems (HDFS, MR, Hive, HBase, Sqoop, Flume, Zookeeper, Spark RDDs, Spark DataFrames, and Spark Datasets, SparkSQL)

•Create Spark Core ETL processes to automate using a workflow scheduler.

•Experience handling XML files as well as Avro and Parquet SerDes

•Performance tuning at source, Target, and Data Stage job levels using Indexes, Hints, and Partitioning in DB2, ORACLE

•Use Apache Hadoop to work with Big Data and analyze large data sets efficiently

•Hands-on experience developing PL/SQL Procedures and Functions and SQL tuning of large databases by creating tables, views, indexes, stored procedures, and functions.

•Proficient in writing SQL queries, stored Procedures, Triggers, Cursors, and Packages

•Track record of results in an Agile methodology using data-driven analytics.

•Skilled at bucketing, partitioning, multi-threading computing and streaming (Python, PySpark).

•Developed and deployed complex ETL workflows using Apache Airflow and Jenkins to automate data processing and improve efficiency.

Technical Skills

Programming Languages: Python, Java, Scala, SQL, Shell Scripting

IDEs: Spyder, Jupyter, PyCharm, RStudio, Eclipse.

Visualization Tools: Pentaho, QlikView, Tableau, PowerBI, Matplotlib, Plotly, Dash

Databases: MS Access, MySQL, Oracle, PL/SQL, RDBMS

NoSQL Databases: Cassandra, Hbase, DynamoDB, MongoDB

Data warehouses: Hive, Redshift, Snowflake.

Data Processing: MapReduce, Spark, Databricks.

Machine Learning Frameworks: TensorFlow, Torch, Keras, Caffe Python Libraries NumPy, Pandas, SciPy, Matplotlib, sci-kit-learn, Caffe, NLTK, StatsModels, Seaborn, Selenium.

Deep Learning: Keras, TensorFlow, PyBrain,

Analysis Methods: Unsupervised Learning K-means Clustering, Hierarchical Clustering, Centroid Clustering, Principal Component Analysis, Gaussian Mixture Models, Singular Value Decomposition Supervised Learning Naive Bayes, Time Series Analysis, Survival Analysis, Linear Regression, Logistic Regression, ElasticNet Regression, Multivariate Regression.

Hadoop Distributions: Cloudera, Hortonworks,

AWS Services: Elastic, ELK, S3, Lambda functions, EMR, Glue jobs, Redshift, Athena, Kinesis

Azure Services: ADLS, Azure BLOB, ADF, Synapse, Databricks, Azure Fully Managed Kafka.

Automation: Oozie, Airflow.

CICD and Version control: GitHub, Git, Jenkins, SVN.

PROFESSIONAL EXPERIENCE

Big Data Engineer, NRG Energy, TX, Jun 2021 – Present

At NRG Energy, worked as a part of the Automation team and was involved in reporting to the manager. Developed a way to ensure the new code meets security standards and audit compliance requirements.

•Implement the technical delivery of project-based ML work

•EC2 Instance creation and Auto Scaling, snapshot backup, and managing template

•Developed ETLs to pull data from various sources and transform it for reporting applications using PL/SQL

•Involved in the complete Big Data flow of the application starting from data ingestion from upstream to HDFS

•Used Impala where possible to achieve faster results compared to Hive during data Analysis.

•Worked on Hadoop streaming jobs to process terabytes of XML format data.

•Real-Time/Stream processing Apache Storm, Apache Spark

•Architecting and Data Engineering for AWS cloud services including AWS Cloud services planning, designing, and DevOps support like IAM user, group, roles & policy management

•Proposed a working POC and constructed the roadmap for the prediction pipeline

•Optimized data storage in Kafka Brokers within the Kafka cluster by partitioning Kafka Topics

•Created modules for Apache airflow to call different services in the cloud including EMR, S3, Athena, Crawlers, Lambda functions and Glue jobs.

•Designed jobs using DB2 UDB, ODBC,.Net, Join, Merge, Lookup, Remove duplicate, Copy, Filter, Funnel, Dataset, Lookup file set, Change data capture, Modify, Row merger, Aggregator and Peek, and Row generator stages

•Involved in loading data from the UNIX file system to HDFS

•Performed Cloud formation scripting, security, and resources automation

•Involved in running Hadoop jobs for processing millions of records and data which was updated daily/weekly

•Developed Spark Applications by using Scala and Python, and Implemented Apache Spark data processing project to handle data from various RDBMS and streaming sources.

•Developed Spark programs using PySpark.

•Created and maintained the data warehouse in AWS Redshift.

•Fetching the live stream data from DB2 to the Hbase table using Spark Streaming and Apache Kafka

•Worked with various compression techniques to save data and optimize data transfer over the network using Lzo, Snappy, etc.

•Wrote shell scripts for automating the process of data loading

•Used ETL to transfer the data from the target database to Pentaho to send it to the MicroStrategy reporting tool

•Conducted POC for Hadoop and Spark as part of NextGen platform implementation. Implemented a recommendation engine using Scala

•Transferred Streaming data from different data sources into HDFS and HBase using Apache Flume

•Developed the build script for the project using Maven Framework.

•Mapped ingested data to a Databricks Delta Lake table schema and loaded into tables in a Snowflake Data Lake.

Big Data/ Hadoop Developer, Nielsen, NY Oct 2019-May 2021

Worked at Nielsen Company helping in the design of market samples for countries in Central America.

• Developed ingestion pipeline from streaming audience and customer data using APIs and Kafka.

•Performed streaming data ingestion to Spark as a consumer from Kafka

•Used Spark Structured Streaming to process data as Kafka Consumer.

•Implemented Spark using Python and utilized Data Frames and Spark SQL API for faster processing of data

•Execute long running jobs for preprocess products and ware houses data in Snowflake (staging area and snowpipes) to cleanse and prepare the data before consuming.

•Used Spark Streaming to divide streaming data into batches as input to Spark engine for batch processing

•Data required extensive cleaning and preparation for MLmodeling, as some observations were censored without any clear notification

•Created Hive queries to process large sets of structured and semi-structured data.

•Defined the Spark/Python (PySpark) ETL framework and best practices for development.

•Used Spark SQL with Hive context to run queries

•Moved relational database data using Sqoop into Hive dynamic partition tables

•Extracted metadata from Hive tables and supported Data Analysts running Hive queries.

•Stored unprocessed data to HDFS data lake

•Contrived solutions to data mapping for structured and unstructured data by applying schema inference.

•Worked with the analytics team to gain customer insights using Scikit-learn

•Set up Airflow on the server for pipeline automation.

•Repartitioned datasets after loading Gzip files into Data Frames and improved the processing time.

•Configured CI/CD pipeline through Jenkins and GitHub

•

•Data Engineer, Chili's and Buffalo Wild Wings, Ohio, Jan 2017-Sep 2019

•

•Buffalo Wild Wings develops, franchises, and operates the Buffalo Wild Wings and Chili chain.

•

•Hands-on with AWS data migration between database platforms Local SQL Servers to Amazon RDS and EMR HIVE

•Developed a task execution framework on EC2 instances mistreatment using Lambda and Apache Airflow.

•Optimized Hive analytics SQL queries and created tables/views and wrote custom queries and Hive-based exception processes.

•Designed and developed streaming POCs data pipelines in an AWS environment using S3 storage buckets ingesting from AWS Kinesis streams from disparate sources.

•Create PySpark scripts to fetch data from disparate sources such as Snowflake to ingest data using AWS Glue jobs along with Crawler using jdbc connectors.

•Creating from scratch AWS glue jobs to transform big scheduled amounts of data.

•Managing the E.T.L. process of the pipeline using tools like Alteryx and Informatica

•Using airflow tool with Bash and Python operators to automate pipeline process.

•Joined, manipulated, and drew actionable insights from large data sources using Python and SQL

•Applying transformation using AWS Lambda functions given a triggering event for event-driven architecture

•Using AWS RDS Aurora for storage of historical relational data and AWS Athena for data profiling and infrequent queries

•Utilized AWS Redshift to store large of data on the Cloud.

•Used Spark SQL and Data Frames API to load structured and semi-structured data into Spark Clusters

•Worked on AWS to form and manage EC2 instances and Hadoop Clusters

•Used Spark-SQL and Hive Query Language (HQL) for obtaining client insights.

•Finalized the data pipeline using Redshift an AWS data storage option.

•

•Data Engineer, E-Trade, Arlington, VA Dec 2014-Dec 2016

•

•The goal was to provide the company analysis, insight, and suggestions for the future. Data was scraped from online sources using SQL queries among other tools.

•Streamed data from Azure Fully Managed Kafka brokers using Spark Streaming and processed the data using explode transformations.

•Designed Python-based notebooks for automated weekly, monthly, and quarterly reporting E.T.L using Alteryx

•Analyzed large amounts of data sets to determine the optimal way to aggregate and report on them using Azure Databricks

•Designed the backend database and Azure cloud infrastructure for maintaining company proprietary data

•Designed ETL pipelines using Azure Blob, Data Factory, Synapse, Databricks, CosmosDB, and SQL Server

•Installed and configured Hive, Pig, Sqoop, and Oozie on the Hadoop cluster

•Monitored Apache Airflow cluster and used Sqoop to import data from Oracle to Hadoop

•Produced scripts for doing transformations using Scala

•Developed and implemented Hive custom UDFs involving date functions to query large amounts of data

•Developed Java Web Crawler to scrape market data for Internal products.

•Wrote Shell scripts to orchestrate the execution of other scripts and move the data files within and outside of HDFS

•Migrated various Hive UDFs and queries into Spark SQL for faster requests

•Created a benchmark between Hive and HBase for fast ingestion

•Configured Spark Streaming to receive real-time data from Apache Kafka and store the streamed data to HDFS using Scala

•Hands-on experience in Spark and Spark Streaming creating RDDs

•Wrote simple SQL scripts on the final database to prepare data for Tableau

•Created Airflow Scheduling scripts in Python to automate data.

•Orchestrated Airflow/workflow in a hybrid cloud environment from a local on-premises server to the cloud

•Wrote Shell FTP scripts for migrating data to Azure Blob Storage

•Implemented Azure Fully Managed Kafka streaming to send data streams from the company APIs to Spark cluster in AWS/Azure Databricks into Azure Data Lake Gen2

•Developed DAG data pipeline to on-board and change management of datasets

•Used Kafka on publish-subscribe messaging as a distributed commit log

•

•Data Engineer, Alignment Healthcare, Orange, CA, Jan 2013 - Dec 2014

•

•Alignment Healthcare is a company that provides medicare through its technology platform. It delivers customized healthcare for seniors, chronically ill, and frail. The company provides the clients with an end-to-end continuous care program, including clinical coordination, risk management, and technology facilitation.

•

•Wrote shell scripts for automating the process of data loading

•And Map Reduce programming paradigm.

•Performed streaming data ingestion to the Spark distribution environment, using Kafka.

•Write producer /consumer scripts to process JSON response in python

•Developed distributed query agents for performing distributed queries against shards

•Experience with hands-on data analysis and performing under pressure.

•Implemented Spark using Scala and utilized Data Frames and Spark SQL API for faster processing of data.

•Worked closely with stakeholders and data scientists/data analysts to gather requirements and create an engineering project plan.

•ETL to Hadoop file system (HDFS) and wrote HIVE UDFs.

•Hive for queries and incremental imports with Spark and Spark jobs

•Developed DBC//ODBC connectors between Hive and Spark for the transfer of the newly populated data frame

•Writing Hive Queries for analyzing data in Hive warehouse using Hive Query Language

•Managed jobs exploitation honest computer hardware to apportion process resources.

•Used Spark engine, Spark SQL for data analysis, and provided to the data scientists for further analysis.

•Implemented parser, query planner, query optimizer, and native query execution using replicated logs combined with indexes, supporting full relational KQL queries, including joins

•Worked with AWS and involved in ETL, Data Integration, and Migration.

•Created materialized views, partitions, tables, views, and indexes.

•Wrote custom user define functions (UDF) for complex Hive queries (HQL)

EDUCATION

•Bachelor’s Degree in physics, Faculty of Sciences, Universidad Nacional Autónoma de México

•Optical Tomography, Research, Institute of Physics, Universidad Nacional Autónomade México

•

•CERTIFICATIONS

•Machine Learning A-Z: Hands-On Python and R In Data Science, Udemy

•Intro to SQL for Data Science Course, DataCamp

•Cambridge Advanced English Certification, Cambridge University

•Data Analysis with Pandas and Python, Udemy

•Master SQL for Data Science, Udemy

•Microsoft Excel-Excel from beginner to advance, Udemy.

Contact this candidate