Big Data Engineer

Location:

San Diego, CA

Posted:

March 23, 2023

Contact this candidate

Resume:

Jorge Zepeda

Big Data Engineer/ Hadoop/ Cloud Developer

Phone: 858-***-****

Email: *****************@*****.***

Profile Summary

Big data Engineer with 6+ years of expertise in using Hadoop Ecosystem tools such as Spark, Kafka, HDFS, Pig, Hive, Sqoop, Storm, Yarn, Oozie, etc.; Skilled with Python and Scala.

Brilliant in applying user-defined functions (UDF) for Hive and Pig using Python and performing ETL, data extraction, transformation, and load using Hive, Pig and H Base.

Hands-on experience with

oSpark Architecture, including Spark Core, Spark SQL, and Spark Streaming.

oHadoop Distributed files system (Cloudera, MapR, S3), Hadoop framework and Parallel processing implementations.

Skilled in writing UNIX shell scripts and experience in AWS Cloud IAM, Data pipeline, EMR, S3, EC2, AWS CLI, SNS, and other services.

Expertise in creating scripts & macros using Microsoft Visual Studios to automate tasks and writing UDFs for Hive and Pig.

Execute Flume to load the log data from multiple sources directly into HDFS.

Design both time-driven and data-driven automated workflows using Oozie.

Develop applications using RDBMS, Hive, Linux/Unix shell scripting and Linux internals.

Experience importing and exporting data using Sqoop and SFTP for Hadoop to/from RDBMS.

Good at cleaning and analyzing data using Hive QL and Pig Latin.

Possess database experience with Apache Cassandra, Apache HBase, MongoDB, SQL

Experienced in extracting and generating analysis using Business Intelligence Tools and Tableau for better analysis of data.

Technical Skills List

Big Data Platforms: Hadoop, Cloudera Hadoop, Hortonworks.

Hadoop Ecosystem (Apache) Tools: Kafka, Spark, Cassandra, Flume, Hadoop, Hadoop YARN, HBase, Hive, Oozie, Pig, Spark Streaming.

Hadoop Ecosystem Components: Sqoop, Kibana, Tableau, AWS, HDFS, Hortonworks, Apache Airflow.

Scripting: Python, Scala, SQL, Spark, Hive QL.

Data Storage and Files: HDFS, Data Lake, Data Warehouse, Redshift, Parquet, Avro, JSON, Snappy, Gzip.

Databases: Apache Cassandra, Apache HBase, MongoDB, SQL, MySQL, RDBMS, NoSQl, DB2, DynamoDB.

Cloud Platforms and Tools: S3, AWS, EC2, EMR, Lambda services, Microsoft Azure, Adobe Cloud, Amazon Redshift, Rackspace Cloud, Intel Nervana Cloud, Open Stack, Google Computer Cloud, IBM Bluemix Cloud, MapR cloud, Elastic Cloud, Anaconda Cloud.

Data Reporting and Visualization: Tableau, PowerBI, Kibana, Pentaho, QlikView.

Analytics framework Tools: Databricks

Web Technologies and APIs: XML, Blueprint XML, Ajax, REST API, Spark API, JSO.

Professional Project Experience

Cloud Data Engineer Arkus Nexus San Diego, CA

July 2021 – Current

Created custom Spark Streaming jobs to process data streaming events.

Created modules for Apache Airflow to call different services in the cloud including EMR, S3, and Redshift.

Developed Spark jobs for data processing and Spark-SQL/Streaming for distributed processing of data.

Created AWS Lambda function for extracting the data from Kinesis Firehose and post the data to AWS S3 bucket on scheduled basis (every 4 hours) using AWS Cloud Watch event.

Developed Spark code using Python to run in the EMR clusters to load data into Snowflake dataware house

Designed extensive automated test suites utilizing Selenium in Python.

Wrote Unit tests for all code using PyTest for Python.

Used Python Boto3 for developing Lambda functions in AWS

Updated, maintained, and validated large data sets for derivatives using SQL and Pyspark

Used the Pandas library and Spark in python for data cleansing, validation, processing and analysis.

Created and maintained the data warehouse in AWS Redshift.

Implemented different optimization techniques to improve the performance of the data warehouse in AWS Redshift.

Implemented Spark in EMR for processing Big Data across Data Lake in AWS System.

Implemented Serverless architecture using AWS Lambda with Amazon S3 and Amazon Dynamo DB.

Set up workflows in Apache Airflow to run ETL pipelines using tools in AWS.

Integrated streams with Spark streaming for prime speed processing.

Ingested data into the data lake using AWS S3.

Configured Spark-submit command to allocate resources to all the jobs across the cluster.

Collected log information using custom-engineered input adapters and Kafka.

Created custom producer to ingest the data into Kafka topics for consumption by custom Kafka consumers.

Performed maintenance, monitoring, deployments, and upgrades across applications that support all Spark jobs.

Partitioned and bucketed log file information to differentiate information on a common place and combine supported business needs.

Deployed the applying jar files into AWS EC2 instances.

Developed a task execution framework on EC2 instances mistreatment using Lambda and Airflow.

Implemented KSQL queries to process the data ingested in the Kafka topic.

Authored queries in AWS Athena to query from files in S3 for data profiling.

Support for the clusters, topics on the Kafka manager and Kafka/Hadoop upgrades on the environment.

Big Data Engineer Red shelf, Inc. Chicago, IL

May 2020 – June 2021

As one of the nation’s leading edtech companies, Red Shelf has helped thousands of colleges, businesses, and publishers’ transition effortlessly from traditional print to more affordable, efficient digital textbooks and learning content.

Offering a vast catalog of titles on our award-winning eReader, plus an end-to-end Content Delivery System to streamline every step in the distribution process, we're your one-stop shop for digital evolution.

Developed PySpark application as ETL job to read data from various file system sources, apply transformations, and write to NoSQL database (Cassandra).

Installed and configured Kafka cluster and monitored the cluster.

Architected a lightweight Kafka broker and integrated Kafka with Spark for real-time data processing.

Wrote Unit tests for all code using PyTest for Python.

Used Python Boto3 for developing Lambda functions in AWS

Used the Pandas library and Spark in python for data cleansing, validation, processing and analysis.

Created Hive external tables and designed data models in Apache Hive.

Developed multiple Spark Streaming and batch Spark jobs using Python on AWS.

Implemented advanced procedures of feature engineering for data science team using the in-memory computing capabilities like Apache Spark written in Python.

Implemented Rack Awareness in the Production Environment.

Collected data using REST API, built HTTPS connection with client server, sent GET request and collected response in Kafka Producer.

Imported data from web services into HDFS and transformed data using Spark.

Executed Hadoop/Spark jobs on AWS EMR using programs, data stored in S3 Buckets.

Used Spark SQL for creating and populating HBase warehouse.

Worked with Spark Context, Spark -SQL, Data Frames, and Pair RDDs.

Ingested data through AWS Kinesis Data Stream and Firehose from various sources to S3.

Extracted data from different databases and scheduling Oozie workflows to execute the task daily.

Worked with Amazon Web Services (AWS) and involved in ETL, Data Integration, and Migration.

Documented the requirements including the available code which should be implemented using Spark, Amazon DynamoDB, Redshift, and Elastic Search.

Imported and exported data into HDFS and Hive using Sqoop.

Worked on AWS Kinesis for processing huge amounts of real-time data.

Hadoop Data Engineer Cocos.AI Remote

October 2019 – April 2020

Collaborative AI on private data

CoCoS.ai (Confidential Computing System for AI) lets you run AI/ML workloads on combined datasets from multiple organizations while guaranteeing the privacy and security of the data and the algorithms. Data is always encrypted, protected by hardware secure enclaves (Trusted Execution Environments), attested via secure remote attestation protocols, and invisible to cloud processors or any other 3rd party to which computation is offloaded.

Designed a cost-effective archival platform for storing Big Data using Hadoop and its related technologies.

Developed a task execution framework on EC2 instances using SQL and DynamoDB.

Built a Full-Service Catalog System with a full workflow using CloudWatch, Kibana, Kinesis, Elasticsearch, and Logstash.

Built a prototype for real-time analysis using Spark streaming and Kafka in Hadoop system.

Loaded and transformed large sets of structured, semi structured, and unstructured data using Hadoop, Spark, Hive for ETL, pipeline and Spark streaming, acting directly on Hadoop Distributed File System (HDFS).

Extracted data from RDBMS (Oracle, MySQL) to Hadoop Distributed File System (HDFS). using Sqoop.

Used NoSQL database MongoDB in implementation and integration.

Configured Oozie workflow engine scheduler to run multiple Sqoop, Hive, and Pig jobs in the Hadoop system.

Consumed the data from Kafka queue using Storm and deployed the application jar files into AWS instances.

Collected business requirements from subject matter experts and data scientists.

Transferred data using Informatica tool from AWS S3.

Used AWS Redshift for cloud data storage.

Used different file formats such as Text files, Sequence Files, and Avro for data processing in Hadoop system.

Loaded data from various data sources into Hadoop Distributed File System (HDFS). using Kafka.

Integrated Kafka with Spark Streaming for real-time data processing in Hadoop.

Used Cloudera Hadoop (CDH) distribution with Elasticsearch.

Used image files to create instances containing Hadoop installed and running.

Streamed analyzed data to Hive Tables using Sqoop, making available for data visualization.

Tuning and operating Spark and its related technologies like Spark SQL and Spark Streaming.

Used the Hive JDBC to verify the data stored in the Hadoop cluster.

Connected various data centers and transferred data using Sqoop and ETL tools in Hadoop system.

Imported data from disparate sources into Spark RDD for data processing. In Hadoop

Used shell scripts to dump the data from MySQL to Hadoop Distributed File System (HDFS).

Data Engineer Loreal, S.A. Remote

August 2017 – September 2019

French personal care company headquartered in Clichy, Hauts-de-Seine with a registered office in Paris. It is the world's largest cosmetics company and has developed activities in the field concentrating on hair color, skin care, sun protection, make-up, perfume, and hair care.

Designed, configured, and installed Hadoop clusters.

Applied Hadoop system administration using Hortonworks/Ambari and Linux system administration (RHEL 7, Centos.).

Set up Apache Ranger for cluster ACLs and audits to meet compliance specifications.

Performed HDFS balancing and fine-tuning for MapReduce applications.

Produced data migration plan for other data sources into the Hadoop system.

Applied open-source configuration management and deployment using Puppet and Python.

Configured Kerberized to the cluster for user authentication.

Configured YARN Capacity and Fair scheduler based on organizational needs.

Performed cluster capacity and growth planning and recommended nodes configuration.

Worked with highly unstructured and structured data.

Optimized and integrated Hive, SQOOP, and Flume into existing ETL processes, accelerating the extraction, transformation, and loading of massive structured and unstructured data.

Used Hive to simulate data warehouse for performing client-based transit system analytics.

Monitored production cluster by setting up alerts and notifications using metrics thresholds.

Tuned MapReduce counters for faster and optimal data processing.

Helped design back-up and disaster recovery methodologies involving Hadoop clusters and related databases.

Performed upgrades, patches and fixes using either rolling or express method

NONAP

Software and Database developer

2016 – 2017

Extensively utilized Django technologies, including forms, templates, and ORM for communication with databases in different forms while using interceptors, validators, actions

Development of software for drowsiness detection for drivers with computerized vision

Hardware design

Used Hive to simulate data warehouse for performing client-based transit system analytics.

Monitored production cluster by setting up alerts and notifications using metrics thresholds.

Tuned MapReduce counters for faster and optimal data processing.

Helped design back-up and disaster recovery methodologies involving Hadoop clusters and related databases.

Educational Credentials

Bachelors – Mechatronic Engineer – 2019

ITESM, MX

Contact this candidate