Sr. Big Data Engineer

Location:

Manhattan, NY, 10035

Posted:

March 24, 2023

Contact this candidate

Resume:

HÉCTOR MANUEL ARELLANO GONZALEZ

Email: ************@*****.***; Phone: 347-***-****

BIG DATA ENGINEER

Software/IT and Big Data experience over 9+ years.

Develop Big Data projects using AWS, Hive, Spark, and multiple more Apache open-source tools/technologies.

Skilled in applying Big Data analytics with hands-on experience in Data Extraction, Transformation, Loading, and Data Analysis, Data Visualization using ###Cloudera Platform (HDFS, Hive, Sqoop, Flume, HBase, Airflow).

Experienced working with different Cloud ecosystem components such as S3, EMR, Glue, Lambda, SNS, SQS, Step Function, CloudWatch, Kinesis, and RDS.

Experience importing and exporting data between HDFS and Relational Database Management systems using Sqoop.

Experienced in loading data to hive partitions and creating buckets in Hive.

In-depth Knowledge of AWS Cloud Services like Compute, Network, Storage, and Identity and Access management.

Skilled in working in cloud environments and setting up services and integrations between Systems.

Background with traditional databases such as Oracle, SQL Server, ETL tools/processes, and data warehousing architectures.

Hands-on with Snowflake and Databricks and Apache Airflow for coordinating the ETL pipelines and scheduling workflows.

Extensive knowledge of NoSQL databases such as DynamoDb and Cassandra.

Importing data from various data sources, performing transformations using Hive, loading data into HDFS, and extracting the data from relational databases like Oracle and MySQL into HDFS and Hive using Sqoop.

Hands-on experience in fetching the live stream data from RDBMS to the HBase table using Spark Streaming and Apache Kafka.

Work with cloud environments like Amazon Web Services, EC2, and S3.

Hands-on experience working with Amazon EMR framework transferring data to EC2 server.

Expert with the extraction of data from structured, semi-structured, and unstructured data sets to store in HDFS.

Skilled in programming with Python.

Technical Skills

Scripting

Python, Shell scripting, PHP, MySQL, Scala

Big Data Platforms

Hadoop, Cloudera Impala, Talend, Informatica, AWS, Microsoft Azure, Adobe Cloud, Elastic Cloud, Anaconda Cloud,

Big Data Tools

Apache Sqoop, Spark, Hive, HDFS, Zookeeper, Oozie, Airflow,

Database Technologies

SQL Server 2008 R2/2012/ 2014, MySQL, SQL Server Reporting Services (SSRS), SQL Server Integration Services (SSIS), SQL Server Analysis Services (SSAS), SQL Server Management Studio. RDBMS and NoSQL; HBase, Cassandra

Data Processing

ETL Processes, EDP, Real-time processing, Batch processing, Streaming Processes, Cloud Security, and Cloud Filtering.

SharePoint Technologies

Workflows, Event Receivers, Web Parts, Site Definitions, Site Templates, Timer Jobs, SharePoint Hosted Apps, Provider Hosted Apps, Search, Business Connectivity Services (BCS), User Profiles, Master Pages, Page Layouts, Managed Metadata, SharePoint Designer, InfoPath, ShareGate, OAuth, templates, Taxonomy, ShareGate, Metalogix, Nintex, Forms, InfoPath, SharePoint Designer, Visual Studio, MS Office, SharePoint Search, SharePoint User Profiles.

BI/Reporting Visualization

Business Analysis, Data Analysis, Use of Dashboards and Visualization Tools, Power BI, Tableau

Professional Experience

Data Engineer

Constellation brands, New York, NY Sep 2021 – present

Created end-to-end ETL pipelines using Components such as AWS, Spark, and Kafka

Built Real-Time Streaming Data Pipelines with Kafka, Spark Streaming

Created a Kafka producer to connect to different external sources and bring the data to a Kafka broker.

Attended meetings with managers to determine the company’s Big Data needs, according to future production.

Developed ETL pipeline to process log data from Spark, pandas, sequence files, and output to Amazon S3.

Implemented Spark streaming for real-time data processing with Kafka and handled large amounts of data with Spark, like inventory to know what to produce.

Used SQL to perform transformations and actions on data residing in Redshift as Data Warehouse, updating every material of the different products we produce.

Participated in various phases of data processing (collecting, aggregating, and moving from various sources) using Apache Spark.

Managed structured data via Spark SQL then stored it into AWS S3 parquet files for downstream consumption.

Defined the Spark/Python (PySpark) ETL framework and best practices for development in Python.

Used Spark streaming to check and track the ETA of the transports.

Versioned with Git and set up a Jenkins CI to manage CICD practices.

Interacted with data residing in AWS S3 using Spark to process the data.

Data Engineer

Heineken USA, White Plains, New York Mar 2018 – Sep 2021

Created data frames in Apache Spark by passing schema as a parameter to the ingested data using case classes, getting the files from SAP.

Participated in the implementation of Cloudera and AWS, getting the information there.

Involved in the implementation of analytics solutions through Agile/Scrum processes for development and quality assurance.

Interacted with data residing in Google Cloud Storage using Spark/PySpark to process the data.

Automated, configured, and deployed instances on Google Cloud Platform (GCP) environments.

Populated data frames inside spark jobs, Spark SQL, and Data Frames API to load structured data into Spark clusters.

Forwarded requests to source REST Based API from a Python script via Kafka Producer.

Developed PySpark application to read data from various file system sources, apply transformations, and write to SQL database.

Gained knowledge of Spark, and similar frameworks to work with plans of productions and inventory.

Attended meetings with managers to determine the company’s Big Data needs.

Loaded disparate data sets and conducted pre-processing services using Google Cloud Storage.

Collaborated with the software research and development teams and built cloud platforms for the development of company applications.

Training staff on data resource management.

Collected data using REST API, built HTTPS connection with client-server, sent GET request, and collected response in Kafka Producer.

Used Spark transformations from Imported data from web services into Google Cloud Storage.

Executed Spark jobs on Google Cloud Dataproc, data stored in Google Cloud Storage Buckets, and ingested data through Google Cloud Pub/Sub and Dataflow from various sources to Google Cloud Storage.

Worked with Spark Context, Spark SQL, DataFrames, and Pair RDDs.

Extracted data from different databases and scheduled PySpark workflows to execute the task daily. (Cloud Composer)

Worked with Google Cloud Platform (GCP) and was involved in ETL, Data Integration, and Migration.

Worked on Google Cloud Pub/Sub for processing huge amounts of real-time data.

Senior Data Engineer

Leaddocket, Salt Lake City, UT (Remote) May 2015 – Feb 2018

●Worked directly with a large knowledge data team that created the inspiration for this Enterprise Analytics initiative during a Spark knowledge Data Lake on GCP.

●Programmed Spark advancement engine to run multiple Cloud SQL queries on GCP.

●Developed Python Scripts for data capture and delta record process between fresh arrived knowledge and already existing knowledge in the GCP cluster using Cloud Storage.

●Implemented workflows with Apache Airflow on GCP to automate tasks.

●Imported data from disparate sources into Spark RDD on GCP for the process using Cloud Storage.

●Analyzed giant sets of structures, semi-structured, and unstructured knowledge by running Spark, Python scripts with pandas on GCP Dataproc.

●Partitioned and bucketed log file knowledge, parquet format using GCP Dataproc.

●Collected relational data designed with custom input adapters and Cloud Storage Transfer Service on GCP.

●Consumed knowledge from writer queue using Cloud Pub/Sub on GCP.

●Wrote python configuration scripts for the exportation of log files to the GCP Dataproc cluster through automatic processes and migrated to a modern cloud.

●Accessed GCP Dataproc cluster and reviewed log files of all daemons.

●Translated migrated MapReduce jobs to Spark to boost the performance of the body process on GCP Dataproc.

●Developed Dataflow pipelines on GCP to extract, transform, and load data from various sources into BigQuery.

●Utilized GCP Data Studio to create interactive dashboards and reports for stakeholders.

●Implemented GCP Cloud Functions to process real-time data streams from Cloud Pub/Sub and store the results in Cloud Storage or BigQuery.

●Worked with GCP Cloud Composer to deploy and manage Airflow workflows in a production environment.

●Managed and monitored GCP resources using Stackdriver Logging and Monitoring.

Junior Data Engineer

River Valley Construction Vancouver, Canada Jan 2014 – May 2015

Built large-scale and complex data processing pipelines.

Built Apache Spark data structures and programmed Spark code using Scala and Spark-SQL/Streaming for faster processing of data.

Used Spark-SQL to Load JSON data and created Schema RDD and loaded it into Hive Tables and handled Structured data using SparkSQL.

Used Spark API over Hortonworks Hadoop YARN to perform analytics on data in Hive.

Analyzed Hadoop clusters using big data analytic tools including Hive, and MapReduce.

Worked on analyzing Hadoop cluster and different big data analytic tools, including MapReduce, Hive, HDFS, Spark, Kafka, and Apache NiFi.

Installed Oozie workflow engine to run multiple Spark Jobs.

Created a Kafka broker which used schema to fetch structured data in structured streaming.

Interacted with data residing in HDFS using Spark to process the data.

Configured Spark Streaming to receive real-time data to store in HDFS.

Ingested production line data from IoT sources through internal RESTful APIs.

Created Python scripts to download data from the APIs and perform pre-cleaning steps.

Built Spark applications to perform data enrichments and transformations using Spark Data Frames with Cassandra lookups.

Performed ETL operations on IoT data using PySpark.

Wrote user-defined functions (UDFs) to apply custom business logic to datasets using PySpark.

Configured AWS S3 to receive and store data from the resulting PySpark job.

Wrote DAGs for Airflow to allow scheduling and automatic execution of the pipeline.

Performed cluster-level and code-level Spark optimizations.

Created AWS Redshift Spectrum external tables to query data in S3 directly for data analysts.

Configured Zookeeper to coordinate the servers in clusters to maintain data consistency and monitor services.

Converted Hive/SQL queries into Spark transformations using Spark RDDs Python.

Installed and configured Tableau Desktop to connect to the Hortonworks Hive Framework (Database) which contained the bandwidth data through the Hortonworks JDBC connector for further analytics of the data.

Wrote simple SQL scripts on the final database to prepare data for visualization with Tableau.

EDUCATION

Industrial and Systems Engineer

TECNOLÓGICO DE MONTERREY ITESM TOLUCA, MÉXICO

Industrial and Systems Engineer in the European Union

DANISH STUDIES AND SCIENCE SECRETARIAT DENMARK, Validation of university studies in Denmark

Diploma in Marketing

LAUREA UNIVERSITY LOHJA, FINLANDIA

Contact this candidate