Senior Big Data Engineer

Location:

Irving, TX, 75061

Posted:

September 29, 2023

Contact this candidate

Resume:

Nearly ** years of experience applying accurate and efficient technical solutions in the Big Data space.

23+ years overall experience in software/database/IT.

SKILLS

Apache Hive, Apache Kafka, Apache Spark,

Apache Hadoop, Apache Tez,

GCP, Apache Zookeeper, HDFS, HiveQL, Apache Spark, Spark Streaming,

AWS – EC2, EMR, RDS Aurora, Redshift, CloudWatch, Cloud Formation, Lambda, S3 & Glacier, DynamoDB

Scala Spark, PyCharm, Python, MySQL

Unix/Linux, Windows,

Microsoft SQL Server Database (2000, 2008R2, 2012), IBM Netezza, Oracle.

LEONARDO GASKIN

PHONE: 469-***-**** GMAIL: *****************@*****.***

Work History

Sr. Data Engineer

SiriusXM / Pandora Irving, TX December 2021 – PRESENT

•Developed and maintained three main pipelines using GCP Composer, Scala-based jobs, and Python-based jobs for efficient data processing and workflow management

•Managed and maintained a Django-based web tool for handling monthly financial reports, utilizing Python and PostgreSQL (psql) for seamless data management.

•Involved as a Pull Requests reviewer, ensuring code quality and adherence to best practices.

•Conducted code rebase as needed to incorporate new features, improvements, and bug fixes into the codebase.

•Managed Git updates from both the local terminal and IntelliJ IDEA, ensuring smooth version control and collaboration among team members.

•Utilized PyCharm and IntelliJ IDEA as primary Integrated Development Environments (IDEs) for efficient coding and development.

•Handled label/partner renewals and termination updates in the PSQL database, ensuring accurate and up-to-date information.

•Participated in on-call shifts to ensure proper functionality and timely resolution of in-demand jobs.

•Actively involved in the GCP Dataproc transition from version 1.5 to 2.0, contributing to the successful migration and ensuring uninterrupted data processing.

•Played a key role in the Airflow upgrade from version 1 to version 2, ensuring a smooth transition and improved workflow management.

•Contributed to the creation of a new plan on demand for a newly added bundle, enabling efficient data processing and analysis.

•Assisted in the host transition for the on-prem tool "MIDAS," ensuring a seamless migration and uninterrupted functionality.

•Automated report validations, streamlining the validation process and reducing manual effort.

•Developed a new weekly report based on the daily usage of the Pandora/SiriusXM app, providing valuable insights about user behavior and trends.

•Performed monthly cleaning of spin fraud data, ensuring data accuracy and maintaining a reliable dataset.

•Prepared data for annual audits as required, ensuring compliance and accuracy of the data.

•Conducted Big Query analysis on specific data sets, extracting meaningful insights and supporting data-driven decision-making.

•Created and maintained buckets in cloud storage systems, ensuring efficient data storage and retrieval.

•Managed cluster creation, decommissioning, and maintenance, optimizing performance and resource utilization.

•Generated and delivered quarterly reports, providing stakeholders with valuable information and analysis.

•Participated in weekly meetings with Licensing and Accounting teams to understand new requirements and align data processing.

•Attended stand-up meetings to provide progress updates and coordinate tasks within the team and participated in bi-weekly sprint planning meetings on Tuesdays, ensuring proper task allocation and project planning for the upcoming sprints.

Big Data Engineer

Verizon Dallas, TX January 2019 – November 2021

•Developed Data Pipeline with Kafka and Spark.

•Contributed to designing the Data Pipeline with Lambda Architecture.

•Performed advanced procedures like text analytics and processing, using the in-memory computing capabilities of Spark using Scala.

•Involved in installation, configuration, supporting, and managing Hadoop clusters.

•Worked on AWS Data pipeline to configure data loads from S3 to Redshift.

•Used Spark for interactive queries and processing of streaming data.

•Expansively worked with Partitions, Dynamic Partitioning, and bucketing tables in Hive, designed both Managed and External tables, and worked on the optimization of Hive queries.

•Developed Spark Applications by using Scala, and Python and Implemented Apache Spark data processing Project to handle data from various RDBMS and Streaming sources.

•Worked with Spark for improving the performance and optimization of the existing algorithms in Hadoop.

•Using Spark Context, Spark-SQL, Data Frame, and Spark Yarn.

•Used Spark Streaming APIs to perform transformations and actions on the fly for building common.

•Configured a data model to get data from Kafka in near real-time and persist it to Cassandra.

•Developed Kafka consumer API in Python for consuming data from Kafka topics.

•Experienced in writing live Real-time Processing and core jobs using Spark Streaming with Kafka as a data pipeline system.

•Migrated an existing on-premises application to AWS.

•Used AWS services like EC2 and S3 for small data sets processing and storage.

•Experienced in Maintaining the Hadoop cluster on AWS EMR.

•Imported data from AWS S3 into Spark RDD, Performed transformations and actions on RDDs.

Big Data Engineer

Aflac Columbus, GA September 2017 – January 2019

•Communicated deliverables status to stakeholders and facilitated periodic review meetings.

•Developed Spark streaming application to pull data from the cloud to Hive and HBase.

•Built Real-Time Streaming Data Pipelines with Kafka, Spark Streaming, and Hive.

•Created a Kafka producer to connect to different external sources and bring the data to a Kafka broker.

•Handled schema changes in the data stream using Kafka.

•Responsible for Kafka operation and monitoring, and handling of messages funneled through Kafka topics.

•Coordinated Kafka operation and monitoring with dev ops personnel; formulated balancing impact of Kafka producer and Kafka consumer message(topic) consumption.

•Designed and developed ETL workflows using Python and Scala for processing data in HDFS.

•Collected, aggregated, and shuffled data from servers to HDFS using Apache Spark & Spark Streaming.

•Worked on importation and claims information between HDFS and RDBMS.

•Created Hive External tables and loaded the data into tables and query data using HQL.

•Worked on streaming the prepared information to HBase utilizing Spark.

•Performed performance calibration for Spark Steaming e.g., setting the right Batch Interval time, the correct level of executors, choice of correct publishing & memory.

•Used HBase connector for Spark.

•Performed gradual cleansing and modeling of datasets.

•Utilized Avro tools to build the Avro schema to create external hive tables using PySpark.

•Created and managed external tables to store ORC and Parquet files using HQL.

•Developed Apache Airflow DAGs to automate the pipeline.

•Created a NoSQL HBase database to store the processed data from Apache Spark.

Data Engineer

FINRA Rockville, MD May 2015 – September 2017

•Created and executed Hadoop Ecosystem installation and document configuration scripts on the Google Cloud Platform.

•Transformed batch data from several tables containing hundreds of thousands of records from SQL Server, MySQL, PostgreSQL, and CSV file datasets into data frames using PySpark.

•Developed a PySpark program that writes data frames to HDFS as avro files.

•Utilized Spark's parallel processing capabilities to ingest data.

•Created and executed HQL scripts that create external tables in a raw layer database in Hive.

•Developed a Script that copies avro formatted data from HDFS to External tables in the raw layer.

•Created PySpark code that uses Spark SQL to generate data frames from avro formatted raw layer and writes them to data service layer internal tables as orc format.

•In charge of PySpark code, creating data frames from tables in the data service layer, and writing them to a Hive data warehouse.

•Installed Airflow and created a database in PostgreSQL to store metadata from Airflow.

•Configured documents that allow Airflow to communicate to its PostgreSQL database.

•Developed Airflow DAGs in Python by importing the Airflow libraries.

•Utilized Airflow to schedule automatically trigger and execute data ingestion pipeline.

Hadoop Distributed Data Engineer

Citizens Financial Group Providence, RI December 2013 – May 2015

•Administration and optimization of data pipelines and ETL processing for assessment of stocks for the financial strategy of this investment division.

•Migrated the needed data from Oracle, and MySQL into HDFS using Sqoop and imported various formats of flat files into HDFS.

•Uploaded and processed more than 30 terabytes of data from various structured and unstructured sources into HDFS (AWS cloud) using Sqoop and Flume.

•Designed and created Hive external tables using a shared meta-store instead of Derby with partitioning, dynamic partitioning, and buckets.

•Used Pig as an ETL tool to do Transformations, even joins, and some pre-aggregations before storing the data onto HDFS.

•Analyzed the data by performing Hive queries and running Pig scripts to validate sales data.

•Experience working on Solr to develop a search engine on unstructured data in HDFS.

•Used Solr to enable indexing for enabling searching on non-primary key columns from Cassandra keyspaces.

•Developed custom processors in Java using Maven to add the functionality in Apache Nifi for some additional tasks.

•Wrote Python code using a HappyBase library of Python to connect to HBASE.

•Used Spark SQL to process the huge amount of structured data and Implemented Spark RDD transformations, and actions to migrate Map reduce algorithms.

•Used Tableau for data visualization and generating reports.

•Created SSIS packages to extract data from OLTP and transformed them into OLAP systems and Scheduled Jobs to call the packages and Stored Procedures. Created Alerts for successful or unsuccessful completion of Scheduled Jobs.

•Developed Python scripts, and UDFs using both Data frames/SQL and RDD in Spark for Data Aggregation, queries, and writing data back into RDBMS through Sqoop.

•Developed Spark code using Scala and Spark-SQL for faster testing and data processing.

•Worked on converting PL/SQL code into Scala code and also converted PL/SQL queries into HQL queries.

•Involved in scheduling the Oozie workflow engine to run multiple Hive and Pig jobs.

•Environment: Cloudera Distribution CDH 5.5.1, Oracle 12c, HDFS, Map Reduce, Nifi, Hive, HBase, Pig, Oozie, Sqoop, Flume, Hue, Tableau, Scala, Spark, Zookeeper, Apache Ignite, SQL, PL/SQL, UNIX shell scripts, Java, Python, AWS S3, Maven, JUnit, MRUnit.

Hadoop Developer

Whirlpool Benton Harbor, MI November 2011 – November 2013

•Used Sqoop to expeditiously transfer information between information databases and HDFS and used Flume to stream the log data from servers.

•Enforced partitioning, and bucketing in Hive for the higher organization of the data.

•Worked with totally different file formats and compression techniques to standards.

•Loaded information from a UNIX system to HDFS.

•Used UNIX system shell scripts to alter the build method, and to perform regular jobs like file transfers between totally different hosts.

•Assigned to production support, which concerned observance server and error logs, foreseeing and preventing potential problems, and escalating issues once necessary.

•Documented Technical Specs, Dataflow, Information Models, and sophistication Models using Confluence.

•Documented needs gathered from stakeholders.

•With success loaded files to HDFS from Teradata and loaded from HDFS to HIVE.

•Used Zookeeper and Oozie for coordinating the cluster and programming workflows

•Involved in researching various available technologies, industry trends, and cutting-edge applications. Data ingestion is done using Flume with source as Kafka Source & sink as HDFS.

•Performed storage capacity management, performance tuning, and benchmarking of clusters.

MySQL Database Administrator

Chicago Transit Authority Chicago, IL March 2009 to November 2011

•Designed and configured MySQL server cluster and managed each node on the Cluster.

•Responsible for MySQL installations, upgrades, performance tuning, etc.

•Collected and analyzed business requirements to derive conceptual and logical data models.

•Developed database architectural strategies at the modeling, design, and implementation stages.

•Translated a logical database design or data model into an actual physical database implementation.

•Mentored and worked with developers and analysts to review scripts and better querying.

•Performed security audit of MySQL internal tables and user access. Revoked access for unauthorized users.

•Set up replication for disaster and point-in-time recovery. Replication was used to segregate various types of queries and simplify backup procedures.

•Defined procedures to simplify future upgrades. Standardized all MySQL installs on all servers with custom configurations.

•Applied performance tuning to improve issues with a large, high-volume, multi-server MySQL installation for job applicant sites of clients.

•Modified database schema as needed.

•Analyzing, and profiling data for quality and reconciling data issues using SQL.

•Regular database maintenance.

•Created and implemented database standards and procedures.

•Prepared documentation and specifications.

DATABASE ADMINISTRATOR

Sogeti Redmond, WA May 2001 – March 2009

•Created and maintained databases in SQL Server 2010.

•Designed and established SQL applications.

•Created tables and views in the SQL database.

•Supported schema changes and maintained the database to perform in optimal conditions.

•Created and managed tables, views, user permissions, and access control.

•Sent requests to source REST Based API from a Scala script via Kafka producer.

•Built the Hive views on top of the source data tables and built a secured provisioning.

•Created and managed dynamic web parts.

•Customized library attributes and imported and exported existing data and connections of data.

•Provided a workflow and initiated the workflow processes.

•Worked on SharePoint Designer and InfoPath Designer and developed workflows and forms.

Education

INGENIERO INDUSTRIAL - GERENCIA DE PROYECTOS from UNIVERSIDAD ALONSO DE OJEDA

Contact this candidate