Sr. Big Data Engineer

Location:

Lawrence, MA, 01843

Posted:

April 27, 2023

Contact this candidate

Resume:

Iván Tabaré

Email: **************@*****.*** Phone: 351-***-****

Profile Summary

A seasoned professional offering nearly 8 years of experience in the development of Big Data Solutions both offline and in the cloud

Skilled in working on SQL and developing other Relational Database Management System which includes PostgreSQL and MSSQL Server

Recognized for the work and received the Bioenetix Award in 2021

Skilled at working in the AWS ecosystem, understanding the architecture, and performing projects involving transferring data

Ability to troubleshoot and tune relevant programming languages like SQL, Java, Python, Scala, and HiveQL.

Able to design elegant solutions using problem statements.

Collaborating with data scientists, data analysts, and other stakeholders to understand their data needs and ensure that the big data infrastructure supports their requirements.

Expert at creating HDFS and implementing RDMBs into HDFS using Sqoop

Certified in Data Science, Machine Learning, and Natural Language Processing at universities such as MIT and Stanford

Used Spark to improve Python programming skills and understanding of the architecture of Spark.

Hands-on experience with AWS and the integration between multiple services in the cloud.

Proficient in working on AWS tools (Redshift, Kinesis, S3, EC2, EMR, DynamoDB, Elasticsearch, Athena, Firehose, Lambda)

Experience in Migrating SQL database to Azure Data Lake, Azure data lake Analytics, Azure SQL Database, Data Bricks, and Azure SQL Data warehouse and controlling and granting database access and Migrating On-premises databases to Azure Data Lake store using Azure Data factory

Created classes that simulate real-life objects and write loops to perform actions on the collected data

In-depth knowledge of real-time ETL/Spark analytics using Spark SQL with visualization

Extensive experience in performance tuning, for instance, SQL and Spark query tuning

Possess strong analytical skills for troubleshooting and problem-solving

CERTIFICATION

Monterrey BTC Embassy Tokenomist

MIT Data Science and Machine Learning: Making Data-Driven Decisions

Stanford Certificate of Achievement in Natural Language Processing with Deep Learning

Imperial College Business School Professional Certificate in Machine Learning and AI

WORK EXPERIENCE

Senior Data Engineer (AWS)

Schneider Electric, Andover, MA, Jan’22-Present

Created different pipelines in AWS for end-to-end ETL processes

Implementing data governance and security protocols using AWS Identity and Access Management (IAM) to ensure that sensitive data is protected.

Collaborating with data scientists and analysts to develop machine learning models.

Containerizing Confluent Kafka application and configured subnet for communication between containers.

Creating and maintaining documentation for the company's big data infrastructure and systems, including design diagrams, system configurations, and best practices

Optimizing data processing systems for performance and scalability using Amazon EMR and Amazon EC2.

Installed AWS command line interface (CLI) to interact with S3 bucket to download and upload files

Designing and implementing data pipelines to extract, transform, and load data from various sources into an Amazon S3 data lake or Amazon Redshift data warehouse.

Building real-time streaming systems using Amazon Kinesis to process data from sensors as it is generated.

Developing and maintaining data models using Amazon Athena to represent the relationships between different data elements.

Communicate with multiple stakeholders and project managers.

Created Apache Airflow jobs to orchestrate pipeline executions.

Transform the data stored on old mainframe computers to more recent methods of storing data with a tie into MySQL in AWS RDS.

Develop a data pipeline that included data being streamed with Kinesis, reformatted with Firehose and Lambda to be stored in S3, preprocessed with Spark in EMR, and finally stored in Redshift for further analysis, all automated with Airflow.

Designing and implementing data pipelines to collect, store, and process large amounts of data from different types of machines.

Big Data Engineer

Coinbase (Remote), Jul’20-Jan’22

Implemented real-time crypto market analysis pipeline for Coinbase using AWS technologies.

Set up data ingestion process using Amazon Kinesis Streams from multiple exchanges, including Coinbase.

Processed data using Amazon Kinesis Analytics to generate insights into market trends and anomalies.

Stored processed data in Amazon S3 as Parquet files, partitioned and organized by date and cryptocurrency.

Used Amazon Athena to run ad-hoc SQL queries on data stored in Amazon S3.

Visualized insights using Amazon QuickSight for business users to understand and act on the data.

Loaded processed data into Amazon Redshift to support more complex analysis and reporting.

Facilitated data scientists and analysts to run complex queries and build predictive models to inform business decisions.

Implemented customer behavior analysis pipeline for Coinbase using Databricks technologies.

Collected customer behavior data from multiple sources and stored it in Delta Lake.

Processed customer behavior data using Databricks Spark to create a single source of truth for analysis.

Used Databricks Notebooks to run complex data analytics algorithms and build machine learning models to understand customer behavior.

Visualized results using Databricks Dashboards for business users to understand and act on the data.

Loaded processed data into a Delta Lake table for more complex analysis and reporting.

Facilitated data scientists and analysts to run complex queries and build predictive models to inform business decisions using Delta Lake and Databricks Spark.

Provided a collaborative environment for data scientists, engineers, and business users to work together on big data projects using Databricks Notebooks, Dashboards, and Models.

Big Data Engineer

Power Dental USA, McHenry, IL. Jan’19-Jul’20

Designed and implemented a big data engineering pipeline for managing patient dental records using Microsoft Azure technologies including Azure Data Lake Storage, Azure Databricks, and Azure Machine Learning.

Collected and stored dental records of patients from multiple dental clinics in Azure Data Lake Storage.

Cleaned and pre-processed the collected data using Azure Databricks, removing duplicates and transforming the data into a format suitable for analysis.

Transformed the cleaned data using Azure Databricks to create features for analysis, extracting demographic information, treatment history, and insurance information from the records.

Analyzed the transformed data using Azure Machine Learning to identify patterns and trends in patient dental health.

Visualized the results of the analysis using Power BI to help dental professionals make informed decisions about patient treatment plans.

Designed and implemented a big data engineering pipeline for managing dental supply inventory using Databricks and Azure Data Lake Storage.

Collected and stored data on dental supplies purchased and used by multiple dental clinics in Azure Data Lake Storage.

Cleaned and pre-processed the collected data using Databricks, handling outliers and missing values and transforming the data into a format suitable for analysis.

Transformed the cleaned data using Databricks to create features for analysis, extracting information on the type of supplies, quantity, and cost from the records.

Analyzed the transformed data using Databricks to identify patterns and trends in dental supply usage.

Built a predictive model using Databricks to forecast future dental supply needs based on historical data and current trends.

Visualized the results of the analysis and predictive modeling using Databricks notebooks to help dental professionals make informed decisions about their supply management.

Data Engineer / Data Analyst

CEMEX Inc., Houston, Texas, Jul’17-Jan’19

Designed and implemented a data engineering pipeline on Google Cloud Platform (GCP) to collect real-time data from construction sites using sensors, cameras, and other IoT devices.

Leveraged Google Cloud Pub/Sub and Dataflow for real-time data loading and transformation into Google Cloud Storage and BigQuery.

Conducted real-time analysis using Google Cloud Dataproc and BigQuery on data stored in Google Cloud Storage and integrated results with existing data sources such as ERP systems.

Developed dashboards and reports using Google Data Studio to make results of the analysis available to stakeholders in real-time.

Built a cost-effective archival platform using Google Cloud Storage and BigQuery for data storage.

Connected various data centers and transferred data using Cloud Data Transfer and Cloud Pub/Sub.

Utilized Google Cloud SQL and Bigtable for task execution and data storage.

Developed a full-service catalog system with a full workflow using Google Cloud Monitoring, Logging, and Pub/Sub.

Verified the data stored in the Google Cloud Storage and BigQuery using SQL and JDBC drivers.

Designed and implemented a data engineering pipeline on GCP to collect data from various sources such as ERP systems, GPS trackers, and manual inputs.

Implemented data loading and transformation into Google Cloud Storage and BigQuery using Google Cloud Dataflow and Spark for ETL and pipeline.

Conducted data analysis using Google BigQuery and Dataproc on data stored in Google Cloud Storage to identify inefficiencies and bottlenecks in the supply chain.

Utilized machine learning algorithms such as TensorFlow and AutoML to predict demand and optimize the supply chain.

Integrated results of the analysis with existing systems such as transportation management systems to automate processes.

Built a cost-effective archival platform using Google Cloud Storage and BigQuery for data storage.

Utilized shell scripts to dump data from RDBMS to Google Cloud Storage.

Used different file formats such as CSV, JSON, and Avro for data processing in Google Cloud Storage and BigQuery.

Integrated Google Cloud Pub/Sub with Spark Streaming for real-time data processing in Google Cloud Storage and used Google Cloud Dataproc for distributed computing.

Developed dashboards and reports to make results of the analysis available to stakeholders using Google Data Studio.

Utilized Google Compute Engine to create instances containing Hadoop installed and running.

Transferred data using the Google Cloud Storage Transfer Service and used BigQuery for cloud data storage.

Data Engineer / Data Analyst

Sirius XM Holdings Inc., NY, Jul’16-Jul’17

Ingested data using Flume with source as Kafka Source and Sink as HDFS.

Performed storage capacity management, performance tuning, and benchmarking of clusters.

Collected and aggregated large amounts of log data using Apache Flume and staging data in HDFS for further analysis.

Installed and configured Hive and wrote Hive UDFs.

Converted ETL to Hadoop file system (HDFS) and wrote HIVE UDFs.

Transferred data between a Hadoop ecosystem and structured data storage in MySQL RDBMS using Sqoop.

Loaded data from the UNIX file system to HDFS.

Worked on Cluster coordination administrations through Zookeeper.

Worked on stacking log information straightforwardly into HDFS utilizing Flume.

Sent out information from DB2 to HDFS utilizing Sqoop.

Assisted in exporting analyzed data to relational databases using Sqoop.

Developed workflow utilizing Oozie for running MapReduce jobs and Hive.

Researched available technologies, industry trends, and cutting-edge applications.

Data Engineer

Greyhound Lines Inc., Dallas, Texas Dec’14 – Jun’16

Worked with both unstructured and structured data.

Optimized and integrated Hive, SQOOP, and Flume into existing ETL processes, accelerating the extraction, transformation, and loading of massive structured and unstructured data.

Used Hive to simulate data warehouse for performing client-based transit system analytics.

Monitored production cluster by setting up alerts and notifications using metrics thresholds.

Designed, configured, and installed Hadoop clusters.

Applied Hadoop system administration using Hortonworks/Ambari and Linux system administration (RHEL 7, Centos.).

Set up Apache Ranger for cluster ACLs and audits to meet compliance specifications.

Performed HDFS balancing and fine-tuning for MapReduce applications.

Produced data migration plan for other data sources into the Hadoop system.

Applied open-source configuration management and deployment using Puppet and Python.

Configured YARN Capacity and Fair scheduler based on organizational needs.

Performed cluster capacity and growth planning and recommended node configuration.

Tuned MapReduce counters for faster and optimal data processing.

Helped design backup and disaster recovery methodologies involving Hadoop clusters and related databases.

Performed upgrades, patches, and fixes using either the rolling or express method.

Programmed similar tools to Flume and HiveQL scripts to extract, transform and load the data into the database.

Created scheduling scripts in Python to automate data pipeline and data transfer.

Wrote Shell scripts for migrating data to a data storage solution

Technical Skills

IDE: Eclipse, IntelliJ, PyCharm, DBeaver, Workbench

PROJECT METHODS: Agile, Scrum, DevOps, Continuous Integration, Test-Driven Development, Unit Testing, Functional Testing, Design Thinking

HADOOP DISTRIBUTIONS: Hadoop, Cloudera Hadoop, Hortonworks Hadoop

CLOUD PLATFORMS: Amazon AWS, Azure, Google Cloud Platform (GCP)

CLOUD SERVICES: Databricks, Snowflake,

DATABASEs & DATA WAREHOUSES: Cassandra, Apache HBase, MySQL, MongoDB, SQL Server, Redshift, Synapse, BigQuery, Snowflake, Hive.

FILE SYSTEMS: HDFS, S3, Azure Blob, GCS

ETL TOOLS: Sqoop, AWS Glue, Azure Data Factory, Apache Airflow

DATA VISUALIZATION TOOLS: Tableau, Power BI

Data Query: Spark SQL, Data Frames

PROGRAMMING LANGUAGES: Spark, Spark Streaming, Java, Python, Scala, PySpark, R

SCRIPTING: Hive, MapReduce, SQL, Spark SQL, Shell Scripting

CONTINUOUS INTEGRATION (CI-CD): Jenkins, Git, Bitbucket

Education

Engineer in Physics, ITESM, MTY

Concentration in Advanced Data Analytics, ITESM, MTY

Semester of Big Data and Advanced Data Analytics. ITESM, MTY

Courses

Learn C++ Course

Learn Git & GitHub Course

Handling Missing Data Course

Learn the Command Line Course

Learn Web Scraping with Beautiful Soup Course

Python Object-Oriented Programming for Java Developers

Learn Java Course

Contact this candidate