Big Data Engineer

Location:

Atlanta, GA

Posted:

July 24, 2023

Contact this candidate

Resume:

Mario Andres De Regules Soto

Big Data & Cloud Engineer

Email: **************@*****.*** Phone: 914-***-****

SUMMARY

A result-oriented Professional with 7+ years of experience in Big Data development along with Data administration and proposing effective solutions through an analytical approach with a track record of building large-scale systems using Big Data technologies.

●Worked extensively on Azure data factory including data transformations, Integration Runtimes, Azure Key Vaults, Triggers and migrating data factory pipelines to higher environments using ARM Templates

●Experienced with the Spark improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Dataframe API, Spark Streaming, MLlib, Pair RDD 's and worked explicitly on PySpark and Scala .

●Worked on Azure Data Factory to integrate data of both on-prem (MY SQL, Cassandra) and cloud (Blob storage, Azure SQL DB) and applied transformations to load back to Azure Synapse

●Transformed PySpark using AWS Glue dynamic frames with PySpark; cataloged the transformed the data using Crawlers and scheduled the job and crawler using workflow feature.

●Have used Docker to create containers for different components of the infrastructure to easily manage the infrastructure across different environments, such as development, testing, and production.

●Excellence in using Apache Hadoop to work with large amounts of data and analyze large data sets.

●Strong knowledge of Hive's analytical functions; extend Hive functionality by writing custom UDFs.

●Work with Data Lakes and Big Data ecosystems (Hadoop, Spark, Hortonworks, Cloudera).

●Achieved a 95% accuracy rate in identifying sentiment polarity in real-time data, enabling timely response to customer feedback and brand monitoring.

●Track record of results in an Agile methodology using data-driven analytics.

●Load and transform large sets of structured, semi-structured, and unstructured data working with data on Amazon Redshift, Apache Cassandra, and HDFS in Hadoop Data Lake.

●Experience handling XML files as well as Avro and Parquet SerDes.

●Performance tuning at source, Target, and DataStage job levels using Indexes, Hints, and Partitioning in DB2, ORACLE.

●Skilled with BI tools like Tableau and PowerBI, data interpretation, modeling, data analysis, and reporting with the ability to assist in directing planning based on insights.

●Skilled in HDFS, Spark, Hive, Sqoop, HBase, and Zookeeper.

●Add value to Agile/Scrum processes such as Sprint Planning, Backlog, Sprint Retrospective, and Requirements Gathering and provide planning and documentation for projects.

●Skilled at writing SQL queries, stored Procedures, Triggers, Cursors, and Packages.

●Apply in-depth understanding/knowledge of Hadoop architectures and various components such as HDFS, MapReduce, Spark, and Hive.

●Create Spark Core ETL processes to automate using a workflow scheduler.

●Gained experience in applying techniques to live data streams from big data sources using Spark and Scala; possess cloud platform experience using Azure, AWS, and GCP

●Compiling the data, including internal and external data sources, leveraging new data collection processes such as geo-location information.

●Communicator with the ability to perform at a high level, meet deadlines, and adaptable to ever-changing priorities

TECHNICAL SKILLS

●Languages: Python, SQL, Visual Basic, R, Command Line, Markdown

●Python Packages: Numpy, TensorFlow, Pandas, Scikit-Learn, SciPy, matplotlib, Seaborn

●Programming & Scripting: Spark, Python, Java, Scala, Hive, Kafka, SQL

●Databases/NOSQL: Cassandra, HBase, Redshift, DynamoDB, MongoDB, MS Access, SQL, MySQL, Oracle, PL/SQL, SQL, RDBMS

●Cloud: Amazon Web Services (AWS), Azure, Google Cloud Platform (GCP)

●Amazon: Glue, Kinesis, EMR, MSK, S3, CloudWatch, Lambda.

●Azure: Databricks, Synapse, ADLS gen2, Azure functions.

●GCP: Cloud Storage, Cloud Dataflow, DataProc, Data Studio and BigQuery.

●Apache: Spark, Hive, HBase, Flink, Hadoop, HDFS, Zookeeper.

●CI/CD: Jenkins, GitHub, GitLab, Docker.

●Cluster Security: Kerberos, Ranger, IAM

●Visualization: Tableau, Power BI, QuickSight

●Scheduler Tool: Airflow, CRON jobs

PROFESSIONAL WORK EXPERIENCE

Sr. Data Engineer

IBM, Armonk, NY, Jun’21 to Present

This project involves developing a data engineering solution for customer churn prediction.

The goal is to identify customers who are likely to churn and take proactive measures to retain them. The project utilizes historical customer data, including demographic information, transaction history, customer interactions, and service usage patterns.

●Developed a cloud-based solution for gathering and analyzing historical data from brother companies and IoT data from manufacturing tools to provide insights and easy-to-understand statistics and graphs for management decision-making and auditing purposes using advanced data modeling techniques, integrating Databricks as the data analysis platform.

●Designed Python-based notebooks for automated weekly, monthly, and quarterly reporting E.T.L

●Automated data pipeline workflows using Apache Airflow and wrote scripts in Python.

●Applied advanced data modeling techniques to provide insights and visualizations of the data in an easy-to-understand format for management decision-making and auditing purposes.

●Presented data to managers and key stakeholders using data visualization tools such as Power BI.

●Conducted English-Spanish translation of the digitalization methodology for the company executives visiting from different countries.

●Worked in a Git development environment and wrote unit tests for the code using PyTest.

●Experienced in fact dimensional modeling ( Star schema, Snowflake schema ), transactional modeling and SCD (Slowly changing dimension).

●Developed and implemented Spark custom UDFs.

●Configured Spark Streaming to receive real-time data from Apache Kafka and store the streamed data to HDFS using Scala.

● Employed Spark SQL and Python libraries like Pandas for efficient data cleansing, validation, processing, and analysis.

● Constructed and maintained robust data ingestion pipelines utilizing Azure Event Hubs, Azure Stream Analytics, and Azure Data Factory.

● Deployed Azure Analysis Services to empower business users with self-service data exploration and visualization tools.

● Leveraged Azure Databricks for distributed machine learning operations on extensive datasets.

● Utilized Azure Log Analytics for proactive monitoring and troubleshooting of data pipeline issues.

●Performed the migration of large data sets to Databricks (Spark), create and administer cluster, load data, configure data pipelines, loading data from ADLS Gen2 to Databricks using ADF pipelines

Cloud Engineer

InterSystems, Cambridge, MA, Jan’20 to May’21

InterSystems Corporation is a privately held vendor of software systems and technology for high-performance database management, rapid application development, integration, and healthcare information systems.

●Implemented data ingestion using Apache Kafka and AWS Kinesis to stream data from various sources into AWS S3.

●Utilized AWS Glue for data transformation and ETL processes to prepare data for analysis and visualization.

●Implemented AWS Fully Managed Kafka streaming to send data streams from the REST API to Spark cluster in AWS Glue.

●Consumed data from Kafka topics using Spark Streaming and transformed the data processed.

●Used AWS Glue to automate data processing and migration from on-premises systems to the cloud.

●Designed and optimized Spark SQL queries for data transformation. Extracted real-time financial data of Bitcoin and alt-coin prices every minute using a REST API.

●Involved in loading data from the UNIX file system to AWS S3.

●Set up and implemented Kafka brokers to write data to topics and utilize its fault tolerance mechanism.

●Created and managed Topic creation inside Kafka.

●Configured a full Kafka cluster with the multi-broker system for high availability.

●Involved in the complete Big Data flow of the application starting from data ingestion to the data warehouse.

●Used Spark where possible to achieve faster results.

●Architecting and Data Engineering for AWS cloud services including AWS Cloud services planning, designing, and DevOps support like IAM user, group, roles & policy management

●Proposed a serverless architecture to process data in AWS on an event-based architecture.

●Created modules for Apache airflow to call different services in the cloud including EMR, S3, Athena, Crawlers, Lambda functions, and Glue jobs.

●Implemented AWS Step Functions to automate and orchestrate the Amazon SageMaker related tasks such as publishing data to S3, training ML model and deploying it for prediction.

●Used AWS EMR to transform and move large amounts of data into and out of other AWS data stores and databases, such as Amazon Simple Storage Service (Amazon S3) and Amazon DynamoDB.

●Performed end- to-end Architecture & implementation assessment of various AWS services like Amazon EMR, Redshift, S3.

Big Data Engineer

FourKites Inc., Chicago, IL, Jul’18 to Dec’19

Collaborated with a professional data engineering team on this project that aims to leverage Google Cloud Platform (GCP) to develop an efficient and scalable data engineering solution for managing and analyzing supply chain data. FourKites is a logistics visibility platform that provides real-time tracking and analytics for shipments.

●Extracted data from different databases and scheduled Apache Beam workflows to execute the task daily on Google Cloud Platform (GCP).

●Involved in ETL, Data Integration, and Migration using GCP services such as Cloud Storage, Cloud Dataflow, and BigQuery.

●Documented the requirements including the available code which should be implemented using Google Cloud Dataproc and Apache Spark.

●Managed version control setup for the platform using Google Cloud Source Repositories.

●Performed analysis of user profiles and current application entitlements using Google Cloud Bigtable and Google Cloud Datastore.

●Built a recommendation system using Google Cloud AI Platform and Google Cloud Functions to auto-provision applications and platform access to new employees/contractors.

●Selected and built dashboards for internal usage using Google Cloud Data Studio.

●Developed multiple Spark Streaming and batch Spark jobs using Python on Google Cloud Dataproc.

●Implemented advanced procedures of feature engineering for the data science team using in-memory computing capabilities like Apache Spark on Google Cloud Dataproc.in GCP.

●Wrote unit tests for all code using PyTest for Python.

●Used Google Cloud Pub/Sub for collecting real-time data and sending it to Google Cloud Dataproc for processing.

●Used Google Cloud Storage and Google Cloud Dataproc to store and process large amounts of data.

●Created data models in GCP BigQuery for efficient data analysis.

Big Data Engineer

iTechArt Group, New York, NY, Oct’16 to Jun’18

iTechArt is a top-tier, one-stop custom software development company with a talent pool of 3500+ experienced engineers. We help VC-backed startups and fast-growing tech companies build successful, scalable products that users love.

●Implemented Agile Methodology for building an internal application.

●Optimized Spark jobs for performance by leveraging techniques like data partitioning, caching, and tuning of cluster resources.

●Utilized databases and language ontologies.

●Performed cluster capacity and growth planning and recommended node configuration.

●Worked with highly unstructured and structured data.

●Optimized and integrated Hive, SQOOP, and Flume into existing ETL processes, accelerating the extraction, transformation, and loading of massive structured and unstructured data.

●Reduced query response time by 70% by optimizing data partitioning and using Hive bucketing techniques, facilitating faster data-driven decision-making.

●Conducted data analysis and produced reports that provided insights into business operations and performance.

●Developed and maintained data models using NoSQL databases.

●Requirements capture with the management team to design reports to make decisions.

●Construction and customization of integration systems using technologies such as Saas, API, and web services

●Implemented DevOps practices such as continuous integration and delivery using Git, Jenkins, and Terraform.

Jr. Data Engineer

The Bridgespan Group, Boston, Massachusetts, Oct’15 to Sep’16

The Bridgespan Group is a U.S. nonprofit organization in Boston, Massachusetts that provides management consulting to nonprofits and philanthropists. In addition to consulting, Bridgespan makes case studies.

●Used different file formats such as Text files, Sequence Files, and Avro for data processing in the Hadoop system.

●Loaded data from various data sources into Hadoop Distributed File System (HDFS). using Kafka.

●Integrated Kafka with Spark Streaming for real-time data processing in Hadoop.

●Streamed analyzed data to Hive Tables using Sqoop, making it available for data visualization.

●Tuning and operating Spark and its related technologies like Spark SQL and Spark Streaming.

●Used the Hive JDBC to verify the data stored in the Hadoop cluster.

●Used Cloudera Hadoop (CDH) distribution with Elasticsearch.

●Designed and implemented data preprocessing and transformation steps using MapReduce, ensuring data quality and compatibility for downstream analysis.

●Used Hive to analyze data ingested into HBase by using Hive-HBase integration and compute various metrics for reporting on the dashboard

PREVIOUS EXPERIENCE

Web Application Developer

ESQUEMA Consulting, Tijuana, Baja California, Mexico, Sep’14- Oct’15

EDUCATION

Bachelor of Science: Biomedical Engineering from Tecnologico De Monterrey Campus Monterrey - Monterrey, Nuevo Leon, Mexico

Post-Baccalaureate program: Specialization in Bionic Technology from P4H Bionics - Mexico D.F

Contact this candidate