SR. BIG DATA ENGINEER

Location:

Tappahannock, VA

Posted:

September 05, 2023

Contact this candidate

Resume:

•Areas of Exposure

Database SQL Server, MySQL, Firebird, HANA, and Cloud Storage

Hadoop, HDFS, Airflow, Kafka

Data Modeling, Data Engineering

Tableau

Hive, Apache Spark

Python, Scala, Spark

Extract, Transform, Load (ETL)

Office Access tools, Excel, Project, Problem Solving

Analysis & Data Visualization Skills

Soft Skills

Work Experience

Profile Summary

•Black Belt Project Certified from Sig Sigma Institute with extensive experience of 11+ years in all phases of Big Data Engineering & project management life cycle.

•The distinction of generating more than 15 million per year in savings for the company with implemented projects and more than 90 million for new clients.

•Excellent academic credentials in Mechatronic Engineering from Instituto Tecnológico de la Laguna along with working proficiency in Analytics, OLAP technologies, and more experience in agile development methodologies.

•Passionate about Big Data technologies and the delivery of effective solutions through creative problem solving with a track record of building large-scale systems using Big Data technologies.

•Demonstrated excellence in:

•Business intelligence tools like Tableau

•Strong Skills in Databases SQL Server, MySQL, Firebird, HANA, and Cloud Storage.

•Hadoop Services

•Data Integration Mining

•Programming in Scala, Pyspark, Python, SAP API, C#, .Net, Javascript

•Project Techniques 6SIGMA, KAIZEN, SCRUM, KANBAN

•Extensive experience with SQL and NoSQL solutions as well as Cassandra, HIVE, MongoDB, and HBase, Snowflake, MySQL, Oracle, Postgres, Microsoft SQL Server.

•Proficiency in data ingestion, processing, and manipulation both in on-premise and cloud platforms across various sectors like healthcare, information technology, or finance using AWS and Azure.

•Understanding of the Hadoop Architecture and its ecosystem such as HDFS, YARN, MapReduce, Sqoop, Avro, Spark, Hive, HBase, Flume, Kafka, and Zookeeper

FITCH RATINGS, POWDER SPRINGS, GA

Sr. Big Data Engineer (Mar 2023 - Present)

Spearheaded the development and implementation of robust end-to-end AWS data pipelines, designed to process and transform diverse data formats while ensuring seamless integration with various systems. Leveraging my expertise in AWS services and big data technologies, I successfully orchestrated the ingestion, transformation, and storage of data using scalable and efficient solutions.

•Played a role in architecting and constructing end-to-end data pipelines on the AWS platform and leveraged services such as AWS Glue, Amazon S3, and other relevant AWS offerings to facilitate data movement and transformation

•Managed data in various formats, including structured, semi-structured, and unstructured data, and employed appropriate techniques and tools, such as Apache Spark, to ensure accurate parsing and transformation of these different data formats

•Employed Jupyter Notebooks and PySpark to design and develop Glue Jobs, enabling the transformation of large files (such as CSV, nested XML, XLS, TXT) into normalized tables stored in the Parquet format

•Implemented efficient data parsing, transformation, and schema inference techniques.

•Designed Lambda functions for small file transformations: Utilized Lambda functions to handle small files (CSV, nested XML, XLS, TXT) and performed the necessary transformations to convert them into normalized tables stored in the Parquet format

•Developed robust and scalable Lambda functions to handle data processing requirements efficiently

•Implemented Event Bridge rules to trigger the data pipeline based on specified events or schedules. Configured rules to automate the initiation of data processing workflows, ensuring timely and reliable execution

•Leveraged Step Functions to orchestrate the execution and coordination of Lambda functions and Glue jobs. Developed complex workflows, incorporating conditional branching and error handling, to ensure seamless and reliable data processing and transformation

•Created reusable and parameterized functions capable of ingesting various types of files without the need for manual intervention

•Implemented automation techniques to identify file types, apply appropriate transformations, and store the resulting data in the desired format

•Utilized Glue catalog, Crawlers, and classifiers: Leveraged AWS Glue catalog to manage and organize metadata for the ingested data. Configured and utilized Glue Crawlers to automatically discover, classify, and populate metadata information from various data sources. Employed classifiers to identify file types and infer schemas accurately

•Integrated DynamoDB to enable logging for Glue jobs and Lambda functions. Captured and stored essential information, including job execution details, errors, and metrics, for monitoring, troubleshooting, and auditing purposes

•Applied data transformation techniques to cleanse, validate, enrich, and harmonize data, ensuring its quality and consistency

•Implemented custom transformations and used AWS Glue for automatic schema inference and metadata management

•Designed and implemented data storage strategies using Amazon S3 by optimizing partitioning and bucketing techniques, enhanced query performance, and reduced costs

•Integrated data catalogs with AWS Glue, enabling efficient data discovery, organization, and management, configured and maintained the catalog, ensuring accurate metadata representation and easy data accessibility

•Fine-tuned data processing workflows and optimized query performance by employing techniques such as partitioning, indexing, and query optimization

•Monitoring and Troubleshooting: Established robust monitoring systems to proactively detect and address issues within the data pipelines, utilized AWS CloudWatch and other monitoring tools to monitor pipeline performance, data quality, and system health

•Collaborated with cross-functional teams, including data scientists, analysts, and business stakeholders, to understand their requirements and deliver efficient data solutions

•Documented pipelines, workflows, and architectural decisions to ensure knowledge sharing and maintainable solutions

SOLUGLOB IKON S.A./ PISA BIOPHARM, LOS ANGELES, CA

Sr. Big Data Engineer (Mar 2015 – Feb 2023)

Built a variety of Big Data platforms to explicitly address the needs of health system pharmacies operating in the highly regulated markets of the US, Latin America, and the Caribbean. Leveraging world-class pharmaceutical and medical device manufacturing experience and building on a corporate culture that enables employees to make a positive difference in the health and well-being of people's lives.

•Responsible for all the information in databases mainly for the Health Sector

•Designed and Implemented Stage 1:

•Designed central repository using SQL Server.

•Created SQL Server Store Procedures to gather information from Firebird into a central repository.

•Implemented SQL Server jobs to have all information Updated

•Designed and implemented all Premises Services Stage 2:

•Implemented Hadoop and HDFS as a Frame for storing and ingesting data.

•Created Spark application to push data from Firebird into parquet files in HDFS

•Created Spark application to push data from parquet format into Hive

•Used Airflow DAGs to schedule the imports.

•Created Kafka consumer to store streaming data to HDFS

•Designed and implemented the migration from Premises to the cloud Stage 3:

•Implemented Spark application to push parquet files into AWS S3 Buckets.

•Developed a Snowflake warehouse for the Central Repository

•Created Glue jobs using Jupyter Notebooks and Pyspark to process data into Snowflake.

•Used Lambda Functions to trigger Glue jobs when data files were stored in s3 Buckets.

•Create a Spark application using AWS EMR to process the data and store it in Hive and S3 Buckets.

•Used Lambda Functions to trigger a submit to EMR cluster into Hive.

•Creating tables using Glue from s3 and using Athenas to query data.

•Used SNS to send messages to track the jobs.

•Implementing spark jobs on GCP data proc.

•Consuming data from the Google Analytics platform to the data warehouse in GCP BigQuery.

•Creating complex SQL queries in GCP BigQuery.

•Working in a multi-cloud environment with AWS and GCP.

•Designing all business intelligence boards of the company, as well as decision-making based on them, and providing solutions to achieve the company's objectives.

•Responsible for all logistics in the company, strategic planning of resources, and inventory replenishment.

•Implemented projects generating more than 15 million per year in savings for the company with implemented projects and more than 90 million for new clients

•Designing all the solutions for new clients and projects, leading many improvement projects, and developing many processes that are now under my management which include software development, planning, Warehouse, Call Center, and Home delivery).

•Implementing new technologies for the health sector, automated pharmacies, software, and interfaces between equipment.

•Designed and implemented the company’s new lines of business like Software development with SCRUM methodology.

•Automating the collection of information from all company processes in databases for analysis. (MySQL, SQL Server, Firebird, HANA, etc.)

AFLAC - COLUMBUS, GA

Big Data Engineer (May 2014 – Feb 2015)

Fortune 500 company, providing financial protection to more than 50 million people worldwide. When a policyholder or insured gets sick or hurt, Aflac pays cash benefits promptly, for eligible claims, directly to the insured (unless assigned otherwise).

Develop, maintains, and integrate application software, related project management activity, and production support; work closely with internal and external clients, business analysts, and team members to understand business requirements; Develop and integrate application software, including unit testing and implementation efforts; continues to maintain and support software implementation.

•Communicated deliverables status to stakeholders and facilitated periodic review meetings.

•Developed Spark streaming application to pull data from the cloud to Hive and HBase.

•Collected, aggregated, and shuffled data from servers to HDFS using Apache Spark & Spark Streaming.

•Worked on importation and claims information between HDFS and RDBMS.

•Created Hive External tables and loaded the data into tables and query data using HQL.

•Worked on streaming the prepared information to HBase utilizing Spark.

•Performed performance calibration for Spark Steaming e.g., setting the right Batch Interval time, the correct level of executors, choice of correct publishing & memory.

•Used SparkSQL for creating and populating the HBase warehouse.

•Performed gradual cleansing and modeling of datasets.

•Utilized Avro tools to build the Avro schema to create an external hive table using PySpark.

•Created managed and external tables to store ORC and parquet files using HQL.

•Developed Apache Airflow scripts to automate the pipeline.

•Created a NoSQL HBase database to store the processed data from Apache Spark.

•Designed Spark streaming to receive real-time information from Kafka and store the stream information to HDFS.

•Migrated ETL jobs to Python scripts for transformations, and joins aggregations before HDFS.

THERMOFISHER SCIENTIFIC INC., INDIANAPOLIS, IN

Data Engineer (Mar 2012 – Apr 2014)

As part of the team at Thermo Fisher Scientific, I did important work, like helping customers in finding cures for cancer, protecting the environment, or making sure our food is safe. My work there had a real-world impact, and I was supported in achieving my career goals as well.

As a Data Engineer, I played a key role in Enterprise Data Management - EDS Data delivery organization providing data solutions for critical business processes, IT systems, and IT solutions through project implementations, enhancements, and documentation.

•Configure Kafka producer with API endpoints using JDBC Autonomous REST Connectors.

•Architected pipeline ingestion from Kafka, Integrating Kafka with Spark streaming for high-speed data processing and

•enrichment.

•Write producer /consumer scripts to process JSON responses in Python.

•Configure and deploy production-ready multi-node Hadoop cluster with services such as Hive, Sqoop, Flume, and Oozie.

•Implemented advanced procedures of feature engineering for the data science team using in-memory computing capabilities like Apache Spark written in Scala.

•Experienced with batch processing of data sources using Apache Spark.

•Implemented different components on the cloud for the Kafka application messaging for data processing and analytics.

•Spark applications using Spark Core, Spark SQL, and Spark Streaming API.

•Performance tuning of Spark jobs for setting batch interval time, level of parallelism, and memory tuning.

•Collected and aggregated large amounts of log data using Apache Flume and staging data in HDFS for further analysis.

•Analyzed the medical claims data which consists of information about the patient, medical provider, and medical facilities.

•Built databases on medical claims data for both outpatient and inpatient services.

•Ingested RDBMS data to Hadoop ecosystem using SQOOP by writing SQOOP job.

•Used Apache Hive to query and analyze the data.

•Created HBase tables from the internal tables in Apache Hive.

CATERPILLAR TORREON, COAH

Quality Engineer (Jan 2011 – Feb 2012)

•Assisted in the development of QA software for our testing facility

•Oversaw the production of all software products created by our design and programming team

•Monitored and reviewed quarterly data based on suppliers’ reports

•Reviewed and processed all purchase orders following company policy

•Provided quarterly inspection data and quality assurance reports to senior management

•Participated in all QA meetings and provided software engineers and developers with necessary analysis tools

Education

Certifications

Mechatronic Engineer from Instituto Tecnológico de la Laguna

•Certified Advanced Tableau Desktop Tableau

•Black Belt 6Sigma Certification

Contact this candidate