Sr. Big Data Engineer

Location:

Atlanta, GA, 30309

Posted:

July 24, 2023

Contact this candidate

Resume:

Guillermo Hernández González

Senior Big Data Engineer

Email: ************@*****.***; Phone: 941-***-****

PROFILE

•Dynamic and motivated IT professional with around 9+ years of experience in field of Big Data.

•Expertise in designing data intensive applications using Hadoop Ecosystem, Big Data Analytical, Cloud Data engineering, Data Modeling, Data Warehouse / Data Mart, Data Visualization, Reporting, and Data Quality solutions

•Collaborated with Dev-ops teams to identify business requirements and implemented CICD, leading to a 30% reduction in time-to-market for new data-driven products.

•Experience with Big Data Technologies such as Amazon Web Services (AWS), Microsoft Azure, GCP, Databricks, Kafka, Spark, Hive, Sqoop and Hadoop

•Hands-on experience with Amazon EC2, Amazon S3, Amazon RDS, IAM, Amazon Elastic Load Balancing, Auto Scaling, CloudWatch, SNS, SES, SQS, Lambda, EMR and other services of the AWS family.

•Improved model accuracy by 20% by incorporating advanced data transformations and feature engineering techniques.

•In-depth understanding/knowledge of Hadoop Architecture and various components such as HDFS, Yarn and MapReduce concepts

•Understanding of Delta Lake architecture, including Delta Lake tables, transactions, schema enforcement, and time-traveling capabilities in Databricks and Snowflake.

•Hands-on experience in working with Data Warehouses and designed performant models.

•Expertise with NoSQL databases such as HBase and Cassandra to provide low latency.

•Well versed with spark performance tuning at the source, target, and data stage job levels using indexes, hints, and partitioning.

•Worked on data governance and data quality.

•Utilized PL/SQL and SQL to create queries and develop Python-based designs and programs.

•Developed custom data processing algorithms using Python and Spark to handle large volumes of data.

•Experienced in designing and developing data pipelines using GCP data services such as BigQuery, Cloud DataFlow, Cloud Dataproc, and Cloud Composer.

•Experience with Streaming Processing using PySpark and Kafka.

•Preparation of test cases, documenting and performing unit testing and Integration.

•Automatically detected and ingest new files into a data lake using autoloader.

•Experienced in creating Azure Data Factory pipelines, datasets, and linked services

•Proven experience in developing and deploying data engineering solutions on the Databricks.

•Expertise in Creating, Debugging, Scheduling and Monitoring jobs using Airflow and Oozie.

•Demonstrated knowledge in Automation Testing, SDLC utilizing Waterfall and Agile Model.

•Utilized knowledge in creating tailored reports that extracted data and used by reporting tools like Tableau, PowerBI, and AWS QuickSight.

•Skilled in setting up Airflow production environment with high availability

•Utilized containerization technologies such as Docker and Kubernetes to build and deploy applications

•Developed scripts and automation tools to improve development and deployment processes

TECHNICAL SKILLS

Programming Languages: Python, Scala, Java, C, C++ SQL

Databases/Data warehouses: MS SQL Server, Oracle, DB2, MySQL, PostgreSQL, Snowflake, Hive, Redshift, BigQuery

NoSQL Databases: HBase, Cassandra, MongoDB.

Cloud Platforms: AWS, MS Azure, GCP

Big Data Primary Skills: MapReduce, Sqoop, Hive, Kafka, Spark, Cloudera, Databricks, Zookeeper

CICD: GitHub, Docker, Jenkins, Terraform and Kubernetes.

Orchestration Tools: Airflow, Oozie, Cloud Composer, AWS MWAA

Data Analytics: Tableau, MS PowerBI, Amazon QuickSight.

Operating Systems: UNIX/Linux, Windows

Cloud Services:

•AWS S3, EMR, Lambda Functions, Step Functions, Redshift Spectrum, Redshift, RDS, Quicksight, DynamoDB, CloudFormation, CloudWatch, SNS, SES, SQS.

•Azure Data Factory, Azure Databricks, Azure Data Lake Gen2, Azure SQL, Azure HDInsight,

•GCP, DataProc, BigQuery, Dataflow, Cloud Composer.

Testing Tools: PyTest, Selenium, ScalaTest.

PROFESSIONAL EXPERIENCE

Sr Data Engineer

Home Depot Inc. Dec 2020 - Present

(Atlanta, GA)

•Teamed with 5 developers to execute APIs that enabled the analytics team to increase reporting speed from 20% to 30% in 2 weeks.

•Automated ETL processes across billions of rows of data, which saved 45 hours of manual hours per month.

•Read CSV and JSON files from Google Cloud Storage in GCP to get the information required for the client and partners using lambda functions on an event driven architecture.

•Deposit clean parquet files into the Google Cloud Storage in GCP to provide the information for the partner and client.

•Worked designing, loading and querying data in BigQuery, Cloud Storage, Cloud SQL and other GCP services.

•Proficient in troubleshooting and resolving data-related issues on GCP.

•ETL process in End-To-End Pipelines using python and GCP.

•Data processing using Spark Jobs with PySpark in Databricks to clean the information and then store or integrate it into the Enterprise Data Warehouse in BigQuery.

•Successfully implemented Docker containers to containerize applications, resulting in a 50% reduction in deployment time and improved scalability and portability across different environments.

•Create multiple Apache Airflow jobs using python in Cloud Compose to orchestrate pipelines and synchronize with the Payroll Calendars.

•Developed complex SQL queries and stored procedures to extract, transform, and load data from source systems into the data warehouse.

•Built models of the data processing using the PySpark and Spark SQL in Databricks to generate insights for specific purposes.

•Process and store parquet files in the Data Lake using GCS in GCP for easy access and analysis.

•Data pipeline worked under an Agile/Scrum methodology using Jira to track tickets and project progress.

•Create ETL pipelines to load data from multiple data sources to the Databricks Delta Lake in a multi-layer architecture.

Lead Data Engineer

Berkshire Hathaway Inc. Jul 2018 – Dec 2020

(Omaha, NE)

•Oversaw a team of 5 data engineers, and collaborated with company management, recommend changes based on data history and tests

•Designed and implemented data archiving and backup solutions using AWS Glacier and AWS Backup, ensuring data durability and availability.

•Skilled in identifying and documenting test cases, test scenarios, and test data

•Designed and implemented data models using star schema and snowflake schema for efficient querying and reporting.

•Developed Cloud-based Big Data Architecture using AWS.

•Developed and maintained data pipeline, ingesting data across 12 disparate sources using Redshift, S3, and Python

•Created HBase tables, loading with data and writing HBase queries to process data.

•Established customer rapport through a recommended loyalty program that drove subscriptions up by 16%

•Communicated with business departments to understand needs and requests in order to build data pipelines for analyzing technical issues

• Implemented data governance policies and procedures to ensure data quality, consistency, and compliance.

•Conducted performance tuning and optimization of ETL processes to improve overall system performance.

•Designed, developed, and tested Spark SQL clients with PySpark.

•Established collection of data using REST API, built HTTPS connection with client-server, sent GET request, and collected response in Kafka producer.

•Used Spark to parse out needed data by using Spark SQL Context and select features with target information and assigned names.

•Decoded raw data from JSON and streamed it using the Kafka producer API.

•Integrated Kafka with Spark Streaming for real-time data processing using structured streaming.

•Conducted exploratory data analysis and management dashboard for weekly report.

•Utilized transformations and actions in Spark to interact with data frames to show and process data.

•Hands-on with Spark Core, Spark SQL, and Data Frames/Data Sets/RDD API.

•Split JSON files into DataFrames to be processed in parallel for better performance and fault tolerance.

Sr Data Engineer

CVS Health Corporation Jan 2018 – Jul 2018

(Woonsocket, RI)

•Oversaw the migration from Oracle to Redshift, saving $250,000 with a performance increase of 14%

•Performed ETL on the streaming as well as batch data using PySpark and loaded the data into the Amazon S3 Buckets.

•Used AWS Kinesis and Kinesis firehose for processing real time streaming data and store the data into S3 Bucket.

•Integrated Glue with Kinesis for serverless processing of streaming data

•Used the AWS SDK in Python and Boto3 to write Kinesis producers.

•Create test environments on various Amazon EC2 instances,

•Used the AWS console to manage and monitor EMR and different applications running on it.

•Built real-time data processing systems using Apache Kafka and Apache Flink, and integrated them with AWS Kinesis for streaming data analytics.

•Created AWS Cloud Formation templates used for Terraform with existing plugins.

•Developed AWS Cloud Formation templates to create the custom infrastructure of the pipeline.

•Developed multiple PySpark Streaming and batch Spark jobs on EMR

•Implemented AWS Lambda functions to run scripts in response to events in Amazon Dynamo DB table or S3.

•Decoded raw data and loaded it into JSON before sending the batched streaming file over the Kafka Producer.

•Specified nodes and performed data analysis queries on Amazon Redshift Clusters on AWS.

•Hands-on with AWS Kinesis for processing excessively large amounts of real-time data.

•Populated database tables via AWS Kinesis Firehose and AWS Redshift.

•Utilized a cluster of three brokers to handle replication needs and allow for fault tolerance.

•Used EC2, ECS, ECR, AWS CodePipeline, AWS CodeCommit, AWS CodeBuild, and AWS CodeDeploy to help teams automate the process of developing, testing, and deploying applications.

Data Engineer

State Farm Insurance Sept 2014 – Jan 2018

(Bloomington, IL)

•Worked with Spark in Azure Databricks and configured with external libraries to develop multiple data pipelines.

•Created Data Pipelines where used Azure Data Factory to get data from SQL Server and load the data into Azure Data Lake.

•Process the data using Databricks based on the requirements and loaded processed data into azure Synapse DB for Business Intelligence.

•Processed the data flows and steaming using spark in HD Insight, and then stored the results in Data Lake.

•Developed data integration solutions using Talend and integrated them with AWS services for seamless data transfer and transformation.

•Wrote SQL queries in Azure Synapse for analyzing data in the Enterprise Data Warehouse and creating Stored procedures.

•Migrated on-premises architecture to Azure services.

•Optimized data ingestion in Kafka Brokers within the Kafka cluster by partitioning Kafka Topics.

•Consumed requests from REST-Based API from a Python script to a Kafka Producer.

•Performed Data scrubbing and processing with Azure Data Factory and Databricks jobs API.

•Provided connections to different Business Intelligence tools to the tables in the data warehouse such as Databricks and Power BI.

•Created stored procedures in Azure Synapse to process data in the Enterprise Datawarehouse and orchestrated them using Azure Data Factory.

•Improved and made bug corrections to Azure HDInsight's already-built pipelines and flows.

•Designed and implemented data validation and quality control processes, utilizing Scala and Spark's built-in functionality.

•Modeled schema for Azure Synapse data warehouse.

•Created and maintained clusters for Kafka, Hadoop, and Spark in HDInsight.

DBA (Remote)

Centene Corporation Sep 2013 – Sep 2014

(St. Louis, MO)

•Created Multi Node Hadoop cluster setup in pseudo-distributed mode.

•Installed, configured, monitored, and administered Linux servers.

•Monitored CPU, memory, hardware, and software including raid, physical disk, multipath, filesystems, and networks using the Nagios monitoring tool.

•Automated daily tasks using bash scripts while documenting the changes in the environment and each server, analyzing the error logs, user logs, and /var/log messages.

•Created and modified users and groups with root permissions.

•Administered local and remote servers using SSH daily.

•Made sure the Namenode and Secondary Namenode of Hadoop cluster are healthy and available.

•Created and maintained Python scripts for automating build and deployment processes.

•Created users, managed user permissions, maintained user and file system quotas, and installed and configured DNS.

•Adhered to industry standards by securing systems, directory and file permissions, groups, and supporting user account management along with the creation of users.

•Performed kernel and database configuration optimization such as I/O resource usage on disks.

•Analyzed and monitored log files to troubleshoot issues.

EDUCATION

Computer Engineering (Ingeniería en Computación)

Universidad Autónoma de Querétaro (UAQ)

Contact this candidate