Senior Data Engineer

Location:

Atlanta, GA

Posted:

March 09, 2023

Contact this candidate

Resume:

RENAN CALDERON

***************************@*****.*** 417-***-****

Profile Summary

A seasoned professional offering over 8 years of experience in the development of custom Big Data Solutions both on-premises and in the cloud.

Skilled in working on SQL and developing other Relational Database Management System which includes PostgreSQL and MSSQL Server

Skilled at working in the AWS ecosystem, understanding the architecture, and performing a project involving transferring data

Ability to troubleshoot and tune relevant programming languages like SQL, Java, Python, Scala, Hive, RDDs, DataFrame & MapReduce. Able to design elegant solutions using problem statements

Build and configure virtual environments in the cloud to support Enterprise Data

Collaborating with data scientists, data analysts, and other stakeholders to understand their data needs and ensure that the bank's big data infrastructure supports their requirements. Expert at creating HDFS and implementing RDMBs into HDFS using the Sqoop tool

Used Spark to improve Python programming skills and understanding of the architecture of Spark

Hands-on experience with Hive and HBase integration, understanding the knowledge of how different they are from one another

Developed, implemented and administered data solutions on cloud

Created classes that simulate real-life objects and write loops to perform actions on the collected data

Proficient in working on AWS tools (Redshift, Kinesis, S3, EC2, EMR, DynamoDB, Elasticsearch, Athena, Firehose, Lambda)

Gained experience in Cloudera and Hortonworks Hadoop distributions

Experience in Migrating SQL database to Azure Data Lake, Azure data lake Analytics, Azure SQL Database, Data Bricks, and Azure SQL Data warehouse and controlling and granting database access and Migrating On-premise databases to Azure Data Lake store using Azure Data factory

In-depth knowledge of real-time ETL/Spark analytics using Spark SQL with visualization

Strong technical skills, including proficiency in programming languages such as Scala or Python, and experience with big data technologies such as Hadoop and Spark.

Strong analytical skills for troubleshooting and problem-solving.

Communicate with multiple stakeholders and project managers.

Extensive experience in performance tuning, for instance, SQL and Spark query tuning

Technical Skills

IDE: Eclipse, IntelliJ, PyCharm, DBeaver, Workbench

PROJECT METHODS: Agile, Scrum, DevOps, Continuous Integration, Test-Driven Development, Unit Testing, Functional Testing, Design Thinking

HADOOP DISTRIBUTIONS: Hadoop, Cloudera Hadoop, Hortonworks Hadoop

CLOUD PLATFORMS: Amazon AWS, Azure, Google Cloud Platform (GCP)

Other Cloud Platform: AWS RDS, AWS EMR, AWS Redshift, AWS S3, AWS Lambda, AWS Kinesis, AWS ELK, AWS Cloud, AWS IAM Formation

CLOUD SERVICES: Databricks, Snowflake,

DATABASEs & DATA WAREHOUSES: Cassandra, Apache HBase, MySQL, MongoDB, SQL Server, Redshift, Synapse, BigQuery, Snowflake, Hive.

FILE SYSTEMS: HDFS, S3, Azure Blob, GCS

ETL TOOLS: Sqoop, AWS Glue, Azure Data Factory, Apache Airflow

DATA VISUALIZATION TOOLS: Tableau, Power BI

Data Query: Spark SQL, Data Frames

PROGRAMMING LANGUAGES: Spark, Spark Streaming, Java, Python, Scala, PySpark, R

SCRIPTING: Hive, MapReduce, SQL, Spark SQL, Shell Scripting

CONTINUOUS INTEGRATION (CI-CD): Jenkins, Git, Bitbucket

WORK EXPERIENCE

Senior Data Engineer (AWS)

Silicon Valley Bank, Remote

Aug’19-Present

Created different pipelines in AWS for end-to-end ETL processes.

Designing and implementing data pipelines to extract, transform, and load data from various sources into an Amazon S3 data lake or Amazon Redshift data warehouse.

Building real-time streaming systems using Amazon Kinesis to process data as it is generated.

Developing and maintaining data models using Amazon Athena or Amazon Glue to represent the relationships between different data elements.

Processed data with a natural language toolkit to count important words and generated word clouds

Implementing data governance and security protocols using AWS Identity and Access Management (IAM) and Amazon Macie to ensure that sensitive data is protected.

Collaborating with data scientists and analysts to develop machine learning models using Amazon SageMaker for tasks such as fraud detection, risk assessment, and customer segmentation. Develop and implement recovery plans and procedures.

Strong analytical skills for troubleshooting and problem-solving.

Communicate with multiple stakeholders and project managers.

Created Apache Airflow jobs to orchestrate pipeline executions.

Transform the data stored on old mainframe computers to more recent methods of storing data with a tie into MySQL in AWS RDS.

Develop a data pipeline that included files being stored on an AWS EC2 instance, decompressed, sent to an AWS S3 bucket, and transformed to ASCII. Rules were applied to data based on DDL from the client Oracle database admin. Data conversion/transformation applied by custom Python script amongst other data manipulation requirements.

Set up cloud compute engine in managed and unmanaged mode and SSH key management

Containerizing Confluent Kafka application and configured subnet for communication between containers.

Creating and maintaining documentation for the bank's big data infrastructure and systems, including design diagrams, system configurations, and best practices

Applied embedded JSON schemas to incoming data in Kafka and stored into another topic

Optimizing data processing systems for performance and scalability using Amazon EMR and Amazon EC2.

Installed AWS command line interface (CLI) to interact with S3 bucket to download and upload files

Developed AWS Cloud Formation templates to create custom infrastructure of pipeline.

Attended daily meetings to review the results of transformed tables and to see if they met the requirements of the client

Run Python scripts to initiate custom data pipeline used to download files, transform data within files, and upload to MySQL database server

Setting up and maintaining data storage systems, such as Amazon S3 and Amazon Redshift, to ensure data is properly stored and easily accessible for analysis. Designing and implementing data pipelines to collect, store, and process large amounts of data from various sources, such as bank transactions and customer information.

Creating and optimizing data processing workflows using AWS services such as Amazon EMR and Amazon Kinesis to process and analyze large amounts of data in a timely and efficient manner.

Big Data Engineer

O'Reilly Auto Parts, Springfield, MO

Oct’18-Jul’19

Worked one on one with clients to resolve issues regarding Spark performance.

Interacted with data residing in HDFS using Spark to process the data.

Split JSON files into DFs level to be processed in parallel for better performance and fault tolerance using Apache Spark.

Created data frames in Apache Spark by passing schema as a parameter to the ingested data using Azure Databricks.

Participated in the migration from Cloudera Hadoop to Azure Cloud.

Established a connection between HBase and Spark for quick transfer of transactions.

Monitored background operations in Hortonworks Ambari.

Configured Zookeeper to coordinate the servers to maintain data consistency and monitor services.

Involved in analyzing system failures, identifying root causes, and recommending a course of action.

Created standardized documents for company usage.

Created multi-node Hadoop and Spark clusters on EMR instances to process terabytes of data.

Implemented analytics solutions through Agile/Scrum process for development and quality assurance using Azure DevOps.

Automated, configured, and deployed instances on Azure Cloud

Created Azure Data Factory pipelines to orchestrate jobs and executions.

Developed spark code to process data as part of end-to-end data pipelines using Azure infrastructure and Spark in Databricks.

Worked with blob storage, ADF, Databricks, key vault, and Synapse to create different ETL processes for multiple business requirements.

Data Engineer

RadioShack, Fort Worth, TX

May17-Oct’18

Developed PySpark application to read data from various file system sources, apply transformations, and write to NoSQL database

Installed and configured Kafka cluster and monitored the cluster.

Architected a lightweight Kafka broker and integrated Kafka with Spark for real-time data processing.

Created Hive external tables and designed data models in Apache Hive.

Developed multiple Spark Streaming and batch Spark jobs using Scala and Python on AWS.

Implemented advanced procedures of feature engineering for the data science team using in-memory computing capabilities like Apache Spark written in Scala.

Implemented Rack Awareness in the Production Environment.

Collected data using REST API, built HTTPS connection with client-server, sent GET request, and collected response in Kafka Producer.

Imported data from web services into HDFS and transformed data using Spark.

Executed Hadoop/Spark jobs on AWS EMR using programs, and data stored in S3 Buckets.

Used SparkSQL for creating and populating the HBase warehouse.

Worked with Spark Context, Spark -SQL, DataFrames, and Pair RDDs.

Ingested data through AWS Kinesis Data Stream and Firehose from various sources to S3.

Extracted data from different databases and scheduled Oozie workflows to execute the task daily.

Worked with Amazon Web Services (AWS) and was involved in ETL, Data Integration, and Migration.

Documented the requirements including the available code which should be implemented using Spark, Amazon DynamoDB, Redshift, and Elastic Search.

Worked on AWS Kinesis for processing huge amounts of real-time data.

Jr. Data Enineer / Data Analyst

Fundación Rafael Dondé, Remote

Jul’15-May’17

Designed a cost-effective archival platform for storing Big Data using Hadoop and its related technologies

Developed a task execution framework on EC2 instances using SQL and DynamoDB

Built a Full-Service Catalog System with a full workflow using CloudWatch, Kibana, Kinesis, Elasticsearch, and Logstash

Connected various data centers and transferred data using Sqoop and ETL tools in the Hadoop system.

Imported data from disparate sources into Spark RDD for data processing. In Hadoop

Used shell scripts to dump the data from MySQL to Hadoop Distributed File System (HDFS)

Built a prototype for real-time analysis using Spark Streaming and Kafka in the Hadoop system.

Loaded and transform large sets of structured, semi-structured, and unstructured data using Hadoop, Spark, and Hive for ETL, pipeline, and Spark streaming, acting directly on Hadoop Distributed File System (HDFS).

Extracted data from RDBMS (Oracle, MySQL) to Hadoop Distributed File System (HDFS). using Sqoop

Used NoSQL database MongoDB in implementation and integration.

Configured Oozie workflow engine scheduler to run multiple Sqoop, Hive, and Pig jobs in the Hadoop system.

Consumed the data from the Kafka queue using Storm and deployed the application jar files into AWS instances.

Collected business requirements from subject matter experts and data scientists.

Transferred data using the Informatica tool from AWS S3 and used AWS Redshift for cloud data storage.

Used different file formats such as Text files, Sequence Files, and Avro for data processing in the Hadoop system.

Loaded data from various data sources into Hadoop Distributed File System (HDFS). using Kafka.

Integrated Kafka with Spark Streaming for real-time data processing in Hadoop and Used Cloudera Hadoop (CDH) distribution with Elasticsearch.

Used image files to create instances containing Hadoop installed and running.

Streamed analyzed data to Hive Tables using Sqoop, making available for data visualization

Used the Hive JDBC to verify the data stored in the Hadoop cluster

EDUCATION DETAILS

Bachelor's degree in Actuary, Facultad de Matemáticas, Universidad Autónoma de Yucatán

Contact this candidate