RENAN CALDERON
***************************@*****.*** 417-***-****
Profile Summary
A seasoned professional offering over 8 years of experience in the development of custom Big Data Solutions both on-premises and in the cloud.
Skilled in working on SQL and developing other Relational Database Management System which includes PostgreSQL and MSSQL Server
Skilled at working in the AWS ecosystem, understanding the architecture, and performing a project involving transferring data
Ability to troubleshoot and tune relevant programming languages like SQL, Java, Python, Scala, Hive, RDDs, DataFrame & MapReduce. Able to design elegant solutions using problem statements
Build and configure virtual environments in the cloud to support Enterprise Data
Collaborating with data scientists, data analysts, and other stakeholders to understand their data needs and ensure that the bank's big data infrastructure supports their requirements. Expert at creating HDFS and implementing RDMBs into HDFS using the Sqoop tool
Used Spark to improve Python programming skills and understanding of the architecture of Spark
Hands-on experience with Hive and HBase integration, understanding the knowledge of how different they are from one another
Developed, implemented and administered data solutions on cloud
Created classes that simulate real-life objects and write loops to perform actions on the collected data
Proficient in working on AWS tools (Redshift, Kinesis, S3, EC2, EMR, DynamoDB, Elasticsearch, Athena, Firehose, Lambda)
Gained experience in Cloudera and Hortonworks Hadoop distributions
Experience in Migrating SQL database to Azure Data Lake, Azure data lake Analytics, Azure SQL Database, Data Bricks, and Azure SQL Data warehouse and controlling and granting database access and Migrating On-premise databases to Azure Data Lake store using Azure Data factory
In-depth knowledge of real-time ETL/Spark analytics using Spark SQL with visualization
Strong technical skills, including proficiency in programming languages such as Scala or Python, and experience with big data technologies such as Hadoop and Spark.
Strong analytical skills for troubleshooting and problem-solving.
Communicate with multiple stakeholders and project managers.
Extensive experience in performance tuning, for instance, SQL and Spark query tuning
Technical Skills
IDE: Eclipse, IntelliJ, PyCharm, DBeaver, Workbench
PROJECT METHODS: Agile, Scrum, DevOps, Continuous Integration, Test-Driven Development, Unit Testing, Functional Testing, Design Thinking
HADOOP DISTRIBUTIONS: Hadoop, Cloudera Hadoop, Hortonworks Hadoop
CLOUD PLATFORMS: Amazon AWS, Azure, Google Cloud Platform (GCP)
Other Cloud Platform: AWS RDS, AWS EMR, AWS Redshift, AWS S3, AWS Lambda, AWS Kinesis, AWS ELK, AWS Cloud, AWS IAM Formation
CLOUD SERVICES: Databricks, Snowflake,
DATABASEs & DATA WAREHOUSES: Cassandra, Apache HBase, MySQL, MongoDB, SQL Server, Redshift, Synapse, BigQuery, Snowflake, Hive.
FILE SYSTEMS: HDFS, S3, Azure Blob, GCS
ETL TOOLS: Sqoop, AWS Glue, Azure Data Factory, Apache Airflow
DATA VISUALIZATION TOOLS: Tableau, Power BI
Data Query: Spark SQL, Data Frames
PROGRAMMING LANGUAGES: Spark, Spark Streaming, Java, Python, Scala, PySpark, R
SCRIPTING: Hive, MapReduce, SQL, Spark SQL, Shell Scripting
CONTINUOUS INTEGRATION (CI-CD): Jenkins, Git, Bitbucket
WORK EXPERIENCE
Senior Data Engineer (AWS)
Silicon Valley Bank, Remote
Aug’19-Present
Created different pipelines in AWS for end-to-end ETL processes.
Designing and implementing data pipelines to extract, transform, and load data from various sources into an Amazon S3 data lake or Amazon Redshift data warehouse.
Building real-time streaming systems using Amazon Kinesis to process data as it is generated.
Developing and maintaining data models using Amazon Athena or Amazon Glue to represent the relationships between different data elements.
Processed data with a natural language toolkit to count important words and generated word clouds
Implementing data governance and security protocols using AWS Identity and Access Management (IAM) and Amazon Macie to ensure that sensitive data is protected.
Collaborating with data scientists and analysts to develop machine learning models using Amazon SageMaker for tasks such as fraud detection, risk assessment, and customer segmentation. Develop and implement recovery plans and procedures.
Strong analytical skills for troubleshooting and problem-solving.
Communicate with multiple stakeholders and project managers.
Created Apache Airflow jobs to orchestrate pipeline executions.
Transform the data stored on old mainframe computers to more recent methods of storing data with a tie into MySQL in AWS RDS.
Develop a data pipeline that included files being stored on an AWS EC2 instance, decompressed, sent to an AWS S3 bucket, and transformed to ASCII. Rules were applied to data based on DDL from the client Oracle database admin. Data conversion/transformation applied by custom Python script amongst other data manipulation requirements.
Set up cloud compute engine in managed and unmanaged mode and SSH key management
Containerizing Confluent Kafka application and configured subnet for communication between containers.
Creating and maintaining documentation for the bank's big data infrastructure and systems, including design diagrams, system configurations, and best practices
Applied embedded JSON schemas to incoming data in Kafka and stored into another topic
Optimizing data processing systems for performance and scalability using Amazon EMR and Amazon EC2.
Installed AWS command line interface (CLI) to interact with S3 bucket to download and upload files
Developed AWS Cloud Formation templates to create custom infrastructure of pipeline.
Attended daily meetings to review the results of transformed tables and to see if they met the requirements of the client
Run Python scripts to initiate custom data pipeline used to download files, transform data within files, and upload to MySQL database server
Setting up and maintaining data storage systems, such as Amazon S3 and Amazon Redshift, to ensure data is properly stored and easily accessible for analysis. Designing and implementing data pipelines to collect, store, and process large amounts of data from various sources, such as bank transactions and customer information.
Creating and optimizing data processing workflows using AWS services such as Amazon EMR and Amazon Kinesis to process and analyze large amounts of data in a timely and efficient manner.
Big Data Engineer
O'Reilly Auto Parts, Springfield, MO
Oct’18-Jul’19
Worked one on one with clients to resolve issues regarding Spark performance.
Interacted with data residing in HDFS using Spark to process the data.
Split JSON files into DFs level to be processed in parallel for better performance and fault tolerance using Apache Spark.
Created data frames in Apache Spark by passing schema as a parameter to the ingested data using Azure Databricks.
Participated in the migration from Cloudera Hadoop to Azure Cloud.
Established a connection between HBase and Spark for quick transfer of transactions.
Monitored background operations in Hortonworks Ambari.
Configured Zookeeper to coordinate the servers to maintain data consistency and monitor services.
Involved in analyzing system failures, identifying root causes, and recommending a course of action.
Created standardized documents for company usage.
Created multi-node Hadoop and Spark clusters on EMR instances to process terabytes of data.
Implemented analytics solutions through Agile/Scrum process for development and quality assurance using Azure DevOps.
Automated, configured, and deployed instances on Azure Cloud
Created Azure Data Factory pipelines to orchestrate jobs and executions.
Developed spark code to process data as part of end-to-end data pipelines using Azure infrastructure and Spark in Databricks.
Worked with blob storage, ADF, Databricks, key vault, and Synapse to create different ETL processes for multiple business requirements.
Data Engineer
RadioShack, Fort Worth, TX
May17-Oct’18
Developed PySpark application to read data from various file system sources, apply transformations, and write to NoSQL database
Installed and configured Kafka cluster and monitored the cluster.
Architected a lightweight Kafka broker and integrated Kafka with Spark for real-time data processing.
Created Hive external tables and designed data models in Apache Hive.
Developed multiple Spark Streaming and batch Spark jobs using Scala and Python on AWS.
Implemented advanced procedures of feature engineering for the data science team using in-memory computing capabilities like Apache Spark written in Scala.
Implemented Rack Awareness in the Production Environment.
Collected data using REST API, built HTTPS connection with client-server, sent GET request, and collected response in Kafka Producer.
Imported data from web services into HDFS and transformed data using Spark.
Executed Hadoop/Spark jobs on AWS EMR using programs, and data stored in S3 Buckets.
Used SparkSQL for creating and populating the HBase warehouse.
Worked with Spark Context, Spark -SQL, DataFrames, and Pair RDDs.
Ingested data through AWS Kinesis Data Stream and Firehose from various sources to S3.
Extracted data from different databases and scheduled Oozie workflows to execute the task daily.
Worked with Amazon Web Services (AWS) and was involved in ETL, Data Integration, and Migration.
Documented the requirements including the available code which should be implemented using Spark, Amazon DynamoDB, Redshift, and Elastic Search.
Worked on AWS Kinesis for processing huge amounts of real-time data.
Jr. Data Enineer / Data Analyst
Fundación Rafael Dondé, Remote
Jul’15-May’17
Designed a cost-effective archival platform for storing Big Data using Hadoop and its related technologies
Developed a task execution framework on EC2 instances using SQL and DynamoDB
Built a Full-Service Catalog System with a full workflow using CloudWatch, Kibana, Kinesis, Elasticsearch, and Logstash
Connected various data centers and transferred data using Sqoop and ETL tools in the Hadoop system.
Imported data from disparate sources into Spark RDD for data processing. In Hadoop
Used shell scripts to dump the data from MySQL to Hadoop Distributed File System (HDFS)
Built a prototype for real-time analysis using Spark Streaming and Kafka in the Hadoop system.
Loaded and transform large sets of structured, semi-structured, and unstructured data using Hadoop, Spark, and Hive for ETL, pipeline, and Spark streaming, acting directly on Hadoop Distributed File System (HDFS).
Extracted data from RDBMS (Oracle, MySQL) to Hadoop Distributed File System (HDFS). using Sqoop
Used NoSQL database MongoDB in implementation and integration.
Configured Oozie workflow engine scheduler to run multiple Sqoop, Hive, and Pig jobs in the Hadoop system.
Consumed the data from the Kafka queue using Storm and deployed the application jar files into AWS instances.
Collected business requirements from subject matter experts and data scientists.
Transferred data using the Informatica tool from AWS S3 and used AWS Redshift for cloud data storage.
Used different file formats such as Text files, Sequence Files, and Avro for data processing in the Hadoop system.
Loaded data from various data sources into Hadoop Distributed File System (HDFS). using Kafka.
Integrated Kafka with Spark Streaming for real-time data processing in Hadoop and Used Cloudera Hadoop (CDH) distribution with Elasticsearch.
Used image files to create instances containing Hadoop installed and running.
Streamed analyzed data to Hive Tables using Sqoop, making available for data visualization
Used the Hive JDBC to verify the data stored in the Hadoop cluster
EDUCATION DETAILS
Bachelor's degree in Actuary, Facultad de Matemáticas, Universidad Autónoma de Yucatán