Big - Data Science

Location:

Tappahannock, VA

Posted:

June 07, 2023

Contact this candidate

Resume:

Juan Pablo Osorio Flores

Email: ****.**.********@*****.*** Phone: 305-***-****

Profile Summary

●An achievement-driven professional offering close to 9 of experience in the development of custom Big Data Solutions, passionate about technology and computation.

●In-depth knowledge of real-time ETL/Spark analytics pipelines using Spark SQL with visualization tools such as Tableua, Power BI, ELK, Splunk and ElasticSearch.

●Strong technical skills, including proficiency in programming languages such as Scala or Python, SQL, Shell scripting, and experience with big data technologies such as Hadoop ecosystem, Cloudera, Snowflake and Spark/Spark Streaming.

●Ability to troubleshoot and tune relevant programming languages like SQL, Python, and Scala. Able to design elegant solutions using problem statements.

●Skilled in working on SQL and developing other Relational Database Management System which includes PostgreSQL and MS SQL Server.

●Skilled at working in the AWS ecosystem, building the architecture, and performing a project involving the movement and transformation of data.

●Proficient in working on AWS tools (Redshift, Kinesis, S3, EC2, EMR, DynamoDB, Elasticsearch, Athena, Firehose, Lambda, Glue, Crawler, Data Catalog)

●Experienced with multiple Hadoop distributions including Cloudera (Cloudera manager) and Hortonworks (Ambari).

●Built and configured virtual environments for running multiple Big Data systems

●Worked with stakeholders to understand their data needs and ensure that the company’s big data infrastructure supports their requirements.

●Expert at creating HDFS and implementing RDMBs into HDFS using the Sqoop tool.

●Experience in Migrating SQL database to Azure Data Lake, Azure data lake Analytics, Azure SQL Database, Databricks, Azure Data Factory and Azure SQL Data warehouse and controlling and granting database access and Migrating On-premises databases to Azure Data Lake store using Azure Data factory

●Extensive experience in performance tuning, for instance, SQL and Spark query tuning

●Strong analytical skills for troubleshooting and problem-solving

●Received awards such as Excelencia Estudiantil and Embajador ITESM

Certifications

●Implementing and Administering Cisco Networking Technologies

●Google Cloud Platform Fundamentals: Core Infrastructure

●LASPAU Program Assistant

●SolidWorks Certified Associate

Technical Skills

●Big Data Platforms: Hadoop, Cloudera Hadoop, Snowflake, Databricks

●Hadoop Ecosystem (Apache) Tools: Kafka, Spark, Cassandra, Flume, Hadoop, Hadoop YARN, HBase, Hive, Airflow, Spark Streaming, Sqoop, Oozie.

●Hadoop Ecosystem Components: Sqoop, Kibana, Tableau, Power BI, AWS, Apache Airflow, GCP.

●Scripting: Python, Scala, SQL.

●Data Storage and Files: HDFS, Data Lake, Data Warehouse, Redshift, Parquet, Avro, JSON, Snappy, Gzip, ORC, BSON.

●Databases: Apache Cassandra, Apache HBase, MongoDB, PostgreSQL, MySQL, RDBMS, DB2, DynamoDB, AWS DocumentDB, Snowflake, MS SQL Server, Oracle.

●Cloud Platforms and Tools: AWS, S3, EC2, EMR, Redshift, Lambda services, Microsoft Azure, Open Stack, Google Cloud Storage.

●File Systems: HDFS, S3, Azure Blob, GCS

●ETL Tools: Sqoop, AWS Glue, Azure Data Factory, Apache Airflow

●Data Visualization Tools: Tableau, Power BI

●Data Query: Spark SQL, PySpark, Pandas.

●Programming Languages: Java, Python, Scala

●Scripting: Hive, SQL, Spark SQL, Shell Scripting

●Continuous Integration (CI-CD): Jenkins, Git, Bitbucket, AWS CodePipeline, CodeCommit

WORK EXPERIENCE

Senior Data Engineer (AWS)

Revelo, Miami Florida, Aug 21-Present

●Building real-time streaming systems using Amazon Kinesis to process data as it is generated.

●Created different pipelines in AWS for end-to-end ETL processes.

●Designing and implementing data pipelines to extract, transform, and load data from various sources into an Amazon S3 data lake or Amazon Redshift data warehouse.

●Optimizing data processing systems for performance and scalability using Amazon EMR and Amazon EC2.

●Installed AWS command line interface (CLI) to interact with S3 bucket to download and upload files.

●Processed data with a natural language toolkit to count important words and generated word clouds.

●Implementing data governance and security protocols using AWS Identity and Access Management (IAM) and Amazon Macie to ensure that sensitive data is protected.

●Collaborating with data scientists and analysts to develop data pipelines for tasks such as fraud detection, risk assessment, and customer segmentation. Develop and implement recovery plans and procedures.

●Created Apache Airflow jobs to orchestrate pipeline executions.

●Transform the data stored on old mainframe computers to more recent methods of storing data with a tie into MySQL in AWS RDS.

●Setting up and maintaining data storage systems, such as Amazon S3 and Amazon Redshift, AWS Glue jobs and AWS Step Functions to ensure data is properly stored and easily accessible for analysis.

●Develop a data pipeline that included files being stored on an AWS EC2 instance, decompressed, sent to an AWS S3 bucket, and transformed to ASCII. Rules were applied to data based on DDL from the client Oracle database admin.

●Set up cloud compute engine in managed and unmanaged mode and SSH key management.

●Containerizing Confluent Kafka application and configured subnet for communication between containers.

●Creating and maintaining documentation for the company’s big data infrastructure and systems, including design diagrams, system configurations, and best practices

●Developed AWS Cloud Formation templates and Terraform as well to create a custom infrastructure of pipeline.

●Run Python scripts to initiate custom data pipeline used to download files, transform data within files, and upload to MySQL database server and other RDBMS.

●Creating and optimizing data processing workflows using AWS services such as Amazon EMR and Amazon Kinesis Firehose and Kinesis Streams to process and analyze large amounts of data in a timely and efficient manner.

Big Data Engineer

Progressive Corporation (Remote), Oct 19 – Aug 21

●Worked with a Big Data dev team that created a solution to allow you to join 2 or more tables from a trusted bucket and save the result in a new table into an enhanced-trusted bucket, query it using Dremio or RapidSQL, and save the data lineage in Collibra.

●Implemented AWS EMR Spark using PySpark/Scala and utilized DataFrames and SparkSQL API for faster processing of data.

●Registered datasets to AWS Glue through Rest API.

●Used AWS API Gateway to Trigger Lambda functions.

●Queried with Athena on data residing in AWS S3 bucket.

●Utilized AWS Step functions/Lambda functions (with s3 triggers) to run a data pipeline.

●Used DynamoDB to store metadata and logs.

●Monitored and managed services with AWS CloudWatch.

●Performed transformations using Apache SparkSQL.

●Wrote Spark applications for data validation, cleansing, transformation, and customed aggregation in Scala and PySpark.

●Developed Spark code using Python/Scala and Spark-SQL for faster testing and data processing.

●Tuned Spark to increase job performance.

●Configured ODBC Driver, and Presto Driver with RapidSQL.

●Used Dremio as a Query engine for faster Joins and complex queries over AWS S3 bucket using Dremio data reflections.

●Wrote and upgraded Lambda function written in Python2 to Python3.

●Conducted testing using PyCharm and PyTest functions upgraded from Python2 to Python3.

●Loaded data into the company’s Snowflake-based data warehouse.

●Worked on POCs to ETL with S3, EMR(Spark), and Snowflake.

●Worked as part of the Big Data Engineering team and worked on pipeline creation activities in the AWS environment.

●Used Airflow to schedule jobs and monitor.

●Used Collibra to save the data lineage of each pipeline.

●Used Talend to transfer dataset to S3 bucket as POC.

Data Engineer

Michael Page, Boston, MA, Dec 16-Oct 19

●Developed PySpark application to read data from various file system sources, apply transformations, and write to NoSQL database in GCP environment.

●Implemented Rack Awareness in the Production Environment using GCP resources.

●Collected data using REST API, built HTTPS connection with client-server, sent GET request, and collected response in Pub/Sub.

●Imported data from web services into Cloud Storage and transformed data using Cloud Dataproc.

●Executed Hadoop/Spark jobs on GCP Dataproc using programs, and data stored in Cloud Storage Buckets.

●Automated, configured, and deployed instances on Google Cloud Platform (GPC).

●Used BigQuery for creating and populating the Cloud Bigtable warehouse.

●Worked with Spark Context, Spark -SQL, DataFrames, and Pair RDDs on GCP Dataproc.

●Ingested data through GCP Pub/Sub and Dataflow from various sources to Cloud Storage.

●Architected a lightweight Kafka broker and integrated Kafka with Spark for real-time data processing on GCP.

●Created Hive external tables and designed data models in Apache Hive on GCP Dataproc.

●Developed multiple Spark Streaming and batch Spark jobs using Scala and Python on GCP Dataproc.

●Implemented advanced procedures of feature engineering for the data science team using in-memory computing capabilities like Apache Spark written in Scala on GCP.

●Extracted data from different databases and scheduled workflows using Cloud Composer to execute the task daily.

●Worked with GCP services such as Cloud Storage, Pub/Sub, Dataflow, Dataproc, BigQuery, and was involved in ETL, Data Integration, and Migration.

●Documented the requirements including the available code which should be implemented using Spark, Cloud Firestore, Bigtable, and Stackdriver Logging in GCP.

●Worked on GCP Dataflow for processing huge amounts of real-time data.

Data Engineer / Data Analyst

Unified Group, Boston, MA, Jul 14-Nov 16

●Designed a cost-effective archival platform for storing Big Data using Azure Data Lake Storage and its related technologies.

●Developed a task execution framework on Azure Virtual Machines using Azure SQL Database and Azure Cosmos DB

●Collected business requirements from subject matter experts and data scientists and implemented them in Azure Synapse Analytics.

●Transferred data using Azure Data Factory from various sources such as Azure Blob Storage, Azure SQL Database, and Azure Cosmos DB for cloud data storage.

●Used different file formats such as CSV, JSON, Parquet, and Avro for data processing in Azure HDInsight.

●Loaded data from various data sources into Azure Data Lake Storage Gen2 using Azure Event Hubs.

●Integrated Azure Event Hubs with Azure Stream Analytics for real-time data processing in Azure HDInsight and used Azure Elasticsearch for indexing and searching.

●Built a Full-Service Catalog System with a full workflow using Azure Logic Apps, Azure Event Grid, Azure Functions, Azure Search, and Azure Monitor.

●Connected various data centers and transferred data using Azure Data Factory and Azure Databricks in the Azure HDInsight system.

●Imported data from disparate sources into Azure Databricks for data processing.

●Used shell scripts to dump the data from Azure SQL Database to Azure Data Lake Storage Gen2.

●Built a prototype for real-time analysis using Azure Stream Analytics and Azure Event Hubs in the Azure HDInsight system.

●Loaded and transformed large sets of structured, semi-structured, and unstructured data using Azure HDInsight, Azure Databricks, and Azure Data Lake Storage Gen2 for ETL, pipeline, and Spark streaming, acting directly on the data stored in Azure Data Lake Storage Gen2.

●Extracted data from RDBMS (Oracle, MySQL) to Azure Blob Storage using Azure Data Factory.

●Used Azure Cosmos DB in implementation and integration for NoSQL databases.

●Configured Azure Data Factory workflow engine scheduler to run multiple jobs in the Azure HDInsight system.

●Consumed the data from the Azure Event Hubs using Azure Stream Analytics and deployed the application jar files into Azure Virtual Machines.

●Used Azure Marketplace to create virtual machines containing Azure HDInsight and running Hadoop, Spark, and Hive.

●Streamed analyzed data to Azure Synapse Analytics using Azure Data Factory, making it available for data visualization.

●Used the Azure Synapse Analytics SQL serverless endpoint to verify the data stored in the Azure Data Lake Storage Gen2.

EDUCATION DETAILS

●Tecnológico de Monterrey Licenciatura, Ingeniería electrónica, robótica y mecatrónica

●Universidad de Huelva Licenciatura, Ingeniería Electrónica, Robótica y Mecatrónica

Contact this candidate