Senior Data Engineering

Location:

Burlingame, CA, 94010

Posted:

November 15, 2024

Contact this candidate

Resume:

Juan Roman Granillo

Phone: 350-***-**** Email: ***************@*****.***

Profile Summary

Strategic Big Data Development and ETL Professional with over 10 years of extensive experience in designing and implementing large-scale systems, complemented by more than 12 years in IT. Proficient in delivering effective solutions through a strong analytical approach, with expertise in cloud platforms, data architecture, and system optimization.

Cloud Expertise:

Amazon Web Services (AWS): Expertise in AWS services like EMR, Lambda, Redshift, Athena, and CloudWatch, with hands-on experience in data warehousing and querying large datasets.

Google Cloud Platform (GCP): Skilled in managing data lakes, utilizing Google Cloud Data Prep, Terraform for resource provisioning, and workflows via Cloud Composer. Proficiency in scalable messaging with Google Cloud Pub/Sub and compliance monitoring with Google Cloud Audit Logging.

Microsoft Azure: Strong abilities in designing scalable data architectures using Azure Data Lake, Synapse Analytics, SQL, and managing pipelines with Azure Data Factory.

DevOps and CI/CD:

Successfully integrated CI/CD pipelines using Jenkins, CodePipeline, Azure DevOps, Kubernetes, Docker, and GitHub.

Extensive experience with AWS CloudFormation and Google Cloud Terraform for resource management and provisioning.

Data Engineering and Processing:

Proficiency in processing large datasets across cloud and on-premises platforms, optimizing Spark performance in Databricks, Glue, and AWS EMR environments.

Expertise in Hadoop ecosystem components like Spark, HDFS, Hive, HBase, Zookeeper, Sqoop, Oozie, and Kafka for robust data processing.

Advanced skills in data security, implementing AWS Identity and Access Management (IAM) for access controls and ensuring compliance.

Analytical and Data Visualization Skills:

Proven track record in performing exploratory data analysis (EDA), statistical analysis, and hypothesis testing to interpret trends and provide data-driven recommendations.

Skilled in data visualization tools such as Tableau, Power BI, and Matplotlib to effectively present insights and trends to stakeholders.

Migration and Performance Optimization:

Experience in migrating on-premises data to the cloud and optimizing data pipelines. Deep understanding of upgrading Hadoop distributions such as MapR, CDH (Cloudera), and HDP (Hortonworks) for seamless transitions.

Collaborative and Results-Oriented:

Strong ability to work with cross-functional teams and business stakeholders to translate project needs into scalable technical solutions. Demonstrates a deep understanding of data governance principles, ensuring data integrity, privacy, and security.

Technical Skills

IDE: Jupyter Notebooks, Eclipse, IntelliJ, PyCharm, Vs Code

Project Methods: Agile, Kanban, Scrum, DevOps, Continuous Integration, Test-Driven Development, Unit Testing, Functional Testing, Design Thinking, Lean Six Sigma

Hadoop Distributions: Hadoop, Cloudera Hadoop, Hortonworks Hadoop

Cloud Platforms: Amazon AWS, Google Cloud (GCP), Microsoft Azure

Database & Tools: Redshift, DynamoDB, Synapse DB, Big Table, Aws RDS, SQL Server, PostgreSQL, Oracle, MongoDB, Cassandra, HBase

Programming Languages: Python, Scala, PySpark, SQL

Scripting: Hive, MapReduce, SQL, Spark SQL, Shell Scripting

CI/ CD: Jenkins, Code Pipeline, Docker, Kubernetes, Terraform, CloudFormation

Versioning: Git, GitHub

Orchestration Tools: Apache Airflow, Step Functions, Oozie

Programming Methodologies: Object-Oriented Programming, Functional Programming

File Format & Compression: CSV, JSON, Avro, Parquet, ORC

File Systems: HDFS, S3, Google Storage, Azure Data Lake

ELT Tools: Spark, NIFI, Aws Glue, Aws EMR, Data Bricks, Azure Data Factory, Google Data flow

Data Visualization Tools: Tableau, Kibana, Power BI, AWS Quick Sight

Search Tools: Apache Lucene, Elasticsearch

Security: Kerberos, Ranger, IAM

Professional Experience

Since November 2023 – Present

Lead Cloud/ Data Engineer Virgin Airlines, Burlingame, CA

As a Lead Big Data Engineer/Analyst at Virgin Airlines, I designed and deployed scalable Azure infrastructure, leveraging services like Virtual Machines, Kubernetes, and SQL Databases through automated CI/CD pipelines. I led the migration of on-premises data to AWS, improving data processing with real-time analytics using Kinesis and Glue. Additionally, I implemented serverless architectures for event-driven data pipelines, optimized ETL workflows, and utilized Snowflake for efficient data warehousing and insights. This project enhanced data accessibility, reduced processing time, and improved overall performance.

Designed and deployed Azure resources like Virtual Machines, Web Apps, Function Apps, SQL Databases, Azure Kubernetes, and Containers using ARM Templates, BICEP, and Terraform via Azure DevOps CI/CD pipelines.

Integrated Azure services such as Cosmos DB, App Insights, Blob Storage, API Management, and Functions to build a microservices-based web application.

Developed capacity and architecture plans to host migrated IaaS VMs and PaaS applications in Azure Cloud.

Develop and implement data ingestion pipelines using Python to collect and process large volumes of structured and unstructured data from various sources (e.g., databases, APIs, and streaming services).

Led the migration of on-premises data to AWS, improving scalability and reducing data processing times through real-time analysis using Amazon Kinesis Data Analytics.

Utilized AWS Glue to establish a unified metadata repository with S3, enhancing data collection and access for large datasets.

Created Apache Airflow modules to interact with AWS services like EMR, S3, Athena, Glue, and Lambda.

Write optimized Spark jobs in Python (PySpark) for distributed data processing.

Implemented AWS Managed Kafka for streaming data from REST APIs to Spark clusters in Glue.

Designed and optimized ETL processes using AWS Glue, transforming data for analysis and reporting.

Architected event-driven, serverless solutions on AWS, utilizing Kafka and Kinesis for real-time data ingestion into DynamoDB.

Leveraged Snowflake for cloud-based data warehousing, optimizing query performance, and enhancing data-driven insights through SQL.

Employed DBT for data testing, quality assurance, and debugging complex queries.

Automated CI/CD pipelines with Jenkins, integrated version control with Git, and improved data quality and reporting using visualization tools.

Administer and optimize PostgreSQL databases to ensure high availability and performance, implementing indexing strategies, partitioning, and query optimization techniques to enhance data retrieval efficiency and support scalable applications across Virgin's digital platforms.

Configured a high-availability Kafka cluster, managing end-to-end data flow from ingestion to the data warehouse using AWS Glue and Spark for faster processing.

Apr 2022 – Oct 2023

Sr. Big Data Engineer Lucid Motors, Newark, CA

As a Senior Big Data Engineer at Lucid Motors, I deployed and configured AWS EC2 instances and S3 storage, orchestrated data pipelines using AWS Step Functions, and ensured secure, scalable data processing with services like EMR, Redshift, and DynamoDB. I implemented real-time analytics via Kinesis and optimized workflows using Apache Airflow, while ensuring data quality and governance across large datasets using Spark, Hive, and Python.

Deployed and configured Amazon EC2 instances and S3 storage using Amazon Web Services (Linux/Ubuntu/RHEL) tailored to specific application requirements.

Gained expertise in managing essential AWS services, such as IAM, VPC, Route 53, EC2, ELB, CodeDeploy, RDS, ASG, and CloudWatch, ensuring a robust and high-performance cloud infrastructure.

Provided expert guidance on complex AWS architectural queries, demonstrating a comprehensive understanding of cloud strategies and best practices.

Implemented stringent security measures within the AWS Cloud environment, protecting critical assets and ensuring compliance with security standards. Worked extensively with AWS S3 buckets, facilitating file transfers between HDFS and S3.

Utilized AWS Glue for data cleaning and preprocessing, and performed real-time analytics with Amazon Kinesis Data Analytics, leveraging services like EMR, Redshift, DynamoDB, and Lambda for scalable and efficient data processing.

Assessed on-premises data infrastructure for migration to AWS, orchestrated data pipelines with AWS Step Functions, and used Amazon Kinesis for event-driven messaging.

Skilled in using Apache Airflow, an open-source workflow automation platform.

Design and automate ETL workflows within Lucid to facilitate seamless data integration from various sources, ensuring accurate data flow into Lucid's data visualization and analytics platforms for improved decision-making.

Applied statistical techniques such as regression, hypothesis testing, and clustering to uncover significant data patterns, supporting informed decision-making.

Created visually impactful charts, graphs, and dashboards with tools like Tableau and matplotlib to present analytical insights to stakeholders.

Employed data cleaning and preprocessing techniques to manage missing values, outliers, and inconsistencies, ensuring high-quality data for analysis.

Maintained data governance by adhering to regulations and principles, ensuring data privacy, integrity, and security throughout the analysis lifecycle.

Used Hive scripts in Spark for data cleaning and transformation.

Imported and transformed data from various sources with Hive and MapReduce, loaded data into HDFS, and extracted data from MySQL into HDFS via Sqoop.

Utilized Hive to analyze partitioned and bucketed data, calculating various metrics for reporting purposes.

Developed Pig Latin scripts to extract data from web server logs for loading into HDFS.

Monitored and troubleshot Hadoop jobs using YARN Resource Manager and EMR job logs.

Converted Cassandra, Hive, and SQL queries into Spark transformations using Spark RDDs in Scala.

Implement and optimize ETL processes to load structured and semi-structured data into Snowflake, leveraging its capabilities for data scaling and query performance to support analytics and reporting needs across the organization.

Streamed log data from web servers to HDFS using Kafka and Spark Streaming.

Developed Spark jobs in Scala for faster data processing in a test environment and used Spark SQL for querying.

Created Python Flask-based login and dashboard features with Neo4j graph database, executing Cypher queries for data analysis.

Designed efficient Spark code using Python and Spark SQL, enabling forward engineering by code generation developers.

Worked with large structured, semi-structured, and unstructured datasets.

Developed big data workflows to ingest data from multiple sources into Hadoop using OOZIE, involving various jobs like Hive, Sqoop, and Python scripts.

Demonstrated expertise in data quality management, ensuring data accuracy, consistency, and reliability.

Implemented data quality frameworks and standards to maintain high data integrity.

Conducted data profiling, cleansing, and validation processes to improve data quality.

Collaborated with cross-functional teams to establish data quality best practices and governance policies.

Jan 2020 – Mar 2022

Cloud/ Data Engineer Goldman Sachs, New York City, NY

As a Cloud/Data Engineer at Goldman Sachs, I led the design and implementation of scalable cloud-based data pipelines, optimizing data flow across AWS and GCP platforms. I leveraged services like AWS Redshift, Google Cloud Pub/Sub, and Terraform to streamline data processing and storage, ensuring secure, efficient, and real-time access to critical business insights. Additionally, I collaborated with cross-functional teams to enhance data governance, security, and performance optimization across the infrastructure.

Utilized Google Cloud Platform (GCP) services such as Compute Engine, Cloud Load Balancing, Cloud Storage, Cloud SQL, Stackdriver Monitoring, and Cloud Deployment Manager to build and manage infrastructure.

Configured GCP firewall rules to control traffic flow to and from VM instances and leveraged Google Cloud CDN to optimize content delivery from cache locations, improving user experience and reducing latency.

Developed Terraform templates to provision and manage infrastructure on GCP, streamlining resource deployment.

Administered Hadoop clusters on GCP using services like Dataproc and other related GCP tools for big data processing.

Processed real-time data streams using Google Cloud Pub/Sub and Dataflow, building scalable data solutions.

Created and maintained data pipelines leveraging Google Cloud Dataflow for efficient data processing and transformation.

Orchestrated data pipeline workflows using Apache Airflow on GCP for scheduled and automated tasks.

Developed custom NiFi processors to enhance data pipeline performance and processing.

Managed data storage and warehousing solutions using Google Cloud Storage and BigQuery for optimal data management.

Set up and administered compute instances on GCP for various processing and application tasks.

Containerized applications with Google Kubernetes Engine (GKE) for scalable deployment and management.

Optimized data processing workflows using tools like Google Cloud Dataprep and Dataflow, ensuring high efficiency.

Implemented data governance and security protocols using Google Cloud IAM, ensuring data privacy and compliance.

Ran Python scripts for custom data processing on GCP instances to automate and streamline tasks.

Design and implement ETL processes to extract data from multiple sources, transform it into a usable format, and load it into target data warehouses or data lakes. Utilize tools such as Apache NiFi, Talend, or custom Python scripts to automate and streamline ETL workflows.

Collaborated with data scientists to build robust data pipelines, ensuring effective data analysis on GCP.

Conducted exploratory data analysis (EDA) to detect patterns, anomalies, and trends within datasets, guiding further data insights.

Performed data validation and quality assurance to ensure accuracy, consistency, and integrity throughout the analysis process.

Develop and optimize distributed data processing applications using Apache Spark to handle large-scale data transformations and aggregations. Leverage Spark SQL and DataFrames to enhance data processing efficiency and enable real-time analytics.

Worked closely with stakeholders to define and align key performance indicators (KPIs) with business objectives and strategies.

Built real-time streaming systems using Amazon Kinesis to process data as it is generated, enhancing data flow and analysis.

Set up and maintained data storage systems like Amazon S3 and Amazon Redshift, ensuring secure and accessible data storage for analysis.

Optimized data processing systems using Amazon EMR and Amazon EC2 to achieve performance and scalability improvements.

Processed data with Natural Language Toolkit (NLTK) to extract key insights and generate visual word clouds.

Ensured data governance and security using AWS Identity and Access Management (IAM) and Amazon Macie, protecting sensitive information.

Created and optimized data processing workflows with Amazon EMR and Amazon Kinesis, enabling timely and efficient analysis of large datasets.

Provided timely ad-hoc analysis and support to cross-functional teams, facilitating data-driven decision-making.

Collaborated with data scientists and analysts to build data pipelines for tasks like fraud detection, risk assessment, and customer segmentation.

Developed and implemented recovery plans and procedures to ensure system resilience and data protection.

Sep 2017 – Dec 2019

Cloud Data Engineer Molina Healthcare, Long Beach, CA

As a Cloud Data Engineer at Lutheran Services in America, I ingested terabytes of diverse raw data into Spark RDDs for computation, enhancing output generation while facilitating seamless data transfer from Azure Blob Storage. I optimized data processing by leveraging Hive Context for efficient querying of Azure HDInsight tables and extensively modeled Hive partitions to boost performance. My development of robust Python and Scala code within Azure Databricks ensured adherence to application requirements, while my work in automating ETL processes and mentoring junior engineers fostered a culture of best practices and innovation.

Ingested terabytes of various raw data into Spark RDD for computation, facilitating output generation, and imported data from Azure Blob Storage into Spark RDD.

Leveraged Hive Context, which extends SQL Context functionality, to write queries using the HiveQL parser for accessing data from Azure HDInsight Hive tables.

Extensively modeled Hive partitions for data separation and enhanced processing speed, adhering to Hive best practices for optimization in Azure HDInsight.

Developed complex, maintainable Python and Scala code within Azure Databricks, ensuring it meets application requirements for data processing and analytics using built-in libraries.

Successfully transferred files to Azure HDInsight Hive and Azure Blob Storage from Oracle and SQL Server utilizing Azure Data Factory. Environment: Azure HDInsight, Azure Blob Storage, Linux, Shell Scripting, Airflow.

Migrated legacy MapReduce jobs to PySpark jobs on Azure HDInsight.

Crafted UNIX scripts for automating and scheduling ETL processes, including job invocation, error handling, reporting, and file transfer operations using Azure Blob Storage.

Utilized UNIX Shell scripts for job control and file management within Azure Linux Virtual Machines.

Gained experience in offshore and onshore models for development and support projects on Azure.

Mentored junior data engineers, providing technical guidance on ETL data pipeline best practices.

Developed and maintained ETL pipelines using Apache Spark and Python for large-scale data processing and analysis.

Assessed the existing on-premises data infrastructure, including data sources, databases, and ETL workflows.

Established a CI/CD pipeline using Jenkins to automate ETL code deployment and infrastructure changes.

Managed version control for ETL code and configurations using tools like Git.

Created automated Python scripts for converting data from various sources and generating ETL pipelines.

Optimized ETL and batch processing jobs for performance, scalability, and reliability using Spark and YARN.

Jan 2015 – Aug 2017

Hadoop Developer/Administrator Liberty Insurance Mutual Group, Boston, Massachusetts, U.S.

As a Hadoop Developer/Administrator at Liberty Insurance Mutual Group, I successfully migrated on-premises data to AWS, enhancing accessibility and scalability while significantly reducing data processing times through real-time analytics with Amazon Kinesis Data Analytics. I optimized data workflows by implementing AWS Lambda for automated PySpark ETL jobs and managed the installation and maintenance of Apache Hadoop clusters, ensuring peak performance and high availability with Zookeeper. My efforts in integrating Neo4j with relational databases streamlined data handling, while automating data ingestion processes improved overall operational efficiency.

Successfully transitioned on-premises data to AWS, improving accessibility and scalability, which significantly reduced data processing times by implementing real-time analytics with Amazon Kinesis Data Analytics.

Assessed the on-premises data architecture to identify migration opportunities to AWS, orchestrating data pipelines with AWS Step Functions and utilizing Amazon Kinesis for event-driven messaging.

Optimized data flow for analytics and configured AWS Lambda functions to automate PySpark ETL jobs, ensuring scheduled and error-managed data processing.

Installed, configured, and maintained Apache Hadoop clusters for application development, alongside Hadoop tools like Hive, HBase, Zookeeper, and Sqoop.

Employed Neo4j to integrate graph databases with relational databases, enhancing efficiency in data storage, handling, and querying.

Installed and upgraded Cloudera CDH and Hortonworks HDP versions in lower and proof-of-concept environments.

Configured Sqoop for importing and exporting data into HDFS and Hive from relational databases.

Closely monitored and analyzed MapReduce job executions at the task level, optimizing Hadoop cluster components for peak performance.

Deployed a high-availability infrastructure with automatic failover for NameNodes using Zookeeper services, ensuring the continuous reliability of the Hadoop cluster.

Managed the installation and version upgrades of Hadoop and related software components, ensuring compatibility and optimized performance for big data processing tasks.

Integrated CDH clusters with Active Directory and enabled Kerberos for authentication.

Handled commissioning and decommissioning of DataNodes, conducted NameNode recovery, capacity planning, and installed the Oozie workflow engine to execute multiple Hive jobs.

Implemented high availability and automatic failover infrastructure to eliminate single points of failure for NameNodes, utilizing Zookeeper services.

Used Hive to create tables, loaded data, and developed Hive UDFs while collaborating with the Linux server administration team to manage server hardware and operating systems.

Automated data workflows with shell scripts to extract, transform, and load data from various databases into Hadoop, streamlining data ingestion processes.

Collaborated closely with data analysts to develop innovative solutions for their analysis tasks and managed Hadoop and Hive log file reviews.

Worked with application teams to install operating system updates and Hadoop version upgrades as needed.

Automated workflows using shell scripts to pull data from various databases into Hadoop.

Sep 2013 – Dec 2014

Hadoop Administrator Adidas, Portland, OR

Oversee the installation, configuration, and maintenance of Hadoop clusters, ensuring optimal performance and scalability to support Adidas’ data processing needs.

Implement and manage data ingestion processes using tools like Apache Flume and Sqoop to ensure efficient data flow from various sources into the Hadoop ecosystem.

Utilize monitoring tools such as Apache Ambari or Cloudera Manager to track cluster health, performance metrics, and resource utilization, implementing tuning adjustments to optimize performance.

Enforce data security measures and access controls within the Hadoop ecosystem, managing user permissions and implementing Kerberos authentication to protect sensitive information.

Develop and execute backup and recovery strategies for Hadoop data, ensuring data integrity and availability in case of failures or disasters.

Work closely with data engineers, data scientists, and other stakeholders to understand data requirements, providing support for Hadoop-related queries and troubleshooting issues as they arise.

Maintain comprehensive documentation of Hadoop cluster configurations, operational procedures, and best practices, ensuring knowledge transfer and compliance with internal policies.

Implement and manage job scheduling using tools like Apache Oozie or Apache Airflow to automate data processing workflows, ensuring timely and efficient execution of data jobs.

Monitor and allocate cluster resources effectively using YARN (Yet Another Resource Negotiator) to ensure optimal resource utilization and performance across various data processing tasks.

Diagnose and resolve complex issues within the Hadoop environment, including data processing failures, performance bottlenecks, and system errors, ensuring minimal downtime and data loss.

Jan 2012 – Aug 2013

Data Analyst The Home Depot, Cobb County, GA

Deconstructed and analysed functional requirements of business products to align with organizational project and product goals.

Established the data and reporting infrastructure from scratch using Tableau and SQL to deliver real-time insights.

Employed advanced SQL queries for analysis and research to address specific requirements and solutions.

Conducted data analysis and modeling to uncover patterns and trends within large datasets.

Developed and implemented data-driven strategies to enhance business operations and decision-making processes.

Collaborated with cross-functional teams to identify and prioritize data-related projects and initiatives.

Created and maintained data visualizations and dashboards to effectively communicate insights to stakeholders.

Conducted A/B testing and experimentation to optimize business outcomes.

Performed data quality checks and resolved data-related issues as they arose.

Remained updated on industry trends and emerging technologies in the realm of big data analytics.

Mentored and trained junior analysts on best practices and techniques in data analysis and modeling.

Participated in the development and implementation of data governance policies and procedures to ensure data security and compliance.

Developed operational reporting in Tableau to identify areas of improvement for contractors, resulting in an ROI through annual incremental revenue.

Applied models and data to forecast repair costs for vehicles in the market and presented findings to stakeholders.

Education

Bachelor of Science

Texas State University, San Marcos, Texas

Associate Degree in Science

South Plains College, Levelland, Texas

Contact this candidate