Data Engineer Lead

Location:

Los Angeles, CA, 90006

Salary:

Posted:

March 13, 2025

Contact this candidate

Resume:

Leonardo Perez

Lead Data Engineer / Back end Engineer

Los Angeles, CA ********.*.*********@*****.*** +1-469-***-****

SUMMARY

Dynamic Senior Data Engineer with 8+ years of experience specializing in backend development, data engineering, and AI integrations.

Skilled in designing scalable, high-performance data pipelines, cloud-based architectures, and advanced AI-powered systems for industries including healthcare, finance, and IT.

Expert in modern ELT/ELT processes, data transformation, and orchestration tools like Apache Airflow, Kafka, Flink, Apache Spark and Databricks..

Proficient in Python, SQL, and cloud platforms such as AWS and Azure, with hands-on expertise in leveraging tools like Snowflake, Apache Flink, PySpark and Databricks for large-scale data processing.

Experienced in developing RESTful and GraphQL APIs, optimizing backend systems for real-time data processing, and implementing AI/ML models to enhance decision-making.

Strong background in containerization and orchestration using Docker and Kubernetes, ensuring scalable and reliable deployments.

Skilled in Generative AI and Large Language Models (LLMs), including building AI-powered chatbots for natural language data queries.

Proven ability to collaborate cross-functionally to deliver innovative, client-focused solutions.

EXPERIENCE

LEAD DATA ENGINEER

Jade Global – United States (Contract, Remote) 02/2023 - 12/2024

Designed and implemented a scalable, AI-powered clinical decision support system for enterprise-scale healthcare clients using modern cloud technologies and microservices architecture, and dbt for building, testing, and documenting data transformations, whiling ensuring robust data security protocols to protect sensitive information.

Designed and optimized large-scale ETL Data pipelines using PySpark, Apache Spark, dbt, and Amazon EMR and Databricks with Delta Lake, enhancing data transformation speed by 40% and integrating with Amazon S3, Redshift, and Glue for seamless data storage and analytics.

Leveraged Databricks and dbt for optimizing and orchestrating data workflows, ensuring efficient processing of large-scale datasets and integration with cloud-native technologies such as AWS S3, Redshift, and Glue.

Utilized Databricks notebooks and dbt models for real-time data processing, collaboration, and advanced analytics, streamlining the development and testing of data pipelines.

Developed and optimized real-time data pipelines using Apache Airflow, integrating with Apache Kafka and AWS Kinesis for seamless stream processing, improving data latency by 30% and enabling real-time analytics on large-scale datasets.

Optimized data pipeline performance and reduced API response times by 30% using Redis for caching and session management, integrated with Amazon RDS(PostgreSQL), DynamoDB, S3, and AWS Lambda for seamless scalability in a cloud-based architecture.

Designed and optimized interactive data visualizations and dashboards in Tableau, integrating large-scale data pipelines with Snowflake to deliver real-time business insights, leveraging SQL and Python for seamless data transformation and reporting.

Leveraged NumPy and Pandas for efficient data manipulation and transformation, building scalable data pipelines for processing large datasets, while utilizing SciPy for advanced statistical analysis and optimization.

Developed an AI-powered chatbot leveraging Generative AI models and LLMs, such as OpenAI, to provide users with data through natural language queries, eliminating the need for SQL-based inputs.

Utilized Docker for containerization, Kubernetes for orchestration, and Jenkins for CI/CD pipelines, ensuring scalable, reliable deployments with automated testing and seamless integration across development and production environments.

Implemented MLOps pipelines to automate the deployment, monitoring, and scaling of machine learning models, ensuring seamless integration with data processing workflows and continuous model improvement.

Leveraged comprehensive testing strategies, including unit, integration, and end-to-end tests, using tools like Pytest, Jest, and Selenium, to ensure the reliability, performance, and data security of data pipelines and backend systems.

Developed and generated synthetic test data for data pipelines to ensure robust validation, testing, and optimization of data workflows across various systems.

Architected and deployed backend microservices using NestJS and FastAPI, ensuring modularity, scalability, and high performance.

Designed and implemented RESTful and graphQL APIs for seamless communication between microservices.

Extensively applied Agile methodology in cross-functional teams, leading sprint planning, daily stand-ups, and code reviews to ensure timely delivery of high-quality software in a fast-paced, iterative development environment.

SENIOR DATA ENGINEER

ServiceNow – United States (Contract, Remote) 06/2020 – 12/2022

Designed and developed scalable data pipelines using Kafka, Spark, and Flink to support the processing and transformation of large datasets for the IT Service Management (ITSM) automation platform, ensuring efficient data ingestion and real-time analytics.

Optimized ETL workflows by leveraging Databricks with Delta Lake to handle large-scale data transformation, improving pipeline performance by 35% and ensuring data consistency across cloud storage solutions.

Engineered and deployed Databricks notebooks to streamline data analysis and reporting, integrating with Azure Blob Storage and Azure Data Lake to improve data accessibility and enable more efficient reporting for real-time ITSM insights.

Engineered ETL/ELT workflows to streamline data extraction, transformation, and loading from various data sources into a central data warehouse, improving data accessibility and decision-making.

Implemented advanced data models for structuring and optimizing data storage in the data warehouse, enhancing query performance and analytical capabilities.

Leveraged Azure services (Azure Data Lake, Azure Functions, and Azure Blob Storage) to build cloud-native solutions, enabling efficient storage, processing, and orchestration of large volumes of ITSM-related data.

Developed real-time data processing solutions using Kafka Streams and Flink to support the ingestion of live IT service management data, ensuring timely data delivery for real-time reporting and analytics.

Optimized data pipelines by integrating Datadog for performance monitoring, enabling proactive identification of bottlenecks and reducing pipeline latency by 25%.

Collaborated with cross-functional teams to develop and integrate data-driven features into the ITSM platform, providing stakeholders with insights into service performance, incident management, and user activity.

Designed and implemented data solutions using Hadoop for distributed storage and processing of massive datasets, ensuring high availability and fault tolerance.

Enhanced data quality and reliability by implementing automated data validation checks and logging mechanisms in the ETL pipeline, ensuring accurate and consistent data processing.

Applied T-SQL to develop complex queries and stored procedures for extracting and analyzing service-related data from relational databases, driving key reporting and business intelligence efforts.

Led the integration of cloud-based data storage solutions using Azure Blob Storage to ensure scalability and cost-effectiveness in handling large-scale data from IT service operations.

SENIOR DATA ENGINEER

TeraData – United States (Full Time, Hybrid) 10/2018 - 05/2020

Engineered AI-driven data solutions for IoT-based manufacturing systems, utilizing Python, Apache Kafka, and Spark to process and analyze large streams of real-time data from IoT devices, improving operational efficiency and predictive maintenance.

Leveraged Databricks with Apache Spark for large-scale data processing and transformation, improving the efficiency of real-time and batch data pipelines for IoT manufacturing systems and enabling faster predictive analytics.

Developed and optimized data pipelines for collecting and processing sensor data, manufacturing logs, and transactional data, enabling predictive analytics and real-time decision-making across the manufacturing floor.

Built and deployed machine learning models using TensorFlow and scikit-learn to predict equipment failure, optimize production schedules, and detect anomalies, contributing to a 15% reduction in downtime and improved equipment life cycle management.

Designed and implemented data lakes and data warehouses to consolidate structured and unstructured data from various IoT devices and production systems, enabling seamless access and integration of historical and real-time data.

Implemented batch and stream processing workflows using Apache Spark and Kafka Streams, supporting both real-time analytics and historical data processing for large-scale IoT data.

Developed robust ETL processes to ensure seamless ingestion, transformation, and loading of data from IoT devices and other enterprise systems into a centralized data warehouse for efficient querying and reporting.

Automated and monitored data pipelines, leveraging Apache Kafka, Apache Flink, and AWS S3 for scalable and reliable data ingestion and processing.

Collaborated with data scientists and AI engineers to ensure proper integration of machine learning models into the data pipeline, providing actionable insights from real-time and historical IoT data.

Optimized data storage strategies using Hadoop, reducing data storage costs by 25% and improving query performance by implementing partitioning and indexing strategies.

Developed and maintained SQL-based queries and stored procedures to support data analytics and reporting requirements, ensuring high performance and low-latency access to critical manufacturing data.

Integrated cloud platforms (AWS, GCP) to scale data processing infrastructure and ensure seamless data flow across systems, increasing system reliability and reducing infrastructure overhead by 20%.

DATA ENGINEER

Global TIES – United States 08/2016 - 07/2018

Designed and maintained scalable ETL/ELT pipelines for large-scale data ingestion, transformation, and storage across AWS and GCP platforms.

Implemented data processing workflows using Google Pub/Sub, Cloud Dataflow, and Python, loading data into BigQuery for analytical purposes.

Automated data pipeline processes by developing Google Cloud Functions for efficient data loading from Google Cloud Storage (GCS) into BigQuery.

Applied Apache Airflow and Apache Spark to orchestrate and automate ETL processes, ensuring high availability and fault tolerance.

Optimized data workflows, ensuring data integrity, accuracy, and validation using SQL and Python.

Led efforts in Continuous Delivery using Docker and GitHub Actions to streamline deployment pipelines and accelerate time-to-production.

Collaborated closely with cross-functional teams on a scientific research project, providing data engineering support to enable advanced analytics and model development, including predictive modeling using Logistic Regression and other ML techniques.

Worked extensively with GCP services, including BigQuery, Cloud Functions, Cloud Dataflow, and GCS, to ensure seamless and efficient data processing and storage.

Contributed to data-driven insights and research by ensuring accurate, timely, and scalable data solutions.

EDUCATION

University of California San Diego Master of Science in Computer Science 05/2016 - 08/2018

University of California San Diego Bachelor of Science in Computer Science 09/2012 - 04/2016

SOFT SKILLS

Leadership & Mentorship

Communication & Collaboration

Problem-Solving

Adaptability & Continuous Learning

Critical Thinking

Time Management & Organization

Attention to Detail

Client-Focused

TECHNICAL SKILLS

Programming Languages: Python, SQL, Scala,TypeScript, JavaScript, Shell

Data pipeline Orchestration: Apache Airflow, Apache Kafka, Apache Flink, Apache Spark, PySpark, AWS Kinesis, Google Cloud Pub/Sub

ELT/ELT Frameworks: Apache Nifi, AWS Glue, Google Dataflow, Talend, Informatica, DBT, Databricks

Data Storage & Warehousing: Snowflake, Amazon Redshift, Google BigQuery, AWS S3, HDFS, Delta Lake, AWS DynamoDB, PostgreSQL, MySQL, Teradata, MongoDB, Cassandra

Real-time Data Processing: Apache Kafka, Apache Flink, Apache Spark Streaming, AWS Lambda, Google Cloud Functions

Data Lake & Data Warehouses: AWS S3, Azure Data Lake, Hadoop, Databricks, Delta Lake

Data Modeling & Transformation: SQL, Python(NumPy, Pandas), T-SQL, DBT(Data Build Tool)

AWS Services: EC2, S3, Lambda, RDS (PostgreSQL), Redshift, Glue, Kinesis, EMR, CloudFormation, SQS, CloudWatch

Azure Services: Azure Blob Storage, Azure Data Lake, Azure Functions, Azure Databricks, Azure Synapse

Google Cloud Services: BigQuery, Cloud Pub/Sub, Google Cloud Functions, Cloud Dataflow, GCS, Google Cloud Composer

CI/CD & DevOps Tools: Jenkins, GitLab, Github Actions, Docker, Kubernetes, Terraform

Frameworks: Flask, FastAPI, NestJS, TensorFlow, PyTorkch, scikit-learn, Spark MLlib, Keras

APIs & Microservices: RESTful APIs, GraphQL, gRPC, WebSockets, FastAPI, Flask, Django

AI Integration: Generative AI, LLMs, Natural Language Processing(NLP)

Statistical Analysis & Data Science: Pandas, NumPy, SciPy, Jupyter, SciKit Learn

Data Visualization: Tableau, Power BI, Matplotlib, Seaborn, Plotly, D3.js, Plotly Dash

Version Control: Git, GitHub, Gitlab

Testing Frameworks: Pytest, Jest, Selenium, Mocha

Monitoring & Logging: Datadog, Prometheus, Grafana

Contact this candidate