Data Engineer Machine Learning

Location:

Edgemont Park, TX

Salary:

130,000$

Posted:

October 08, 2025

Contact this candidate

Resume:

David Kidwell

Senior Data Engineer & Analyst

PERSONAL INFO

Phone No: +1-915-***-**** Email: *********.*****@*****.*** Location: Mesquite, TX, 75149 ABOUT ME

* ***** ** ********* ** designing and optimizing scalable data pipelines and machine learning workflows. Skilled in transforming raw data into actionable insights, implementing machine learning models, and optimizing data systems for efficiency and accuracy. Proficient in building end-to-end data architectures, real-time streaming systems, and automating data processes using tools like Apache Kafka, PySpark, and Docker. Adept at delivering solutions that support business intelligence, improve decision-making, and drive operational excellence.

WORK EXPERIENCE

Senior Data Specialist Apr 2024 - Present

Canoe Intelligence NewYork, NY

● Engineered an automated real-time data pipeline for portfolio rebalancing using Python, PySpark, and AWS Glue, enabling seamless data integration, transformation, and reduced operational costs by 50% through optimized workflows.

● Developed data ingestion and processing workflows to measure investor risk profiles, leveraging ETL pipelines and AWS Lambda. Used Pandas and NumPy to transform and clean data to increase processing speed and drive a 30% boost in customer conversion through accurate and standardized risk analysis.

● Built a data integration solution for arbitrage analysis across public and private markets using PySpark and AWS Redshift, enabling seamless data flow, automated calculations, and improving annual returns by 2%, while ensuring data reliability and security.

● Designed and implemented automated data pipelines to extract and organize documents from General Partner (GP) portals and email inboxes, leveraging ETL processes using Python, Apache Spark, and AWS Lambda for seamless integration of data collection and tracking.

● Developed a robust data ingestion system using Apache Kafka and AWS S3 for high-volume, real-time document collection, ensuring scalable and fault-tolerant data processing to handle large scale alternative investment data.

● Optimized data storage solutions by implementing efficient data partitioning and compression techniques on Amazon Redshift to reduce storage costs while improving query performance and retrieval times for large document datasets.

● Built and deployed automated data validation frameworks using PySpark and DBT to ensure data integrity and accuracy in the processing and transformation of financial documents, reducing manual intervention and increasing processing speed by 40%.

● Orchestrated GitOps CI/CD workflows with GitHub Actions, AWS CodePipeline and Docker for ML model and API deployment for zero downtime releases.

● Enhanced an outdated simulator of alternative streamline for decision making in investment written by combination of C++ and Python script by adding a caching mechanism, which caused over 10x speed up

● Maintained whole MLOps pipelines using MLflow for experiment tracking, model versioning and deployment improving reproducibility and collaboration across the ML team.

● Integrated a generated data into backend APIs by collaborating closely with backend engineers, refined data format for easier retrieval from APIs and utilized technologies like AWS Lambda and Step Functions to trigger side effect API calls whenever there is an update in the data storage on AWS Redshift.

● Optimized PostgreSQL database queries and indexing strategies to improve the performance of complex data retrieval operations to reduce query execution time by 40%.

● Gained extensive experience working with AWS services such as EKS, Kubernetes, Docker, RDS, Glue, Lambda, Step Functions and S3 to build scalable and resilient data processing pipelines. AI/ML Data Engineer Aug 2021 - Mar 2024

MasterClass San Francisco, CA

● Launched an AI driven recommendation system using TensorFlow, PyTorch and OpenAI API, increasing user engagement with suggested lectures by 70% through behavior based insights and collaborative filtering powered by LLM.

● Developed a high performance FastAPI service for retrieving high-K similar vectors with batch querying capabilities and optimizing RAG models for advanced Q/A system

● Established robust data quality checks and observability using Great Expectations and Datadog, reducing pipeline failures by 40% and increasing data trust across teams.

● Architected and deployed scalable ETL pipelines using Apache Airflow, Azure Data Factory and Spark, processing over 10TB of user interaction data monthly to enable personalized content delivery and analytics.

● Leveraged Azure Synapse Analytics for data lake integration, significantly improving data access times and enabling the team to perform complex analytics on large datasets.

● Implemented real time data streaming solutions with Azure Event Hubs and Apache Kafka, enabling instantaneous updates to user profiles and improving personalized recommendations across the platform.

● Delivered containerization of data services using Docker and Kubernetes, improving scalability and ease of deployment for machine learning models and data processing workflows, managed via Azure Kubernetes Service (AKS).

● Integrated data from multiple sources including SQL databases, APIs and cloud storage into a unified analytics pipeline using Azure Data Lake and Azure Data Factory, streamlining data processing and reducing data silos.

ETL Data Engineer Sep 2020 - Aug 2021

Data.ai (Formerly App Annie) New York, NY

● Rollout and optimized core data processing logic using Java, R and MyBatis for objects mapping to SQL statements using XML, enabling seamless data integration of over 500TB of market

● Migrated legacy big data infrastructure from Hadoop(MapReduce, HDFS) to Snowflake data warehouse.

● Developed analysis services in Databricks to model mobile app performance, market trends and user behavior using ML algorithms such as ARIMA for time-series forecasting and XGBoost for regression.

● Utilized PySpark and MLflow for distributed feature engineering and model lifecycle management, with Snowflake as the central data platform powering strategic insights for app owners.

● Utilized Snowflake for automatic clustering and micro partitioning for faster data retrieval and query optimization and it improved query performance up to 50% and reduced data processing costs by 20%.

● Established monitoring and debugging solutions using AWS CloudWatch, Datadog and Kibana for health checks, log aggregation and anomaly detection in ETL workflows, improving pipeline stability and enhancing troubleshooting efficiency by 40%.

● Collaborated in an Agile Scrum environment, participating in sprint planning and backlog grooming, delivering data infrastructure enhancements and improvements in feature velocity.

● Developed backend microservice architecture using Flask(Python) to handle data ingestion, processing and API calls for app performance metrics and it optimized API response times by 40% and promoted system scalability.

● Integrated PowerBI reports and visualizations into the backend system, automating the generation of analytics and business intelligence reports, which contributed to a 20% faster decision making process for key stakeholders.

● Ensured data consistency and minimized latency by 30% working closely with the backend team to inject processed market data into microservice

Data Engineer Oct 2015 - Aug 2020

PaloAlto Networks New York, NY

● Worked for conversational AI agents utilizing NLP to handle user requests raised 50k every day and performed tasks and FAQs based on business needs in the cybersecurity industry.

● Built a data pipeline for a SaaS platform to handle 1M+ daily events utilizing Apache Kafka and Apache Spark for frequent user profile updates.

● Accomplished a robust CI/CD pipeline leveraging AWS EKS, GitHub Actions, AWS CodeBuild and AWS CodePipeline to automate an deployment and integration processes EDUCATION

M.S in Computer Science

Oct 2013 - Sep 2015

Binghamton University

Binghamton, NY

B.S. in Data Science

Aug 2009 - Jul 2013

Hartwick College

Oneonta, NY

SKILLS

Languages:Python, Scala, Java, C/C++, Bash

Data Storage:MySQL, PostgreSQL, Oracle, BigQuery, ElasticSearch, AWS RedShift, Snowflake, AWS S3, AWS DynamoDB, Redis, Hadoop

Tools / Libraries: MLflow, Scipy, Numpy, Pandas, Matplotlib, Apache Spark, Apache Airflow, PowerBI, Grafana, Tableau, OpenCV, Databricks, Git, RabbitMQ, DBT

Cloud/Devops:AWS(Glue, EC2, RDS, Lambda, Kinesis, SageMaker, EKS, RedShift), GCP, Azure, Docker, Terraform, CI/CD, Linux

Others:ETL, Web Scraping, LLM(OpenAI, Llama), Econometrics, Agile Development, NLP, Vector Search, RAG

Contact this candidate