Data Engineer Scientist

Location:

Plano, TX

Salary:

90000

Posted:

October 15, 2025

Contact this candidate

Resume:

JASON RAYEN

*****.*@**********.*** +1-703-***-**** LinkedIn GitHub Tableau

OVERVIEW

As a results-driven Data Engineer and Data Scientist with over 5 years of experience, I specialize in building and optimizing end-to-end data solutions that bridge the gap between raw data and actionable business intelligence. My expertise encompasses the entire data lifecycle, from designing robust, scalable pipelines with Apache Airflow, Databricks Workflow and AWS Glue to ingest high-volume data from APIs and relational sources, to managing and fine-tuning modern data platforms like Snowflake where I optimized DBT workflows through incremental models and strategic query rewrites to slash execution times and reduce warehousing costs. I possess a strong foundation in both SQL (SQL Server, MySQL) and NoSQL (MongoDB, HBase) systems, which I leverage for data modeling, migration, and cleansing to ensure the delivery of accurate, business-ready data that powers advanced analytics, machine learning models, and critical reporting needs.

SKILLS

Programming Language: Python, R, Java, Scala, SQL

Big Data Ecosystem: Hadoop, Hive, HDFS, HBase, Apache Airflow, Apache Kafka, Apache Spark, Apache Flink, Databricks

Cloud Technologies: AWS (EMR, EC2, S3, Athena, Glue, Elasticsearch, Lambda, DynamoDB, Redshift, QuickSight, Kinesis), Azure (Data Lake, Databricks, Data Storage, Data Factory, Azure App Service, Azure SQL Database, Azure Blob Storage) Visualizations: Tableau, Power BI, Excel

ETL Tools: SSIS, SSRS, Fivetran, SAS, Informatica, PySpark, Tableau, Talend Packages & Data Processing: Pandas, Matplotlib, Seaborn, PySpark, Data Pipelines, Data Build Tool (DBT) Version Control & Database:

Data Management:

GitHub, Git, SQL Server, PostgreSQL, Cassandra, DynamoDB, MySQL, Snowflake Data Modeling, Data Warehousing, Data Governance, Metadata Management, Data Quality, Master Data Management (MDM), Data Cataloging WORK EXPERIENCE

Cardinality AI, Maryland January 2025 – Present

Data Engineer II & Product Integration Lead

Spearheaded the end-to-end data architecture and integration strategy for the State of Hawaii Child Welfare Modernization Program, a multi-year initiative to replace legacy systems with a modern SaaS platform (Cardy product). Acted as the technical lead bridging data engineering, integration development, and cross-functional project teams across Hawaii, Georgia, and Indiana.

Designed and implemented the core Data Architecture Model and State Complaints Architecture using Draw.io, optimizing star- schema designs for the data warehouse. This optimization enhanced query performance and system scalability, resulting in a 20% increase in database transaction speed and providing a single source of truth for all child welfare reporting.

Engineered complex SQL queries and automated stored procedures to validate source-to-target data mappings from legacy systems. This rigorous ETL testing framework ensured data integrity across the migration of over 50M records, achieving a 98% accuracy rate in migrated data and resolving critical historical data quality and accuracy issues.

Built and deployed serverless, automated file ingestion workflows on AWS to ensure reliable data flow. Utilized AWS services

(S3, Lambda) to receive, encrypt, and store daily data extracts, reducing manual processing handoffs by 30% and providing a secure, auditable foundation for downstream processing.

Served as the lead Integration Engineer on the Boomi Integration Platform, designing and developing both REST and SOAP APIs to facilitate bi-directional, secure data transfer between state systems (e.g., Georgia State DHS) and the Cardinality AI application. Implemented OAuth 2.0 secured authentication flows to ensure real-time synchronization between platforms.

Provided technical product leadership for multiple state programs (HI and GA). Successfully performed solutioning for project- specific scenarios by overcoming challenges related to understanding diverse legacy systems, ensuring compliance with state- specific regulations, and maintaining unwavering data quality across all deliverables. Played a major role in the go-live of two Cardinality products Communicare – DSP Invoicing and Foster Parent Invoicing. AARP Services Inc., Washington DC. June 2024 – December 2024 Data Scientist

Engineered and maintained scalable ETL pipelines on Databricks using PySpark to process and integrate terabyte-scale, multi- source data (including policy, claims, customer, and external market data), improving data availability for analytics by 40% and reducing pipeline processing time.

Automated repetitive data validation and reporting tasks using PySpark scripts, reducing manual workload by 30% and accelerating the monthly financial closing process for the actuarial team.

Conducted comprehensive analysis on policyholder behavior, claims patterns, and market trends using Python and SQL, delivering actionable insights that informed the development of new insurance products and risk models.

Applied hyperparameter tuning and AutoML frameworks to optimize predictive models for customer lifetime value (CLV) and propensity-to-churn, increasing model accuracy by 15% and improving the precision of retention budgeting.

Established MLOps practices to automate the model retraining and deployment lifecycle, ensuring model robustness and reducing the time-to-production for new algorithms from 3 weeks to 5 days.

Collaborated with cross-functional teams (Actuarial, Underwriting, and Marketing) to identify data gaps and implement scalable data solutions, streamlining the data collection and analysis process across departments.

Designed targeted marketing campaigns leveraging data-driven customer segmentation, boosting campaign conversion rates by 20% and improving policyholder retention.

Designed and productionized an end-to-end Databricks pipeline (“Email Modeling”) that automates data engineering, feature engineering/selection, EDA, reporting, and hyperparameter tuning; trains 200–250 models per run and promotes the best-performing

(lowest RMSE) model.

George Mason University, Fairfax VA. August 2023 – August 2024 Data Scientist – Research Assistant

Built and maintained scalable, real-time data pipelines in Azure Databricks to process multi-modal datasets, including structured EHRs and unstructured clinical notes, improving data processing efficiency for research by 30%.

Accelerated data ingestion speeds by 20% in Azure Data Factory by tuning Data Lake Storage configurations, ensuring timely availability of patient and operational data.

Performed extensive data wrangling and time series analysis on patient vitals and treatment data to identify trends and predictors of health outcomes.

Developed Python-based scripts to automate data cleaning, feature engineering, and transformation tasks, improving pipeline reliability and reducing manual data preparation time by 15 hours per week.

Applied survival analysis techniques, including Cox Proportional Hazards models, to analyze patient survival data and identify significant risk factors, contributing to key research findings.

Wrote and optimized complex queries in SQL Server and MySQL to support advanced analytics, reporting, and robust data validation protocols.

Designed interactive Tableau dashboards to visualize key healthcare metrics, patient cohort trends, and model outputs, empowering clinical teams with actionable insights.

Led a data quality initiative that automated the validation of incoming clinical trial data. Implemented rule-based checks that reduced data errors by 95% and decreased the time spent by researchers on data cleaning from 10 hours to 30 minutes per dataset, significantly accelerating the research timeline.

Virtusa Consulting Services Pvt Ltd., India. September 2020 – January 2023 Data Engineer

Orchestrated the entire data pipeline lifecycle using Apache Airflow, authoring and maintaining 15+ DAGs to automate daily and weekly incremental data refresh cycles, ensuring data timeliness and integrity.

Managed and scheduled over 50 production job workflows using BMC Control-M, establishing dependencies, handling job failures with automated alerts, and maintaining a 99.8% on-time job success rate for critical path processes.

Performed complex data integrations and transformations using SAS Enterprise Guide and SAS tool to extract and validate large, sensitive healthcare datasets from legacy source systems, ensuring compliance with HIPAA data standards before ingestion into the modern data platform.

Optimized complex SQL queries for OLAP (Online Analytical Processing) within Redshift and Snowflake by restructuring joins, implementing sort and distribution keys, and leveraging materialized views, reducing average report generation time from 12 minutes to under 90 seconds.

Designed and implemented a star-schema data model for the enterprise data warehouse, consolidating 12+ source tables into a single fact table with conformed dimensions, drastically simplifying the data structure for end-users and reducing the number of redundant tables by 25%.

Automated the generation and distribution of 20+ daily/weekly BI reports using Python and SQL, eliminating 15+ manual hours weekly and accelerating stakeholder access to insights. Zensar Technologies, India. July 2019 – August 2020 Data Engineer

Engineered a robust, scalable data warehousing solution by integrating Amazon RDS with Snowflake, resulting in a 35% increase in query performance and reduction in enterprise report generation time.

Automated cloud infrastructure provisioning using Terraform (IaC) and established CI/CD pipelines with GitHub Actions, achieving 70% reduction in deployment times and eliminating configuration drift.

Developed scalable ETL workflows utilizing Apache Spark, Pandas, and NumPy, leading to a 20% improvement in data reliability and a 43% reduction in data transformation latency.

Integrated and managed Spark, Hive, and Sqoop within AWS EMR frameworks, optimizing data pipelines to achieve an increase in data throughput for petabyte-scale data migrations.

Streamlined ingestion from diverse structured and unstructured sources using custom scripts (Beautiful Soup) and Apache Flink connectors, enhancing end-to-end pipeline stability and reducing system errors by 20%.

Designed interactive operational and strategic dashboards in Power BI, sourcing data from AWS S3, RDS, and Snowflake, enabling real-time KPI monitoring and improving stakeholder decision-making speed by 45%. EDUCATION

Masters in Data Analytics Engineering, George Mason University, Fairfax, VA, USA. December 2024 Bachelors in Computer Science Engineering, Anna University, Chennai, India. July 2020 PROJECTS

ClimateGPT – Large Language Model for Climate Questions. November 2024 Technologies: Python, LLAMA, REST APIs, Docker, AWS Lambda, R Integrated live environmental data APIs with LLAMA model using dynamic function calling and prompt engineering to provide real- time climate responses. Implemented real-time REST APIs to fetch live climate indicators (temperature, CO2 levels, disaster alerts), mapped API responses into model inputs,

Deployed end-to-end MLOps pipelines using Docker and AWS Lambda with 99.9% uptime, and optimized API latency by 27% for 150+ international users.

Analytics and Predictive Modeling of Soccer Players. November 2023 Technologies: PySpark, Databricks, ANN, MLlib, MongoDB, Tableau Achieved 96% prediction accuracy by developing an ANN model on Databricks using PySpark and optimizing it with Grid Search and Cross Validator.

Created an interactive Tableau dashboard and NoSQL backend to support real-time filtering and visualization for efficient player recruitment.

Organized data storage using a NoSQL database (MongoDB), applying indexing and aggregation pipelines for rapid retrieval, reducing backend query latency by 18%.

Statistical, Visualizations, and Predictive Analytics of Diabetes using R July 2023 Technologies: R, ggplot2, dplyr, Random Forest, Shiny Performed statistical analysis to identify 5 key health predictors and developed Random Forest model with 94% accuracy. Illustrated and compared performance by developing more than 5 models and identified risk factors for model degradation. Built and deployed a R Shiny web app to predict diabetes risk in real time, earning academic recognition and use in educational demos. Appreciations were given by Dr. Isuru Dassanayake and conducted youtube tutorials showcasing methodology, application usability, and results to academic audiences.

CERTIFICATIONS

Virtusa Neural Hack Best Solver - One among the top 100 coders across the country AWS Certified Developer (DVA-CO2)

AWS Certified Machine Learning Speciality (MLS-CO1) SAS Certified Advanced Programmer (A00-232)

Contact this candidate