Senior Data Engineer - Cloud Data Platforms & Lakehouse Experts

Location:

Irving, TX

Posted:

March 19, 2026

Contact this candidate

Resume:

Diwas Rai

*********@*****.*** (***) *** – **** LinkedIn

PROFESSIONAL SUMMARY

Senior Data Engineer with 7+ years of experience architecting and delivering end to end data solutions across AWS, Azure, and GCP. Proven expertise in handling the complete data lifecycle, including ingestion, transformation, storage, modeling, and analytics, with hands on experience processing 10TB+ of data daily. Specialized in building high performance batch and real time data pipelines, implementing modern lakehouse architectures, and ensuring data reliability, governance, and security. Adept at optimizing large scale data systems for performance and cost, while collaborating with cross functional teams to deliver scalable, business focused data platforms and actionable insights.

TECHNICAL SKILLS

Cloud Platforms: AWS (S3, Glue, Redshift, Kinesis), Azure (Data Factory, Synapse, Databricks, Event Hubs, Key Vault, Purview), GCP (BigQuery, Dataflow)

Programming Languages: Python, SQL, Scala, Bash/Shell, T-SQL, PL/SQL, HiveQL

Databases: SQL Server, PostgreSQL, MySQL, Oracle, MongoDB, DynamoDB, Cosmos DB, Redis

Data Engineering: Apache Spark, PySpark, Databricks, Delta Lake, Kafka, Hadoop, Airflow, dbt

Data Warehousing: Snowflake, Redshift, BigQuery, Azure Synapse, SQL Server, PostgreSQL, MongoDB

ETL / ELT Pipelines: Data Ingestion, Data Transformation, Batch Processing, Real Time Processing, CDC, Data Integration

Streaming Technologies: Kafka, Kinesis, Event Hubs, Spark Streaming, Flink

Data Modeling: Dimensional Modeling, Star Schema, Snowflake Schema, Data Vault, OLAP, OLTP

Architecture: Data Lakehouse, Medallion Architecture, Data Mesh, Lambda Architecture, Event Driven Architecture

Data Governance & Security: IAM, RBAC, Data Lineage, Metadata Management, Data Catalog, Azure Purview, Encryption

(KMS, Key Vault), HIPAA, SOC2, GDPR

DevOps & CI/CD: Terraform, Docker, Kubernetes, Git, Azure DevOps, Jenkins, GitHub Actions

BI & Visualization: Power BI, Tableau, Looker, QuickSight, Dashboarding, Reporting

Machine Learning / AI: PySpark ML, MLflow, TensorFlow, Feature Engineering, RAG, LLM Integration (OpenAI, Hugging Face)

Monitoring & Observability: CloudWatch, Azure Monitor, Datadog, Prometheus, Logging, Alerting

File Formats: Parquet, Avro, ORC, JSON, CSV, XML, Delta Lake CERTIFICATIONS

Microsoft Certified: Azure Data Engineer Associate (DP 203)

Databricks Certified Data Engineer – Professional

Snowflake SnowPro (Advanced – Data Engineer)

PROFESSIONAL EXPERIENCE

Senior Data Engineer Jan 2023 – Present

State Farmer Insurance – Bloomington, IL

Designed and implemented end to end ETL/ELT pipelines using Azure Data Factory (ADF) to ingest and transform data from multiple sources into a centralized data lakehouse, improving overall data availability and making it easier for analytics teams to access reliable datasets.

Built and optimized data warehouses in Snowflake and Azure Synapse, applying dimensional modeling (star/snowflake schema) to structure data effectively, which led to almost 30% improvement in query performance and faster reporting.

Developed scalable data processing workflows using Databricks and Apache Spark (PySpark) for both batch and real time use cases, helping reduce processing time by almost 40% and ensuring timely data delivery for downstream systems.

Implemented real time data pipelines using Azure Event Hubs, Kinesis, and Spark Streaming, enabling near real time data processing for operational reporting and business insights.

Built data integration pipelines across Azure and AWS services using ADF, AWS Glue, and Lambda, ensuring smooth data flow between systems and improving overall data accessibility.

Applied Delta Lake and Medallion architecture (Bronze, Silver, Gold layers) to organize data across different stages, improving data quality, consistency, and making datasets more reusable for analytics and machine learning.

Established data governance and security practices using Azure Purview, IAM, RBAC, and encryption (Key Vault, KMS), ensuring sensitive data was properly managed and compliant with HIPAA and SOC2 standards.

Developed and maintained CI/CD pipelines using Azure DevOps, Git, and Jenkins, automating deployments and reducing release cycle time by 35%, which improved overall workflow reliability.

Delivered analytics ready datasets and dashboards using Power BI and Looker, helping business teams track KPIs and make faster, data driven decisions.

Monitored pipeline performance using CloudWatch and Azure Monitor, setting up logging and alerting to quickly identify issues and maintain stable, reliable data systems. Technologies Used: Azure Data Factory (ADF), Azure Databricks, Apache Spark (PySpark), Snowflake, Azure Synapse Analytics, Azure Stream Analytics, Event Hubs, AWS Lambda, AWS Glue, GCP BigQuery, Azure Key Vault, Azure Purview, Azure DevOps, Azure Machine Learning, Power BI, Looker, along with experience in relational database design, data visualization, KPI development, statistical analysis, and A/B testing for data driven decision making Data Engineer May 2020 – Dec 2022

TIAA – New York, NY

Worked closely with the team to build and maintain ETL pipelines using GCP (Dataflow, Cloud Storage, BigQuery), where I was mainly responsible for handling ingestion from source systems and writing transformation logic, which helped reduce manual data handling efforts by almost 40% over time.

Spent a good amount of time developing data transformation workflows in PySpark on Databricks, cleaning messy datasets, handling joins across multiple sources, and creating structured outputs that analysts could directly use, which helped reduce processing time for larger jobs.

Contributed to building data warehouse tables in BigQuery and Snowflake, working with star and snowflake schemas, and gradually got more comfortable designing tables that supported reporting without frequent rework.

Assisted in organizing data into raw, processed, and curated layers (lakehouse style setup) using Cloud Storage and BigQuery, which made it easier for the team to track data quality issues and reuse datasets across different use cases.

Helped set up and manage Dataflow jobs and scheduling workflows, making sure pipelines were running on time, handling failures, and reprocessing data when needed.

Worked on batch data pipelines using Dataflow and BigQuery, where I was mainly responsible for transformations and validation logic before the data was made available to downstream users.

Got exposure to streaming pipelines using Pub/Sub and Dataflow, where I supported processing incoming data in near real time and helped debug issues when data wasn’t flowing as expected.

Regularly handled data validation and troubleshooting, comparing outputs across stages and fixing issues related to schema mismatches, null values, or inconsistent transformations.

Followed GCP IAM based access control practices, making sure the right teams had access to datasets while keeping sensitive data restricted.

Supported the team in maintaining CI/CD workflows using Git and Jenkins, mostly around version control, small pipeline updates, and assisting with deployments.

Worked directly with analysts and reporting teams to deliver clean datasets and dashboards using Power BI and Looker, often helping them understand data structure and fix data related issues.

Monitored pipelines using logging and alerting tools, and was usually involved in identifying failures, checking logs, and helping resolve issues to keep workflows stable.

Technologies Used: GCP (BigQuery, Dataflow, Cloud Storage, Pub/Sub, IAM), Apache Spark (PySpark), Databricks, Snowflake, SQL, Python, Git, Jenkins, Data Modeling, ETL/ELT, Data Transformation, Orchestration, Batch & Streaming, Parquet, JSON ETL Developer. Aug 2019 – Apr 2020

Pfizer – New York, NY

Worked on building and maintaining ETL workflows using Informatica PowerCenter, where I was mainly responsible for extracting data from different source systems and applying transformation logic before loading it into the data warehouse.

Spent a lot of time working with SSIS packages, helping move data between transactional systems and reporting databases, and making sure scheduled jobs were running without failures.

Improved performance of existing ETL jobs by reviewing mapping logic, joins, and transformations, which helped reduce processing time by almost 30% for some of the heavier workflows.

Wrote and optimized SQL queries (SQL Server and Oracle) for data extraction and transformation, often troubleshooting issues related to slow queries or incorrect data outputs.

Added data validation checks within ETL workflows, comparing source and target data to catch inconsistencies early and improve overall data quality.

Regularly monitored ETL jobs using Informatica Workflow Monitor and SQL tools, identifying failures, checking logs, and fixing issues related to data loads or transformation errors.

Built and maintained SSRS reports for business users, helping automate reporting and reducing manual effort by ~40%.

Assisted in migrating some legacy workflows from IBM DataStage to Informatica, helping standardize ETL processes and improve maintainability.

Worked closely with the DBA team, learning about indexing, query tuning, and partitioning, and applying those improvements to make ETL jobs run more efficiently. Technologies Used: Informatica PowerCenter, SQL Server Integration Services (SSIS), SQL Server, Oracle Database, PL/SQL, IBM DataStage, SQL Profiler, SQL Server Reporting Services (SSRS), Microsoft Visio, Confluence, SSL, TLS EDUCATION

Master’s in Science Technology Management 4.00

Southeast Missouri State University, Cape Girardeau, Missouri

Contact this candidate