Diwas Rai
PROFESSIONAL SUMMARY
Senior Data Engineer with 7+ years of experience architecting and delivering end to end data solutions across AWS, Azure, and GCP. Proven expertise in handling the complete data lifecycle, including ingestion, transformation, storage, modeling, and analytics, with hands on experience processing 10TB+ of data daily. Specialized in building high performance batch and real time data pipelines, implementing modern lakehouse architectures, and ensuring data reliability, governance, and security. Adept at optimizing large scale data systems for performance and cost, while collaborating with cross functional teams to deliver scalable, business focused data platforms and actionable insights.
TECHNICAL SKILLS
Cloud Platforms: AWS (S3, Glue, Redshift, Kinesis), Azure (Data Factory, Synapse, Databricks, Event Hubs, Key Vault, Purview), GCP (BigQuery, Dataflow)
Programming Languages: Python, SQL, Scala, Bash/Shell, T-SQL, PL/SQL, HiveQL
Databases: SQL Server, PostgreSQL, MySQL, Oracle, MongoDB, DynamoDB, Cosmos DB, Redis
Data Engineering: Apache Spark, PySpark, Databricks, Delta Lake, Kafka, Hadoop, Airflow, dbt
Data Warehousing: Snowflake, Redshift, BigQuery, Azure Synapse, SQL Server, PostgreSQL, MongoDB
ETL / ELT Pipelines: Data Ingestion, Data Transformation, Batch Processing, Real Time Processing, CDC, Data Integration
Streaming Technologies: Kafka, Kinesis, Event Hubs, Spark Streaming, Flink
Data Modeling: Dimensional Modeling, Star Schema, Snowflake Schema, Data Vault, OLAP, OLTP
Architecture: Data Lakehouse, Medallion Architecture, Data Mesh, Lambda Architecture, Event Driven Architecture
Data Governance & Security: IAM, RBAC, Data Lineage, Metadata Management, Data Catalog, Azure Purview, Encryption
(KMS, Key Vault), HIPAA, SOC2, GDPR
DevOps & CI/CD: Terraform, Docker, Kubernetes, Git, Azure DevOps, Jenkins, GitHub Actions
BI & Visualization: Power BI, Tableau, Looker, QuickSight, Dashboarding, Reporting
Machine Learning / AI: PySpark ML, MLflow, TensorFlow, Feature Engineering, RAG, LLM Integration (OpenAI, Hugging Face)
Monitoring & Observability: CloudWatch, Azure Monitor, Datadog, Prometheus, Logging, Alerting
File Formats: Parquet, Avro, ORC, JSON, CSV, XML, Delta Lake CERTIFICATIONS
Microsoft Certified: Azure Data Engineer Associate (DP 203)
Databricks Certified Data Engineer – Professional
Snowflake SnowPro (Advanced – Data Engineer)
PROFESSIONAL EXPERIENCE
Senior Data Engineer Jan 2023 – Present
State Farmer Insurance – Bloomington, IL
Designed and implemented end to end ETL/ELT pipelines using Azure Data Factory (ADF) to ingest and transform data from multiple sources into a centralized data lakehouse, improving overall data availability and making it easier for analytics teams to access reliable datasets.
Built and optimized data warehouses in Snowflake and Azure Synapse, applying dimensional modeling (star/snowflake schema) to structure data effectively, which led to almost 30% improvement in query performance and faster reporting.
Developed scalable data processing workflows using Databricks and Apache Spark (PySpark) for both batch and real time use cases, helping reduce processing time by almost 40% and ensuring timely data delivery for downstream systems.
Implemented real time data pipelines using Azure Event Hubs, Kinesis, and Spark Streaming, enabling near real time data processing for operational reporting and business insights.
Built data integration pipelines across Azure and AWS services using ADF, AWS Glue, and Lambda, ensuring smooth data flow between systems and improving overall data accessibility.
Applied Delta Lake and Medallion architecture (Bronze, Silver, Gold layers) to organize data across different stages, improving data quality, consistency, and making datasets more reusable for analytics and machine learning.
Established data governance and security practices using Azure Purview, IAM, RBAC, and encryption (Key Vault, KMS), ensuring sensitive data was properly managed and compliant with HIPAA and SOC2 standards.
Developed and maintained CI/CD pipelines using Azure DevOps, Git, and Jenkins, automating deployments and reducing release cycle time by 35%, which improved overall workflow reliability.
Delivered analytics ready datasets and dashboards using Power BI and Looker, helping business teams track KPIs and make faster, data driven decisions.
Monitored pipeline performance using CloudWatch and Azure Monitor, setting up logging and alerting to quickly identify issues and maintain stable, reliable data systems. Technologies Used: Azure Data Factory (ADF), Azure Databricks, Apache Spark (PySpark), Snowflake, Azure Synapse Analytics, Azure Stream Analytics, Event Hubs, AWS Lambda, AWS Glue, GCP BigQuery, Azure Key Vault, Azure Purview, Azure DevOps, Azure Machine Learning, Power BI, Looker, along with experience in relational database design, data visualization, KPI development, statistical analysis, and A/B testing for data driven decision making Data Engineer May 2020 – Dec 2022
TIAA – New York, NY
Worked closely with the team to build and maintain ETL pipelines using GCP (Dataflow, Cloud Storage, BigQuery), where I was mainly responsible for handling ingestion from source systems and writing transformation logic, which helped reduce manual data handling efforts by almost 40% over time.
Spent a good amount of time developing data transformation workflows in PySpark on Databricks, cleaning messy datasets, handling joins across multiple sources, and creating structured outputs that analysts could directly use, which helped reduce processing time for larger jobs.
Contributed to building data warehouse tables in BigQuery and Snowflake, working with star and snowflake schemas, and gradually got more comfortable designing tables that supported reporting without frequent rework.
Assisted in organizing data into raw, processed, and curated layers (lakehouse style setup) using Cloud Storage and BigQuery, which made it easier for the team to track data quality issues and reuse datasets across different use cases.
Helped set up and manage Dataflow jobs and scheduling workflows, making sure pipelines were running on time, handling failures, and reprocessing data when needed.
Worked on batch data pipelines using Dataflow and BigQuery, where I was mainly responsible for transformations and validation logic before the data was made available to downstream users.
Got exposure to streaming pipelines using Pub/Sub and Dataflow, where I supported processing incoming data in near real time and helped debug issues when data wasn’t flowing as expected.
Regularly handled data validation and troubleshooting, comparing outputs across stages and fixing issues related to schema mismatches, null values, or inconsistent transformations.
Followed GCP IAM based access control practices, making sure the right teams had access to datasets while keeping sensitive data restricted.
Supported the team in maintaining CI/CD workflows using Git and Jenkins, mostly around version control, small pipeline updates, and assisting with deployments.
Worked directly with analysts and reporting teams to deliver clean datasets and dashboards using Power BI and Looker, often helping them understand data structure and fix data related issues.
Monitored pipelines using logging and alerting tools, and was usually involved in identifying failures, checking logs, and helping resolve issues to keep workflows stable.
Technologies Used: GCP (BigQuery, Dataflow, Cloud Storage, Pub/Sub, IAM), Apache Spark (PySpark), Databricks, Snowflake, SQL, Python, Git, Jenkins, Data Modeling, ETL/ELT, Data Transformation, Orchestration, Batch & Streaming, Parquet, JSON ETL Developer. Aug 2019 – Apr 2020
Pfizer – New York, NY
Worked on building and maintaining ETL workflows using Informatica PowerCenter, where I was mainly responsible for extracting data from different source systems and applying transformation logic before loading it into the data warehouse.
Spent a lot of time working with SSIS packages, helping move data between transactional systems and reporting databases, and making sure scheduled jobs were running without failures.
Improved performance of existing ETL jobs by reviewing mapping logic, joins, and transformations, which helped reduce processing time by almost 30% for some of the heavier workflows.
Wrote and optimized SQL queries (SQL Server and Oracle) for data extraction and transformation, often troubleshooting issues related to slow queries or incorrect data outputs.
Added data validation checks within ETL workflows, comparing source and target data to catch inconsistencies early and improve overall data quality.
Regularly monitored ETL jobs using Informatica Workflow Monitor and SQL tools, identifying failures, checking logs, and fixing issues related to data loads or transformation errors.
Built and maintained SSRS reports for business users, helping automate reporting and reducing manual effort by ~40%.
Assisted in migrating some legacy workflows from IBM DataStage to Informatica, helping standardize ETL processes and improve maintainability.
Worked closely with the DBA team, learning about indexing, query tuning, and partitioning, and applying those improvements to make ETL jobs run more efficiently. Technologies Used: Informatica PowerCenter, SQL Server Integration Services (SSIS), SQL Server, Oracle Database, PL/SQL, IBM DataStage, SQL Profiler, SQL Server Reporting Services (SSRS), Microsoft Visio, Confluence, SSL, TLS EDUCATION
Master’s in Science Technology Management 4.00
Southeast Missouri State University, Cape Girardeau, Missouri