Imran shah
Lead Data Engineer ETL Data Engineer Cloud Data Engineer
***************@*****.*** 778-***-**** Vancouver, BC github.com/codevector809
SUMMARY
Senior Data Engineer with 8+ years of experience designing scalable ETL pipelines, distributed data platforms, and analytics solutions across AWS and Azure. Expertise in Python, SQL, Apache Spark, Apache Kafka, and Apache Airflow for batch and real-time processing of financial, healthcare, and enterprise data. Strong experience building cloud data architectures, Data Lakes, and Data Lakehouse using Snowflake, Amazon Redshift, and Azure Synapse. Proven track record of leading end-to-end data engineering initiatives, optimizing pipeline performance, reducing latency, and enabling self service analytics for product, data, and BI teams.
SKILLS
Languages
Python, SQL, Scala, Java, Bash
Big Data Technologies
Apache Spark, Hadoop, Hive, HBase, Presto,
Apache Flink
ETL / Data Pipeline Tools
Apache Airflow, dbt, Apache NiFi, Talend,
Informatica
Data Warehousing
Snowflake, Amazon Redshift, Google BigQuery,
Azure Synapse
DevOps & Infrastructure
Docker, Kubernetes, Terraform, Jenkins, Git, CI/
CD
Data Architecture
Data Lakes, Data Lakehouse, Star Schema,
Snowflake Schema, Data Modeling
Data Governance & Compliance
GDPR, HIPAA, Apache Atlas, Collibra, AWS Lake
Formation
Streaming & Real-Time Processing
Apache Kafka, Spark Streaming, Kafka Streams,
AWS Kinesis, Pub/Sub
Cloud Platforms
AWS (S3, Redshift, Glue, Lambda, EMR), GCP
(BigQuery, Dataflow, Pub/Sub), Azure (Data
Factory, Synapse, ADLS)
Databases
PostgreSQL, MySQL, MongoDB, Cassandra,
DynamoDB
Monitoring & Observability
Prometheus, Grafana, ELK Stack, Datadog
PROFESSIONAL EXPERIENCE
Lead Data Engineer
EAGIS INC.
•Architected and deployed a clinical data warehouse on Oracle and SQL Server, consolidating 20+ disparate data sources and improving reporting coverage for operations and clinical teams.
2024 – Present
•Designed and optimized batch and real-time data pipelines using Apache Spark, PySpark and Apache Kafka, improving ingestion SLAs and reducing latency for clinical event processing.
•Developed high performance PL/SQL stored procedures and optimized SQL queries to improve reporting and analytics performance for data and clinical teams.
•Built end to end real time data processing pipelines by integrating Kafka with Spark based processing on a shared distributed computing platform.
•Led data modeling initiatives with business and analytics stakeholders, creating star schema models that improved data quality, traceability and HIPAA aligned compliance.
Senior Data Engineer
CL CONSULTING
•Designed cloud native data architecture on Azure (Data Factory, Synapse, ADLS Gen2), supporting analytics for millions of workforce records and increasing platform scalability by 3x.
2021 – 2024
•Built scalable Spark based ETL pipelines on Azure Databricks and Synapse to process multi terabyte datasets and improve data freshness for executive dashboards and HR analytics.
•Implemented ingestion frameworks to land raw and curated data into ADLS and Synapse, enabling centralized Data Lake and Data Lakehouse architectures for analytics and machine learning.
•Developed curated semantic layers and reusable data models for Power BI and Tableau, improving report performance and enabling self service analytics for business and product teams.
Associate Data Engineer
AUTOX
•Maintained and improved ETL and ELT pipelines processing of daily financial transaction data on AWS, increasing throughput and stability for regulatory and risk reporting.
2019 – 2021
Remote
•Migrated a legacy on premise data warehouse to AWS using S3, Redshift, Glue, Lambda and EMR, modernizing the data platform and enabling scalable analytics with lower operational overhead.
•Designed and orchestrated ETL pipelines using Apache Airflow and PySpark on EMR to load curated datasets into Snowflake and Redshift, improving data freshness and reducing manual work.
•Built dimensional data models using star schema and snowflake schema in Snowflake to support finance and risk analytics, improving query performance and usability for analysts and data scientists.
•Implemented monitoring and alerting for critical data workflows, reducing incident resolution time and improving reliability for batch and near real time data pipelines.
Software Engineer
TECH VERSE
•Developed and maintained scalable backend services using Python and PL/ SQL to support transactional and reporting use cases for enterprise applications.
2018 – 2019
•Built and integrated RESTful APIs to connect multiple enterprise systems and third-party applications, enabling real-time and batch data exchange across platforms.
•Collaborated with cross-functional teams using Agile/Scrum methodologies to deliver high-quality software and data solutions, improving time-to- market for key features.
•Troubleshot, debugged, and optimized existing applications and database queries to improve performance, scalability, and reliability under increasing data volumes.
PROJECTS
Healthcare Streaming Data Platform - CHORD
Architected a hybrid Data Lakehouse platform combining Spark based batch processing and Kafka powered streaming data pipelines to ingest clinical events into Oracle/SQL Server warehouses and analytics layers, enabling more timely clinical and operational insights. Financial Data Warehouse Migration to AWS - AUTOX
Led core data streams in migrating a legacy on premise data warehouse to AWS S3, Redshift, and Snowflake using Spark, Glue, and Apache Airflow, enabling scalable data warehousing on distributed systems and reducing infrastructure management effort. EDUCATION
University Of Sargodha
Master in Science