MOUNA VANGA
Sr. Data Engineer
************@*****.***
PROFESSIONAL SUMMARY:
Accomplished Senior Data Engineer with 8 years of experience architecting, developing, and optimizing scalable data solutions across Azure and Databricks to support enterprise-wide analytics and operational needs.
Expert in designing end-to-end cloud-native data pipelines using Azure Data Factory, Synapse Analytics, Databricks, Microsoft Fabric, and Azure Functions, processing structured and unstructured data from sources such as IBM DB2, Cosmos DB, REST APIs, and EHR systems.
Proficient in implementing real-time streaming pipelines with Azure Event Hubs, Apache Kafka, and Azure Stream Analytics to support low-latency analytics and monitoring use cases.
Extensive experience with Azure Data Lake Gen2, delivering cost-effective and governed storage solutions using RBAC, Key Vault, and lifecycle policies for secure enterprise data access.
Skilled in developing ELT workflows using DBT, SQL, and T-SQL across Synapse, Snowflake, and Cosmos DB, incorporating modular logic, reusable macros, and automated data quality checks.
Advanced programming skills in Python, PySpark, Scala, and Apache Beam, building robust data transformation pipelines and machine learning-ready datasets.
Designed and maintained Delta Lake solutions with SCD Type 2 logic, upserts, ACID-compliant transactional pipelines, and Unity Catalog-governed data assets to support critical business operations.
Experienced in orchestrating workflows using ADF triggers, Azure Logic Apps, Airflow, and DevOps CI/CD pipelines via Terraform, GitHub Actions, and Azure DevOps.
Implemented monitoring and observability using Azure Monitor, Log Analytics, and App Insights, enabling rapid detection and resolution of data pipeline issues.
Enforced compliance with GDPR, HIPAA, and SOC 2 using encryption, data masking, Purview metadata management, Unity Catalog access controls, and audit frameworks.
Collaborated with data scientists to build feature stores, data marts, and deploy ML models via Azure ML, MLflow, and SageMaker, supporting real-time scoring and predictive analytics.
Hands-on experience building Power Apps integrated with Dataverse, SQL, and Power BI, delivering actionable KPIs to business stakeholders.
Skilled in dimensional modeling (Star/Snowflake schemas), performance tuning, and cost optimization across cloud warehouse platforms.
Strong knowledge of Kusto Query Language (KQL) and Azure Data Explorer (ADX) for telemetry analysis, anomaly detection, and device data analytics.
Integrated OpenAI, LLMs, and vector/graph databases for advanced use cases in fraud detection, patient risk analysis, and document search.
A recognized leader in cloud modernization initiatives, mentoring junior engineers, and driving alignment across architecture, DevOps, and data teams.
Committed to excellence through pipeline observability, governance, and performance optimization across hybrid and multi-cloud data platforms.
TECHNICAL SKILLS:
Azure Cloud & Data Services: Azure Data Factory (ADF), Microsoft Fabric, Azure Databricks, Synapse Analytics, Azure Data Lake Storage Gen2 (ADLS), Blob Storage, Event Hubs, Azure Functions, Key Vault, Azure Monitor, Azure Machine Learning, OpenAI GPT Models, Azure Purview
Databases & Warehousing: Azure SQL Database, Azure Cosmos DB, Snowflake, SQL Server, MySQL, Oracle, IBM DB2, AWS Redshift, BigQuery, Graph Databases, Vector Databases
Big Data & Distributed Frameworks: Apache Spark, PySpark, Apache Beam, Delta Lake, Hadoop (HDFS, Hive), Apache Kafka, Scala
Data Migration: IBM DB2, SQL Server, Data Reconciliation
JSON Transformations: Nested JSON Generation in PySpark, Azure Synapse
Data Lineage & Quality: Unity Catalog Governance, Data Quality Checks, Metadata Management, Data Validation Frameworks
DevOps & CI/CD: Azure DevOps, GitHub Actions, Jenkins, Terraform, Docker, Kubernetes(AKS), Bamboo, Ansible
Data Governance & Security: Azure Purview, Databricks Unity Catalog (RBAC & Fine-Grained Access Control), Apache Ranger, Azure Active Directory (RBAC), AWS IAM, Encryption at Rest/In Transit, Key Vault Integration, GCP Data Catalog
Compliance Standards: SOC-2, GDPR, HIPAA, PCI-DSS
Scripting & Programming: Python, SQL, Spark SQL, T-SQL, Shell Scripting, REST APIs, PowerShell, Java, C++, Perl
Analytics & Visualization: Power BI, Tableau, Power Apps, Looker Studio, D3.js, Clinical & Financial Dashboards, GIS/Spatial Data Analysis
Tools & Methodologies: Agile, Scrum, JIRA, Confluence, Git, GitHub, DBT, Airflow, Azure Logic Apps, Azure Monitor, Log Analytics, App Insights
EDUCATION:
Master’s in Computer Science Engineering - Kennesaw State University- Marietta, GA
Bachelor’s in Computer Science Engineering- Jawaharlal Nehru Technological University- Hyderabad, India
WORK EXPERIENCE:
Cedars Sinai, Los Angeles, CA March 2024 – Present
Sr. Data Engineer
Responsibilities:
Designed and maintained Apache Beam pipelines deployed on Cloud Dataflow for batch and near real-time ETL processing, enabling ingestion from Azure Event Hubs, EHR systems, PostgreSQL, and REST APIs into BigQuery to support scalable clinical analytics.
Engineered and optimized complex Azure Databricks ETL pipelines using PySpark, Scala, and Python, improving runtime performance, reducing cluster costs, and resolving production defects through structured log-based debugging and root cause analysis.
Implemented Delta Lake with ACID transactions and scalable SCD Type 2 merge strategies to manage evolving patient, provider, and audit datasets while ensuring data consistency and historical traceability.
Leveraged Databricks Unity Catalog to enforce centralized governance, fine-grained access control, lineage tracking, and secure data sharing across clinical analytics workspaces.
Developed custom Python-based monitoring, health checks, and alerting frameworks to proactively detect job failures, latency spikes, and schema drift across Databricks and ADF pipelines, improving operational resilience.
Modeled and transformed healthcare data using HL7 v2 messages and FHIR bundles, enabling longitudinal patient analysis, interoperability workflows, and compliance with healthcare data exchange standards.
Extracted SMART on FHIR resources via REST APIs and Python scripts, transforming structured and semi-structured clinical data into analytics-ready datasets for reporting and patient dashboards.
Built Azure Data Factory pipelines to ingest, validate, and transform sensitive healthcare datasets into Synapse Analytics, Cosmos DB, Snowflake, and Microsoft Fabric, incorporating schema enforcement, metadata tracking, and automated lineage capture.
Orchestrated complex hybrid batch and streaming workflows using Apache Airflow and ADF triggers, while deploying containerized data services and APIs on Azure Kubernetes Service (AKS) for scalable processing.
Designed and managed secure Azure Data Lake Storage Gen2 architecture with hierarchical namespaces, RBAC enforcement, lifecycle management policies, and encryption for structured and unstructured healthcare data.
Provisioned and maintained Azure Virtual Machines for compute-intensive workloads, integrating with ADF pipelines, automation runbooks, and monitoring frameworks for controlled scheduling and scalability.
Optimized complex SQL, Spark SQL, and Python transformations using partitioning, indexing strategies, caching, and cost-aware compute sizing to reduce cluster runtime and overall cloud spend.
Standardized ELT logic using dbt models and reusable Python macros, embedding automated data validation checks and promoting modular, version-controlled transformation architecture.
Enabled real-time telemetry monitoring and alerting using Azure Event Hubs, Logic Apps, and Azure Data Explorer (KQL), supporting proactive system health checks for medical devices and AKS workloads.
Developed anomaly detection pipelines using Azure Functions and Python to identify irregular patterns in protected health information, enhancing downstream clinical analytics reliability.
Enforced enterprise-grade data security controls using RBAC, Key Vault, Private Link, encryption at rest/in transit, and audit logging, aligning data platforms with HIPAA, GDPR, and SOC 2 compliance standards.
Designed dimensional models using Star and Snowflake schemas to support BI and analytics consumption through Power BI, Looker, and Microsoft Fabric’s unified data experience.
Built low-code Power Apps integrated with Dataverse, Cosmos DB, and Azure SQL, enabling healthcare stakeholders to access real-time KPIs and operational dashboards.
Collaborated closely with DevOps, clinical analysts, architects, and data scientists to align pipeline delivery with patient-centric outcomes and regulatory requirements.
Environment: Azure (Data Factory, Databricks, ADLS Gen2, Synapse Analytics, Cosmos DB, Event Hubs, Azure Functions, Logic Apps, Azure Data Explorer, AKS, Virtual Machines, Key Vault, RBAC, Microsoft Fabric, Power Apps, Dataverse, Unity Catalog), Apache Beam, Cloud Dataflow, Apache Airflow, Delta Lake, Python, PySpark, Scala, SQL, KQL, dbt, REST APIs, FHIR/HL7, Power BI, Looker, HIPAA, GDPR, SOC 2.
Texas Capital, Dallas, TX July 2022 – February 2024
Data Engineer
Responsibilities:
Developed and optimized Azure Databricks ETL pipelines using PySpark, Scala, and Python, supporting large-scale batch and streaming ingestion of financial datasets across Azure and hybrid cloud environments.
Implemented Delta Lake with ACID transactions, time travel, and scalable SCD Type 2 merge strategies to manage regulatory-grade transaction records and historical data traceability.
Leveraged Databricks Unity Catalog to enforce centralized governance, fine-grained access control, data lineage, and secure data sharing across analytics and ML workloads.
Diagnosed and resolved pipeline failures and performance bottlenecks through structured root cause analysis of PySpark jobs, cluster logs, and workflow telemetry, improving SLA adherence and runtime efficiency.
Built Azure Data Factory (ADF) pipelines to extract data from IBM DB2 and other RDBMS systems, implementing CDC patterns, parameterized templates, and incremental strategies into ADLS Gen2, Synapse, and Cosmos DB.
Orchestrated complex hybrid ETL workflows using Apache Airflow and ADF triggers, enabling dependency management, cross-platform scheduling, and operational observability.
Tuned complex SQL, T-SQL, and Spark SQL queries across Synapse, Snowflake, and Cosmos DB using partitioning, indexing, statistics management, and caching techniques to significantly enhance Power BI performance.
Designed and implemented dimensional models (Star and Snowflake schemas) supporting regulatory reporting, risk analytics, and OLAP workloads while aligning with enterprise governance standards.
Built real-time ingestion pipelines using Apache Kafka and Azure Event Hubs to power fraud detection, transaction monitoring, and streaming analytics use cases.
Standardized transformation logic using dbt with reusable macros, automated testing, and modular ELT architecture for maintainable staging and reporting layers.
Established comprehensive monitoring and observability using Azure Monitor, Log Analytics, and custom logging frameworks with SLA-based alerting for ADF and Databricks workflows.
Integrated vector databases and embedding pipelines to enable AI-driven use cases such as personalized recommendations, fraud detection, and semantic search through similarity matching.
Collaborated with ML teams to preprocess structured and unstructured datasets, fine-tune LLM models, and integrate OpenAI APIs (GPT, Vision models) into Azure-based data and analytics platforms.
Enforced enterprise-grade data security and governance using Azure AD RBAC, Managed Identities, Key Vault integration, encryption at rest/in transit, and masking policies aligned with SOC 2 and GDPR standards.
Deployed Azure Purview for metadata management, lineage tracking, and data cataloging, enhancing audit readiness and compliance visibility.
Automated infrastructure provisioning and CI/CD deployments using Azure DevOps, GitHub Actions, and Terraform to consistently deploy ADF pipelines, Databricks jobs, Synapse objects, and Cosmos DB configurations.
Authored technical documentation, conducted peer code reviews, and delivered stakeholder demos to improve solution transparency and cross-functional alignment.
Environment: Azure (Databricks, Unity Catalog, ADF, ADLS Gen2, Synapse, Cosmos DB, Event Hubs, Azure Monitor, Azure AD, Key Vault, Purview), Apache Spark (PySpark), Apache Airflow, Kafka, Delta Lake, Python, Scala, SQL/T-SQL, dbt, OpenAI APIs, Vector Databases, Azure DevOps, GitHub Actions, Terraform, Power BI, SOC 2, GDPR.
Voltuswave Technologies, India March 2020 – December 2021
Data Engineer
Responsibilities:
Architected and developed batch and streaming ETL pipelines using ADF, and Databricks to ingest and process data from on-prem SQL sources, Cosmos DB, REST APIs, and external SaaS platforms.
Built modular ELT workflows with DBT, SQL, and T-SQL, enabling transformation version control and multi-environment (dev/test/prod) deployment.
Leveraged Azure Data Lake Storage Gen2 as a unified repository with RBAC, lifecycle policies, and secure access for raw and curated datasets.
Designed dimensional models (Star and Snowflake schemas) across Synapse, Snowflake, and Cosmos DB, to support scalable analytics and fast query performance.
Optimized SQL, T-SQL, and Spark SQL queries with partitioning, indexing, and caching, significantly improving Power BI and dashboard responsiveness.
Utilized Apache Spark (PySpark, Scala, Python) with Delta Lake for ACID-compliant pipelines and implemented SCD Type 2 logic to handle historical data tracking.
Engineered nested JSON outputs for downstream applications using ADF, Python, and PySpark for external API integrations.
Integrated Azure Machine Learning, MLflow, TensorFlow, and Scikit-learn for anomaly detection and recommendation system models.
Developed PySpark UDFs for parsing hierarchical JSON within Synapse, improving ETL throughput and maintainability.
Processed high-frequency streaming data using Kusto Query Language (KQL) and Azure Data Explorer for telemetry analytics.
Ensured data compliance by implementing Key Vault encryption, RBAC, tokenization, and secure credential handling.
Managed data governance and lineage with Azure Purview, supporting metadata classification and GDPR compliance.
Exposed curated datasets via Azure Functions-based serverless APIs for secure data consumption across apps and workspaces.
Automated orchestration, monitoring, and alerting using Azure Logic Apps, DevOps Pipelines, and Log Analytics.
Validated pipelines using Python-based unit tests and DBT test suites, ensuring schema integrity and data quality.
Built internal automation tools using Power Apps and collaborated in Agile Scrum teams to meet iterative delivery goals.
Environment: Azure (ADF, Databricks, ADLS Gen2, Synapse, Cosmos DB, ADX, Functions, Logic Apps, Azure ML, Purview, Key Vault, RBAC, Log Analytics), Apache Spark (PySpark), Delta Lake, Python, Scala, SQL/T-SQL, dbt, KQL, MLflow, TensorFlow, Scikit-learn, Power BI, Power Apps, Azure DevOps, CI/CD, Agile, GDPR.
Sciens Software Technologies, India January 2018 – February 2020
Jr. Data Engineer
Responsibilities:
Designed and implemented data pipelines using Talend and shell scripting, ensuring smooth data ingestion, transformation, and validation for distributed processing systems.
Optimized SQL and PL/SQL queries for improved performance tuning in handling large-scale Hadoop and HDFS datasets.
Automated ETL workflows for batch processing and monitored data pipelines to ensure efficient and error-free operations.
Developed and deployed serverless data integration solutions using AWS Lambda, AWS EC2, and AWS S3 for cost-effective and scalable data processing.
Employed Apache Beam and MapReduce to process large datasets, enabling real-time and batch data transformation across multiple systems.
Enhanced data governance by implementing robust data security and compliance measures to safeguard sensitive information.
Conducted system troubleshooting to identify and resolve issues, minimizing downtime and ensuring operational efficiency.
Collaborated with stakeholders to gather requirements, validate results, and deliver high-quality data solutions aligned with business goals.
Leveraged Git for version control and source code management, ensuring seamless collaboration within an Agile Kanban development environment.
Validated and ensured data quality by creating integrity, consistency, and completeness checks during ingestion and transformation processes.
Streamlined data ingestion processes to handle structured and unstructured data, enabling distributed processing across cloud and on-premise systems.
Played a key role in distributed data processing, ensuring seamless integration across diverse tools and platforms.
Environment: Talend, Shell Scripting, SQL, PL/SQL, Hadoop, HDFS, AWS Lambda, AWS EC2, AWS S3, Apache Beam, MapReduce, Git, Agile, Apache Kafka, Jenkins, Apache Airflow.