Sushmita Das
Professional Summary
Result-oriented Data Engineer with hands-on experience designing and implementing scalable data pipelines and distributed data systems using Apache Spark, Databricks, and cloud platforms like AWS and Azure. Proficient in real-time and batch data ingestion, ETL workflows, Delta Lake optimization, orchestration with Databricks Jobs, and performance tuning. Passionate about transforming raw data into reliable, high-quality datasets to support data-driven decision-making. Technical Skills
Programming & Querying: Python (Pandas, NumPy), SQL, PySpark Big Data & Processing: Apache Spark (RDD, SQL, DataFrame API), Databricks (Jobs, DLT/workflows, Autoloader, Unity Catalog), kafka
Data Architecture: Delta Lake, Medallion Architecture, Structured Streaming, ETL Cloud Platforms:
- AWS: S3, Lambda
- Azure: Azure Databricks, ADF
Databases: SQL Server, MySQL, Databricks SQL
Tools & DevOps: Git, GitHub Actions, Power BI, Tableau, VSCode, draw.IO Soft Skills: Collaboration, Problem-Solving, Quick Learning, Time Management Languages: English, Hindi, Bengali
Professional Experience
Junior Technical Consultant (Data Engineer)
Digivate Labs Oct 2023 – Present (1.10yrs)
- Developed scalable batch and real-time data pipelines using Databricks, Autoloader, and Apache Spark.
- Implemented Unity Catalog for robust data governance and secure access controls.
- Built and deployed data jobs orchestrated via Databricks Jobs, enhancing pipeline automation.
- Partnered with analysts and data scientists to ensure data quality and lineage across projects.
E-commerce Data Platform Migration
- Automated migration 35TB+ of historical and real-time data from vertica to s3 to databricks, ensuring data integrity and minimal downtime.
- Re-engineered and optimized 150+ ETL pipelines and batch workflows using Databricks notebooks and Delta Lake.
- Refactored 1100+ SQL scripts to align with updated naming standards
- Implemented robust data governance and access control with Unity Catalog.
- Applied naming rules: tables lowercase, columns camelCase
- Automated data quality validation and monitoring to ensure accuracy and consistency post-migration.
- Achieved a 40% reduction in infrastructure costs and improved query performance by 3x.
- Collaborated with cross-functional teams to align migration with business requirements and SLAs.
- Utilized best practices in schema evolution, data transformation, and cost optimization.
- Built metadata-driven batch processing for scalable table migration Key Skills: Databricks, AWS, ETL, Data Migration, Delta Lake, Python, SQL, Data Quality, Data Governance, Big Data, Real-time Data Processing, CI/CD, Stakeholder Management
Project: Data Pipeline Modernization
- Refactored and optimized 10+ legacy Python scripts into scalable PySpark jobs, significantly improving data processing performance for large-scale IoT datasets.
- Migrated complex ETL pipelines to Databricks, leveraging advanced Spark features (partitioning, caching, optimized joins) to enable distributed, real-time analytics.
- Integrated geospatial data processing (GeoPandas, DBSCAN) and implemented robust data enrichment, aggregation, and alerting mechanisms (Slack integration).
- Enhanced data quality and reliability by implementing comprehensive error handling, logging, and monitoring.
- Collaborated cross-functionally to ensure seamless migration, validation, and deployment of new data workflows, contributing to improved operational efficiency and analytics capabilities.
- Key Skills: PySpark, Databricks, Distributed Data Processing, ETL, Kafka, Delta Lake, Geospatial Analytics, Python, Real-Time Data Pipelines, Data Engineering, Automation, Monitoring & Alerting
- Refactored inefficient Spark jobs, reducing runtime by >40% via caching, partitioning, and code modularization.
- Applied best practices in Delta Lake storage layout and metadata handling. Key Skills: PySpark, Databricks, Distributed Data Processing, ETL, Kafka, Delta Lake, Geospatial Analytics, Python, Real-Time Data Pipelines, Data Engineering, Automation, Monitoring & Alerting
Indore +91-934******* ***************@*****.*** LinkedIn: linkedin.com/in/sushmita-das-43574a226/
Real-Time Streaming Analytics Platform
- Designed and implemented a real-time streaming data pipeline for an e-commerce client using Databricks Delta Live Tables (DLT) and PySpark, processing real-time data from Azure Data Lake Storage.
- Developed and deployed a scalable Medallion architecture Bronze, Silver, Gold layers).to optimize ingestion, transformation, and analytics
workflows, ensuring seamless data processing, implemented separately in both Python (PySpark) and SQL.
- Built automated data quality monitoring using expectation-based validations, quarantining invalid records and achieving 99.9% data accuracy for downstream reporting.
- Implemented SCD Type 1 & Type 2 transformations for both dimension and fact tables, enabling historical tracking of changes and enhancing data warehouse integrity.
- Integrated a real-time analytics dashboard, automatically refreshing as soon as new data arrives in ADLS,
- providing instant insights into total revenue, customer retention, discount impact, and product performance KPIs, accelerating business decision-making and market responsiveness.
Associate Trainee Software Engineer
Techcoopers software solutions pvt ltd Sep 2022 – Jan 2023 (5 mos.)
Trainee data analyst
Zivaya wellness pvt ltd Mar 2023 – Apr 2023(1 mon) Certifications
- Databricks Certified Data Engineer Associate
- Databricks Certified Data Engineer Professional
- Databricks Accredited Platform Architect – AWS, Azure Education
Msc data science and analytics DAVV University 11/2020 - 08/2022 Bsc DAVV University 05/2017 - 06/2020
Diploma in computer applications Pioneer institute of technology and management 03/2018 - 06/2019 12th Neeraj bal mandir higher secondary school 03/2016 - 04/2017 10th Daily mirror public higher secondary school 03/2014 - 04/2015