Rohit L
Email:***********@*****.***
Mobile: +1-470-***-****
SUMMARY
• Over 5+ years of experience across Data Scientist / Engineer,backend development, and applied data science projects in enterprise environments
• Experienced in building interactive dashboards and visual analytics using Streamlit, Dash, and Power BI, with deployment on Posit Connect (RStudio Workbench)
• Proficient in designing and integrating AWS data sources (S3, Redshift, Glue, Athena) for downstream analytics and reporting workflows.
• Familiar with graph databases (Neo4j), Dataiku, and JMP for scientific data applications.
• Strong version control and collaborative development experience using Git and GitHub.
• Adept at developing data exports, reports, and visualizations that bridge technical workflows with business stakeholders.
• Highly collaborative team player, capable of working independently and proposing innovative visualization solutions for scientific challenges.
• Currently working on cloud-native data pipelines and microservices for a large-scale retail client using Google Cloud Platform (GCP), Java Spring Boot, and Databricks.
• Hands-on experience building batch and streaming data pipelines using PySpark, SQL, and Apache Airflow on GCP and AWS.
• Solid expertise in writing modular, reusable Python code for ETL/ELT workflows and data validation logic.
• Created data ingestion frameworks to pull structured and semi-structured data from APIs, flat files, and cloud buckets into Snowflake and BigQuery.
• Designed and deployed DBT-based transformation pipelines, making data consumable for analytics and reporting teams.
• Strong expertise in Kafka, Databricks, and Snowflake to build real-time, scalable, and fault-tolerant data platforms for enterprise systems.
• Delivered platform engineering solutions for data observability, job orchestration, lineage, and cost optimization using Databricks Lakehouse architecture.
• Implemented streaming and batch data integration pipelines on Kafka-Databricks-Snowflake stack for business-critical applications.
• Worked extensively on data wrangling, aggregation, and feature engineering for both analytics and machine learning models.
• Developed and maintained REST APIs using Spring Boot to expose processed data to frontend systems and third-party vendors.
• Built scheduled jobs and orchestrated DAGs in Apache Airflow to manage extract-transform-load
(ETL) dependencies.
• Strong understanding of schema evolution, partitioning strategies, and performance tuning in Snowflake and BigQuery.
• Experience with designing slowly changing dimensions (SCDs) and fact tables in data warehouses.
• Used Databricks notebooks to collaborate with data analysts and data scientists for exploratory data analysis (EDA).
• Built real-time ingestion pipelines using Kafka and GCP Pub/Sub for operational and customer-facing systems.
• Regularly debug and monitor production jobs using Stackdriver, Cloud Logging, and custom alerting mechanisms.
• Created data validation rules and custom checks using Great Expectations and Pytest-based frameworks.
• Used Terraform and GCP Deployment Manager for infrastructure provisioning and managing resource templates.
• Participated in code reviews, performance profiling, and production support rotations for data pipelines.
• Collaborated with business teams to translate business requirements into technical designs and data contracts.
• Integrated logging, data lineage, and metadata tracking across pipelines to improve observability and data governance.
• Developed Power BI and Tableau dashboards to visualize the output of data transformations and key performance indicators (KPIs).
• Mentored junior engineers on Python, data modeling, Git workflows, and cloud-native data tools.
• Comfortable working in Agile teams, managing sprints, JIRA tickets, and collaborating via Confluence. TECHNICAL SKILLS
Languages: Python, R, SQL, Java Script, Java, C, C++, ML/AI: TensorFlow, Kera’s, Scikit-learn, Prophet, PySpark, NLTK, Airflow, Pandas, OpenCV Data Base: MySQL, SQL server, PostgreSQL, MongoDB
Data Visualization: Streamlit, Dash, Shiny, Power BI, Tableau, Spotfire Data Modeling & Warehousing: Dimensional modeling, SCD, schema evolution, Snowflake RBAC Reporting Tools: Tableau, Power BI, Wavefront Statistical Tools: SAS, Data Pipeline Tools: Dataiku, Databricks, Apache Airflow Databases: Snowflake, Redshift, PostgreSQL, Neo4j (Graph DB) SQL Environment: Snowflake, Microsoft SQL Server, Netezza Statistical Tools: JMP, Minitab, Scikit-learn
Streaming/Batch Processing: Structured Streaming, CDC, Kafka Connect, Debezium CI/CD: GitHub, Concourse, Airflow, Docker, Jenkins (if applicable) Cloud ML Platforms: AWS SageMaker, Azure ML, GCP Vertex AI Cloud: Google Cloud Platform, Pivotal Cloud Foundry, Azure, AWS Cloud Resources: Azure Data bricks, AWS Glue, GCP Big Query, Cloud Composer, Dataflow, Data Proc Frame works: Flask, Django, Falcon, Bottle
Tools: Apache Spark, Kafka, Docker, Git, GitHub Actions, Azure DevOps, CI/CD for notebooks, Airflow, Terraform Operating System: Linux, Windows, Unix, MacOS
PROFESSIONAL EXPERIENCE
US Bank Minneapolis, MN Jan 2024-Present
Data Scientist/Engineer
Responsibilities:
• Designed and implemented scalable ETL pipelines in AWS using Glue and PySpark to process credit card transactions, customer demographics, and fraud detection signals.
• Migrated legacy ETL workflows from on-prem Teradata to AWS Glue and EMR, improving processing time by over 40%.
• Partnered with credit risk analysts to translate business logic into SQL transformations using Snowflake and Amazon Redshift, including stored procedures.
• Built real-time ingestion frameworks using Kafka and Spark Streaming to identify and monitor fraudulent patterns.
• Developed custom dashboards using Streamlit and Power BI to track fraud signals and transaction insights.
• Published Python-based applications to Posit Connect for internal stakeholders to interactively explore analytics reports.
• Integrated AWS services such as Athena, Redshift, and S3 to deliver visual data stories and exports.
• Implemented Git-based version control with clear branching strategies for collaborative work across engineering teams.
• Designed reusable data visualization modules using Dash and Plotly for real-time streaming insights.
• Developed modular, version-controlled AWS Glue jobs in Python for ingesting third-party credit scores and bureau reports into S3 and Redshift.
• Orchestrated complex workflows using Apache Airflow DAGs, including pipeline health monitoring and SLA breach alerting.
• Used Terraform to provision AWS resources across dev, QA, and prod environments for infrastructure consistency.
• Applied data partitioning, bucketing, and Z-order optimization in AWS Glue to enhance Spark job performance.
• Developed batch and streaming pipelines using Databricks notebooks, Delta Lake, and Kafka to serve both operational and analytics needs.
• Integrated Kafka Connect with Databricks to enable near real-time processing of financial transactions and fraud signals.
• Designed robust ingestion workflows from Kafka to Snowflake via Databricks Auto Loader with checkpointing and schema evolution support.
• Tuned Apache Spark on Databricks using dynamic resource allocation, caching strategies, and cluster auto-scaling to reduce job runtimes by 30%.
• Created parameterized, reusable notebook templates for onboarding new datasets across dev/test/prod environments using MLflow and Unity Catalog.
• Defined modularized data models in Snowflake, handling SCD Type 2 dimensions and CDC-based fact table updates with schema versioning.
• Built streaming ingestion with Kafka Avro schema registry and Databricks Structured Streaming to ensure data quality and consistency.
• Created Delta Lake tables on EMR for slowly changing dimensions (SCDs) and historical datasets, optimizing query costs.
• Established data contracts and SLAs between data producers and consumers for customer behavior datasets.
• Collaborated with Snowflake admins to configure RBAC (Role-Based Access Control) and optimize warehouse sizing.
• Developed unit and integration tests using Pytest to ensure data reliability prior to production deployment.
• Participated in compliance audits reviewing data retention, encryption, and PII/PCI masking policies.
• Integrated with internal APIs and external vendors to securely ingest credit utilization reports into data lake zones.
• Helped define CDC (Change Data Capture) strategy using Debezium and Kafka for tracking transactional system updates.
• Authored detailed documentation, runbooks, and SOPs to support production operations and reduce response time.
• Leveraged AWS Athena and Redshift Spectrum to serve curated data to ML and analytics platforms.
• Created QuickSight dashboards to track pipeline metrics, usage trends, and compute costs across environments.
• Engaged in sprint grooming and architecture reviews as part of a distributed agile data engineering team.
• Mentored junior engineers on AWS Glue, data modeling best practices, and cloud-native patterns for resilient pipelines.
Kasier Permanent, Atlanta, GA Feb 2023 -Dec 2023
Data Engineer /Scientist
Responsibilities:
• Designed and maintained robust ETL pipelines using PySpark, SQL, and Informatica to process clinical, claims, and patient engagement data.
• Integrated diverse data sources including Epic (EHR), Oracle, and external APIs into an enterprise data warehouse on Teradata.
• Collaborated with data analysis and reporting teams to identify data needs and deliver clean, reliable datasets.
• Developed patient-centric data models and created fact and dimension tables to support analytics on readmission rates and treatment adherence.
• Migrated legacy SQL Server jobs to Spark-based workflows to improve scalability and reduce processing time.
• Created data ingestion pipelines using Apache NiFi and Kafka for near-real-time processing of appointment and lab result feeds.
• Built metadata-driven ETL frameworks for flexible onboarding of new data sources without code changes.
• Built clinician-focused dashboards in Power BI and Streamlit, visualizing trends in patient readmissions and appointment adherence.
• Created exportable data reports and self-service analytics tools deployed via Shiny and Dash.
• Connected AWS Glue, Athena, and Redshift pipelines directly to visualization layers for dynamic insights.
• Provided Python/SQL visual summaries of clinical KPIs, ensuring data validation across patient engagement metrics.
• Maintained version control with GitHub and conducted peer reviews of dashboard-related scripts and data connectors.
• Implemented data validation and profiling using Great Expectations and Informatica DQ to ensure trust in clinical KPIs.
• Built scalable event-driven architectures using Kafka topics to transport patient event data to Databricks Lakehouse and Snowflake warehouse.
• Migrated batch ETL processes to a hybrid Lambda architecture on Kafka + Databricks + Snowflake, improving data availability SLAs.
• Integrated Databricks with Snowflake using JDBC and Snowpipe for curated data delivery and analytics-ready modeling.
• Configured multi-cluster compute profiles on Databricks for medical image and clinical data parallel processing.
• Designed structured and semi-structured ingestion flows from EMR systems into Bronze-Silver-Gold Lakehouse layers using Delta Live Tables (DLT).
• Implemented GDPR- and HIPAA-compliant data transformations and masking policies in Snowflake with role-based access control (RBAC).
• Participated in HIPAA-compliant data governance initiatives, ensuring PII/PHI masking and audit trails across sensitive datasets.
• Automated data quality alerts and monitoring dashboards to proactively detect anomalies or pipeline failures.
• Deployed solutions on AWS, leveraging S3, Lambda, and Glue for scalable data processing and cost optimization.
• Supported data lake creation for population health analytics by ingesting unstructured files and imaging metadata.
• Tuned SQL and Spark transformations to handle large volumes of claims data with improved performance.
• Worked in agile sprints with data scientists and EMR specialists to deliver data products supporting clinical models.
• Delivered curated datasets to Tableau dashboards tracking COVID-19 testing, admissions, and vaccine rollouts.
• Documented data flows, lineage, and transformation logic in Confluence and shared repos for transparency.
• Designed CDC-based workflows to track updates to patient records and ensure incremental loads.
• Provided ad hoc data extracts and quick-turnaround solutions for executive and operational leadership.
• Participated in regular code reviews and knowledge-sharing sessions to enhance team practices and mentorship.
• Ensured SOX and compliance audit readiness by tracking data movement and access logs across environments.
CONTUS Tech, India May 2019-Feb 2022
Data Engineer/Scientist
Responsibilities
• Built scalable ETL pipelines using Apache NiFi, PySpark, and SQL to ingest and transform large volumes of e-commerce transaction and customer interaction data.
• Designed and maintained Hive-based data warehouse schemas with proper partitioning and bucketing strategies to optimize query performance.
• Implemented real-time ingestion using Kafka and Kafka Streams, connecting producers to downstream consumers including Hive, HDFS, and dashboard services.
• Assisted in setting up CDC (Change Data Capture) mechanisms using Sqoop and Kafka Connect for incremental updates from MySQL to HDFS.
• Developed data quality validation frameworks in Python using custom checks, ensuring schema integrity and null handling.
• Built automated data reconciliation scripts to compare ingestion vs. source using PySpark and Hive queries.
• Collaborated with the data science team to prepare training datasets and conduct feature engineering for churn prediction models using Python, Pandas, and Scikit-learn.
• Supported deployment and monitoring of machine learning inference pipelines built using Flask APIs hosted on AWS EC2.
• Contributed to dashboard development in Tableau for internal operations metrics and customer analytics.
• Created custom Python utilities for log parsing, job status monitoring, and data profiling.
• Authored reusable shell scripts for automation of daily ingestion jobs and backups.
• Participated in data lineage tracking initiatives using Apache Atlas to enhance metadata and governance.
• Followed Agile practices, contributed to sprint planning, and regularly demoed work in team reviews. Project:
1. Customer Behavior Analytics Platform
a. Built a data pipeline to track user interactions from a retail app in real-time using Kafka NiFi
Hive.
b. Enabled product recommendation engine with curated datasets delivered to the data science team.
2. Sales Forecasting with PySpark & ML Models
a. Developed data pipelines in PySpark to aggregate daily sales and inventory data. b. Partnered with data science team to feed the data into a Prophet-based forecasting model, improving forecasting accuracy by 25%.
3. Data Lake Implementation
a. Helped implement a raw-to-curated HDFS data lake, designed ingestion strategies and built a tagging system using Atlas for data discovery.
Education: Concordia university st .paul (Information science) 2023-2024