Data Scientist Power Bi

Location:

Cumming, GA

Posted:

September 10, 2025

Contact this candidate

Resume:

Rohit L

Email:***********@*****.***

Mobile: +1-470-***-****

SUMMARY

• Over 5+ years of experience across Data Scientist / Engineer,backend development, and applied data science projects in enterprise environments

• Experienced in building interactive dashboards and visual analytics using Streamlit, Dash, and Power BI, with deployment on Posit Connect (RStudio Workbench)

• Proficient in designing and integrating AWS data sources (S3, Redshift, Glue, Athena) for downstream analytics and reporting workflows.

• Familiar with graph databases (Neo4j), Dataiku, and JMP for scientific data applications.

• Strong version control and collaborative development experience using Git and GitHub.

• Adept at developing data exports, reports, and visualizations that bridge technical workflows with business stakeholders.

• Highly collaborative team player, capable of working independently and proposing innovative visualization solutions for scientific challenges.

• Currently working on cloud-native data pipelines and microservices for a large-scale retail client using Google Cloud Platform (GCP), Java Spring Boot, and Databricks.

• Hands-on experience building batch and streaming data pipelines using PySpark, SQL, and Apache Airflow on GCP and AWS.

• Solid expertise in writing modular, reusable Python code for ETL/ELT workflows and data validation logic.

• Created data ingestion frameworks to pull structured and semi-structured data from APIs, flat files, and cloud buckets into Snowflake and BigQuery.

• Designed and deployed DBT-based transformation pipelines, making data consumable for analytics and reporting teams.

• Strong expertise in Kafka, Databricks, and Snowflake to build real-time, scalable, and fault-tolerant data platforms for enterprise systems.

• Delivered platform engineering solutions for data observability, job orchestration, lineage, and cost optimization using Databricks Lakehouse architecture.

• Implemented streaming and batch data integration pipelines on Kafka-Databricks-Snowflake stack for business-critical applications.

• Worked extensively on data wrangling, aggregation, and feature engineering for both analytics and machine learning models.

• Developed and maintained REST APIs using Spring Boot to expose processed data to frontend systems and third-party vendors.

• Built scheduled jobs and orchestrated DAGs in Apache Airflow to manage extract-transform-load

(ETL) dependencies.

• Strong understanding of schema evolution, partitioning strategies, and performance tuning in Snowflake and BigQuery.

• Experience with designing slowly changing dimensions (SCDs) and fact tables in data warehouses.

• Used Databricks notebooks to collaborate with data analysts and data scientists for exploratory data analysis (EDA).

• Built real-time ingestion pipelines using Kafka and GCP Pub/Sub for operational and customer-facing systems.

• Regularly debug and monitor production jobs using Stackdriver, Cloud Logging, and custom alerting mechanisms.

• Created data validation rules and custom checks using Great Expectations and Pytest-based frameworks.

• Used Terraform and GCP Deployment Manager for infrastructure provisioning and managing resource templates.

• Participated in code reviews, performance profiling, and production support rotations for data pipelines.

• Collaborated with business teams to translate business requirements into technical designs and data contracts.

• Integrated logging, data lineage, and metadata tracking across pipelines to improve observability and data governance.

• Developed Power BI and Tableau dashboards to visualize the output of data transformations and key performance indicators (KPIs).

• Mentored junior engineers on Python, data modeling, Git workflows, and cloud-native data tools.

• Comfortable working in Agile teams, managing sprints, JIRA tickets, and collaborating via Confluence. TECHNICAL SKILLS

Languages: Python, R, SQL, Java Script, Java, C, C++, ML/AI: TensorFlow, Kera’s, Scikit-learn, Prophet, PySpark, NLTK, Airflow, Pandas, OpenCV Data Base: MySQL, SQL server, PostgreSQL, MongoDB

Data Visualization: Streamlit, Dash, Shiny, Power BI, Tableau, Spotfire Data Modeling & Warehousing: Dimensional modeling, SCD, schema evolution, Snowflake RBAC Reporting Tools: Tableau, Power BI, Wavefront Statistical Tools: SAS, Data Pipeline Tools: Dataiku, Databricks, Apache Airflow Databases: Snowflake, Redshift, PostgreSQL, Neo4j (Graph DB) SQL Environment: Snowflake, Microsoft SQL Server, Netezza Statistical Tools: JMP, Minitab, Scikit-learn

Streaming/Batch Processing: Structured Streaming, CDC, Kafka Connect, Debezium CI/CD: GitHub, Concourse, Airflow, Docker, Jenkins (if applicable) Cloud ML Platforms: AWS SageMaker, Azure ML, GCP Vertex AI Cloud: Google Cloud Platform, Pivotal Cloud Foundry, Azure, AWS Cloud Resources: Azure Data bricks, AWS Glue, GCP Big Query, Cloud Composer, Dataflow, Data Proc Frame works: Flask, Django, Falcon, Bottle

Tools: Apache Spark, Kafka, Docker, Git, GitHub Actions, Azure DevOps, CI/CD for notebooks, Airflow, Terraform Operating System: Linux, Windows, Unix, MacOS

PROFESSIONAL EXPERIENCE

US Bank Minneapolis, MN Jan 2024-Present

Data Scientist/Engineer

Responsibilities:

• Designed and implemented scalable ETL pipelines in AWS using Glue and PySpark to process credit card transactions, customer demographics, and fraud detection signals.

• Migrated legacy ETL workflows from on-prem Teradata to AWS Glue and EMR, improving processing time by over 40%.

• Partnered with credit risk analysts to translate business logic into SQL transformations using Snowflake and Amazon Redshift, including stored procedures.

• Built real-time ingestion frameworks using Kafka and Spark Streaming to identify and monitor fraudulent patterns.

• Developed custom dashboards using Streamlit and Power BI to track fraud signals and transaction insights.

• Published Python-based applications to Posit Connect for internal stakeholders to interactively explore analytics reports.

• Integrated AWS services such as Athena, Redshift, and S3 to deliver visual data stories and exports.

• Implemented Git-based version control with clear branching strategies for collaborative work across engineering teams.

• Designed reusable data visualization modules using Dash and Plotly for real-time streaming insights.

• Developed modular, version-controlled AWS Glue jobs in Python for ingesting third-party credit scores and bureau reports into S3 and Redshift.

• Orchestrated complex workflows using Apache Airflow DAGs, including pipeline health monitoring and SLA breach alerting.

• Used Terraform to provision AWS resources across dev, QA, and prod environments for infrastructure consistency.

• Applied data partitioning, bucketing, and Z-order optimization in AWS Glue to enhance Spark job performance.

• Developed batch and streaming pipelines using Databricks notebooks, Delta Lake, and Kafka to serve both operational and analytics needs.

• Integrated Kafka Connect with Databricks to enable near real-time processing of financial transactions and fraud signals.

• Designed robust ingestion workflows from Kafka to Snowflake via Databricks Auto Loader with checkpointing and schema evolution support.

• Tuned Apache Spark on Databricks using dynamic resource allocation, caching strategies, and cluster auto-scaling to reduce job runtimes by 30%.

• Created parameterized, reusable notebook templates for onboarding new datasets across dev/test/prod environments using MLflow and Unity Catalog.

• Defined modularized data models in Snowflake, handling SCD Type 2 dimensions and CDC-based fact table updates with schema versioning.

• Built streaming ingestion with Kafka Avro schema registry and Databricks Structured Streaming to ensure data quality and consistency.

• Created Delta Lake tables on EMR for slowly changing dimensions (SCDs) and historical datasets, optimizing query costs.

• Established data contracts and SLAs between data producers and consumers for customer behavior datasets.

• Collaborated with Snowflake admins to configure RBAC (Role-Based Access Control) and optimize warehouse sizing.

• Developed unit and integration tests using Pytest to ensure data reliability prior to production deployment.

• Participated in compliance audits reviewing data retention, encryption, and PII/PCI masking policies.

• Integrated with internal APIs and external vendors to securely ingest credit utilization reports into data lake zones.

• Helped define CDC (Change Data Capture) strategy using Debezium and Kafka for tracking transactional system updates.

• Authored detailed documentation, runbooks, and SOPs to support production operations and reduce response time.

• Leveraged AWS Athena and Redshift Spectrum to serve curated data to ML and analytics platforms.

• Created QuickSight dashboards to track pipeline metrics, usage trends, and compute costs across environments.

• Engaged in sprint grooming and architecture reviews as part of a distributed agile data engineering team.

• Mentored junior engineers on AWS Glue, data modeling best practices, and cloud-native patterns for resilient pipelines.

Kasier Permanent, Atlanta, GA Feb 2023 -Dec 2023

Data Engineer /Scientist

Responsibilities:

• Designed and maintained robust ETL pipelines using PySpark, SQL, and Informatica to process clinical, claims, and patient engagement data.

• Integrated diverse data sources including Epic (EHR), Oracle, and external APIs into an enterprise data warehouse on Teradata.

• Collaborated with data analysis and reporting teams to identify data needs and deliver clean, reliable datasets.

• Developed patient-centric data models and created fact and dimension tables to support analytics on readmission rates and treatment adherence.

• Migrated legacy SQL Server jobs to Spark-based workflows to improve scalability and reduce processing time.

• Created data ingestion pipelines using Apache NiFi and Kafka for near-real-time processing of appointment and lab result feeds.

• Built metadata-driven ETL frameworks for flexible onboarding of new data sources without code changes.

• Built clinician-focused dashboards in Power BI and Streamlit, visualizing trends in patient readmissions and appointment adherence.

• Created exportable data reports and self-service analytics tools deployed via Shiny and Dash.

• Connected AWS Glue, Athena, and Redshift pipelines directly to visualization layers for dynamic insights.

• Provided Python/SQL visual summaries of clinical KPIs, ensuring data validation across patient engagement metrics.

• Maintained version control with GitHub and conducted peer reviews of dashboard-related scripts and data connectors.

• Implemented data validation and profiling using Great Expectations and Informatica DQ to ensure trust in clinical KPIs.

• Built scalable event-driven architectures using Kafka topics to transport patient event data to Databricks Lakehouse and Snowflake warehouse.

• Migrated batch ETL processes to a hybrid Lambda architecture on Kafka + Databricks + Snowflake, improving data availability SLAs.

• Integrated Databricks with Snowflake using JDBC and Snowpipe for curated data delivery and analytics-ready modeling.

• Configured multi-cluster compute profiles on Databricks for medical image and clinical data parallel processing.

• Designed structured and semi-structured ingestion flows from EMR systems into Bronze-Silver-Gold Lakehouse layers using Delta Live Tables (DLT).

• Implemented GDPR- and HIPAA-compliant data transformations and masking policies in Snowflake with role-based access control (RBAC).

• Participated in HIPAA-compliant data governance initiatives, ensuring PII/PHI masking and audit trails across sensitive datasets.

• Automated data quality alerts and monitoring dashboards to proactively detect anomalies or pipeline failures.

• Deployed solutions on AWS, leveraging S3, Lambda, and Glue for scalable data processing and cost optimization.

• Supported data lake creation for population health analytics by ingesting unstructured files and imaging metadata.

• Tuned SQL and Spark transformations to handle large volumes of claims data with improved performance.

• Worked in agile sprints with data scientists and EMR specialists to deliver data products supporting clinical models.

• Delivered curated datasets to Tableau dashboards tracking COVID-19 testing, admissions, and vaccine rollouts.

• Documented data flows, lineage, and transformation logic in Confluence and shared repos for transparency.

• Designed CDC-based workflows to track updates to patient records and ensure incremental loads.

• Provided ad hoc data extracts and quick-turnaround solutions for executive and operational leadership.

• Participated in regular code reviews and knowledge-sharing sessions to enhance team practices and mentorship.

• Ensured SOX and compliance audit readiness by tracking data movement and access logs across environments.

CONTUS Tech, India May 2019-Feb 2022

Data Engineer/Scientist

Responsibilities

• Built scalable ETL pipelines using Apache NiFi, PySpark, and SQL to ingest and transform large volumes of e-commerce transaction and customer interaction data.

• Designed and maintained Hive-based data warehouse schemas with proper partitioning and bucketing strategies to optimize query performance.

• Implemented real-time ingestion using Kafka and Kafka Streams, connecting producers to downstream consumers including Hive, HDFS, and dashboard services.

• Assisted in setting up CDC (Change Data Capture) mechanisms using Sqoop and Kafka Connect for incremental updates from MySQL to HDFS.

• Developed data quality validation frameworks in Python using custom checks, ensuring schema integrity and null handling.

• Built automated data reconciliation scripts to compare ingestion vs. source using PySpark and Hive queries.

• Collaborated with the data science team to prepare training datasets and conduct feature engineering for churn prediction models using Python, Pandas, and Scikit-learn.

• Supported deployment and monitoring of machine learning inference pipelines built using Flask APIs hosted on AWS EC2.

• Contributed to dashboard development in Tableau for internal operations metrics and customer analytics.

• Created custom Python utilities for log parsing, job status monitoring, and data profiling.

• Authored reusable shell scripts for automation of daily ingestion jobs and backups.

• Participated in data lineage tracking initiatives using Apache Atlas to enhance metadata and governance.

• Followed Agile practices, contributed to sprint planning, and regularly demoed work in team reviews. Project:

1. Customer Behavior Analytics Platform

a. Built a data pipeline to track user interactions from a retail app in real-time using Kafka NiFi

Hive.

b. Enabled product recommendation engine with curated datasets delivered to the data science team.

2. Sales Forecasting with PySpark & ML Models

a. Developed data pipelines in PySpark to aggregate daily sales and inventory data. b. Partnered with data science team to feed the data into a Prophet-based forecasting model, improving forecasting accuracy by 25%.

3. Data Lake Implementation

a. Helped implement a raw-to-curated HDFS data lake, designed ingestion strategies and built a tagging system using Atlas for data discovery.

Education: Concordia university st .paul (Information science) 2023-2024

Contact this candidate