Bandhavi P
Data Scientist / Data Engineer
********.***@*****.*** +1-513-***-**** LinkedIn
Summary of Experience:
●Over 10+ years of experience in data analysis, design, development, and management of enterprise applications, with a focus on data visualization and business intelligence solutions.
●Strong experience in Python, Scala, SQL, PL/SQL and Restful web services.
●Worked on generating various dashboards in Tableau/Power BI using various data sources like HANA, Snowflake, Salesforce, Oracle, MS SQL server, Excel, MS Access.
●Hands-on experience in creating custom visuals, Power BI DAX calculations, and paginated reports for complex, interactive reporting solutions.
●Experience working with GIS tools (e.g., Mapbox, QGIS) to enhance geospatial data visualization in Power BI and other reporting tools.
●Familiarity with Seismic content management for automating and enhancing reporting workflows.
●Replaced existing MapReduce jobs and Hive scripts with Spark SQL & Spark data transformations for efficient data processing.
●Hands-on experience in creating tables, views, and stored procedures in Snowflake.
●Proficient in Machine Learning algorithm and Predictive Modeling including Regression Models, Decision Tree, Random Forests, Sentiment Analysis, Naïve Bayes Classifier, SVM, Ensemble Models.
●Experience in developing Spark Applications using Spark RDD, Spark - SQL and Data frame APIs.
●Worked with real-time data processing and streaming techniques using Spark streaming and Kafka.
●Experience in moving data into and out of the HDFS and Relational Database Systems (RDBMS) using Apache Sqoop
●Worked with applications like R, SPSS and Python to develop neural network algorithms, cluster analysis.
●Hands-on Experience in Data Acquisition and Validation, and Data Governance.
●Strong experience using HDFS, MapReduce, Hive, Spark, Sqoop, Oozie, and HBase.
●Establishes and executes the Data Quality Governance Framework, which includes end - to-end process and data quality framework for assessing decisions that ensure the suitability of data for its intended purpose.
●Good understanding of the Data modelling (Dimensional & Relational) concepts like Star-Schema Modelling, Schema Modelling, Fact and Dimension tables.
●Experience in manipulating/analysing large datasets and finding patterns and insights within structured and unstructured data.
●Hands-on experience on Python and libraries like Numpy, Pandas, Matplotlib, Seaborn, Sci-Kit learn, SciPy.
●Proficient in data visualization tools such as Tableau, Python Matplotlib, R Shiny to create visually powerful and actionable interactive reports and dashboards.
●Experience in developing custom UDFs for Pig and Hive to incorporate methods and functionality of Python/Java into Pig Latin and HQL (HiveQL) and Used UDFs from Piggybank UDF Repository.
●Experience on Migrating SQL database to Azure Data Lake, Azure data lake Analytics, Azure SQL Database, Data Bricks and Azure SQL Data warehouse and Controlling and granting database access and Migrating On premise databases to Azure Data lake store using Azure Data factory
●Experience setting up AWS Data Platform - AWS CloudFormation, Development End Points, AWS Glue, EMR and Jupyter/Sagemaker Notebooks, Redshift, S3, and EC2 instances
●Experience with Snowflake Multi-Cluster Warehouses
●Database design, modeling, migration and development experience in using stored procedures, triggers, cursor, constraints and functions. Used My SQL, MS SQL Server, DB2, and Oracle
●Strong understanding of Java Virtual Machines and multi-threading processes.
●Experience working with NoSQL database technologies, including MongoDB, Cassandra and HBase.
●Experience with Software development tools such as JIRA, Play, GIT.
●Used Informatica Power Center for (ETL) extraction, transformation and loading data from heterogeneous source systems into target database
●Strong experience with ETL and/or orchestration tools (e.g. Talend, Oozie, Airflow)
●Experience developing Kafka producers and Kafka Consumers for streaming millions of events per second on streaming data.
IT Skills:
•Programming Languages: SQL, PL/SQL, Python, UNIX, Pyspark, Pig, HiveQL, Scala, Shell Scripting
•Big Data Tools: Kafka, Cassandra, Apache Spark, Spark Streaming, HBase, Impala, HDFS, MapReduce, Hive, Pig, Sqoop, Flume, Oozie, Zookeeper
•Machine Learning: Linear Regression, Logistic Regression, Naive Bayes, Decision Trees, Random Forest, Support Vector Machines (SVM), K-Means Clustering, K-Nearest Neighbors (KNN), Random Forest, Gradient Boosting Trees, Ada Boosting, PCA, LDA, Natural Language Processing
•Python Libraries: Numpy, Matplotlib, NLTK, Statsmodels, Scikit-learn/sklearn, SOAP, Scipy
•Cloud Management: MS Azure, Amazon Web Services (AWS)- EC2, EMR, S3, Redshift, EMR, Lambda, Athena
•Data Visualization: Tableau, Python (Matplotlib, Seaborn), R(ggplot2), Power BI, QlikView
•Databases: Oracle 12c/11g/ 10g, MySql, MS Sql, DB2, Snowflake
•No Sql Databases: MongoDB, Hbase, Cassandra
•Operating System: Windows, Unix, Sun Solaris
PROFESSIONAL EXPERIENCE:
Client: AMEX, New York City, NY Dec 2022 – till now
Role: Senior Data Engineer
Responsibilities:
●Spearheaded the development of data quality checks, anomaly detection logic, and data profiling scripts using Python for AMEX’s fraud and credit risk data platforms.
●Built production-grade data pipelines integrating Snowflake, PostgreSQL, MongoDB, and Azure Data Lake, ensuring ingestion of sensitive transactional datasets.
●Developed PySpark jobs in Azure Databricks for complex data transformations, reconciliation, and staging of credit, fraud, and underwriting data.
●Ingested and transformed Guidewire Cloud CDA files into Snowflake and downstream risk analytics pipelines, enabling audit-ready ingestion.
●Built automated validation modules using Pytest for data consistency and integrity across development and production layers.
●Developed stored procedures and advanced SQL scripts for metadata validation, referential integrity monitoring, and duplicate detection.
●Created automated Python reconciliation tools to validate data between raw ingestions and curated Snowflake tables.
●Built streaming validation on Kafka messages, applying time-series analytics to detect schema drift, anomalies, and outliers.
●Led onboarding of teams onto AMEX’s Data Quality-as-a-Service platform, defining data contracts and reusable quality workflows.
●Developed Streamlit and Power BI dashboards to visualize DQ KPIs such as freshness, null % metrics, duplicates, and error rates.
●Supported explainability in ML pipelines with anomaly detection and integrity checks using Isolation Forests and clustering techniques.
●Used Git and Jira for version control, integrated CI/CD for validation pipelines and implemented test automation on merge events.
●Worked with Azure Monitor and log analytics to track job health and auto-resolve ingestion issues.
●Ensured encryption, masking, and role-based accFess control policies for sensitive credit risk datasets per FCRA and ECOA.
●Participated in Agile sprint planning and retrospectives, contributing to feature grooming and technical design discussions.
Environment: Python, Pandas, NumPy, Pytest, SQL, Snowflake, Azure Data Lake, Databricks, Kafka, Power BI, Streamlit, PostgreSQL, MongoDB, Git, Jira, CI/CD, Guidewire CDA, Artifactory.
Client: State of Ohio, Columbus, OH Mar 2021 – Nov 2022
Role: Sr. Data Engineer
Responsibilities:
●Built the infrastructure required for optimal extraction, transformation, and loading of data from a wide variety of data sources using SQL and big data technologies like Hadoop Hive, Azure Data Lake storage.
●Developed robust ELT pipelines in Azure Data Factory and Azure Databricks using PySpark and Spark SQL for Medicaid and healthcare data integration.
●Designed ingestion pipelines for structured/unstructured sources (XML, REST APIs, RDBMS) and orchestrated with NiFi and ADF.
●Wrote complex transformation logic using PySpark in Databricks for patient-level and claim-level metrics across Ohio Medicaid.
●Processed real-time and near-real-time data streams using Kafka and Spark Structured Streaming to enable timely Medicaid analytics.
●Designed HBase schemas and developed Kafka-HBase ingestion connectors for high-throughput streaming.
●Built partitioned Hive tables for batch ingestion; automated ingestion using Oozie and Sqoop for RDBMS sources.
●Developed OLAP/Tabular SSAS models and SSRS paginated reports for HHS stakeholders and Medicaid compliance monitoring.
●Built and maintained Power BI dashboards that integrated Snowflake and SQL Server data, using DAX for KPIs such as hospital utilization, enrollment trends, and provider performance.
●Created reusable Data Factory pipelines to support reusable ingestion logic and parameterization across development environments.Wrote Hive, Impala, and Pig scripts to preprocess historical Medicaid datasets for downstream warehousing.
●Migrated SQL Server tables to Azure Data Lake and Snowflake; used Power BI for scalable dashboarding.
●Tuned Spark jobs, implemented broadcast joins, and used caching to enhance Azure Databricks job performance.
●Worked with Cloudera Manager to manage Hadoop cluster health and workflow scheduling (Hive, Sqoop, Pig).
●Ensured HIPAA and CMS compliance via encryption-at-rest, data masking, and audit trails in Azure and Snowflake.
Environment: Azure Data Factory, Azure Databricks, PySpark, Hive, HDFS, Snowflake, Kafka, NiFi, HBase, Sqoop, Pig, Impala, SSRS, Power BI, SSAS, Python, SQL, XML, REST APIs, Cloudera.
Client: Tenet Healthcare Corporation, Dallas, TX Jun 2019 – Feb 2021
Role: Data Engineer
Responsibilities:
●Developed, optimized, and maintained MS SQL Server databases, ensuring high availability, query tuning, and indexing for healthcare data processing.
●Designed and built scalable data pipelines in Azure Databricks using PySpark and Spark SQL for healthcare claims ingestion, transformation, and curation into Snowflake and Hive data warehouses.
●Developed and orchestrated ETL workflows in Databricks to ingest data from Smartsheet, QuickBase, Google Sheets, and RDBMS sources into Snowflake.
●Created reusable UDFs and Spark scripts within Databricks notebooks to support healthcare data aggregation, cleansing, enrichment, and batch processing.
●Developed real-time data ingestion pipelines using Kafka and Databricks Structured Streaming, enabling near-real-time data visibility across patient care, billing, and operations.
●Integrated with Hive and HBase via HiveContext in Databricks for hybrid querying of historical and real-time patient records.
●Built Delta Lake architectures for ACID-compliant transactional data lakes supporting schema enforcement and time travel on medical and claims data.
●Optimized Spark job performance through efficient partitioning, broadcast joins, and cluster autoscaling in Databricks to minimize compute costs.
●Created parameterized and reusable notebook pipelines using Databricks REST APIs and widgets, enabling dynamic multi-environment deployments.
●Automated job execution and monitoring using Databricks Jobs, GitLab CI/CD, and custom alerting via Azure Monitor and Delta-based audit logs.
●Visualized curated data in Tableau dashboards for executive-level insights on cost, patient demographics, and treatment outcomes.
●Led cost-performance benchmarking across Spark clusters and orchestrated cluster policy optimization for compliance with financial constraints.
●Collaborated with governance teams to enforce HIPAA-compliant data masking, access controls, and encryption-at-rest policies in Databricks and Snowflake.
●Contributed to internal documentation, code templates, and training resources to promote adoption of Databricks development standards.
●Supported legacy ingestion processes by integrating Scala-based Kafka producers, and implemented log aggregation pipelines using Spark and Kafka.
●Built dashboards using Snowflake DW and created Power BI paginated reports using SSRS, driving stakeholder engagement and compliance reporting.
Environment: Databricks, PySpark, Spark SQL, Delta Lake, Kafka, Tableau, Snowflake, Hive, HBase, Azure Monitor, GitLab, Smartsheet, QuickBase, Google Sheets, Python, Scala, SQL, SSRS, Power BI, UDFs, HIPAA.
Client: Daiwa Derivatives, Jersey City, NJ Jan 2017 – May 2019
Role: Hadoop Developer
Responsibilities:
●Developed testing scripts in Python and prepared test procedures, analyzed test results data and suggested improvements of the system and software.
●Responsible for data extraction and data ingestion from different data sources into Hadoop Data Lake by creating ETL pipelines using Pig, and Hive.
●Implemented Partitioning, Dynamic Partitions and Buckets in HIVE for efficient data access.
●Create/Modify shell scripts for scheduling various data cleansing scripts and ETL load processes.
●Responsible for data extraction and data ingestion from different data sources into Hadoop Data Lake by creating ETL pipelines using Pig, and Hive. Implemented Flume for real-time data ingestion into the Hadoop Data Lake, efficiently streaming log data into HDFS and Hive for analysis. Integrated Flume with existing ETL pipelines for seamless data flow and real-time processing.
●Involved in Functional Testing, Integration testing, Regression Testing, Smoke testing and performance Testing. Tested Hadoop Map Reduce developed in Python, Pig, Hive.
●Experience in designing and developing applications in PySpark using python to compare the performance of Spark with Hive
●Written and executed Test Cases and reviewed with Business & Development Teams.
●Implemented Defect Tracking process using JIRA tool by assigning bugs to Development Team
●Automated Regression tool (Qute) and reduced manual effort and increased team productivity
Environment: Hadoop, Map Reduce, HDFS, Pig, MySQL, UNIX Shell Scripting, Spark, SSIS, JSON, Hive, Sqoop, Flume.
Client: Aegis Technical Services, India Oct 2014 – Jul 2016
Role: Hadoop Developer
Responsibilities:
●Imported data using SQOOP to load data from Oracle to HDFS on a regular basis.
●Creating Hive tables and working on them using Hive QL. Experienced in defining job flows.
●Involved in creating Hive tables, loading the data, and writing hive queries that will run internally in a MapReduce way. Developed a custom File System plugin for Hadoop so it can access files on Data Platform.
●The custom File System plugin allows Hadoop MapReduce programs, HBase, Pig and Hive to work unmodified and access files directly.
●Used Pig as ETL tool to do transformations, event joins, filters, and some pre-aggregations before storing the data into HDFS.
●Designed and implemented MapReduce-based large-scale parallel relation-learning system.
Setup and benchmarked Hadoop/HBase clusters for internal use
Environment: Hadoop, MapReduce, HDFS, Hive, Java, Hadoop. Hbase. Hive, DB2, MS Office