Data Scientist Engineer

Location:

Loveland, OH

Salary:

Posted:

February 19, 2025

Contact this candidate

Resume:

Bandhavi P

Data Scientist / Data Engineer

********.***@*****.*** +1-513-***-**** LinkedIn

Summary of Experience:

●Over 10+ years of experience as an Analysis, Design, Development, Management and Implementation of various stand - alone, client-server enterprise applications.

●Strong experience in Python, Scala, SQL, PL/SQL and Restful web services.

●Worked on generating various dashboards in Tableau/Power BI using various data sources like HANA, Snowflake, Salesforce, Oracle, SQL server, Excel, MS Access.

●Replaced existing MapReduce jobs and Hive scripts with Spark SQL & Spark data transformations for efficient data processing.

●Hands-on experience in creating tables, views, and stored procedures in Snowflake.

●Proficient in Machine Learning algorithm and Predictive Modeling including Regression Models, Decision Tree, Random Forests, Sentiment Analysis, Naïve Bayes Classifier, SVM, Ensemble Models.

●Experience in developing Spark Applications using Spark RDD, Spark - SQL and Data frame APIs.

●Worked with real-time data processing and streaming techniques using Spark streaming and Kafka.

●Experience in moving data into and out of the HDFS and Relational Database Systems (RDBMS) using Apache Sqoop

●Worked with applications like R, SPSS and Python to develop neural network algorithms, cluster analysis.

●Hands-on Experience in Data Acquisition and Validation, and Data Governance.

●Strong experience using HDFS, MapReduce, Hive, Spark, Sqoop, Oozie, and HBase.

●Establishes and executes the Data Quality Governance Framework, which includes end - to-end process and data quality framework for assessing decisions that ensure the suitability of data for its intended purpose.

●Good understanding of the Data modelling (Dimensional & Relational) concepts like Star-Schema Modelling, Schema Modelling, Fact and Dimension tables.

●Experience in manipulating/analysing large datasets and finding patterns and insights within structured and unstructured data.

●Hands-on experience on Python and libraries like Numpy, Pandas, Matplotlib, Seaborn, Sci-Kit learn, SciPy.

●Proficient in data visualization tools such as Tableau, Python Matplotlib, R Shiny to create visually powerful and actionable interactive reports and dashboards.

●Experience in developing custom UDFs for Pig and Hive to incorporate methods and functionality of Python/Java into Pig Latin and HQL (HiveQL) and Used UDFs from Piggybank UDF Repository.

●Experience on Migrating SQL database to Azure Data Lake, Azure data lake Analytics, Azure SQL Database, Data Bricks and Azure SQL Data warehouse and Controlling and granting database access and Migrating On premise databases to Azure Data lake store using Azure Data factory

●Experience setting up AWS Data Platform - AWS CloudFormation, Development End Points, AWS Glue, EMR and Jupyter/Sagemaker Notebooks, Redshift, S3, and EC2 instances

●Experience with Snowflake Multi-Cluster Warehouses

●Database design, modeling, migration and development experience in using stored procedures, triggers, cursor, constraints and functions. Used My SQL, MS SQL Server, DB2, and Oracle

●Strong understanding of Java Virtual Machines and multi-threading processes.

●Experience working with NoSQL database technologies, including MongoDB, Cassandra and HBase.

●Experience with Software development tools such as JIRA, Play, GIT.

●Used Informatica Power Center for (ETL) extraction, transformation and loading data from heterogeneous source systems into target database

●Strong experience with ETL and/or orchestration tools (e.g. Talend, Oozie, Airflow)

●Experience developing Kafka producers and Kafka Consumers for streaming millions of events per second on streaming data.

IT Skills:

•Programming Languages: SQL, PL/SQL, Python, UNIX, Pyspark, Pig, HiveQL, Scala, Shell Scripting

•Big Data Tools: Kafka, Cassandra, Apache Spark, Spark Streaming, HBase, Impala, HDFS, MapReduce, Hive, Pig, Sqoop, Flume, Oozie, Zookeeper

•Machine Learning: Linear Regression, Logistic Regression, Naive Bayes, Decision Trees, Random Forest, Support Vector Machines (SVM), K-Means Clustering, K-Nearest Neighbors (KNN), Random Forest, Gradient Boosting Trees, Ada Boosting, PCA, LDA, Natural Language Processing

•Python Libraries: Numpy, Matplotlib, NLTK, Statsmodels, Scikit-learn/sklearn, SOAP, Scipy

•Cloud Management: MS Azure, Amazon Web Services (AWS)- EC2, EMR, S3, Redshift, EMR, Lambda, Athena

•Data Visualization: Tableau, Python (Matplotlib, Seaborn), R(ggplot2), Power BI, QlikView

•Databases: Oracle 12c/11g/ 10g, MySql, MS Sql, DB2, Snowflake

•No Sql Databases: MongoDB, Hbase, Cassandra

•Operating System: Windows, Unix, Sun Solaris

PROFESSIONAL EXPERIENCE:

Client: AMEX, Phoenix, AZ Dec 2022 – till now

Role: Senior Data Scientist

Responsibilities:

●Led the end-to-end development of a credit risk and fraud detection model using advanced machine learning algorithms (Logistic Regression, KNN, Gradient Boosting) to enhance AMEX’s risk management systems, reducing fraudulent transactions by 18% and improving loan default prediction accuracy by 25%.

●Designed and implemented scalable data pipelines leveraging Python, SQL, and AWS to ingest and process millions of transaction records daily, ensuring high-quality, real-time data for accurate risk assessment.

●Collaborated with cross-functional teams (Risk, Fraud Prevention, IT, Marketing) to integrate machine learning models into AMEX’s enterprise-wide risk management framework, aligning model outputs with business goals and regulatory requirements.

●Built an ensemble model architecture combining Random Forest, XGBoost, and LightGBM, improving model robustness and predictive accuracy for both fraud detection and credit risk forecasting in high-transaction environments.

●Developed feature engineering and selection strategies using domain expertise and advanced statistical methods (PCA, Lasso Regression), improving model interpretability and ensuring the accurate prediction of customer behavior patterns.

●Automated and optimized model validation and performance tracking through A/B testing, ensuring that models adhered to performance and regulatory standards, and provided monthly reports on model efficacy to senior leadership.

●Implemented model deployment and integration strategies, collaborating with IT teams to deploy models within AMEX’s production environment and enabling automated decision-making processes for fraud detection and credit scoring.

●Created real-time dashboards and monitoring tools using Python (Dash/Streamlit) and Power BI to visualize model outputs, ensuring that business users could quickly assess risk factors and take timely action on potential fraud cases or high-risk accounts.

●Mentored and guided a team of 3 junior data scientists, providing training on model development, best practices, and industry standards, fostering a culture of data-driven decision-making within the team.

●Presented model results and actionable insights to senior stakeholders (VPs, Directors, Risk Managers), influencing key business decisions that contributed to a 15% reduction in credit risk exposure and a 10% increase in fraud detection efficiency.

Environment: Python, Scikit-learn, SciPy, NumPy, Pandas, Matplotlib, Seaborn, PostgreSQL, MongoDB, Azure

Data Bricks, Data Lake, Data Storage, Power BI, DevOps, Oracle, Git.

Client: State of Ohio, Columbus, OH Mar 2021 – Nov 2022

Role: Sr. Data Engineer

Responsibilities:

●Built the infrastructure required for optimal extraction, transformation, and loading of data from a wide variety of data sources using SQL and ‘big data’ technologies like Hadoop Hive, Azure Data Lake storage

●Installed Oozie workflow engine to run multiple Hive and Pig jobs which run independently with time and data availability. Created Partitioned Hive tables and worked on them using HiveQL.

●Performed data integrity checks, data cleansing, exploratory analysis and feature engineer using python.

●Developed various solution driven views and dashboards by developing different chart types including Pie Charts, Bar Charts, Tree Maps, Circle Views, Line Charts, Area Charts, Scatter Plots in Power BI. Created ad-hoc reports to users in Power BI Desktop by connecting various data sources, multiple views and associated reports.

●Ingest real-time and near-real time (NRT) streaming data into HDFS using Kafka. Wrote Kafka configuration files for importing streaming log data into HBase with Kafka. Imported several transactional logs from web servers with Kafka to ingest the data into HDFS.

●Installed and configured pig, written Pig Latin scripts to convert the data from Text file to Avro format.

●Involved in data ingestion into HDFS using Sqoop for full load and Kafka for incremental load on variety of sources like web server, RDBMS and Data API’s.

●Experienced with handling administration activations using Cloudera manager.

●Create an Architectural solution that leverages the best Azure analytics tools to solve our specific need in Chevron use case

●Writing Pyspark and spark Sql transformation in Azure Data Bricks to perform complex transformations for business rule implementation

●Implement Spark Kafka streaming to pick up the data from Kafka and send to Spark pipeline.

●Experience in working with different join patterns and implementing both Map and Reduce Side Joins.

●Create and maintain optimal data pipeline architecture in cloud Microsoft Azure using Data Factory and Azure Data Bricks and NiFi to ingest data from REST APIs, SFTP servers, and Data Lakes into Kafka brokers.

●Used various sources to pull data into Power BI such as SQL Server, oracle. Involved in installation of Power BI Report Server. Built dashboards which will provide insights into Snowflake usage statistics. Created various reports using Power BI based on the client’s needs. Using a query editor in Power BI performed certain operations like fetching data from different file.

●Built pipelines to move hashed and un-hashed data from XML files to Data lake.

●Involved in developing Impala scripts for extraction, transformation, loading of data into data warehouse.

●Acted for bringing in data under HBase using HBase shell also HBase client API.

●Exporting of a result set from HIVE to MySQL using Sqoop export tool for further processing.

●Implemented the workflows using Apache Oozie framework to automate tasks.

●Created Live and Extract dashboards on top of Snowflake DW

●Worked with NoSQL databases like HBase in making HBase tables to load expansive arrangements of semi structured data. Involved in transforming data from Mainframe tables to HDFS, and HBase tables using Sqoop.

●Worked on UDFS using Python for data cleansing and performed validation on machine learning using R.

Environment: Snowflake, Python, Pyspark, Pandas, Hadoop, Cloudera, Kafka, HBase, HDFS, Power BI,

MapReduce, Kafka, YARN, Hive, Pig, Sqoop, Oozie, Solr, Azure, Data Factory, Data Bricks, HDInsight, PL/SQL,

MySQL

Client: Tenet Healthcare Corporation, Dallas, TX Jun 2019 – Feb 2021

Role: Data Engineer

Responsibilities:

●Developed PIG UDFs for manipulating the data according to Business Requirements and also worked on developing custom PIG Loaders.

●Responsible for creating SQL data sets for Power BI recurring and Ad-hoc Reports. Designed, developed and tested Power BI visualizations for dashboard and Ad-hoc reporting solutions by connecting from different data sources and databases.

●Used several python libraries like wxPython, NumPY and MatPlotLib.

●Worked on implementation of a log producer in Scala that watches for application logs, transforms incremental logs and sends them to a Kafka and Zookeeper based log collection platform.

●Used Hive to analyze data ingested into HBase by using Hive-HBase integration and compute various metrics for reporting on the dashboard.

●Created ETL using Python API’s/AWS Glue/terraform/Gitlab to consume data from different source systems (smartsheet/ QuickBase/ google sheets) to Snowflake.

●Developing Spark scripts, UDFS using both Spark DSL and Spark SQL query for data aggregation, querying, and writing data back into RDBMS through Sqoop.

●Involved in migrating tables from RDBMS into Hive tables using SQOOP and later generate visualizations using Power BI.

●Expert on Power BI reports, dashboards and publishing to the end users for executive level Business Decision, Performed super user training in Basic and Advanced Power BI Desktop.

●Transformed the data using AWS Glue dynamic frames with PySpark; cataloged the transformed the data using Crawlers and scheduled the job and crawler using workflow feature

●Developed data pipeline programs with Spark Scala APIs, data aggregations with Hive, and formatting data (JSON) for visualization, and generating.

●Worked on developing ETL processes (DataStage, Open Studio) to load data from multiple data sources to HDFS using Kafka and SQOOP, and performed structural modifications using Map Reduce, HIVE.

●Used DataStax Spark connector which is used to store the data into Cassandra database or get the data from Cassandra database.

●Build scalable distributed data solutions using the Hadoop ecosystem and developed Spark Streaming programs to process near real time data from Kafka, and process data with both stateless and stateful transformations.

Environment: Apache Spark, Map Reduce, Snowflake, Apache Pig, Python, Java, SSRS, HBase, AWS, Power BI,

Cassandra, PySpark, Apache Kafka, HIVE, SQOOP, Kafka, Apache Oozie, Zookeeper, ETL, UDF.

Client: Daiwa Derivatives, Jersey City, NJ Jan 2017 – May 2019

Role: Hadoop Developer

Responsibilities:

●Developed testing scripts in Python and prepared test procedures, analyzed test results data and suggested improvements of the system and software.

●Responsible for data extraction and data ingestion from different data sources into Hadoop Data Lake by creating ETL pipelines using Pig, and Hive.

●Implemented Partitioning, Dynamic Partitions and Buckets in HIVE for efficient data access.

●Create/Modify shell scripts for scheduling various data cleansing scripts and ETL load processes.

●Responsible for data extraction and data ingestion from different data sources into Hadoop Data Lake by creating ETL pipelines using Pig, and Hive. Implemented Flume for real-time data ingestion into the Hadoop Data Lake, efficiently streaming log data into HDFS and Hive for analysis. Integrated Flume with existing ETL pipelines for seamless data flow and real-time processing.

●Involved in Functional Testing, Integration testing, Regression Testing, Smoke testing and performance Testing. Tested Hadoop Map Reduce developed in Python, Pig, Hive.

●Experience in designing and developing applications in PySpark using python to compare the performance of Spark with Hive

●Written and executed Test Cases and reviewed with Business & Development Teams.

●Implemented Defect Tracking process using JIRA tool by assigning bugs to Development Team

●Automated Regression tool (Qute) and reduced manual effort and increased team productivity

Environment: Hadoop, Map Reduce, HDFS, Pig, MySQL, UNIX Shell Scripting, Spark, SSIS, JSON, Hive, Sqoop, Flume.

Client: Aegis Technical Services, India Oct 2014 – Jul 2016

Role: Hadoop Developer

Responsibilities:

●Imported data using SQOOP to load data from Oracle to HDFS on a regular basis.

●Creating Hive tables and working on them using Hive QL. Experienced in defining job flows.

●Involved in creating Hive tables, loading the data, and writing hive queries that will run internally in a MapReduce way. Developed a custom File System plugin for Hadoop so it can access files on Data Platform.

●The custom File System plugin allows Hadoop MapReduce programs, HBase, Pig and Hive to work unmodified and access files directly.

●Used Pig as ETL tool to do transformations, event joins, filters, and some pre-aggregations before storing the data into HDFS.

●Designed and implemented MapReduce-based large-scale parallel relation-learning system.

Setup and benchmarked Hadoop/HBase clusters for internal use

Environment: Hadoop, MapReduce, HDFS, Hive, Java, Hadoop. Hbase. Hive, DB2, MS Office

Contact this candidate