Machine Learning Data Science

Location:

Minneapolis, MN

Posted:

February 25, 2025

Contact this candidate

Resume:

Sr Data Engineer

Sai Sanjay G

Email: *********@*****.***

Phone: 763-***-****

Professional Summary:

• Around 10 years of technical IT experience in designing, developing, and maintaining enterprise analytical solutions using big data technologies.

• Proven expertise in Retail, Finance, and Healthcare domains.

• Specialized in Big Data ecosystem, data acquisition, ingestion, modeling, storage analysis, integration, processing, and database management.

• Data Science enthusiast with strong problem-solving, debugging, and analytical capabilities.

• Proficient in deploying machine learning models and supporting data science initiatives using Databricks.

• Expert in Oracle database design and administration.

• Extensive experience with Databricks for scalable data pipelines and optimizing Apache Spark applications.

• Proficient in using database tools like MongoVue, Mongo Compass, SAP HANA, and Toad.

• Strong working experience with NoSQL databases (HBase, Cassandra, MongoDB, DynamoDB, and Cosmos DB) and their integration with Hadoop clusters.

• Proficient in developing end-to-end ETL pipelines using Snowflake, Alteryx, and Apache NiFi for both relational and non-relational databases.

• Developed custom-built ETL solutions, batch processing, and real-time data ingestion pipelines using PySpark and Shell Scripting.

• Hands-on experience with AWS (EC2, Redshift, EMR, S3, CloudWatch) and Azure (SQL Data Warehouse, Machine Learning Studio).

• Developed and managed data ingestion, transformation, and processing workflows in cloud environments using Databricks, Airflow, and Oozie.

• Utilized Kubernetes and Docker for CI/CD, and experience in building and running Docker images with multiple microservices.

• Proficient in data modeling (Dimensional and Relational), including Star-Schema and Snowflake Schema Modeling.

• Designed and implemented scalable data architectures and data staging layers for BI applications and machine learning algorithms.

• Expertise in Java programming and understanding of OOP concepts, I/O, Collections, Exception Handling, Lambda Expressions, and Annotations.

• Experience in writing code in R and Python for data manipulation, statistical analysis, and data munging.

• Improved query performance by transitioning log storage from Cassandra to Azure SQL Data Warehouse.

• Applied data governance and security practices within Databricks to ensure compliance with industry standards and regulations.

• Collaborated with data scientists and architects to create data marts and conducted complex data analysis.

• Developed and maintained documentation for data extraction, integration procedures, and functional specifications.

Education:

Bachelor’s degree - Vellore Institute Of Technology, Vellore,TN (2009-2013)

Computer Science and Engineering

Certifications :

AWS Certified Solution Architech

https://www.credly.com/badges/e4ca7e60-9920-48dd-9821-07f270fd7971/public_url

Technical Skills:

Hadoop Components / Big Data

HDFS, Hue, MapReduce, Pig, Hive, HCatalog, Hbase, Sqoop, Impala, Zookeeper, Flume, Kafka, Yarn, Cloudera Manager, Kerberos, pyspark

Airflow, Kafka Snowflake, Apache airflow

Languages

Scala, Python, SQL, Python, Hive QL

IDE Tools

Eclipse, IntelliJ, Pycharm.

Cloud platform

AWS (Lambda, DynamoDB, S3, EC2, EMR, RDS), MS Azure (Azure Databricks, ADF, Azure Data Explorer, Azure HDInsight, ADLS), GCP

Databases

Oracle, SQL Server, MySQL, MS Access, NoSQL Database (Hbase, Cassandra, MongoDB)

Big Data Technologies

Hadoop, HDFS, Hive, Pig, Oozie, Sqoop, Spark, Machine Learning, Pandas, NumPy, Seaborn, Impala, Zookeeper, Flume, Airflow, Informatica, Snowflake, DataBricks, Kafka, Cloudera

Data Analysis Libraries:

Pandas, NumPy, SciPy, Scikit-learn, NLTK, Plotly, Matplotlib

Containerization

Docker, Kubernetes

CI/CD Tools

Jenkins, Bamboo, GitLab

Software Methodologies

Agile, Scrum, Waterfall

Development Tools

Eclipse, PyCharm, IntelliJ, SSMS, Microsoft Office Suite

Programming Languages

Python (Pandas, Scipy, NumPy, Scikit-Learn, Stats Models, Matplotlib, Plotly, Seaborn, Keras, TensorFlow, PyTorch), PySpark, T-SQL/SQL, PL/SQL, HiveQL, Scala, UNIX Shell Scripting

Databases

MS-SQL, Oracle and DB2

NoSQL Databases

Cassandra, PostgreSQL, Mongo DB and Azure Cosmos DB

ReportingTools/ETL Tools

Power BI, Tableau, DataStage, Pentaho, Informatica, Cognos, Talend, Azure Data Factory, Azure Databricks, Arcadia, Data stage, Talend, SSIS, SSRS, SSAS, ER Studio.

Version Control Tools

GitHub and Azure DevOps, SVN, Bitbucket

Professional Experience

Verizon - Irving, Texas. Jan 2023 – Till date

Sr Data Engineer

Developed a real-time data ingestion and transformation pipeline for telecom log files and customer metrics using Kafka and Spark in a multi-cloud environment. I worked on streaming data into Snowflake using AWS Glue, resolving scaling challenges by optimizing Snowflake’s architecture to handle high-volume, real-time data processing.

Responsibilities:

• Installed, configured, and maintained data pipelines using tools like Apache Kafka and Confluent Kafka.

• Developed and managed data pipelines to load, transform, and model data in Snowflake and AWS environments.

• Automated data loading workflows in Snowflake using Snowpipe and AWS for near real-time data ingestion.

• Migrated an on-premises application to AWS, utilizing services like EC2, S3, and EMR for data processing and storage.

• Designed and implemented data architectures leveraging Snowflake's multi-cluster shared data architecture, ensuring high performance and scalability for complex data workloads.

• Integrated Apache Kafka for real-time data ingestion and used Spark Streaming for ongoing data from Kafka to HDFS.

• Transformed business problems into Big Data solutions and defined strategic roadmaps.

• Collaborated with stakeholders to align data strategies with business objectives, leading to a 30% increase in project delivery speed.

• Designed and executed Master Data Management (MDM) integration with treasury data warehouses.

• Developed real-time log analytics pipelines using Confluent Kafka, Storm, Elasticsearch, Logstash, Kibana, and Greenplum.

• Automated near real-time data ingestion workflows using Snowpipe, significantly reducing data latency and increasing the availability of actionable insights for business users.

• Implemented comprehensive monitoring solutions that reduced operational risks and improved system reliability by 25%.

• Developed complex transformation logic using Snowflake’s SQL capabilities and stored procedures, optimizing data pipelines for performance and cost-effectiveness.

• Designed and developed ETL processes in AWS Glue for migrating campaign data into AWS Redshift.

• Developed solutions leveraging ETL tools like Informatica and Python for process improvements.

• Automated data transfer processes from mainframe file systems to cloud-based or on-premises data storage.

• Developed and optimized Apache Spark applications within Databricks for high-performance data processing and analytics.

• Collaborated with ETL tools like Talend and Informatica to ensure smooth data flows into Snowflake, leveraging their integration capabilities to handle diverse data sources effectively.

• Collaborated with data scientists to deploy machine learning models into production environments, utilizing Spark MLlib for scalable model training and evaluation. Developed pipelines to automate data preparation and feature engineering for AI-driven analytics.

• Achieved a 40% performance improvement in data processing tasks by optimizing Spark applications and leveraging Databricks features.

• Utilized Spark Streaming to collect data from AWS S3 buckets and perform transformations and aggregations on the fly.

• Developed custom PySpark scripts for data manipulation and transformation tasks.

• Managed Databricks clusters for efficient resource utilization and high availability.

• Optimized stored procedures in Snowflake for automated data transformations and processing.

• Monitored and optimized database performance using AWR, ADDM, and ASH reports.

• Enhanced the release process with IBM Udeploy to expedite project transitions to production.

• Streamlined deployment cycles, reducing time to market for new features and enhancements by 20%.

• Created and maintained documentation for data extraction and integration procedures.

• Involved in data mapping specifications and system test plan execution for data warehouses.

• Developed automated regression scripts for ETL process validation across multiple databases.

Environment: AWS Management Console, EMR, PySpark, S3, Kinesis, AWS Lambda, Scrum, Git, Glue, Step Functions, Spark, PySpark, Informatica,Talend,Informatica, ELK Stack, MDM,Snowflake, QuickSight, Talend, Redshift, OLTP, OLAP, RDS, DynamoDB, AWS Step Functions, Apache Airflow, SQL Server, Python, SageMaker, Shell Scripting, XML, Unix.

Deutsche Bank, NY . May 2021 – Dec 2022

Sr Data Engineer/Python developer

Migrated financial transaction data and customer profiles from on-premises to Azure using Snowflake for improved analytics and reporting. I optimized ETL processes with Talend and Informatica, overcoming real-time data replication issues by configuring Snowflake’s multi-cluster architecture, ensuring consistent performance under heavy workloads.

Responsibilities:

• Designed and built statistical models and feature extraction systems to solve business problems related to the company’s data pipeline, communicating solutions to executive stakeholders.

• Researched and implemented various machine learning algorithms using R, Python, and Scala, including decision trees, ALS, logistic and linear regressions, and facial recognition algorithms.

• Developed AI machine learning algorithms like Classification, Regression, and Deep Learning using Python.

• Implemented version control and collaborative practices for ETL jobs developed in Talend and Informatica, ensuring smooth updates and maintenance of data pipelines feeding into Snowflake.

• Optimized resource usage and cost management strategies by leveraging the cloud-native capabilities of Snowflake and the ETL efficiencies provided by Talend and Informatica on Azure.

• Conducted statistical analysis on healthcare data and performed data cleaning, feature scaling, and feature engineering using pandas and numpy.

• Leveraged Talend and Informatica for ETL processes to facilitate seamless data integration into Snowflake hosted on Azure, allowing for efficient handling of various data sources including on-premises databases and cloud storage.

• Created data visualizations using Python and Tableau, including scatter plots, bar charts, and histograms.

• Integrated Databricks with various data sources like Azure Data Lake and relational databases for comprehensive data analysis.

• Developed automated workflows using Talend and Informatica to load data into Snowflake on Azure, ensuring near real-time data availability for analytics and reporting purposes.

• Developed and executed data migration projects using Qlik Replicate, generating reports on data replication performance and success rates.

• Configured Snowflake’s multi-cluster architecture to handle variable data workloads, ensuring consistent query performance.

• Implemented data pipelines and real-time monitoring dashboards using Azure services including Data Lake, Databricks, and SQL Data Warehouse.

• Built a robust data warehousing solution using Snowflake on Azure, integrating it with Talend and Informatica to enable comprehensive analytics capabilities and support business intelligence initiatives.

• Migrated on-premises virtual machines and data to Azure cloud services using Azure Migrate for assessment and planning, Azure Site Recovery for disaster recovery, and Azure Blob Storage for scalable data storage during the transition.

• Configured Azure Virtual Machines, managed Azure Active Directory (AD) users and roles, and created Azure Data Factory pipelines for data orchestration using Azure SDK for Python.

• Optimized Spark SQL queries and data processing tasks in Databricks environments to improve performance and reduce execution time.Developed REST APIs and packages for feature extraction and complex prediction/forecasting algorithms on time series data.

• Developed Python applications for Google Analytics aggregation and reporting, utilizing Django for configuration management.

• Developed preprocessing pipelines for DICOM and NON-DICOM images using Azure Data Factory and Azure Machine Learning, and presented analytical insights on medical data through Azure Synapse Analytics.

• Configured and managed REST APIs using Flask and integrated various data sources, including Java, JDBC, RDBMS, Shell Scripting, Spreadsheets, and Text files.

• Utilized DevOps technologies such as Jenkins, Docker, and Kubernetes for continuous integration and deployment.

• Implemented Agile Methodology for building internal applications and wrote unit test cases in Python and Objective-C.

• Applied data governance and security practices within Databricks to ensure compliance with organizational and regulatory standards.

• Collected data needs and requirements by interacting with other departments and performed preliminary data analysis using descriptive statistics.

• Worked closely with Data Scientists to understand data requirements for experiments and developed solutions to meet those needs.

• Conducted data analysis and reporting, handling data formats such as JSON and XML, and generating reports using various graph methods.

Environment: Azure Portal, Azure HDInsight,Talend,Informatica, PySpark, Azure Data Lake Storage, Azure Data Factory, Azure Stream Analytics, Azure DevOps, Git, Azure Logic Apps, Azure Synapse Analytics, Spark, PySpark, Azure Data Bricks, Azure Monitor, Azure MDM, Power BI, Azure Data Lake Analytics, OLTP, OLAP, Azure Cosmos DB, Azure SQL Database, Snowflake,Azure Functions, Apache Airflow, SQL Server, Python, Shell Scripting, XML, Unix.

Girl scout – NY. Jun 2019 – Apr 2021

Sr Data Engineer

Built a scalable data platform for retail data, integrating POS transactions, customer surveys, and store traffic across geographies. I used Snowflake and GCP to implement ETL pipelines and tackled the challenge of real-time data ingestion by utilizing Google Pub/Sub and Dataflow, enabling seamless streaming analytics and timely insights.

Responsibilities:

• Collaborated with the business team for requirement gathering and test case development.

• Played a crucial role in designing and developing a common architecture for retail data across geographies, including point of sales, store traffic, labor, customer surveys, and audit data.

• Developed a data platform from scratch, involved in requirement gathering and analysis, and documented business requirements.

• Leveraged Snowflake as a cloud data warehouse solution on GCP, integrating it with various GCP services such as BigQuery, Cloud Storage, and Dataflow for seamless data processing and analytics.

• Developed a common framework using Apache Beam to ingest data from various sources (e.g., Teradata to Google Cloud Storage, Google Cloud Storage to BigQuery).

• Utilized Talend and Informatica to design and implement ETL pipelines that efficiently ingest data into Snowflake from diverse sources, including on-premises databases, GCP services, and third-party APIs.

• Ingested real-time data using Pub/Sub and Dataflow, and processed data from JSON and CSV files using Dataflow, storing output in Parquet format on Google Cloud Storage.

• Utilized Google Cloud Dataflow for real-time analytics on streamed data.

• Designed and implemented multiple ETL solutions with various data sources using SQL scripting, GCP Data Fusion, Python, Shell scripting, and scheduling tools.

• Implemented collaborative practices using Talend and Informatica, enabling cross-functional teams to contribute to data integration processes, ensuring alignment with business objectives.

• Built ETL pipelines on BigQuery for business intelligence, developed Spark applications using Dataproc and Spark-SQL for data extraction, transformation, and aggregation, and performed data wrangling of XML and web feeds using Python, Unix, and SQL.

• Used Cloud Composer for scheduling and orchestration of data pipelines, and Talend Integration Suite for ETL mapping.

• Implemented version control and collaboration practices within Dataproc notebooks, provided technical support and troubleshooting for Dataproc environments, and optimized Spark SQL queries and data processing tasks.

• Created functions and assigned roles in Cloud Functions for event-driven processing, and used GCP services (Google Cloud Storage, Dataproc, BigQuery, Compute Engine) for various data tasks.

• Configured BigQuery's multi-cluster architecture for consistent query performance, developed custom-built input adapters using Apache Beam, Hive, and Dataflow, and used Data Studio for automated multi-dimensional cubes.

• Developed reusable Apache Beam scripts and functions for data processing and leveraged them in different data pipelines.

• Developed machine learning models leveraging data stored in Snowflake, utilizing Talend for data preparation and transformation, and deploying the models using GCP services for actionable insights.

• Conducted performance analysis and optimization of data processes, made recommendations for continuous improvement, and collaborated with data scientists to understand data requirements for experiments.

• Implemented machine learning algorithms for business problems, conducted data cleaning, feature scaling, and feature engineering using pandas and numpy, and created data visualizations using Python and Data Studio.

• Monitored and configured orchestration tools like Kubernetes and GKE, set up Cloud Composer service in GCP, and used ELK Stack (Elastic Search, Kibana) for monitoring and logging.

• Provided knowledge transition to support teams and ensured data governance and security practices within Dataproc.

Environment: Environment: BigQuery, Dataproc, Dataflow, Apache Beam, Spark SQL, Google Cloud Storage, Looker, Google Cloud Composer,Talend,Informatica, Cloud Pub/Sub, Cloud Functions, GKE, Python, Google Cloud AI & ML, Kafka, Apache.

Altruista Health, Hyderabad, India Jan 2017 – Mar 2019

Data Engineer

Automated healthcare data ingestion and real-time streaming pipelines to process patient records and monitoring data for operational insights. I developed end-to-end ETL pipelines with PySpark and optimized DAGs in Apache Airflow to improve processing speed, overcoming unstructured data handling challenges through Spark optimizations.

Responsibilities:

• Design robust, reusable, and scalable data driven solutions and data pipeline frameworks to automate the ingestion, processing and delivery of both structured and unstructured batch and real time data streaming data using Python Programming.

• Worked with building data warehouse structures, and creating facts, dimensions, aggregate tables, by dimensional modeling, Star and Snowflake schemas.

• Applied transformation on the data loaded into Spark Data Frames and done in memory data computation to generate the output response.

• Good knowledge of troubleshooting and tuning Spark applications and Hive scripts to achieve optimal performance.

• Used Spark Data Frames API over platforms to perform analytics on Hive data and used Spark Data Frame operations to perform required validations in the data.

• Built end-to-end ETL models to sort vast amounts of customer feedback, derive actionable insights and tangible business solution.

• Optimized workflows by building DAGs in Apache Airflow to schedule the ETL jobs and implemented additional components in Apache Airflow like Pool, Executors, and multi-node functionality.

• Used Spark Streaming to divide streaming data into batches as an input to Spark engine for batch processing.

• Wrote Spark applications for Data Validation, Cleansing, Transformation, and custom aggregation and used Spark engine, Spark SQL for data analysis and provided to the data scientists for further analysis.

• Prepared scripts to automate the ingestion process using Pyspark as needed through various sources such as API, AWS S3, Teradata and Snowflake.

• Created a business category mapping system that automatically maps customers' business category information to any source website's category system. Category platforms include Google, Facebook, Yelp, Bing etc.

• Developed a data quality control model to monitor business information change overtime. The model flags outdated customer information using different APIs for validation and updates it with correct data.

• Responsible for monitoring sentimental prediction model for customer reviews and ensuring high performance ETL process.

• Data cleaning, pre-processing and modelling using Spark and Python.

• Implemented real-time data driven secured REST API's for data consumption using AWS (Lambda, API Gateway, Route 53, Certificate Manager, CloudWatch, Kinesis), and Snowflake

• Develop the automation scripts to transfer the data from on premise clusters to Google Cloud Platform (GCP).

• Load the files data from ADLS Server to the Google Cloud Platform Buckets and create the Hive Tables for the end users.

• Involved in performance tuning and optimization of long running spark jobs and queries (Hive/SQL)

• Implemented Real-time streaming of AWS CloudWatch Logs to Splunk using Kinesis Firehose.

• Developed using object-oriented methodology a dashboard to monitor all network access points and network performance metrics using Django, Python, MongoDB, JSON.

Environment: Hive, Spark SQL,Talend,Informatica, PySpark, EMR, Tableau, Pentaho, Sqoop, AWS, Presto, Python, Snowflake, Teradata, Azure AAS & SSAS, Kafka.

Phenom People, India. May 2014 – Dec 2016

Data Engineer

Architected large-scale data processing pipelines to analyze customer behavior data, such as clickstreams and logs, in a Hadoop ecosystem. I developed MapReduce and Hive applications, and automated workflows using Oozie, addressing scalability and performance challenges by optimizing Hive queries and Spark jobs for better efficiency.

Responsibilities:

• Extensive knowledge/hands on experience in architecting or designing Data warehouse/Database, Modelling, building SQL objects such as tables, views, user defined/ table valued functions, stored procedures, triggers, and indexes.

• Created HBase tables from Hive and Wrote HiveQL statements to access HBase table's data.

• Developed complex Hive Scripts for processing the data and created dynamic partitions and bucketing in hive to improve the query performance.

• Developed MapReduce applications using Hadoop Map-Reduce programming framework for processing and used compression techniques to optimize MapReduce Jobs.

• Developed Pig UDF's to know the customer behavior and Pig Latin scripts for processing the data in Hadoop.

• Scheduled automated tasks with Oozie for loading data into HDFS through Sqoop and pre-processing the data with Pig and Hive.

• Experienced of building Data Warehouse in Azure platform using Azure data bricks and data factory.

• Worked with production support team to provide necessary support for issues with CDH cluster and the data ingestion.

• Worked in Azure environment for development and deployment of Custom Hadoop Applications.

• Designed and implemented scalable Cloud Data and Analytical a solution for various public and private cloud platforms using Azure.

• Designed and implemented distributed data processing pipelines using Apache Spark, Hive, Python, Airflow DAGs and other tools and languages in Hadoop Ecosystem.

Environment: Python, MySQL, PostgreSQL, Hadoop (Hive), AWS (S3, EMR), Tableau, Pentaho, Docker, Kafka.

Contact this candidate