Data Engineer Machine Learning

Location:

United States

Salary:

100000

Posted:

September 22, 2025

Contact this candidate

Resume:

Manaswini Gande

+1-928-***-****) *********************@*****.*** www.linkedin.com/in/gandemanaswini/ Professional Experience

BMO Bank Coolsnail Technologies AZURE Data Engineer Chicago, Illinois, USA Sep 2024 – Present

• Developed and optimized scalable ADF pipelines and ETL workflows to ingest, transform, and load financial data into ADLS and Azure Synapse Analytics for real-time banking operations.

• Implemented performance tuning using ADF, Azure Databricks, and PySpark to ensure low-latency, high- throughput data pipelines supporting mission-critical workloads.

• Designed and managed data models and data warehouses using Azure Synapse, SQL Server, and Azure SQL Database to support key business intelligence and analytics requirements for financial reporting and regulatory compliance.

• Established and managed Azure Storage Accounts, Blob Storage, and Data Lake Gen2, ensuring secure and scalable data storage solutions for sensitive financial transactions and customer data and worked closely with data science teams to operationalize machine learning models using Azure Machine Learning, Databricks MLflow, and Azure Synapse Analytics, embedding predictive models into bank data pipelines for fraud detection and risk assessment.

• Streamlined CI/CD pipelines for data engineering workflows using Azure DevOps, Git enabling continuous integration and seamless deployment of data solutions in a secure, version-controlled environment.

• Automated data extraction, cleaning, and orchestration in banking pipelines using Python, SQL, Scala, Pandas, and NumPy, delivering high-quality datasets for analytics and reporting.

• Leveraged Power BI for building interactive dashboards and financial reports, utilizing data from Azure Synapse

• Collaborated with engineering teams to implement Azure Kubernetes Service (AKS) and Docker for containerized solutions, enabling scalable deployments.

Regeneron Pharmaceuticals AWS Data Engineer Tarrytown, New York, USA Jan 2024 – Aug 2024

• Utilized Sqoop and Impala to facilitate data migration between relational databases and Hadoop ecosystems using EMR.

• Developed and managed ETL pipelines using Informatica and Python-based frameworks to extract, transform, and load data from multiple pharmaceutical systems into Snowflake and Amazon S3, ensuring data integrity.

• Wrote optimized SQL queries and used Python for data extraction, transformation, and processing from relational databases like MySQL, PostgreSQL, and NoSQL databases like MongoDB to support advanced pharmaceutical analytics.

• Utilized PySpark in AWS EMR to perform large-scale data transformations, enabling analytics and insights from pharmaceutical datasets for clinical trials, drug research, and regulatory reporting.

• Designed and implemented real-time data streaming solutions using Kafka and Flume to integrate structured and unstructured pharmaceutical data into Snowflake and Amazon RDS.

• Ensured data security, integrity, and compliance with regulatory standards like HIPAA by implementing encryption and access control mechanisms in AWS Glue and Amazon S3. Maintained audit logs and secured sensitive healthcare data.

• Automated and orchestrated ETL workflows using Apache Airflow and AWS Glue, ensuring seamless scheduling and execution of data processing pipelines in cloud environments.

• Managed distributed data storage using HBase and Cassandra on AWS EMR, coordinated using Zookeeper to ensure high availability and fault tolerance.

• Collaborated with data scientists and analysts to deliver predictive insights using AWS EMR and PySpark. Built and deployed models for drug efficacy studies, patient risk analysis, and healthcare outcome predictions.

• Monitored, troubleshot, and optimized ETL performance using Amazon Redshift, SQL tuning, and EMR profiling to ensure cost-efficient data processing and high-performance analytics. Infosys GCP Data Engineer Hyderabad, India Feb 2022 – Aug 2023

• Used Sqoop on Dataproc to migrate legacy on-prem relational data into Google Cloud Storage and processed it further using Hive for integration with downstream analytics pipelines.

• Developed and optimized ETL and ELT pipelines using Cloud Dataflow (Apache Beam) to process and validate structured and semi-structured data from GCS into BigQuery.

• Designed Hive and Spark-based workflows using Cloud Dataproc to transform large-scale datasets and support advanced analytics use cases across marketing and operations.

• Wrote Python scripts for preprocessing, validation, and transformation of data files ingested into BigQuery using Dataflow and scheduled via Composer.

• Deployed and managed batch and streaming workflows using Apache Airflow in Cloud Composer, leveraging custom Python operators and BigQuery/Dataproc hooks.

• Used Cloud Shell for provisioning services, running deployments, and automating operational tasks.

• Implemented row-level security policies in Tableau connected to BigQuery to ensure secure data access and compliance.

• Built dashboards and reports in Looker Studio (Data Studio) for monitoring project billing, service utilization, and cost optimization strategies.

• Collaborated with data scientists and analysts to provision datasets and build feature engineering pipelines for ML models using BigQuery and Dataproc.

Thales Group Bangalore, India Data Engineer INTERNSHIP Mar 2020 – Jan 2022

• Developed batch data transformation jobs using PySpark on AWS EMR to clean, enrich, and join large pharmaceutical datasets from multiple systems.

• Utilized Spark SQL and DataFrames to implement business rules, perform aggregations, and calculate metrics required for reporting and downstream analytics.

• Processed semi-structured and unstructured data stored in HDFS and Amazon S3, transforming them into structured formats using custom Python scripts.

• Executed table-to-table transformation workflows within Spark to support CDC (Change Data Capture) pipelines and incremental data loads.

• Integrated Kafka with Spark jobs to process real-time and near-real-time message streams for ingesting updates from operational systems.

• Leveraged MapReduce on Hadoop to process image and binary files for classification, metadata tagging, and storage in optimized partitions.

• Designed data parsing logic to normalize raw logs, nested JSON, and CSV inputs into consistent, query-ready tables stored in S3 and Hive.

• Performed format conversion tasks including Parquet, ORC, and Avro for compression optimization and query efficiency in downstream tools.

TECHNICAL SKILLS

Big Data Ecosystem: HDFS, Yarn, MapReduce, Spark, Kafka, Kafka Connect, Hive, Airflow, Stream Sets, Sqoop, HBase, Flume, Pig, Ambari, Oozie, Zookeeper, Nifi, Sentry Hadoop Distributions: Apache Hadoop 2.x/1.x, Cloudera CDP, Hortonworks HDP Cloud Environment: Amazon Web Services (AWS), Microsoft Azure, GCP Databases: MySQL, Oracle, Teradata, MS SQL SERVER,, DB2, Mongo DB AWS: EC2, EMR, S3, Redshift, EMR, Lambda, Kinesis Glue, Data Pipeline Microsoft Azure: Databricks, Data Lake, Blob Storage, SQL Database, SQL Data Ware- house, Cosmos DB, Operating systems: Linux, Unix, Windows 10, Windows 8, Windows 7, Windows Server 2008/2003, Mac OS Reporting Tools/ETL Tools: Informatica, Talend, SSIS, SSRS, SSAS, ER Studio, Tableau, Power BI, Arcadia, Data stage, Programming Languages: Python (Pandas, SciPy, NumPy, Scikit-Learn, Stats Models, Matplotlib, Plotly, TensorFlow, PyTorch), PySpark, T-SQL/SQL, PL/SQL, HiveQL, Scala, UNIX Shell Scripting, C#, Java, Angular. Version Control: Git, SVN, Bitbucket

Development Tools: Eclipse, NetBeans, IntelliJ, Hue, Microsoft Office EDUCATION

• Masters in computer science from Northern Arizona University – Arizona USA(Aug 2023-May 2025)

• Bachelors in computer science and engineering – Telangana, India (Aug 2017- Sept 2021) CERTIFICATIONS

• Microsoft Certified: Azure Data Fundamentals

• Google Cloud Certified: Associate Cloud Engineer

Contact this candidate