MANOJ SAI
Sr. Data Engineer
Phone: +1-561-***-**** Email: ************@*****.*** LinkedIn: www.linkedin.com/in/manoj-sai-a0b36b134
CAREER HIGHLIGHTS:
• 9+ years in data engineering and ETL, specializing in building and optimizing pipelines with Python and SQL.
• Experienced with Azure and AWS cloud platforms for scalable data processing and storage.
• Led the migration of data to Azure Synapse Analytics using Python, ensuring better scalability and faster reporting.
• Created real-time data ingestion pipelines using Azure Event Hubs, processing transactional and market data for business insights.
• Automated schema evolution using Python and SQL, simplifying the adaptation of data structures in Azure Synapse Analytics.
• Developed ETL workflows in Azure Databricks and Azure Data Factory using Scala, enabling efficient large-scale data processing.
• Hands-on with AWS services, including EMR, EC2, S3, Lambda, Glue, and Redshift for cloud operations.
• Used AWS Glue for transforming and migrating data from on-prem systems to the cloud.
• Built CI/CD pipelines with Azure Functions, automating deployment of data pipelines and improving efficiency.
• Optimized Tableau dashboards to provide actionable insights on network performance and system health.
• Migrated large datasets from on-prem SQL systems to Azure Synapse Analytics using Azure Data Factory, ensuring easy access and analysis.
• Validated data accuracy after migration with SQL scripts, ensuring consistency between legacy SQL databases and Azure Synapse Analytics.
• Optimized AWS Redshift queries to improve data retrieval speed and reporting performance for large datasets.
• Built Python scripts to detect anomalies in the incoming data, improving data quality and reliability.
• Experience in working with Azure Blob Storage, Azure Data Lake Storage and Azure SQL Database.
• Automated data pipelines with Bash, PySpark, and Scala, managing large datasets across cloud environments.
• Developed Power BI dashboards for real-time financial, healthcare, and operational insights.
• Knowledge of PL/SQL, T-SQL, MongoDB & Cassandra for data transformation and integrity.
• Experienced in HIPAA-compliant data management and real-time data processing.
• Used AWS S3 for storing processed data, improving data access and integration across systems.
• Migrated data to Snowflake using Hadoop on AWS EMR, making data processing faster and more efficient.
• Experienced in using DynamoDB for fast storage and real time access to critical data.
• Stored real-time log data in HBase which helped in performance monitoring.
• Deployed machine learning models with Azure Machine Learning and Azure Synapse Analytics, using predictive insights for fraud detection and analysis.
• Optimized SQL queries by refining joins, partitioning strategies, and indexing to boost performance.
• Automated ETL workflows using Azure Data Factory which improved timely processing of data and reporting.
• Built fault-tolerant data pipelines using Azure Event Hubs, enabling reliable data flows with error-handling mechanisms.
• Used Azure Stream Analytics for real-time data processing to detect and respond to streaming data instantly.
• Used Python and created scripts to process and clean complex datasets for integrating real time feeds into cloud platforms.
SKILLS SUMMARY:
Category
Technologies
Languages & Scripting
Python, SQL, T-SQL, PL/SQL, Bash, Scala
Cloud Platforms
Azure, AWS
Cloud Services
Azure Synapse Analytics, Azure Event Hubs, Azure Databricks, Azure Data Factory, Azure Functions, Azure Kubernetes Service, Azure Machine Learning, Azure Blob Storage, Azure Key Vault, Azure Stream Analytics, AWS Lambda, AWS S3, AWS EMR, AWS Glue, AWS IAM, AWS Redshift Spectrum, AWS EC2,
Big Data & Streaming
Apache Spark, Apache Flink, Apache Kafka, Apache Beam, Apache NiFi, Hadoop, YARN
Databases & Warehousing
Azure Synapse Analytics, Azure SQL Database, Snowflake, Redshift, PostgreSQL, MySQL, Oracle, SQL Server, MongoDB, Cassandra, HBase, Delta Lake, Data Warehousing, DynamoDB
ETL
Azure Data Factory, Apache Airflow, Control M
CI/CD Automation
Docker, Jenkins, Git, Unix Shell Scripts
AI & ML
Azure Machine Learning, PyTorch
Monitoring & Reporting
Grafana, Power BI, Tableau, Crystal Reports, Cognos
Security & Compliance
HIPAA, FHIR, Data Security
EMPLOYMENT HISTORY:
Client: Ameriprise Financial Jan 2024 – Present
Title: Senior Data Engineer
Key Responsibilities:
• Led the migration of legacy financial data to Azure Synapse Analytics by using Python scripts, improving scalability and reducing query times.
• Created real-time data ingestion pipelines using Azure Event Hubs, processing large volumes of transactional and market data from external feeds and financial sources.
• Automated schema evolution using Python and SQL for handling financial data changes.
• Implemented ETL workflows in Azure Databricks and Azure Data Factory using Scala for efficient data processing, ensuring high-performance data transformations.
• Developed CI/CD pipelines for automated deployment of data pipelines using Azure DevOps and Azure Functions. Also, deployment time was reduced.
• Used Azure Data Factory to migrate and transform large datasets from on-prem SQL systems to cloud-based data platforms, improving accessibility and reporting efficiency.
• Validated data accuracy post-migration using SQL scripts, ensuring consistency between on-prem SQL and Azure Synapse Analytics.
• Built Python scripts for anomaly detection in financial datasets.
• Implemented batch data processing using Azure Synapse Analytics and Azure Data Factory, optimizing unstructured data processing for improved analytics and downstream application use.
• Proficient in deploying machine learning models in Azure Machine Learning (Azure ML) using Synapse ML, providing predictive insights into fraud detection.
• Optimized complex SQL joins in Azure Synapse Analytics to enhance query performance for financial transaction analysis.
• Used Azure Data Factory and DBT to automate ETL workflows, ensuring timely financial data processing and reporting.
• Built fault-tolerant pipelines in Azure Event Hubs, integrating automated retries and error-handling mechanisms.
• Converted legacy SQL-based financial reports to Azure Synapse Analytics by redesigning queries for better performance.
• Used Azure Machine Learning to automate model deployment and monitoring, enabling continuous integration of new data from Azure Synapse Analytics.
• Integrated Azure Machine Learning with Azure Event Hubs to trigger model predictions in real-time as new financial transactions arrived.
• Automated data ingestion from various third-party financial APIs using Azure Functions and Apache NiFi.
• Developed Apache Spark pipelines to automate the transformation of raw financial data into structured formats for analysis in Azure Synapse Analytics.
• Containerized PySpark based data pipelines using Docker and deployed them on Azure Kubernetes Service to simplify deployment in the cloud.
• Optimized data pipeline workflows to handle increasing data volume, scaling processing power dynamically using Azure Databricks and Azure Data Factory.
• Used SQL queries, Python and DBT models scripting to monitor and troubleshoot pipeline performance.
• Implemented fault tolerant data pipelines with automated retries and error-handling mechanisms in Azure Event Hubs, ensuring continuous data flow.
• Gathered reporting requirements and integrated real-time financial data into dashboards with the help of Azure Synapse Analytics.
Tools & Technologies: Azure Synapse Analytics, Azure Event Hubs, Azure Databricks, Azure Data Factory, Azure DevOps, Azure Functions, Azure Machine Learning (Azure ML), Azure Kubernetes Service (AKS), DBT (Data Build Tool) Power BI, Apache NiFi, Apache Spark, Docker, SQL (T-SQL), Python, Scala.
Client: Rivian Automobile Oct 2022 – Nov 2023
Title: Cloud Data Engineer
Key Responsibilities:
• Created real-time data pipelines using Azure Databricks and Apache Spark to process telemetry data from vehicles like GPS and sensor readings.
• Stored and managed telemetry data with PostgreSQL on Azure Database for PostgreSQL, ensuring fast access and low-latency retrieval for performance insights.
• Developed automated ETL workflows with Azure Data Factory, moving telemetry data to Azure Synapse Analytics for easier analysis and faster decision-making.
• Built a data lake using Azure Data Lake Storage, using Delta Lake on Databricks to store both structured and unstructured vehicle data for easy access.
• Set up Azure Event Hubs to ingest real-time data from vehicle sensors, triggering automated alerts and maintenance actions through Azure Functions.
• Optimized data pipelines using Apache Spark on Azure Databricks, utilizing PySpark for faster data transformations across large datasets.
• Built PySpark scripts to process different types of vehicle data and keep it consistent.
• Improved PySpark job performance by using caching and broadcasting.
• Integrated SQL with Python to automate the extraction of vehicle performance metrics.
• Used SQL queries in Azure Synapse Analytics to support Azure Machine Learning model training, aggregating historical vehicle data into structured datasets.
• Monitored and optimized performance of both T-SQL and PL/SQL blocks for efficient data processing.
• Debugged Spark jobs using Python for efficient data processing.
• Developed Bash scripts to automate the rotation and cleanup of vehicle telemetry logs in Azure Data Lake Storage.
• Used Scala with Apache Spark for batch processing telemetry data.
• Automated data processing using Bash scripts, managing cloud resources to improve workflow and reduce manual work.
• Processed real-time data with Azure Event Hubs and Apache Kafka.
• Developed Spark scripts using Scala for data cleansing, validation, and transformation.
• Created stored procedures in SQL for Azure Database for PostgreSQL, automating vehicle data aggregation from multiple sensors.
• Used Python’s Pandas library to clean the telemetry data and then process it.
• Developed SQL views in Azure Synapse Analytics to monitor vehicle diagnostics and real-time operational data.
• Used Azure Stream Analytics for real-time data stream processing, detecting vehicle sensor anomalies and triggering timely maintenance actions.
• Built machine learning models with PyTorch and Azure Machine Learning, using telemetry data to predict vehicle issues and improve fleet management.
• Ensured secure data pipelines with Azure Key Vault, encrypting telemetry data both in transit and at rest to maintain security.
• Orchestrated data workflows with Azure Data Factory and Apache Airflow, automating tasks and managing ETL jobs between Azure Databricks and Azure Synapse Analytics.
Tools & Technologies: Azure Databricks, Apache Spark, PostgreSQL, Azure Database for PostgreSQL, Azure Data Factory, Azure Synapse Analytics, Azure Data Lake Storage, Delta Lake, Azure Event Hubs, Azure Functions, PySpark, SQL, Azure Machine Learning, T-SQL, PL/SQL, Bash, Scala, Apache Kafka, Azure Stream Analytics, PyTorch, Azure Key Vault, Apache Airflow.
Client: Adventist Health May 2021 – Aug2022 Title: Data Engineer
Key Responsibilities:
• Migrated large volumes of healthcare data to Snowflake from legacy systems using Hadoop on AWS EMR, ensuring seamless integration and smooth data flow.
• Used PostgreSQL for storage and retrieval of structured data during the migration.
• Worked on optimizing Redshift queries to handle large data sets more efficiently, which helped in reducing load times.
• Built real-time data pipelines with Apache Kafka and PySpark to stream patient data from medical devices into AWS cloud, ensuring data was transferred quickly with minimal delay.
• Used AWS Glue to move and transform healthcare data from on-prem systems to AWS, improving data access and reporting for clinical teams.
• Used Apache Flink for real-time stream processing, enabling low-latency data handling for patient monitoring.
• Automated data migration workflows using Python scripts to extract, clean, and load healthcare data from legacy systems to AWS cloud services.
• Developed AWS Lambda functions to process incoming patient data in real-time, setting up automated alerts based on health metrics and reducing response time for critical issues.
• Used DynamoDB to quickly access the patient data in real time.
• Apache Kafka was used to stream the data into Lambda, enabling efficient and low-latency processing.
• Implemented FHIR validation in Python to ensure that all incoming patient data met the FHIR standards before being processed and stored.
• Set up a centralized data lake using AWS S3 to store both structured and unstructured data processed with Hadoop HDFS and PySpark on AWS EMR, improving data management.
• Created Power BI dashboards to track key metrics such as patient intake, bed availability, and treatment times.
• Set up Apache Flink for real-time processing of patient telemetry data from wearable devices.
• Built complex PL/SQL stored procedures to automate healthcare data transformation tasks.
• Used FHIR API to securely access and update patient data in real time from various sources.
• Automated data transfers from on-prem relational databases into Hadoop using Sqoop.
• Implemented AWS IAM roles and policies to and ensured the system met HIPAA compliance and security requirements for healthcare data protection.
• Built FHIR-compliant data validation scripts using Python to ensure the integrity of patient data.
• Created external Hive tables on AWS EMR to enable effective querying of large healthcare datasets stored in Hadoop.
• Created Python scripts to automate the extraction of patient data from external healthcare APIs, transforming it into a clean format.
• Enhanced data transformation processes with Python, SQL, and PySpark on AWS EMR, utilizing Hadoop for distributed data processing.
• Containerized applications using Docker to ensure seamless deployment and consistency across all environments.
• Developed parameterized SQL queries in AWS Redshift for efficient reporting on patient data.
• Used Hive on AWS EMR to organize and manage healthcare data.
• Managed complex ETL workflows with Apache Airflow, scheduling jobs and managing data pipelines.
• Set up monitoring for the data pipelines using Grafana, ensuring system health and quick identification of issues.
• Configured and executed Redshift Spectrum queries to access data stored in AWS S3.
• Automated patient data validation and transformation using Python and PySpark, ensuring accurate and timely integration with healthcare systems.
Tools & Technologies: AWS EMR, Snowflake, Hadoop, PostgreSQL, Redshift, Apache Kafka, PySpark, AWS Glue, Apache Flink, Python, AWS Lambda, AWS S3, HIPAA, FHIR, Power BI, PL/SQL, Hive, AWS IAM, Sqoop, Docker, AWS Redshift Spectrum, Apache Airflow, Grafana.
Client: Bluegrass Cellular Feb 2020 – Mar 2021
Title: Data Analyst
Key Responsibilities:
• Gathered and examined real-time network performance data directly from SQL databases, helping the network engineering team pinpoint issues causing service slowdowns.
• Used Sqoop to import network data from relational databases into Hadoop-based systems for easier integration with real-time data processing through Apache Kafka.
• AWS Lambda was used to automate data processing, reducing the delays if done manually.
• Used DynamoDB for faster retrieval of real time customer information.
• Used AWS S3 to store processed data which enabled easy retrieval when needed.
• Stored important network log data in MongoDB and Hive, making it easier to access key performance indicators.
• Used HBase to store real-time network log data.
• Created Tableau dashboards to visualize metrics like traffic volume, call drop rates, and response times, pulling data directly from AWS Redshift for faster reporting.
• Stored network data in Snowflake, ensuring efficient access for reporting.
• Used Git and Jenkin version controls to create scripts for dashboard and automate the deployment process.
• Troubleshot Flink pipelines to ensure smooth network data processing.
• Stored real-time network log data in Cassandra, ensuring low-latency access for performance monitoring.
• Processed telecom data with Apache Spark on AWS EC2 for predictive analysis.
• Cleaned large datasets using Python to ensure any corrupted or incomplete data.
• Managed resources and optimized task scheduling using YARN on Amazon EMR, ensuring efficient execution of distributed jobs.
• Ran large-scale network traffic analyses with Apache Spark, identifying emerging patterns that could affect future bandwidth allocation.
• Integrated MySQL and Oracle into the pipeline for seamless data sync.
• Managed both historical and incoming data in AWS S3, optimizing storage for fast access without compromising performance.
• Set up real-time alerts using AWS Lambda, enabling quick responses to any unusual network events, helping to avoid disruptions.
• Worked to optimize data flow and improve network responsiveness using Apache Flink, addressing performance data gaps.
• Used Apache Flink's event-driven processing to handle high-volume network data.
• Developed custom Python scripts that filtered network logs and automatically identified performance issues, speeding up the troubleshooting process.
Tools & Technologies: AWS Lambda, AWS S3, MongoDB, Hive, HBase, Tableau, AWS Redshift, Snowflake, Git, Jenkins, Apache Flink, Cassandra, Apache Spark, AWS EC2, Python, YARN, Amazon EMR, MySQL, Oracle.
Client: Essar Oil Sep 2017 – Nov 2019
Title: ETL Developer
Key Responsibilities:
• Created ETL workflows using Informatica and SSIS to extract, transform, and load data from various systems (SQL Server, Oracle) into the central data warehouse.
• Used PL/SQL and T-SQL for complex data transformations and stored procedures, which helped high data integrity and improved query performance.
• Implemented Unix Shell Scripts to automate routine data extraction and loading tasks, reducing manual intervention and improving efficiency.
• Applied Data Warehousing principles to design, build, and optimize large-scale data storage solutions, facilitating efficient reporting and analysis.
• Created reports using Crystal Reports and Cognos, delivering actionable insights on business performance.
• Performed data cleansing and transformation in SQL to ensure consistent and accurate data for downstream applications.
• Used Python’s Pandas library to cleanse and transform large datasets.
• Worked with Oracle and SQL Server databases, performing data extraction, optimization, and indexing for faster data retrieval.
• Created SQL queries to support data validation and reporting requirements, ensuring consistency across all systems.
• Executed SQL queries with Python connectors like pyodbc, SQLAlchemy for data extraction from Oracle and SQL Server.
• Implemented data models and schemas to improve the data structure for reporting and analysis.
• Used Git version control to track changes in the data models.
• Set up job scheduling with Control-M, automating the ETL workflows to ensure timely and accurate data delivery.
• Used Python to extract job execution logs from Control-M.
Tools & Technologies: Informatica, SSIS, SQL Server, Oracle, PL/SQL, T-SQL, Unix Shell Scripts, Data Warehousing, Crystal Reports, Cognos, Python, Pandas, pyodbc, SQLAlchemy, Git, Control-M.
Client: Craftsvilla Aug 2015 – Jul 2017
Title: SQL Developer
Key Responsibilities:
• Created SQL queries for extracting data to support business KPIs and reporting requirements.
• Transformed raw data from different sources into structured formats for easy analysis and reporting.
• Automated data extraction and loading using SSIS, reducing the time spent on manual data processing.
• Improved query performance and optimized SQL Server for faster report generation and better responsiveness.
• Designed and maintained Power BI dashboards for real-time business insights, helping teams to make swift data-driven decisions.
• Resolved data inconsistencies, ensuring all data across business systems was accurate and reliable.
• Provided support by troubleshooting data issues by addressing any discrepancies.
• Ensured data security by setting up access controls and permissions in SQL Server, protecting sensitive business information.
• Validated report accuracy and ensured data integrity was maintained across all systems.
• Managed and processed Change Orders (CO) to accommodate updated project requirements, ensuring seamless integration of changes into the reporting system.
• Monitored SQL jobs regularly and handled any failures by resolving them quickly to ensure smooth data processing.
Tools & Technologies: SQL, T-SQL, SSIS, SQL Server, Power BI Data Validation, Data Security.