Sr Data engineer

Location:

Chicago, IL

Salary:

125000

Posted:

December 18, 2025

Contact this candidate

Resume:

Syed Sohail

Sr. Data Analyst / Engineer

Email: ************@*****.***

Phone: +1-773-***-****

PROFESSONAL SUMMARY:

Overall 7+ years’ experience in the field of Data analysis, ETL Development, Data Modelling, Data Warehousing Technologies, with a focus on Azure solutions for big data including services like Azure HDInsight, Azure Databricks, and Azure Synapse Analytics.

Strong experience in Business and Data Analysis, Data Profiling, Data Migration, Data Conversion, Data Quality, Data Integration and Metadata Management Services and Configuration Management.

Collaborate with database administrators to ensure database security and compliance with regulations.

Conduct impact analysis of data model changes in ERwin and assess their potential effect on the overall system architecture.

Designed and implemented Azure Function Apps as middleware to connect and orchestrate bespoke solutions, facilitating seamless data flow and integration across custom applications.

Built and maintained ETL pipelines using Azure Data Factory, PySpark, and custom scripts to ingest, process, and transform large volumes of structured and unstructured data.

Design and develop conceptual, logical, and physical data models using Azure Databricks to support large-scale data processing and analytics.

Work closely with ETL developers to integrate data from various sources into the data warehouse, leveraging PL/SQL for seamless data migration.

Assist in the migration of data models to cloud-based environments, ensuring compatibility with Java applications.

Review and refine reporting processes to enhance efficiency, accuracy, and accessibility across departments.

Work with stakeholders to identify data needs and provide insights through automated reporting pipelines in Azure DevOps.

Operationalized machine learning models in production using Azure Machine Learning Service, integrating with other Azure services for complete data solutions.

Implemented data quality checks, validation processes, and automated data quality assurance pipelines to ensure data accuracy and reliability.

Optimized Spark applications in Azure for optimal performance, memory utilization, and batch interval time.

Tuned SQL queries and stored procedures in Azure Synapse Analytics and other databases, achieving significant performance gains.

Strong experience in Data Modelling with expertise in creating Star Snow-Flake Schemas, FACT and Dimensions Tables, Physical and Logical Data Modelling using Erwin and Embarcadero.

Collaborate with data engineers to build scalable data pipelines using Azure DevOps and other Azure services like Data Factory.

Perform root cause analysis on data issues, leveraging Azure DevOps logs and tools for troubleshooting.

Ability to collaborate with peers in both business and technical areas, to deliver optimal business process solutions, in line with corporate priorities. Experienced working with Excel Pivot and VBA macros for various business scenarios.

Strong experience in interacting with stakeholders/customers, gathering requirements through interviews, workshops, and existing system documentation or procedures, defining business processes, identifying, and analyzing risks using appropriate templates and analysis tools.

Experienced on Migrating SQL database to Azure data Lake, Azure data lake Analytics, Azure SQL

Database, Data Bricks and Azure SQL Data warehouse and controlling and granting database access and migrating on premise databases to Azure Data Lake store using Azure Data factory.

Good Understanding of the necessity of cost-based optimization for applications in AWS.

Experience in building and architecting multiple Data pipelines, including ETL and ELT process for Data ingestion and transformation in GCP using Big Query, Data proc, Cloud SQL, Data store.

Experience in Google Cloud (GCP) components, Google container builders and GCP client libraries and cloud SDK's.

Experience in building Power BI report to make data analysis in better way.

Extensive experience in developing applications that perform Data Processing tasks using Teradata, Oracle, SQL Server, and MySQL databases.

Strong experience in Software Development Life Cycle (SDLC) including Requirements Analysis, Design Specification and Testing as per Cycle in both Waterfall and Agile methodologies.

working with Big Data technologies, including Cloud era and Horton works distributions, with expertise in Hadoop, Spark, Map Reduce, Kafka, Hive, Ambari, Sqoop, HBase, and Impala.

Proficient in programming languages such as Scala, Java, Python, SQL, T-SQL, and R.

Deep understanding of Machine Learning techniques, specifically related to Time series and Regression.

Experience with database manipulation (ETL) and SQL programming and reporting (Tableau, Power BI) and SSRS).

TECHNICAL SKILLS:

Big Data Technologies:

Azure HDInsight, Azure Synapse Analytics, Azure Data Factory, Azure Data Lake Analytics, Azure Functions, Azure Event Hubs, Azure Data Lake Storage Gen2

Databases:

Oracle, MySQL, SQL Server, MongoDB, Cassandra, Dynamo DB, PostgreSQL, Teradata, Cosmos, Azure SQL Database.

Programming:

Python, Py Spark, Scala, Java, C, C++, Shell script, Perl script, SQL

Cloud Technologies:

AWS, Microsoft Azure,GCP

Frameworks:

Django REST framework, MVC, Hortonworks

Tools:

Visual Studio, SQL*Plus, TOAD, SQL Navigator, Query Analyzer, SQL Server, SQL Assistance, Eclipse, Postman, Tableau 2020.3/2019/9x/ 10x (Desktop, Server, Public), Business Intelligence Development Studio (SSRS), SSIS, Alteryx, Qlik sense, Einstein Analytics and Power BI, Crystal Report, Cognos.

Versioning tools:

SVN, Git, GitHub

Operating Systems:

Unix, Linux, Windows, MAC, Solaris

Database Modelling:

Dimension Modeling, ER Modeling, Star Schema Modeling, Snowflake Modeling, Teradata.

Monitoring Tool:

Apache Airflow

Visualization/ Reporting:

Tableau, ggplot2, Matplotlib, SSRS and Power BI

EDUCATION:

Masters of Science (Business Analytics) From Trine University.

EXPERIENCE:

Client: Harmony Care, Troy, Michigan Nov 2023 – Current

Role: Sr. Data Analyst/ Engineer

Project Description: Harmony Cares is a leading provider of in-home healthcare services, dedicated to delivering compassionate, personalized, and high-quality care to individuals who face challenges accessing traditional medical facilities.

Responsibilities:

Designing, implementing, and optimizing big data solutions on the AWS platform and this is crucial for data engineering, analytics, machine learning, and ETL processes.

Analyzed and debugged a large-scale Tableau dashboard issue involving missing staffing agent data for a specific time interval.

Investigated and validated SQL source queries across complex historical employee datasets to identify data gaps and inconsistencies.

Developed complex workflows using Azure Logic Apps to automate data ingestion, transformation, and delivery processes, ensuring seamless integration with Azure Data Factory and other cloud services.

Developed bespoke middleware using Azure Function Apps to handle complex business logic, enabling real-time data processing and integration between diverse systems.

Designed and implemented Azure Logic Apps to automate workflows and integrate data across multiple Azure services, enhancing data processing efficiency.

Leveraged Azure Data Lake Analytics for serverless U-SQL queries on data lake storage, enabling ad hoc analysis and reducing time to insights.

Designed and deployed machine learning models using Azure Machine Learning Studio for customer churn prediction and personalized marketing.

Developed MLOps pipelines using Azure Machine Learning Service to streamline model deployment and lifecycle management.

Implemented real-time data processing and model inference using Azure Databricks and Azure Event Hubs.

Automated data preparation and feature engineering using PySpark and Azure Data Factory, optimizing model training efficiency.

Collaborated with QA and business users to interpret Tableau outputs, trace data lineage, and validate root causes.

Developed SQL filters and temporary logic to isolate missing records using fields like Employee ID, DOCT, and position history.

Documented key findings, query logic, and validation checkpoints to support reproducibility and transparency.

Participated in cross-functional syncs to coordinate updates, discuss blockers, and ensure traceability across dashboards and data sources.

Proposed improvements to Tableau report logic and suggested steps to prune and optimize slow-performing backend data views.

Implemented ADF parameterization for dynamic pipeline orchestration across environments.

Authored custom data profiling scripts in PySpark for validation and schema enforcement.

Implemented Data Quality checks and exception logging with Azure Functions and Event Grid.

Configured and maintained staging & production environments using Azure DevOps pipelines.

Collaborated with ML teams to prepare training data pipelines using Azure Databricks & MLlib.

Defined warehouse sizing and role-based access controls in Snowflake.

Developed Spark Streaming pipelines in Databricks to handle event updates into Delta Lake.

Enabled schema enforcement and error handling for malformed event records via PySpark logic.

Developed Planter Foundry workflows for data integration, transformation, and analytics use cases across multiple teams.

Implemented Delta Lake architecture on Databricks for versioned, ACID-compliant, and optimized data lake storage.

Implemented fully automated data pipelines integrating Planter Foundry with AWS services for real-time data ingestion and transformation.

Queried and filtered sensitive datasets using Teradata/SQL Server to validate ticket-reported anomalies.

Developed source-level queries and data extraction procedures for validation against Tableau visual outputs.

Participated in exploratory sessions identifying opportunities for anomaly detection using statistical rules.

Evaluated data completeness and structure requirements for alerting pipelines (Elastic + Kibana).

Contributed to Jira-based task tracking and review processes for multi-team collaboration on fraud/data quality metrics.

Environment: Hadoop, HDFS, Hive, Core Java, Spark, Scala, Hive, Azure SQL Database, Azure Synapse Analytics, Azure Blob Storage, Azure HDInsight, Azure Databricks, Azure Data Factory, Azure Data Lake Storage, Apache Kafka, Apache Flume, Python, PySpark, Unix/Linux Shell Scripting.

Client: Citi Bank, Texas Jan 2022 – Oct 2023

Role: Sr. Data Analyst / Engineer

Responsibilities:

Develop, design data models, data structures and ETL jobs for data acquisition and manipulation purposes.

Perform complex data analysis using Teradata to generate insights for business decision-making.

Generate database schemas and DDL scripts from ERwin models for deployment in various database management systems.

Ensure data integrity and consistency across models by applying ERwin’s data governance and standardization features.

Optimize performance of data models in Azure Databricks by leveraging Spark clusters and partitioning strategies.

Write and optimize SQL queries to extract and manipulate data from Teradata databases.

Collaborate with data engineers and software developers to integrate data models into applications, ensuring efficient data retrieval and manipulation using Python.

Perform data profiling and cleansing tasks, leveraging Python's data manipulation capabilities to improve data quality and consistency.

Collaborate with stakeholders to gather requirements and translate them into data modeling solutions aligned with Java applications.

Implement ETL processes using Java-based tools to streamline data ingestion from various sources into the data warehouse.

Write and maintain shell scripts to automate data extraction, transformation, and loading (ETL) processes, enhancing data pipeline efficiency.

Design and implement data models to support data warehousing initiatives, ensuring alignment with business requirements.

Develop logical and physical data models that optimize data storage, retrieval, and processing within the data warehouse environment.

Collaborate with cross-functional teams to gather requirements and translate them into effective data modeling solutions.

Design and develop data models for relational databases, ensuring efficient data organization and access while utilizing Unix shell scripting for automation.

Work closely with ETL developers to integrate data from various sources into the data warehouse, leveraging PL/SQL for seamless data migration.

Document data models, data flow processes, and PL/SQL procedures for knowledge sharing and compliance purposes.

Collaborate with software engineers to design and optimize data access patterns, ensuring efficient integration between data models and Java applications.

Utilize Java for data transformation processes, enhancing data quality and consistency across multiple sources.

Develop deep understanding of the data sources, implement data standards, maintain data quality, and master data management.

Conduct performance tuning of PL/SQL queries to optimize data retrieval and processing times within the database.

Support data migration and remediation efforts by writing and executing PL/SQL scripts for data cleansing and transformation.

Expert in developing JSON Scripts for deploying the Pipeline in Azure Data Factory (ADF) that processes the data.

Expert in using Databricks with Azure Data Factory (ADF) to compute large volumes of data.

Maintain version control and data model documentation in Azure Databricks to ensure alignment with data governance policies.

Collaborate with data scientists to structure datasets in Azure Databricks for machine learning and advanced analytics.

Collaborate with development teams to define and implement data workflows within Azure DevOps pipelines.

Design and maintain data integration solutions using Azure DevOps for automating ETL processes.

Perform data modeling tasks in Azure Databricks to integrate data from various sources, including Azure Data Lake, SQL databases, and external APIs.

Perform model version control and manage updates to data models using ERwin’s collaboration and version management capabilities.

Perform data analysis to support business decisions, translating insights into concise reports.

Analyze data and metrics generated from Azure DevOps environments to optimize software delivery processes.

Leverage Azure DevOps pipelines to automate data extraction, transformation, and loading into cloud databases.

Data analytics projects to support business strategy and improve operational efficiency, utilizing advanced SQL and Python for data extraction and analysis.

Conduct deep-dive data analysis to provide actionable insights that enhance customer retention and improve product offerings.

Developed Streaming pipelines using Azure Event Hubs and Stream Analytics to analyze data for dealer efficiency and open table counts for data coming in from IOT enabled poker and other pit tables.

Collaborate with application architects on infrastructure as a service (IaaS) application to Platform as a Service (PaaS).

Environment: Azure, Azure data lake, Azure Data Factory(ADF), Azure Data bricks, Dataflow, Python, PySpark, Synapse, Azure data studio, Microsoft SQL Server Management Studio 18, Oracle, SQL Server,Teradata,Snowflake

Client: GE healthcare, Chicago, Illinois April 2020 – Dec 2021

Role: Sr. Data Engineer

Responsibilities:

Extensively used PySpark to implement transformations and deployed in Azure HDInsight for ingestion and Hygiene, Identity Resolution process.

Work with large datasets to create summaries and detailed reports that guide business strategies.

Collaborate with data engineers and architects to define data structures and models within Azure Databricks environments.

Use Azure Databricks to create and maintain data pipelines that support real-time and batch data processing.

Developed and operationalized machine learning pipelines using Azure Machine Learning Service, improving decision-making processes.

Leveraged Snowflake ML capabilities to train and deploy machine learning models for fraud detection and customer segmentation.

Automated the ETL-to-ML workflow using Azure Data Factory and PySpark, enhancing the efficiency of data preparation for model training.

Utilized Azure Cognitive Services for text analytics and sentiment classification, enabling customer feedback analysis.

Designed and implemented data models in Azure Synapse Analytics for efficient storage and analysis of financial data.

Developed and deployed scalable real-time data processing applications using Azure Stream Analytics.

Migrated legacy MapReduce programs to Spark transformations in Azure for improved performance and maintainability.

Proficient in SQL, Power BI, and other visualization tools

Designed and optimized data warehouses on Azure Synapse Analytics and Azure SQL Data Warehouse

Designed and implemented scalable, high-performance data pipelines on Azure for various industries

Developed complex workflows using Azure Logic Apps to automate data ingestion, transformation, and delivery processes, ensuring seamless integration with Azure Data Factory and other cloud services.

Developed bespoke middleware using Azure Function Apps to handle complex business logic, enabling real-time data processing and integration between diverse systems.

Designed and implemented Azure Logic Apps to automate workflows and integrate data across multiple Azure services, enhancing data processing efficiency.

Leveraged Azure Data Lake Analytics for serverless U-SQL queries on data lake storage, enabling ad hoc analysis and reducing time to insights.

Developed and optimized Cosmoscope scripts to perform advanced data exploration and transformation, ensuring high performance and scalability for complex queries in Azure environments.

Implemented Azure Synapse Analytics's security features (Transparent Data Encryption, Azure Active Directory integration) to ensure compliance with data protection regulations.

Integrated Azure Active Directory (AAD) for secure authentication and authorization of data engineering workflows, ensuring compliance with enterprise security policies.

Leveraged Azure Active Directory to enforce data governance policies, ensuring that only authorized users and applications could access critical data assets.

Implemented automated data quality checks and validation using Python and PySpark within Azure environments’

Developed and tested machine learning models with Azure Machine Learning Studio for financial forecasting and risk assessment.

Utilized AWS CLI to automate backups of ephemeral data-stores to S3 buckets, EBS and create nightly AMIs for mission critical production servers as backups.

Provide insights through detailed reports to support business forecasting and planning.

Develop optimized data models in Azure Databricks to enable efficient querying and data retrieval for analytics.

Implement data transformations using Databricks’ notebooks and Spark SQL to structure raw data for analysis.

Work with stakeholders to develop key performance indicators (KPIs) and metrics for monitoring business performance.

Developed Deduplication module for contacts of sales and marketing data of Confidential, Executed Hive queries on Parquet tables stored in Hive to perform data analysis. Developed Rest API using Flask Framework (python) for the front end (UI) to consume.

Collaborate with DevOps engineers to integrate data sources and automate data flow between different systems.

Designed and built interactive dashboards in Power BI and Tableau for cross-functional teams, enabling real-time decision-making and reporting.

Created predictive models to forecast sales trends, using statistical techniques like regression analysis and time-series forecasting.

Implement data cleansing and validation processes to ensure the accuracy and integrity of data in Teradata systems.

Assist in database design and architecture discussions to optimize performance for Teradata.

Implement data validation and quality checks within Azure DevOps CI/CD pipelines to ensure data accuracy.

Evaluated Rest API calls through python scripts and Postman as well. Implemented Chef Cookbooks for OS component configuration to keep AWS server's template minimal. Written Chef Cookbooks for various B configurations to modularize and optimize product configuration.

Created Azure ML studio pipeline with python module codes to execute the Naive bayes and boost classification (machine learning algorithm) for persona Mapping.

Loaded data (ingestion) from Salesforce, SAP, SQL server, Teradata to Azure Data Lake using Azure Data Factory.

Worked on Implementing Business rules for reduplication of contacts by using spark transformations with spark python and PySpark. Involved in designing and deploying multi-tier applications using all the AWS services like (EC2, Route53, S3, RDS, Dynamo DB, SNS, SQS, IAM) focusing on high-availability, fault tolerance, and auto-scaling in AWS Cloud Formation.

Implemented ETL processes using Alteryx and SQL to streamline data extraction and transformation from multiple sources into a unified format.

Versioning the artifacts with time stamp, Deploying artifacts into servers in AWS cloud with Ansible and Jenkins.

Involved in setting up JIRA as defect tracking system and configured various workflows, customizations, and plugins for the JIRA bug/issue tracker.

Created a File based Data Lake using Azure Blob Storage, Azure Data Factory, Azure HDInsight. Used HBase to Data Storage and Retrieval.

Client: IMG Group, Indianapolis, IN Sep 2018 - March 2020

Role: Data Analyst

Project Description: IMG built a solid foundation by providing multiple innovative insurance, reinsurance, and international medical management programs.

Responsibilities:

Analysis, design, and development phase of the Software Development Lifecycle (SDLC). Experience in the agile environment, used to have sprint planning meeting, scrum calls and retro meetings for every sprint, Used JIRA for the project management and GitHub for the version control.

Design and develop conceptual, logical, and physical data models using ERwin to support business requirements.

Collaborate with stakeholders to gather data requirements and translate them into ERwin data models.

Extensively used PySpark to implement transformations and deployed in Azure HDInsight for ingestion and Hygiene, Identity Resolution process.

Participate in architecture discussions, contributing to the design of scalable and robust data solutions that integrate seamlessly with Java applications.

Optimize data retrieval performance through effective indexing strategies and Java-based caching mechanisms.

Created a File based Data Lake using Azure Blob Storage, Azure Data Factory, Azure HDInsight. used HBase to Data Storage and Retrieval.

Optimize SAS programs for performance improvements and faster processing times.

Design and develop data models for relational databases, ensuring efficient data storage and retrieval processes using PL/SQL.

Write, optimize, and maintain PL/SQL scripts to implement data transformations, validations, and business rules within the data model.

Troubleshoot and resolve any discrepancies or issues in data reporting and analytics.

Create and configure build and release pipelines in Azure DevOps for deploying data analysis tools and scripts.

Manage and version control data analysis scripts and reports using Azure Repos within Azure DevOps.

Collaborate with other teams to integrate reporting solutions with existing systems.

Create and maintain dashboards and reports in SAS for real-time data monitoring.

Create scalable data models to handle big data workloads in Azure Databricks using Delta Lake for optimized storage and processing.

Work closely with business stakeholders to translate business requirements into data models and ensure their implementation in Azure Databricks.

Created Azure ML studio pipeline with python module codes to execute the Naive bayes and boost classification (machine learning algorithm) for persona Mapping.

Loaded data (ingestion) from Salesforce, SAP, SQL server, Teradata to Azure Data Lake using Azure Data Factory.

Worked on Implementing Business rules for deduplication of contacts by using spark transformations with spark python and PySpark. Involved in designing and deploying multi-tier applications using all the AWS services like (EC2, Route53, S3, RDS, Dynamo DB, SNS, SQS, IAM) focusing on high-availability, fault tolerance, and auto-scaling in AWS Cloud Formation.

Developed Graph Database nodes and relation using cypher language. Developed spark job using Spark Data frames to flatten Jon documents to flat file.

Developed micro services using AWS Lambda to make API calls for third party vendors like Melissa, Strike iron. Involved in supporting cloud instances running Linux and Windows on AWS, experience with Elastic IP, Security Groups and Virtual Private Cloud in AWS.

Involved in supporting cloud instances running Linux and Windows on AWS, experience with Elastic IP, Security Groups and Virtual Private Cloud in AWS.

Extensive experience of configuring Amazon EC2, Amazon S3, Amazon Elastic Load Balancing AM and Security Groups in Public and Private Subnets in VPC and other services in the AWS Managed network security using Load balancer, Auto-scaling. Security groups and NACL.

Worked on OpenShift Pass product architecture and created OpenShift namespaces for on-perm applications migrating to cloud.

Experience on working with on-premises network, application, server monitoring tools like Naglos, Splunk, App Dynamics and on AWS with Cloud Watch monitoring tool

Involved in setting up JIRA as defect tracking system and configured various workflows, customizations, and plugins for the JIRA bug/issue tracker.

Contact this candidate