Data Engineer

Location:

Chicago, IL

Posted:

July 07, 2025

Contact this candidate

Resume:

Purna Sai K

: *************@*****.*** : +1-773-***-****

LinkedIn: www.linkedin.com/in/purna-sai-k-809a65166 SUMMARY

• Over 8+ years of experience in Data Architecture, ETL workflows, and Data Visualization, I have developed a deep expertise in handling large-scale data systems, optimizing workflows, and transforming raw data into actionable insights across multiple industries, including finance and healthcare. specializing in Azure Data Factory, Databricks, PySpark, and SQL.

• 8+ years of experience in designing and deploying scalable cloud-based data pipelines using Python, PySpark, SQL, Azure Data Factory, and Databricks, with a focus on financial and healthcare domains.

• Proficient in building cloud-native data products, designing robust data pipelines, and enabling analytics for supply chain and financial systems using modern tools like Snowflake, Power BI, and Kubernetes.

• Hands-on experience with data lake architecture, metadata management, data quality, and version control

(Git/GitHub), with strong focus on data governance and pipeline automation.

• Built scalable ETL pipelines in PySpark and AWS Glue to process large healthcare datasets and load them into Redshift for analytics and reporting.

• Automated data ingestion from various structured and semi-structured sources (CSV, JSON, XML) stored in S3, leveraging Glue and Lambda.

• Optimized complex SQL queries and Spark jobs for improved performance and lower compute costs across EMR clusters.

• Worked on migration of on-prem data workflows to AWS cloud stack, enhancing data availability and reliability for internal stakeholders.

• Experienced in Medicaid data reporting, ETL pipeline development, and report automation using Oracle, PL/SQL, BIRT, Power BI, and Azure Synapse, ensuring compliance with CMS standards and state-level healthcare programs.

• Proficient in managing SQL and NoSQL databases including MySQL, PostgreSQL, MongoDB, and Redis for scalable, high-performance data solutions.

• Experienced in designing scalable ETL processes using Apache Spark, PySpark, and SQL to deliver robust data solutions.

• Worked on developing and enhancing data ingestion and processing frameworks using Azure Data Factory, Databricks, AWS, and Google Cloud Platform (GCP).

• Proficient in designing and managing enterprise data pipelines and models using Azure Data Factory (ADF), Microsoft MDS, and Snowflake, with expertise in SQL, stored procedures, and data warehousing concepts, ensuring smooth integration across cloud platforms.

• Experienced in using Power BI and Tableau to create insightful dashboards and visualizations that improve executive reporting and support data-driven decision-making.

• Experienced in machine learning, big data technologies, and development using R and Python, along with proficiency in Linux, SQL, and Git/GitHub.

• Extensive knowledge of the Hadoop ecosystem components, combined with hands-on experience in processing and analyzing large-scale datasets using Spark, Hive, and YARN.

• Created efficient data models for optimized storage and retrieval using Apache Hive, Apache Parquet, and Snowflake.

• I have expertise in Azure Data Factory (ADF) Mapping Data Flow components like Join, Lookup, Filter, and Sort. These tools simplify data processing and transformations.

• Successfully migrated multiple SSIS packages to the cloud from on-premises using a "Lift and Shift" strategy, with minimal configuration changes as required.

• Developed ETL pipelines to enable smooth data integration between OLAP and OLTP environments, ensuring timely and precise data flow for analytics and reporting.

• Proficient in creating automation regression scripts using Python to validate ETL processes across various databases, including Oracle, SQL Server, Hive, and MongoDB.

• Worked on programming languages like Python, Scala, and Java, with practical experience in containerization technologies such as Docker and Kubernetes to facilitate cloud deployments.

• Strong knowledge of relational and NoSQL databases, including PostgreSQL, Oracle, and MongoDB, along with Infrastructure as Code (IaC) experience using Terraform.

• Developed and deployed reporting solutions using Oracle SQL and PL/SQL to extract valuable insights from JSON and XML data, seamlessly integrating with Tableau and other BI tools to create interactive dashboards and support data-driven decision-making.

• Implemented data security measures and access controls for Oracle-based Jasper Reports, ensuring regulatory compliance and secure data access.

TECHNICAL SKILLS

Visualization Tools

Tableau, Power BI, Matplotlib, Seaborn, ggplot2 (R), QlikView ETL Informatica, Matillion, SSIS Center 8.1.1/7.1.2, DataStage 7.5, ODI, Alteryx, Talend Big Data Tools and

Frameworks

Apache Spark, Hadoop, Azure Data Lake Storage, Azure Data Factory, Azure Databricks, Apache Kafka, Apache Airflow, Hadoop

Database Systems My SQL, Microsoft SQL Server, PostgreSQL, MongoDB, Redis, DynamoDB Cloud Technologies and

Services

Amazon Redshift, Snowflake, Big Query, EC2, Amazon S3, AWS Glue, Amazon Kinesis, AWS Lamda

Programming Languages SQL, Python, HTML

EXPERIENCE

Fidelity Investments, (Boston,Massachusetts) Remote Jan 2024 – May 2025 Senior Data Engineer Database Administrator/Developer Data Platform Specialist

• Designed and built a financial data lake for the Implementing of a Financial Data Lake design, utilizing Azure Data Lake and Azure Synapse Analytics to store and analyze large-scale transactional data in real-time, streamlining financial data processing workflows.

• Integrated CI/CD practices using GitHub and Jenkins for automated deployment and version control of data pipelines and cloud data solutions.

• Built and automated enterprise-scale data pipelines using Azure Data Factory, Databricks, and PySpark to process high-volume financial datasets from various sources into a centralized Azure Data Lake.

• Worked on Orchestrating ETL workflows using Apache Airflow, enabling reliable scheduling, retry logic, and dependency management

• Integrated CI/CD practices using GitHub and Azure for automated deployment and version control of data solutions.

• Worked on designing and managing enterprise data pipelines and models using Azure Data Factory (ADF), Microsoft MDS, and Snowflake, with expertise in SQL, stored procedures, and data warehousing concepts, ensuring smooth integration across cloud platforms.

• Developed a comprehensive data validation framework to ensure data integrity across Azure SQL, Snowflake, and Azure Data Lake, incorporating stored procedures and views for transformation logic.

• Worked on the creation of Cloud-Based Data Solutions for Investment Infrastructure design, modernizing the company's investment data infrastructure by migrating on-premises systems to AWS to enhance data processing, scalability, and operational efficiency.

• Built a real-time fraud detection model using Python, Spark Streaming, and Databricks to process data and generate alerts for fraud detection, enhancing risk monitoring.

• The objective was to simplify data integration from multiple sources, enabling real-time analysis and reporting of investment portfolios and other financial services.

• Created and implemented complex data solutions using AWS services like EMR, S3 and Glue ensuring efficient data processing and integration.

• Worked on developing Kafka producers to stream data from external REST APIs into Kafka topics.

• Utilized both relational (PostgreSQL, MySQL) and NoSQL (MongoDB, Redis) databases for high-throughput transactional and analytics workloads, Designed and managed scalable, cloud-based data pipelines using AWS services like Glue, EKS, Redshift, PySpark, Lambda, and Snowflake.

• Developed data processing workflows with Spark, and AWS EMR to manage large datasets and enhance performance in distributed computing environments.

• Created and deployed data pipelines with PySpark and Airflow to streamline data processing and automate workflows effectively.

• Leveraged S3, DynamoDB, and RDS for optimized data management and storage, ensuring the security and scalability of solutions.

• Created and implemented comprehensive data quality checks using custom SQL scripts, Python frameworks, and automation tools to maintain high data integrity and consistency.

• Created and managed data pipelines using AWS Glue and Apache Airflow to orchestrate ETL processes, ensuring data accuracy and timely availability for stakeholders.

• Utilized relational databases MySQL to efficiently store and manage structured data.

• Applied Infrastructure as Code (IaC) principles using tools like Terraform to automate the deployment and management of AWS resources, ensuring consistency and reliability in the cloud environment.

• Implemented data pipelines with Docker and Kubernetes, ensuring scalability, reliability, and fault tolerance within a distributed cloud infrastructure.

United Health, (Chicago,Illinois) Jan 2023 – Dec 2023 Data Engineer BI Developer Snowflake Specialist

• Designed and optimized financial data pipelines in a Large-Scale Data Transformation and Cloud Migration initiative, improving reporting efficiency and enabling real-time analytics by transitioning from on-premises SQL databases to a cloud-based infrastructure.

• Worked on design and implementation of cloud data architecture and dimensional data models in Snowflake and Databricks, optimizing performance and scalability of reporting systems.

• Developed data pipelines to support healthcare data product deployment with focus on compliance, data integrity, and operational monitoring using Azure Data Factory and Databricks.

• Designed production-ready pipelines and automated jobs using PySpark and Azure Data Lake, aligned with enterprise data governance frameworks.

• Developed and maintained Medicaid-related ETL workflows and operational reports using Oracle PL/SQL, meeting compliance requirements for CMS and internal stakeholders.

• Delivered recurring and ad-hoc Medicaid reports (daily, weekly, monthly) for financial and case management data using Power BI and BIRT, ensuring accuracy and timeliness for audits and DHHS reporting.

• Built advanced data pipelines and dashboards using Azure Synapse and Power BI to support modernization efforts and transition Medicaid data workloads into cloud architecture.

• Collaborated with agile teams and release management to validate reporting components, manage source control, and coordinate production deployments in support of Medicaid initiatives.

• Designed and managed enterprise-level data pipelines and models utilizing Azure Data Factory (ADF), Microsoft MDS, and Snowflake. Demonstrated strong expertise in SQL, stored procedures, and data warehousing principles to ensure seamless integration across cloud environments.

• Worked on the migration of on-premises SQL databases to a cloud-based infrastructure with Snowflake and Azure Data Factory, ensuring smooth data integration and optimized performance.

• Developed and enhanced ETL pipelines using Azure Data Factory and PySpark for incremental data ingestion, boosting processing efficiency and minimizing cloud expenses.

• Leveraged RESTful APIs to provide access to key functionalities, enabling external applications to retrieve portfolio and transaction data.

• Developed Snowflake-based data integration using Azure Data Factory, implementing scalable solutions aligned with modern cloud data warehousing best practices.

• Implemented delta ingestion, data quality frameworks, and table-level security with Databricks and Azure Data Factory, enhancing incremental data processing efficiency.

• Built Power BI dashboards with real-time data refresh from Azure Data Lake, improving data visualization and executive reporting, while working closely with cross-functional teams to address technical challenges and ensure compliance with data policies.

• Successfully migrated on-premises SQL databases to Snowflake using Azure Data Factory, while designing an optimized database architecture and creating a solution for efficient data transfer from Azure BLOB storage to SQL Server.

• Design and manage ETL workflows with SSIS, ensuring efficient and high-quality data processing and transformation.

• Created and maintained enterprise data pipelines by implementing ETL processes with MS SQL Server, SSIS, and Azure Data Factory, ensuring smooth data extraction, transformation, and loading.

• Utilized Azure Synapse Analytics for data warehousing and advanced querying, enhancing financial data analysis, reporting capabilities, and delivering quicker insights.

• Developed a distributed data processing system with Hadoop and Spark, enabling the analysis of Metabytes of financial data to support advanced analytics and drive innovation in trading strategies.

• Created and deployed a cost-efficient data archiving solution with Azure Blob Storage, ensuring long-term data accessibility, regulatory compliance, and reduced storage expenses.

• Collaborated with business users to collect requirements, design visualizations, and deliver training for self-service BI tools, improving user experience and accessibility.

• Created migration strategies for transitioning traditional systems to Azure, using lift-and-shift methods and third-party tools to ensure smooth transitions.

• Enhanced existing data movement frameworks by incorporating automation and new features to improve reliability and performance.

• Prioritized automation to minimize manual intervention and streamline deployment processes across data pipelines.

• Recommended cost-efficient Azure data infrastructure architectures, offering suggestions to optimize spending and resource allocation.

• Applied Python in CI/CD pipelines to automate the deployment of database solutions and BI applications on Azure, ensuring continuous integration, testing, and smooth delivery of updates.

• Created and implemented advanced data processing pipelines using Apache Flink, enabling complex data transformations and analytics.

Prisol Soft Technologies, Pvt. Ltd, (Hyderabad, India) June 2020 – July 2022 Sr.Data Analyst Data Administrator ETL Specialist

• Worked on the development of a Real-Time Inventory and Sales Analytics System using Azure Data Factory (ADF) to improve inventory management efficiency and offer valuable insights into sales performance.

• Designed, developed, and maintained reliable data pipelines using Apache Spark and Scala to efficiently process and analyze data, crucial for transforming and aggregating information from various sources such as point-of-sale (POS) systems and warehouse management systems (WMS).

• Created and managed ETL processes with Apache Spark and Hive to extract, transform, and load data into the Hadoop ecosystem, designed to process both batch and streaming data for near real-time inventory updates as sales occurred.

• Developed MapReduce (YARN) jobs for data cleaning, access, and validation, enabling efficient management of large datasets and ensuring data integrity.

• Utilized Google Cloud Platform (GCP) to deploy, manage, and monitor large-scale data processing systems, ensuring optimal availability and performance.

• Created and maintained ETL processes using Talend and Bash scripting, along with big data technologies like Hive, Impala, and Spark to build data ingestion pipelines that manage real-time updates from multiple sources.

• Utilized containerization technologies like Docker and Kubernetes to improve application deployment, scalability, and reliability.

• Improved data processing and analytics by using Node.js for server-side scripting and API integrations, in combination with Azure services.

• Transferred data from MS SQL Server and Teradata to HDFS using SQOOP.

• Used Azure Kubernetes Service (AKS) to deploy and manage containerized applications, enabling the efficient implementation of microservices architectures.

• Created CSV files and JSON scripts to deploy the pipeline in Azure Data Factory (ADF) for processing data.

• Delivered continuous support and troubleshooting for ETL processes, swiftly identifying and addressing issues to reduce downtime and ensure smooth data flow.

• Applied Agile methodology and Scrum to ensure the on-time delivery of high-quality results.

• Transferred existing self-service and ad-hoc reports to Power BI.

• Implemented complex business logic using dbt (data build tool) to transform raw data into curated data models for analytics consumption.

• Enabled efficient storage optimization by partitioning and bucketing datasets in Delta Lake, facilitating faster reads in downstream applications.

• Utilized Google BigQuery to perform ad-hoc analysis on massive datasets, improving turnaround time for sales and marketing insights.

• Created high-performance visual dashboards using Tableau, empowering stakeholders with interactive views of inventory turnover, sales trends, and regional demand patterns.

• Built scalable data ingestion connectors using Apache Beam, ensuring consistent data flow from RESTful APIs into the cloud data warehouse.

• Applied data masking and encryption techniques with Azure Purview and GCP DLP to ensure data privacy compliance across all ingestion layers.

• Engineered robust data validation frameworks using Apache NiFi, streamlining complex ingestion workflows and ensuring schema consistency across real-time pipelines.

• Leveraged Kafka to build fault-tolerant data streaming services that supported scalable real-time analytics and synchronized transactional data across systems.

• Designed and maintained multi-node HBase clusters to support low-latency read/write access for time-series inventory tracking applications.

• Implemented scalable metadata management using Apache Atlas, improving data governance and traceability across the analytics ecosystem.

• Applied advanced performance tuning techniques in Presto to reduce query execution time across large-scale analytical datasets.

• Automated monitoring and alerting for ETL pipelines using Prometheus and Grafana, increasing transparency into data flow health and reducing incident response times.

• Built event-driven microservices with Flume to capture and transport high-velocity log data from distributed sources into a centralized data lake.

• Developed a metadata-driven orchestration layer using Airflow, allowing flexible dependency management and scheduling across diverse batch and streaming pipelines.

• Integrated Cloud Functions in serverless architectures to trigger data processing logic dynamically in response to data lake updates and ingestion events.

• Focused on data migration using SQL, Azure Storage, and Azure Data Factory. Developed data integration and technical solutions for Azure Data Lake Analytics, Azure Data Lake Storage, Azure SQL databases, and Azure SQL Data Warehouse to support Synapse Analytics and reporting, improving marketing strategies. SBI Life Insurance Co. Ltd., (Hyderabad, India) June 2017 – May 2020 Data Analyst/SQL Server

• Collaborated on a Customer Retention and Policy Analytics design, along with other strategic initiatives, to improve data-driven decision-making and streamline operational processes.

• Worked on analyzing customer behavior, policy renewals, and identifying trends in policy cancellations. This involved utilizing SQL and Python for data analysis and pattern recognition to pinpoint high-risk customers for churn, while proposing strategies to enhance retention rates.

• Maintained consistent data quality by conducting regular audits, resulting in a 20% decrease in reporting errors.

• Utilized Excel functions for data sorting, analysis, and generating statistical reports.

• Leveraged SQL and Python to analyze extensive organizational data, uncovering potential cost-saving opportunities

• Developed several Directed Acyclic Graphs (DAGs) to automate ETL pipelines, optimizing data workflows and improving the efficiency of data processing tasks.

• Utilized Informatica Repository Manager to manage all repositories (development, test, and validation) and participated in migrating folders between repositories.

• Created a predictive model using machine learning techniques, leading to a 35% improvement in sales forecast accuracy.

• Worked on cross-departmental effort to implement a new data warehousing architecture, resulting in a 25% reduction in data retrieval times.

• Developed new business intelligence reports in Tableau, enabling data-driven decision-making.

• Proficient in using tools like Microsoft Excel and Access to create dashboards and reports.

• Effectively used data validation tools and SQL queries to thoroughly investigate data discrepancies, providing recommendations for process improvements that boost overall data quality.

• Conducted root cause analysis on data latency issues using Databricks, resulting in a 30% improvement in pipeline reliability.

• Optimized query performance for large datasets using Spark SQL, reducing processing time by over 40% for key analytical workloads.

• Built interactive dashboards and visual analytics using Power BI, providing leadership with real-time insights into operational KPIs.

• Streamlined data ingestion processes from third-party APIs using Python and RESTful services, ensuring consistent and scalable integration with internal systems.

• Designed custom data models and fact/dimension tables using Star Schema principles, enhancing reporting capabilities and data discoverability.

• Applied clustering and classification techniques using Scikit-learn to segment customers based on lifetime value and behavior patterns.

• Maintained CI/CD workflows for data pipeline deployments using Git and Azure DevOps, ensuring stable and traceable releases.

• Documented data lineage and transformation logic with Collibra, improving transparency and reducing onboarding time for new analysts.

• Integrated and transformed semi-structured data from JSON sources to structured formats, enabling downstream analytics in the enterprise data lake.

• Developed materialized views and custom functions in PostgreSQL to support high-throughput reporting needs in a government healthcare setting.

• Participated in sprint planning and backlog grooming within an Agile development framework, consistently delivering sprint goals on time.

• Validated complex business rules in source-to-target mappings using BIRT reports, aligning data output with Medicaid compliance requirements.

• Created data reconciliation scripts with Shell scripting to ensure completeness and accuracy across staging and production environments.

• Created data masking solutions using Informatica Data Masking to protect sensitive customer and healthcare information during testing and development.

• Performed data extraction from legacy systems using SQL Loader, facilitating successful migration to modern data platforms.

• Conducted regression testing and performance tuning in Teradata, optimizing queries across massive transactional datasets.

• Developed reusable ETL components using PL/SQL Packages to standardize business logic across modules and promote code reusability.

• Supported business units with ad-hoc reporting and data requests by developing quick-turnaround solutions using Jupyter Notebooks.

• Contributed to the analysis, specification, design, implementation, and testing stages of the Software Development Life Cycle (SDLC)

Contact this candidate