Manohar Polapally
Data Engineer
******************@*****.*** +1-989-***-****
www.linkedin.com/in/-manohar
Professional Summary
Data Engineer with 5+ years of experience building batch and real-time data pipelines, data warehouses, and streaming platforms across healthcare, consulting, and HR technology domains. Worked with Python, PySpark, and SQL to build ETL pipelines and process large-scale datasets on AWS and Azure. Built streaming solutions using Kafka and Spark Streaming, set up Airflow workflows, and designed dimensional models in Snowflake and Redshift. Familiar with data quality validation, data lake architecture, and dbt for transformation. Collaborated with analytics and business teams to define data requirements and deliver clean, reliable datasets.
Technical Skills
Programming Languages: Python, SQL, PySpark, Scala, Bash, T-SQL Big Data: Apache Spark, PySpark, Hadoop, Hive, Apache Kafka, Kafka Streams Orchestration: Apache Airflow, AWS Glue, Azure Data Factory, AWS Step Functions Cloud - AWS: S3, Glue, Redshift, EMR, Athena, Lambda, Kinesis, CloudWatch, IAM Cloud - Azure: Data Factory, Databricks, Synapse Analytics, Blob Storage, Event Hubs Data Warehousing: Snowflake, Amazon Redshift, Azure Synapse, Teradata Databases: PostgreSQL, MySQL, SQL Server, MongoDB, DynamoDB, Cosmos DB Data Modeling: Star Schema, Snowflake Schema, Dimensional Modeling, ER Modeling, dbt Streaming: Apache Kafka, Kinesis, Spark Streaming, Event Hubs, Change Data Capture Data Quality: Great Expectations, Pandas Profiling, Custom Validation, Data Lineage DevOps: Git, GitHub, Docker, Kubernetes, Jenkins, Terraform, CI/CD Pipelines Visualization: Tableau, Power BI, QuickSight
ETL Tools: Informatica, Talend, AWS Glue, dbt, Custom Python Pipelines Methodologies: Agile/Scrum, Data Governance, Data Lakehouse, HIPAA, PCI-DSS Professional Experience
CVS Health July 2024- Present
Data Engineer
• Built data pipelines using Python and PySpark processing pharmacy claims, member enrollment, and prescription transaction data from multiple source systems into the enterprise data lake on AWS S3.
• Developed real-time streaming pipelines using Kafka and Spark Streaming ingesting pharmacy dispensing events and claim submissions for operational monitoring dashboards.
• Designed Snowflake data warehouse with star schema dimensional models for claims reporting, pharmacy performance tracking, and member utilization analytics.
• Wrote AWS Glue ETL jobs that pull data from S3, apply transformations and validation checks, and load into Redshift and Snowflake.
• Set up data quality checks using Great Expectations on incoming healthcare data catching schema mismatches, null values, and referential integrity issues before loading.
• Built Airflow DAGs to orchestrate multi-step healthcare data pipelines with dependency management, retry logic, and failure alerting.
• Implemented incremental loading using change data capture for patient demographics and provider network data handling slowly changing dimensions.
• Created data lineage documentation tracking data from source systems through transformations to final reporting tables for compliance and auditing.
• Integrated healthcare data in HL7 and FHIR formats into the pipeline processing clinical records for downstream analytics teams.
• Configured encryption at rest and in transit, access controls, and audit logging across AWS services for HIPAA-compliant data handling.
• Optimized Spark jobs with partition tuning and broadcast joins to improve processing times on large pharmacy transaction datasets.
• Built reusable Python utility libraries for data cleaning, schema validation, and logging shared across multiple pipeline teams.
• Worked with data analysts, platform engineers, and business stakeholders to define data requirements and deliver datasets for reporting.
Environment: Python, PySpark, Kafka, Spark Streaming, AWS (S3, Glue, Redshift, Kinesis), Snowflake, Airflow, Great Expectations, HL7, FHIR, HIPAA.
JPMorgan Chase Dec 2020- Aug 2023
Data Engineer
• Developed data pipelines using Python and Spark processing banking transaction data, account updates, and customer information from core banking systems into the enterprise analytics platform.
• Built Azure Data Factory pipelines orchestrating data ingestion from multiple banking source systems into Azure Data Lake Storage implementing reusable and parameterized pipeline templates.
• Designed dimensional models in Azure Synapse Analytics using star schema supporting financial reporting dashboards, risk analytics, and regulatory compliance reporting.
• Implemented real-time transaction monitoring pipelines using Azure Event Hubs and Spark Streaming processing high-volume payment transactions and flagging anomalies for fraud detection.
• Developed dbt models building incremental transformation logic with automated testing and documentation for the analytics layer.
• Built data validation and reconciliation processes using Python comparing source banking system records with data warehouse tables ensuring accuracy for financial reporting.
• Implemented PCI-DSS compliant data handling configuring encryption at rest and in transit, tokenization of card data, and role-based access controls across Azure services.
• Created Power BI dashboards tracking pipeline health metrics including data freshness, processing latency, row counts, and error rates for operations teams.
• Optimized Spark jobs implementing partition pruning, broadcast joins, and resource tuning improving execution times for large-scale banking datasets.
• Developed API integrations using Python connecting data pipelines with downstream risk management and compliance systems for real-time data access.
• Participated in code reviews, contributed to team documentation, and collaborated with analytics engineers on data contract definitions.
Environment: Python, Spark, Azure (Data Factory, Synapse, Event Hubs, Data Lake), dbt, Power BI, PCI-DSS, Banking. Best Buy Jan 2019- Dec 2020
Data Engineer
• Built batch and streaming data pipelines using Python and SQL processing e-commerce transactions, product catalog updates, inventory events, and customer activity logs into the analytics platform.
• Developed ETL processes using AWS Glue and Python extracting data from order management, inventory management, and website event tracking systems into S3 and Redshift.
• Designed Redshift data warehouse with star schema dimensional models supporting retail analytics for sales performance, inventory turnover, customer segmentation, and regional reporting.
• Implemented streaming pipelines using AWS Kinesis processing real-time website clickstream data and order submission events powering live customer behavior dashboards.
• Built data lake architecture on AWS S3 implementing raw, processed, and curated layers with partitioning by date, region, and product category.
• Developed data validation scripts using Python ensuring completeness and accuracy of retail data before loading into downstream warehouses and reporting systems.
• Created Airflow workflows orchestrating daily and hourly data jobs managing pipeline dependencies, retry logic, and failure alerting for critical retail data flows.
• Optimized Redshift table designs implementing distribution keys, sort keys, and materialized views improving query response times for analyst dashboards.
• Built Athena ad-hoc query layer on top of S3 data lake enabling analytics teams to run queries on raw and processed data without provisioning infrastructure.
• Developed automated schema evolution handling for product catalog data accommodating changes in upstream systems without breaking downstream pipelines.
• Created Tableau dashboards connecting to Redshift displaying sales KPIs, inventory metrics, and seasonal trend analysis for merchandising and marketing teams.
• Collaborated with marketing analytics, merchandising, and platform engineering teams defining data requirements and delivering reliable datasets for business reporting. Environment: Python, AWS (S3, Glue, Redshift, Kinesis, Athena), Airflow, Tableau, SQL, E-commerce, Retail Analytics. Education
Master of Science in Information systems
Central Michigan University, MI