Data Engineer

Location:

Nashua, NH

Posted:

August 28, 2025

Contact this candidate

Resume:

Pranitha K

Sr. Data Engineer

************@*****.*** +1-737-***-****

PROFESSIONAL SUMMARY

Over 8 years of experience as a Data Engineer designing, developing, and managing large-scale batch and real-time data pipelines in cloud (AWS, Azure, GCP) and on-prem environments.

Proficient in Apache Spark (Scala/PySpark), Snowflake, Databricks, Hadoop ecosystem (Hive, HDFS, Pig), Kafka, and advanced SQL for high-volume data processing.

Strong background in data modeling (Star/Snowflake schemas, SCD Types 1 & 2), ETL/ELT development, and orchestration with Airflow, Azure Data Factory, and Cloud Composer.

Hands-on experience with BigQuery, Redshift, Azure Synapse, NoSQL databases (Cassandra, MongoDB, DynamoDB), and data lakehouse architectures.

Skilled in data governance, observability, and quality frameworks; adept at optimizing pipeline performance, cost, and reliability.

Delivered actionable insights using Tableau, Power BI, and BI frameworks; collaborated in Agile teams with CI/CD (Git, Jenkins, Terraform, Docker, Kubernetes) to drive measurable business outcomes.

TECHNICAL SKILLS

Programming & Scripting : Scala, Python, Java, SQL, R, Shell Scripting

Big Data & Processing Frameworks : Apache Spark (Core, SQL, Streaming), Hadoop, MapReduce, YARN, Pig, Hive, HDFS, Impala, Apache Kafka

Cloud Platforms & Services : GCP (Dataproc, BigQuery, Cloud Storage, Cloud Composer, JupyterLab), AWS (EC2, S3, EMR, Redshift, DynamoDB, Aurora), Azure (Data Factory, Data Lake, Databricks, Synapse, Azure SQL DB)

ETL Tools : Informatica PowerCenter, Informatica Cloud (IICS), Sqoop

Data Warehousing & Databases : Snowflake, Hive, SQL Server, Oracle, PostgreSQL, MongoDB, Cassandra, HBase

Orchestration & Workflow Tools : Apache Airflow, Cloud Composer, Automic, Concord, Oozie

Data Visualization & Reporting : Tableau, Power BI,SSAS, Excel

CI/CD & Infrastructure: GitHub, Jenkins, Maven, IntelliJ, Git, Docker, Kubernetes, Terraform

Monitoring & Scheduling Tools : YARN Resource Manager, Informatica Workflow Monitor, ServiceNow

Modeling & Architecture : Star Schema, Snowflake Schema, Slowly Changing Dimensions (SCDs), OLTP, OLAP, Data Lakehouse (Delta Lake)

Data Governance & Quality: DPAT, GDP Portal, Data Observability

Additional Tools & Technologies: Teamcenter (PLM), Salesforce, XML, Flat Files, WSDL, Druid

CERTIFICATIONS

Microsoft Certified: Fabric Data Engineer Associate - July 2025

PROFESSIONAL EXPERIENCE

Client: Walmart Bentonville, AR Jun 2023 – Present Sr. Data Engineer

Responsibilities:

Designed and implemented scalable ETL pipelines using Scala and Apache Spark on GCP to support the Know Your Store Insights initiative, enabling near real-time analysis of store-level sales, inventory turnover, and operational KPIs across thousands of retail locations.

Orchestrated end-to-end data workflows with Apache Airflow DAGs on Cloud Composer, automating Spark processing on Dataproc for daily and weekly refreshes of merchandising, pricing, and supplier performance datasets.

Leveraged GCP services including Cloud Storage, Dataproc, BigQuery, and IAM to securely process high-volume retail data, applying granular access controls to protect sensitive supplier and transactional information while optimizing storage costs.

Developed Spark-based transformation logic to integrate foundational GCP tables with complex retail business rules, producing curated Hive datasets in Parquet format for downstream analytics on category performance, seasonal trends, and supply chain bottlenecks.

Designed partitioned Hive tables (by date and week) to efficiently manage billions of daily POS and inventory records, while using Spark SQL aggregations to load non-partitioned summary tables consumed by BI teams and vendor scorecard dashboards.

Automated on-demand Dataproc cluster provisioning via Automic, tuning executor/memory configurations to handle peak seasonal loads such as Black Friday and holiday sales, reducing batch runtimes by up to 40% using insights from YARN Resource Manager.

Partnered with Data Science teams to deliver model-ready datasets for demand forecasting, shrinkage prediction, and product assortment optimization; conducted in-depth exploratory analysis in JupyterLab to validate data integrity and investigate anomalies.

Performed load and stress testing using Automaton to ensure UI responsiveness under concurrent supplier queries, maintaining sub-second response times for high-priority dashboards.

Designed and optimized Druid ingestion specs to deliver low-latency rollups of Hive data for supplier-facing portals, enabling real-time drill-down into SKU-level performance, sales by region, and vendor compliance metrics.

Managed Druid cluster scaling and tuning to handle multi-billion-row datasets with concurrent queries from merchandising, replenishment, and vendor management teams.

Collaborated with API teams to integrate Druid as a performance layer for UI applications, ensuring instant access to daily updated KPIs for thousands of suppliers and category managers.

Enforced data quality with DPAT checks and GDP Portal DQ rules, ensuring schema adherence for critical datasets; implemented automated environment sync from production to development for consistent testing.

Delivered CI/CD pipelines via Automic, Concord, and GitHub, supporting rapid deployment of new data workflows; managed semi-structured product master and supplier reference data in MongoDB and Cosmos DB.

Documented workflows, transformations, and governance processes in Confluence; actively participated in Agile

ceremonies, sprint planning, and code reviews in Jira to ensure timely, quality deliverables.

Environment: GCP (Cloud Storage, Dataproc, BigQuery, Cloud Composer, IAM), Apache Spark (Scala/PySpark, Spark SQL, Streaming), Hive (Partitioned/Non-Partitioned), Druid, MongoDB, Cosmos DB, YARN Resource Manager, DPAT, GDP Portal, Automic, Concord.

Client: Sephora California, CA May 2022 – Jun 2023 Sr. Data Engineer

Responsibilities:

Designed and implemented scalable batch and real-time data pipelines using Apache Spark (Scala/PySpark), Kafka, Hive, HBase, and Azure Data Factory to deliver highly personalized marketing campaigns, including targeted email, SMS, push notifications, and in-app offers to millions of Beauty Insider members.

Automated ingestion of web logs, loyalty program interactions, and omni-channel engagement data into Azure Data Lake and Hadoop clusters via Airflow DAGs and ADF pipelines, ensuring high availability and accuracy for time-sensitive promotions.

Integrated with marketing platforms Epsilon (SFTP flat file drops, API GET requests) and Braze (API POST/GET calls) to trigger real-time and scheduled campaigns such as birthday rewards, back-in-stock alerts, price-drop notifications, and post-purchase follow-ups.

Managed raw and curated data zones in Azure Blob Storage, applying secure access controls and retention policies to meet data privacy standards while enabling seamless consumption by Azure Data Factory and Databricks pipelines.

Built Databricks pipelines to preprocess customer and behavioral event data into JSON payloads mapped to marketing templates, streaming them via Kafka topics for sub-second delivery of high-priority (P0) campaigns like limited-time offers.

Developed optimized Spark applications (Spark SQL, RDDs, UDFs) for aggregating customer engagement metrics, segmenting audiences, and preparing campaign-ready datasets, reducing job runtimes and cloud costs through executor/memory tuning.

Architected dimensional models (Star/Snowflake schemas) in Snowflake and Hive to power Power BI and Tableau dashboards, providing marketing teams with real-time insights on campaign impressions, click-through rates, and ROI by customer segment.

Implemented PII masking and encryption in Azure Synapse to safeguard sensitive customer data, documenting data handling procedures for recurring compliance audits under privacy regulations.

Configured Azure Monitor and Log Analytics for proactive monitoring of ADF pipelines, Databricks jobs, and streaming workloads, ensuring timely alerts for SLA adherence.

Developed reusable ADF components (linked services, datasets, templates) to standardize ingestion patterns from RDBMS, APIs, and local systems into Azure Data Lake, improving maintainability and reducing onboarding time for new data sources.

Orchestrated incremental and full data loads from RDBMS into Hive using Apache Oozie, meeting strict SLAs to ensure campaign data was ready for downstream execution systems.

Maintained CI/CD pipelines via Azure DevOps and Jenkins, deploying reusable ADF templates, and collaborating in Agile sprints via Jira for backlog grooming, story refinement, code reviews, and production support.

Environment: Hadoop, Apache Spark (Scala/PySpark, Spark SQL, Streaming), Hive, HBase, Kafka, Pig, Azure Data Factory, Azure Data Lake, Azure Synapse, Azure Databricks, Azure SQL DB, Azure Blob Storage, Epsilon, Braze, Airflow, Oozie, Power BI, Tableau, Snowflake, Sqoop, HDInsight, PowerShell, Bash, Jenkins, Azure DevOps, Jira.

Avanade Hyderabad,India June 2018 – Dec 2021 Data Engineer

Client: AIG (American International Group) Jan 2020 – Dec 2021

Responsibilities:

Migrated on-premises SQL Server and SSIS ETL workloads to Azure Data Factory and Azure Synapse, cutting infrastructure costs by 40% and improving policy and claims data refresh from daily to hourly, enabling near real-time insights for underwriting and claims teams.

Designed and built Azure Databricks PySpark workflows to process and unify policy, claims, and customer engagement data from multiple legacy and third-party systems into an Azure Data Lake (Delta/Parquet), supporting enterprise-scale analytics and regulatory reporting.

Engineered incremental ingestion pipelines in ADF Mapping Data Flows using watermarking and Change Data Capture (CDC) logic for low-latency updates across high-volume transactional datasets.

Developed Power BI models and interactive dashboards to deliver claims analytics, risk scoring, and branch performance KPIs, accelerating operational decision-making across underwriting, fraud detection, and customer retention initiatives.

Implemented role-based access control (RBAC), PII masking, and encryption in Azure Synapse to meet IRDAI compliance and internal AIG data protection standards.

Configured Azure Key Vault for secure storage of credentials, keys, and connection strings, integrating with ADF

and Databricks to enforce secure access controls.

Utilized Azure Event Hubs for near real-time ingestion of application event streams into Databricks Structured Streaming, enabling instant policy change notifications and claims status updates.

Managed resource provisioning for ADF, Databricks, and Blob Storage using ARM templates, ensuring version- controlled infrastructure-as-code deployments.

Designed batch and incremental ingestion workflows to move data from Azure Blob Storage into downstream analytical systems via ADF, ensuring timely availability for BI and actuarial teams.

Leveraged Azure Purview for metadata management, lineage tracking, and compliance audits across Blob Storage, Data Lake, and Synapse assets, improving discoverability and governance.

Developed automated validation scripts in Databricks to verify data completeness, transformation accuracy, and referential integrity; integrated Azure Monitor alerts to proactively detect and address pipeline failures.

Automated ADF pipeline and Databricks notebook deployments via Azure DevOps CI/CD, incorporating unit and integration testing to reduce release cycles by 50%.

Environment: Azure Data Factory, Azure Data Lake, Delta Lake, Azure Databricks, Azure Synapse, Azure SQL DB, Power BI, PySpark, Spark SQL, CDC, Parquet, Azure Monitor, Azure Purview, Azure DevOps, Git.

Client: Humana (U.S. health insurer) June 2018 – Dec 2019

Responsibilities:

Designed and implemented a secure AWS S3 data lake with bronze, silver, and gold layers using Parquet, partitioning, compression, and lifecycle policies, reducing storage and scan costs by 30% while meeting HIPAA security standards through IAM roles, KMS encryption, Lake Formation, and fine-grained S3 bucket policies.

Built ingestion pipelines for healthcare datasets including EDI X12 837/835 claims, eligibility and enrollment data, provider records, and HL7/FHIR events, leveraging Amazon Kinesis for streaming, AWS DMS for change data capture from Oracle/RDS, and AWS Glue/SFTP for scheduled batch loads, alongside Kafka for clickstream and IoT data ingestion.

Migrated legacy Teradata and Hadoop ETL workloads to EMR Spark and AWS Glue, implementing cleansing, PHI de-identification, and feature engineering processes that reduced nightly batch runtimes by 35%.

Developed analytics marts in Amazon Redshift covering claims, eligibility, and Member 360 views, and optimized ad-hoc querying through Athena, improving performance for actuarial and care management teams by 2–3x.

Orchestrated and monitored workflows using Apache Airflow on EC2 in combination with AWS Step Functions, integrating SQS and SNS for real-time messaging and alerts to maintain 99.9% SLA compliance.

Delivered low-latency eligibility snapshots and care-gap indicators via API Gateway, AWS Lambda, and DynamoDB caching, providing millisecond-level responses for care coordination applications.

Implemented AWS Glue Data Catalog for centralized schema management and metadata discovery across Redshift and Athena, ensuring consistent data definitions and enabling cross-service integration.

Automated AWS infrastructure provisioning with Terraform and monitored system health with Amazon CloudWatch, tracking job runtimes, API latencies, and pipeline metrics to proactively resolve performance issues.

Established robust data quality and operational monitoring using AWS Deequ for freshness, completeness, and referential integrity checks, supplemented by reconciliation reports and visual dashboards for stakeholders.

Environment: AWS S3, Amazon Redshift, Amazon Athena, Amazon EMR (Spark), AWS Glue, AWS DMS, Amazon Kinesis (Data Streams & Firehose), Kafka, AWS Step Functions, Apache Airflow, Terraform, AWS IAM, AWS KMS, AWS Lake Formation, DynamoDB, API Gateway, AWS Lambda, AWS SQS, AWS SNS, Amazon CloudWatch, AWS Deequ, HL7, FHIR, EDI X12 837/835.

Client: Hansa Solutions Hyderabad, India June 2017-May 2018 Data Engineer

Responsibilities:

Designed and implemented end-to-end ETL in Informatica PowerCenter and IICS to integrate policy administration, claims, underwriting, billing, and reinsurance data from multiple core insurance systems into the enterprise data warehouse.

Built reusable mappings/sessions/transformations to extract from Oracle, SQL Server, and flat files, applied insurance business rules and cleansing, and loaded curated layers into Teradata and Snowflake for downstream operational and regulatory analytics.

Automated daily and near real-time ingestion for first notice-of-loss (FNOL), claim status updates, policy endorsements, and broker/TPA feeds using IICS/PowerCenter and Azure Data Factory, landing in Azure Blob Storage and Azure Data Lake to meet strict SLA targets.

Parameterized Workflow Manager jobs for dynamic Dev/QA/Prod configurations and scheduled with Control- M, monitored and triaged with Informatica Monitor, using pushdown optimization and partitioning to improve throughput and cut deployment time nearly 30%.

Implemented insurance-specific data quality checks in Informatica Data Quality (IDQ) (policy/claim keys, effective/expiry timestamps, premium/reserve validations, contact fields) before publishing to BI layers.

Engineered subject-area marts and semantic views for Tableau and Power BI dashboards (loss ratios, combined ratio, claim cycle time, renewal conversion, broker performance) in partnership with BI and actuarial teams.

Created Python and Shell utilities for file handling, audit/recon reports, and SLA alerts; versioned code in Git and collaborated via Jira within an Agile delivery model.

Environment: Informatica PowerCenter, IICS, Azure Data Factory, Azure Data Lake, SQL Server, Oracle, Power BI, Python, Shell Scripting, Control-M, Git, Jira, Agile.

EDUCATION

Master of Science, Data Science Arlington,TX

University of Texas at Arlington Jan 2022 – May 2023

Bachelor of Technology, Computer Science and Engineering Hyderabad, India

Sreenidhi Institute of Science and Technology, Hyderabad, Telangana, India June 2014 – May 2018

Contact this candidate