PRAVALIKA BYRI
Role: Data Engineer
Experience: *+ Years
Mail : *************@*****.***
LinkedIn: www.linkedin.com/in/pravalika-b-174691246
PROFESSIONAL SUMMARY
Data Engineer with over 5+ years of hands-on experience in architecting and developing data ingestion, transformation, and pipelines using Apache Spark, Python,Oracle and cloud-based tools across multiple industries.
Oracle is a robust relational database management system widely used in data engineering for managing, storing, and querying large-scale structured data efficiently.
Expert in building and optimizing ETL/ELT pipelines and managing large-scale batch and streaming data processing using PySpark, Azure Data Factory, and Kafka to ensure low-latency and high-throughput systems.
Designed and maintained data lake architectures on AWS S3 and Azure Data Lake Gen2, ensuring secure, scalable, and cost-effective storage of both structured and semi-structured data.
Proficient in writing efficient SQL and Spark SQL queries, including advanced joins, window functions, and aggregations, to extract insights from massive datasets across various domains.
Developed and maintained data models using Snowflake, Redshift, and SQL Server, including star schema, snowflake schema, and normalized models for analytics and reporting use cases.
Automated daily and hourly data ingestion workflows using Apache Airflow, Azure Data Factory, and Bash scripting, reducing manual errors and improving data freshness.
Collaborated with cross-functional teams to implement data validation rules, data reconciliation checks, and quality assurance frameworks using Great Expectations and custom Python scripts.
Implemented SCD Type 1 and 2 logic using PySpark to track slowly changing attributes in customer and transaction datasets, improving the auditability and compliance of analytics.
Proficient in using Delta Lake for maintaining ACID-compliant tables on Databricks, enabling time-travel queries, upsets, and scalable merge operations on high-volume datasets.
Integrated REST APIs with Flask and deployed them using Azure Functions and AWS Lambda, allowing business users to securely access curated datasets via endpoints.
Worked with business analysts and data scientists to curate datasets using Apache Hive and Databricks notebooks, improving the accessibility and reusability of core data assets.
Built data lineage and cataloguing systems using Azure Purview and Collibra, enabling data discovery and metadata management across multiple cloud and on-prem environments.
Created robust monitoring and alerting systems using Prometheus, Grafana, and Azure Monitor to track data pipeline failures, latencies, and system health.
Led data migration projects from legacy ETL tools like SSIS to modern cloud-native tools, reducing maintenance overhead and improving pipeline observability.
Managed infrastructure as code using Terraform, provisioning cloud resources such as Azure Synapse, Databricks Clusters, and AWS EMR in a repeatable and scalable way.
Participated in daily Agile ceremonies, including sprint planning and retrospectives, contributing user stories and ensuring timely delivery of sprint commitments.
Tuned Spark jobs by optimizing partitions, caching, broadcast joins, and shuffling strategies to improve performance of ETL jobs by over 60% in some cases.
Developed role-based access controls (RBAC) using Azure Active Directory and AWS IAM to ensure secure access to pipelines, storage, and analytics workspaces.
Created dashboards in Power BI and Tableau connected to Snowflake and Redshift, enabling executives to gain real-time visibility into key business KPIs.
Worked closely with DevOps teams to set up CI/CD pipelines and integrate code quality checks, linting, and automated testing to improve the reliability of deployments.
Designed custom alerting workflows using Slack, Email, and PagerDuty integrations for real-time monitoring of critical pipeline errors and system outages.
Migrated legacy on-prem data platforms to AWS Redshift and Azure Synapse, reducing cost and increasing scalability with auto-scaling and serverless architecture.
Developed Jupyter notebooks for ad-hoc analysis, exploratory data profiling, and POC development, frequently collaborating with analytics and marketing teams.
Participated in SOC 2 audits by providing logging, access control documentation, and pipeline architecture diagrams, ensuring platform compliance and certification.
Provided documentation and internal training to junior engineers and analysts, increasing productivity and reducing onboarding time across data engineering teams.
TECHNICAL SKILLS
Languages: Python, SQL, T-SQL, PL/SQL, Scala, Shell Scripting, Bash
Big Data Tools: Apache Spark, PySpark, Hive, Kafka, HDFS, Sqoop, Flume, Airflow
ETL Tools: Azure Data Factory, SSIS, Informatica, Apache NiFi
Cloud Platforms: Azure (Data Lake, Synapse, Key Vault, Databricks), AWS (S3, Lambda, EMR), GCP
Databases: SQL Server, Oracle, Teradata, Snowflake, Redshift, PostgreSQL
Data Warehousing: Dimensional Modeling, Star/Snowflake Schema, SCD, Fact/Dimension Tables
Visualization Tools: Power BI, Tableau, R, SSRS
Streaming Tools: Kafka, Azure Event Hubs, AWS Kinesis, Spark Streaming
DevOps & Monitoring: Azure DevOps, GitHub Actions, Jenkins, Grafana, Azure Monitor, Docker
Security & Governance: Data Masking, Encryption, RBAC, Azure Key Vault, Azure Purview
Others: Delta Lake, Parquet, Avro, ORC, Jupyter, Flask, REST APIs,SSAS, ML integration
PROFESSIONAL EXPERIENCE
M&T Bank (Oct 2024 – Present)
Data Engineer
Built real-time data pipelines using Kafka, Confluent Schema Registry, and Spark Structured Streaming to track customer transactions nd loan activity.
Oracle is a robust relational database management system widely used in data engineering for managing, storing, and querying large-scale structured data efficiently.
Developed compliance-ready ETL/ELT pipelines using AWS Glue, S3, and Redshift, aligning with internal governance and external regulations.
Modernized legacy batch ETL workflows into event-driven architectures, significantly improving data freshness and reducing latency.
AutoCAD Used for creating precise 2D and 3D drawings and schematics for engineering designs.
Field Engineering Involves on-site technical support, installation, testing, and troubleshooting of engineering systems.
Power System Calculation Performs load flow, short circuit, and protection coordination analysis to ensure system reliability.
Power System Study (SKM, ETAP) Uses specialized software to simulate and analyze electrical power systems for safety and efficiency.
Electrical & Instruments Design Develops layouts and specifications for electrical systems and instrumentation in industrial facilities.
PLC Programs and configures Programmable Logic Controllers for automation of industrial processes.
Oil & Gas Engineering Design Designs infrastructure and systems for exploration, extraction, and processing of oil and gas.
FEED & Detail Design FEED defines project scope and estimates, while Detail Design finalizes drawings and specifications for construction.
Designed and implemented Data Vault models in Snowflake, enabling scalable enterprise data warehousing and reporting.
Enforced data security through RBAC, encryption, and masking of sensitive fields (e.g., SSNs, account numbers) in Snowflake.
Created robust Python-based auditing frameworks and reconciliation logic for pipeline traceability and anomaly detection.
Automated ingestion of unstructured documents (PDFs, statements) using AWS Textract, Lambda, and S3, enabling digital onboarding.
Optimized credit risk pipelines by over 50% through Python-based feature engineering and job tuning.
Built dashboards in Power BI and Tableau for risk, compliance, and transaction analytics consumed by executives.
Implemented centralized logging and alerting using CloudWatch, Lambda, and SNS for pipeline monitoring.
Mentored junior engineers on Spark optimization, testing, and Agile best practices in a Scrum environment.
Designed multi-region disaster recovery plans for Snowflake and Redshift to meet enterprise RTO/RPO standards.
Collaborated with fraud teams to curate multi-source datasets (clickstream, transaction, location) for ML model training.
I worked with Google Cloud Platform (GCP) in my project to design and manage scalable data pipelines using services like Big Query and Cloud Dataflow.
I also utilized Cloud Storage for data lake implementation and integrated Pub/Sub for real-time data ingestion.
Built CI/CD pipelines using Terraform, GitHub Actions, and Databricks CLI for infrastructure as code deployments.
Created technical documentation, ERDs, and conducted training sessions to align engineering and compliance teams.
Moderna (Aug 2021 – Aug 2023)
Data Engineer
Developed secure, scalable ETL pipelines using Azure Databricks and Azure Data Factory to process global COVID-19 clinical trial data.
Implemented ACID-compliant Delta Lake tables with time travel and schema enforcement for managing evolving medical datasets.
Built ingestion pipelines for patient metadata and genomic results using Pytho, Pandas, and PySpark to support real-time vaccine analytics.
Standardized field mappings and vocabulary by collaborating with research teams to define data dictionaries and improve interoperability.
Secured sensitive clinical data using Azure Blob Storage, ADLS Gen2, and Key Vault in compliance with HIPAA and FDA regulations.
Integrated FHIR APIs and HL7 feeds using Azure Functions to enable seamless ingestion of external electronic health records.
Created dimension tables for demographics, vaccine batches, and test results, powering clinical trial analytics via SQL and Power BI.
Designed real-time CI/CD pipelines using Azure Event Grid, Delta Lake, and Databricks Notebooks to update lab data dynamically.
Optimized Spark configurations, reducing pipeline run times by 35% through techniques like partition pruning and broadcast joins.
Ensured data traceability and quality with audit logging, pipeline retries, and reconciliation checks for regulatory reporting.
Mentored junior engineers on data lake architecture, unit testing (pytest), and Spark best practices via code reviews.
Integrated genomic and clinical metadata using Python and BioPython to enable precision medicine and vaccine profiling.
Developed Power BI dashboards to visualize trial enrollment, efficacy trends, and adverse events across research sites.
Built reusable, parameterized pipelines in ADF to streamline ingestion workflows and maintain consistency across datasets.
Used Azure Purview for data cataloging, lineage tracking, and compliance auditing; supported DevOps CI/CD automation for deployments
Walmart (Jan 2019 – July 2021)
Data Engineer
Designed and developed scalable ETL pipelines using Apache Spark, Scala, and Hive to process retail transaction data from 5,000+ stores into a centralized data lake.
Ingested real-time customer behavior data using Kafka and Spark Streaming to support dynamic inventory optimization and sales analytics.
Ensured high data quality with PyDeequ and Great Expectations for clean, reliable reporting and ML workflows.
Managed schema evolution and partitioning in Hive external tables to optimize query performance on historical sales data.
Deployed pipelines with Apache Airflow, enabling DAG-level visibility, retries, and SLA tracking for time-sensitive workflows.
Tuned Spark and Hive jobs by optimizing joins, shuffles, and broadcasts, reducing processing time by over 40%.
Migrated ETL workloads from Informatica to Databricks, achieving a cloud-native architecture and reducing costs.
Utilized AWS Glue, S3, and Athena for petabyte-scale analytics over clickstream and product catalog datasets.
Designed star schemas and fact-dimension models with Erwin Data Modeler for scalable sales and supply chain reporting.
Implemented SCD Type 2 logic in PySpark to track historical changes in supplier and pricing data.
Developed CI/CD pipelines using Jenkins and Git for automated Spark job deployments across environments.
Created Tableau and Power BI dashboards for weekly KPI tracking and regional sales performance analysis.
Enabled ingestion of third-party XML, JSON, and CSV supplier data using Scala and Spark.
Collaborated with data scientists to deliver curated datasets for recommendation engines and CLV models.
Applied AWS KMS and CloudHSM for PII/PCI data encryption and compliance with internal security protocols.