LAKSHMI KIRANMAI REDDY VOGGU
+1-469-***-**** *************@*****.*** www.linkedin.com/in/kiranmai-voggu PROFESSIONAL SUMMARY
Experienced Data Engineer with 5+ years building scalable ETL pipelines using Python and Apache Airflow. Expertise in automating data ingestion, transformation, and validation on AWS, Azure. Proven ability to optimize pipeline performance and collaborate with data science teams to deliver production-ready datasets. Adept at integrating weather datasets for robust forecasting and analytics. TECHNICAL SKILLS
• Programming Languages: Python, SQL, Scala (basic), Shell Scripting, Java (Intermediate), HTML
• Big Data Technologies: Apache Spark, Apache Flink, Hadoop (HDFS, MapReduce), Apache Hive, Impala, Kafka, HBase, Sqoop, Oozie
• Data Processing & ETL: PySpark, Apache Airflow, Azure Data Factory, Databricks, Ab Initio, Informatica (basic), SSIS, ETL Tools, Data Pipeline Development
• Cloud Platforms: Azure (Data Lake Storage, Synapse, ADF, Key Vault, Functions, Monitor), AWS (S3, EC2, Lambda, Redshift, Glue)
• Database: Snowflake, SQL Server, PostgreSQL, MongoDB, Teradata, MySQL, Oracle (PL/SQL), Data Warehousing
• Data Formats: JSON, CSV, Parquet, Avro, ORC
• Job Scheduling & Automation:Apache Airflow, Oozie, Cron, Terraform
• Version Control / CI-CD: Git, Bitbucket, GitHub Actions, Jenkins
• Machine Learning & AI: Scikit-learn, MLlib, Random Forest, Clustering (K-Means), Regression, Model Deployment (basic)
• Visualization Tools: Power BI, Tableau, Looker
• Operating Systems: Windows, Linux, Unix
• Development Methodologies: Agile (Scrum, Kanban), Waterfall
• Tools & Utilities: Jupyter Notebooks, Databricks Notebooks, Visual Studio Code, Postman, Azure Monitor, Slack Alerts, Containerization Tools,JIRA
PROFESSIONAL EXPERIENCE
Client - Northern Trust Jul 2024 - Present
Data Engineer
• Designed and developed a scalable Azure-based cloud technologies data architecture using Python integrated Azure Data Factory, Databricks (PySpark), and Delta Lake, enabling end-to-end high-volume transactional pipelines for credit risk analytics, business intelligence reporting, and regulatory compliance workflows.
• Collaborated with product owners, architects, and BI tools during requirement gathering sessions to identify ingestion sources, define data strategy, address data quality issues, and translate business rules into robust, Python-enhanced ETL architecture blueprints, documenting workflows with clear technical writing for cross-team understanding.
• Developed automated ETL pipelines in Azure Data Factory and Apache Airflow to ingest CSV, JSON, AVRO, XML, and TXT files from FTP, SQL Server, Postgres, NoSQL databases, and external REST APIs into Azure Data Lake Gen2.
• Built modular transformation layers in Databricks using PySpark and custom UDFs, with schema enforcement, null handling, and error logic to support analytics, ML pipelines, and database queries.
• Migrated on-prem SQL Server and legacy Informatica pipelines to Azure Data Lake, improving change management, reducing ETL dependency and costs, and leveraging BigQuery for advanced analytics
• Implemented a real-time ingestion layer using Apache Kafka to stream transaction events into Delta tables, supporting fraud detection systems and incident response workflows, with an architecture designed for deployment on Kubernetes clusters.
• Applied Delta Lake ACID capabilities to build CDC logic for inserts, updates, and deletes across financial records, ensuring complete traceability, technical support for time-travel queries, and alignment with data governance and compliance requirements.
• Optimized database and storage performance by curating datasets in Parquet format, leveraging compression, partitioning, and Bloom filters within Azure Data Lake Storage Gen2.ss
• Developed analytical models in Azure Synapse using dedicated SQL pools, integrated Synapse pipelines with ADF triggers, and connected outputs to Power BI dashboards for business intelligence use cases with secure row-level security.
• Integrated non-relational databases like MongoDB to support flexible schema storage for regulatory documents and trading logs, enabling efficient querying and analytics on unstructured JSON fields.
• Enforced rigorous data governance with column masking, row-level access controls, GDPR-compliant encryption, and role-based security using Azure Key Vault, TLS, and storage encryption.
• Planned and executed a comprehensive change management strategy for an on-prem to cloud migration, transitioning OLAP cubes and reports from SSAS to Azure Synapse and Power BI with updated dimensions and measures.
• Built centralized logging with Log4j in PySpark and automated CI/CD using Azure DevOps, GitHub Actions, and Jenkins; managed deliverables via Jira to ensure reliable, on-time data solutions.
• Collaborated with stakeholders and cross-functional teams to resolve data issues, provide documentation, and drive smooth cloud adoption with clear communication.
Client - Ericsson Nov 2020 - Jul 2023
Software Engineer
• Designed real-time ingestion pipelines using Apache NiFi, Kafka, and Spark Streaming to process millions of CDRs and usage logs, enabling low-latency telecom data processing across network zones.
• Built a Hadoop-based data lake with Hive and HBase to store high-volume subscriber, recharge, and tower connection data for batch analytics and SLA reporting.
• Created analytical datasets in Hive and Spark SQL for ARPU analysis, churn forecasting, and customer segmentation based on usage and behavioral patterns.
• Integrated over 100 systems including OSS/BSS, CRM, and portals via REST APIs and FTP, enabling centralized and synchronized telecom data flows.
• Implemented Delta Lake architecture with schema enforcement and ACID compliance to improve reliability in prepaid and postpaid datasets.
• Developed network health dashboards using Power BI and Grafana to monitor call drops, packet loss, and signal anomalies in real time. integrating AWS CloudWatch metrics for proactive monitoring and alerting.
• Built ML pipelines in PySpark for churn prediction and LTV scoring, sourcing features from usage and complaint data to enhance retention strategies as part of predictive data solutions.
• Utilized Great Expectations for data quality checks on CDRs and recharge data, triggering alerts for missing partitions and schema mismatches.
• Replaced legacy Oracle ETL processes with optimized Spark pipelines, reducing execution time by 90% and enabling near real-time behavioral insights as per customer needs.
• Developed conformed dimensions, SCD logic, and OLAP cubes using dbt and Hive, standardizing reports across departments and ensuring smooth cross-functional data access.
• Managed Airflow DAGs with dependency handling and failure notifications to ensure reliable workflow execution.
• Applied SHA256 hashing to tokenize PII data (e.g., MSISDN, IMSI), ensuring compliance with data privacy regulations and internal security standards.
• Built and deployed Flask and Lambda REST API to expose recharge and usage data for internal and customer-facing portals, supporting real-time insights and targeted promotions and scalable data solutions. Client - Liberty Mutual Jan 2019 - Nov 2020
Software Engineer
• Architected secure batch ingestion pipelines using AWS Glue, Spark-Scala on EMR, and S3, extracting transactional data from legacy Oracle systems with built-in error handling and audit traceability.
• Conducted requirement workshops with insurance risk analysts and compliance teams to translate policy and claims needs into structured ETL pipeline designs, ensuring auditability and comprehensive data lineage.
• Developed ingestion workflows for CSV, JSON, AVRO, and TXT files sourced from SFTP and REST APIs, utilizing Glue Catalog and Athena to enable schema evolution and ad-hoc querying.
• Implemented Kafka-based streaming pipelines for real-time insurance claims ingestion, enriching events via Spark-Scala and storing high-availability lookup data in Cassandra.
• Migrated legacy Oracle and Teradata tables to AWS S3, Redshift, and Snowflake using Parquet/ORC formats, enhancing analytics scalability and performance.
• Constructed OLAP cubes using Hive on EMR to support multi-dimensional queries for actuarial teams and visualized premium revenue and loss ratios in Looker dashboards.
• Automated data quality checks using PyTest and custom Scala validators, ensuring data freshness, schema integrity, and value compliance across production datasets.
• Designed and deployed REST APIs with robust authentication and schema validation to deliver enriched insurance and market data to downstream microservices.
• Enabled fast, interactive queries over curated data zones using Athena, reducing ETL overhead for audit and analytics teams.
• Integrated Docker to facilitate consistent Spark job testing and streamlined CI workflows; automated EMR deployments via GitLab, Jenkins, and branching strategies.
• Coordinated a seamless on-prem to cloud migration, re-engineering SSIS and Oracle ETLs into serverless Spark and Glue pipelines with zero downtime.
• Scheduled complex workflows using Airflow on AWS, orchestrating tasks across Glue, EMR, Kafka, and Redshift with robust retry logic and failure handling.
• Enforced row-level security and column masking in Redshift to safeguard sensitive policy and actuarial data, in compliance with insurance standards.
• Applied data governance policies including IAM controls, audit logging, and encryption with AWS KMS to uphold SOC 2 compliance and secure data assets.
EDUCATION
East Texas A&M University
Masters, Computer and Information Sciences
CERTIFICATIONS
• Microsoft Certified: Azure Data Engineer Associate
• Databricks Certified Data Engineer Associate