Soujith Polireddy
United States ******.*********@*****.*** 919-***-**** soujit-reddy-polireddy-420b3260 https://github.com/vspolireddy
SUMMARY
Senior Data Engineer with 9+ years of experience designing and deploying large-scale data platforms and streaming pipelines in cloud-native environments. Proven expertise in building real-time and batch data pipelines using Apache Spark, Kafka (Confluent), and AWS services. Strong track record of improving data latency, observability, and scalability while enabling business-critical insights and fraud detection capabilities. Adept at leading migrations from legacy systems, optimizing data platforms for performance, and contributing to enterprise data mesh and MLOps strategies.
EXPERIENCE
Sr Data and Analytics Engineer
Blue Cross Blue Shield NC August 2022 - October 2024, remote
•Architected and developed python real-time Confluent Kafka pipelines to process claims, billing codes, and provider data—reducing latency by 70% and enabling early fraud detection using policy-based streaming analytics.
•Migrated 100+ TB of structured/unstructured data from on-prem HDFS to Snowflake via Azure Databricks (PySpark), improving scalability for actuarial modeling and reducing query times by 50%.
• Developed standardized Helm charts and Kubernetes YAML templates for deploying Apache Spark, Kafka, and Cassandra microservices—accelerated onboarding by 40% and ensured consistent CI/CD delivery across teams.
• Built reusable Docker images with Apache Spark, ML libraries, and observability agents; published to AWS ECR, cutting deployment time by 60% across environments.
•Automated ingestion workflows using python AWS Lambda, S3, Kinesis, DynamoDB, and Boto3, achieving schema evolution support and zero manual intervention for real-time data delivery.
•Created a real-time ML job metadata polling and alerting service using Lambda + SNS, pushing events to Slack, Teams, and Email—reducing MTTR by 50% for data science workflows.
• Supported secure MLOps initiatives by integrating AWS SageMaker into the data pipeline stack, ensuring reproducibility, versioning, and compliance in healthcare model deployments.
Sr. Data Engineer
Nike(MGB) August 2021 - September 2022, remote
•Developed scalable, modular PySpark ETL pipelines via Apache Airflow on Kubernetes (EKS) to support digital commerce in Korea & Japan—boosted throughput by 30%.
•Refactored fragile legacy 50 workflows into resilient Airflow DAGs with SLA alerting and retry policies—enhanced pipeline reliability and observability.
•Implemented JVM-level Spark performance dashboards using Prometheus and Grafana, reducing debugging and memory leak resolution time by 60%.
•Containerized pipeline components with Docker and deployed via CI/CD pipelines on Amazon EKS, improving deployment consistency across dev, QA, and prod.
•Contributed to Nike’s global data mesh by aligning data products with domain boundaries and enforcing decentralized governance practices.
Sr Data and Analytics Engineer
Blue Cross Blue Shield NC July 2020 - August 2021, Durham, NC
•Refactored legacy Data Haiku Python notebooks into modular, testable object-oriented PySpark pipelines, reducing data processing times by 30% and increasing unit test coverage.
•Transformed outdated ETL pipelines to operate on Kubernetes-hosted Apache Spark, resulting in streamlined operations with automated CI/CD pipelines and a 30% increase in scalability.
•Built and containerized Spark APIs on Amazon EKS, integrating with Helm, Prometheus, and Grafana for performance monitoring and cost visibility.
•Achieved reduction in operational waste via Argo Workflows, driving improvements in Spark job schedule efficiency and fostering better utilization of computing resources.
•Created monitoring using PromQL-based dashboards to surface JVM bottlenecks and long-running job risks in advance.
•Engineered end-to-end data pipeline solutions that enhanced data accuracy and reliability by 25%, utilizing big data technologies such as Apache Spark, Kafka, and Hive to streamline analytics processes.
Software Engineer III(Data)
Lexis Nexis July 2017 - June 2020, Raleigh,NC
•Led migration of 20+ years of legal content and news archives from MarkLogic XML stores to a modernized Apache Solr-based search engine reduced reindexing time from weeks to under 5 hours using Apache Spark on AWS EMR.
•Implemented and optimized Scala Spark-based ETL processes on AWS for XML data transformations achieving a 80% acceleration in reindexing operations.
•Developed event-driven microservices in Python and Java using Knative on Kubernetes, enabling serverless ingestion, enrichment, and validation at scale.
•Engineered reliable real-time ingestion pipelines using AWS Kinesis, Lambda, SNS, and SQS, with full dead-letter queue (DLQ) support and idempotent retry logic.
•Refined Spark job performance on YARN/EMR environments reducing overall costs by employing data caching strategies, and efficient resource management techniques which resulted in 25% savings.
• Automated deployment of Spark jobs and infrastructure using Jenkins and AWS CloudFormation, reducing release cycles and increasing deployment success rates.
Big Data Developer
United Health Group(Optum) March 2016 - November 2017, Raleigh,NC
•Engineered data processing scripts using Pig and Spark for CDC, facilitating raw data ingestion and transformation. Played a key role in the team that managed MapR 5.2 Hadoop cluster, integrating Talend to enhance Big Data PaaS solutions within the architectural framework.
•Advanced data extraction and transformation processes by designing and implementing Hive and Pig scripts, enriched with custom Python and Java UDFs.
•Optimized job scheduling and management on Hadoop using Talend/TAC and advanced querying and analytical operations using Hbase and Hive databases.
•Developed scalable data pipelines on Hadoop ecosystem, leveraging Kafka and Sqoop to facilitate seamless data import/export processes, achieving a 30% boost in data throughput.
Java Developer & Summer Intern
Fidelity Investments July 2014 - August 2015, Raleigh,NC
•Implemented comprehensive database solutions, optimizing complex query execution by utilizing Oracle, SQL, and PL/SQL in a structured development environment.
•Pioneered object-oriented programming initiatives using Java/J2EE, focusing on extraction of database information, transformation into JSON format, and subsequent integration into AngularJS-based frontend.
•Crafted and executed Unix scripts for rigorous validation of extensive datasets, comprising millions of rows within CSV files, as integral components of daily batch processing tasks.
•Devised and implemented a multi-threading solution in Java for a high-frequency trading application, enhancing data processing speed by 30% during real-time financial analysis.
EDUCATION
Bachelors in Computer Science
East Carolina University • Greenville, NC • 2014
Bachelors in Data Science (course work)
University of Wisconsin
SKILLS
Technical Skills: Spark, PySpark, Scala, Java, Python, Kafka, Apache Airflow, AWS (S3, DynamoDB, Lambda, Kinesis, SageMaker, EMR,
CloudFormation), Azure Databricks, Kubernetes, Knative, Helm charts, YAML, Docker, RESTful Web Services, JSON, SQL, PL/SQL, Hive, Pig, HBase,
Talend/TAC, MapR Hadoop, Informatica, Jenkins, Prometheus, Grafana, Argo Workflows, Shell Scripting
Soft Skills: Problem Solving, Innovation, Collaboration, Communication, Leadership, Analytical Thinking, Attention to Detail, Adaptability, Continuous
Learning