Saipriya P
940-***-**** ********.*****@*****.*** www.linkedin.com/in/saipriya-reddy-p
SUMMARY
Senior Data Engineer with over five years of experience architecting and optimizing scalable, cloud-based data pipelines and warehouses using Big Data technologies including Hadoop, Spark, and Kafka across AWS, Azure, and GCP. Designed robust data solutions leveraging tools such as Databricks, Snowflake, and Azure Data Factory to ensure efficient data integration and transformation. Proven expertise in both relational and NoSQL databases with a strong background in data security and compliance. Experienced in Agile environments with a track record of mentoring peers and driving data-driven decision-making. SKILLS
• Big Data Processing: Apache Hadoop, Apache Spark, Apache Flink, Apache Hive, Apache HBase
• Data Streaming: Kafka, Kinesis, Beam
• Cloud Platforms: AWS (S3, Redshift, DynamoDB, Lambda, Athena, ECS, EKS, EMR, CloudFormation), Google Cloud Platform
(BigQuery, Dataflow, Pub/Sub), Microsoft Azure (Synapse Analytics, Data Lake, Cosmos DB, Azure ML, Azure Data Factory)
• ETL & Data Integration: Talend, Informatica
• Databases: MySQL, PostgreSQL, Oracle, SQL Server, Cassandra, MongoDB, DynamoDB
• Data Warehousing: Amazon Redshift, Google BigQuery, Snowflake
• Containerization And Orchestration: Docker, Kubernetes
• DevOps & CI/CD: Jenkins, GitLab CI, Docker
• Data Visualization: Tableau, Power BI, Looker
• Programming Languages: Python, SQL, Java
• Version Control: Git, CVS
WORK EXPERIENCE
New York Life Jul 2023 - Present
Data Engineer New York
• Designed and implemented scalable data models using AWS Redshift and PostgreSQL to support efficient querying and BI reporting.
• Integrated AWS Glue and Athena for scalable data preparation and querying, enabling analysis of both structured and unstructured manufacturing data.
• Managed data lakes and warehouses on AWS (S3, Redshift) and Google BigQuery to optimize query performance and support real-time decision-making.
• Ensured compliance with GDPR and HIPAA through robust data encryption, masking, and role-based access control.
• Developed scalable data processing applications using PySpark and Scala-Spark, optimizing performance for high-volume workloads.
• Utilized AWS cloud technologies to implement cost-effective and scalable solutions.
• Automated ETL processes using AWS Lambda, ensuring scalable data transformation workflows.
• Managed diverse data formats in AWS Redshift by leveraging Redshift Spectrum and COPY commands to efficiently load, partition, and transform data.
• Implemented version control with Git to ensure seamless collaboration and accurate code tracking.
• Collaborated with product teams in an Agile environment to identify business needs and deliver data-driven solutions.
• Engineered and deployed scalable ETL/ELT pipelines, ensuring reliability, efficiency, and maintainability for high-volume data processing.
• Automated CI/CD pipelines with Jenkins and Kubernetes, reducing deployment time for analytics models while maintaining pipeline reliability.
• Implemented robust data validation, monitoring, and governance frameworks to maintain accuracy and quality across ingestion and storage layers.
• Monitored critical data pipelines using AWS CloudWatch and Prometheus, ensuring uptime and identifying performance optimization opportunities.
• Partnered with data scientists and ML engineers to operationalize AI/ML workflows by deploying models with Vertex AI/Gemini for scalable production support.
• Supported business intelligence teams by designing data models and providing reliable data access through AWS S3 and Redshift.
• Monitored and fine-tuned Apache Spark jobs to enhance performance and ensure successful data processing within SLA. Dish Network Jan 2020 - Aug 2022
Data Engineer
• Developed and implemented ETL pipelines using Apache Kafka and Apache Spark for data ingestion and transformation from multiple sources to Hadoop clusters.
• Managed the transition of legacy databases to Cassandra and PostgreSQL, improving query performance by 25%.
• Developed and deployed machine learning models using PyTorch for customer churn prediction and integrated them into production for real-time insights.
• Optimized data models in AWS Redshift to enable fast and efficient data retrieval, enhancing analytics and reporting capabilities.
• Established real-time streaming data pipelines with Apache Kafka and Hadoop to support on-demand analytics of customer behavior.
• Created interactive dashboards in Tableau to empower executives and product teams with data-driven insights.
• Collaborated with data analysts to develop predictive models and establish KPIs for monitoring customer interactions and engagement.
• Optimized Hadoop clusters for large-scale data processing, reducing query times by 40% through performance tuning and storage enhancements.
• Constructed and maintained data aggregation and transformation jobs using Apache Hive to enable efficient processing of large datasets for BI reporting.
• Monitored data pipelines, optimizing performance through automation and troubleshooting to ensure data integrity and reliability.
• Configured and monitored Redshift Workload Management (WLM) to optimize resource allocation and enhance query performance for concurrent users.
• Integrated Redshift with AWS Glue Data Catalog and Amazon Athena to streamline metadata management and improve data discoverability.
• Ensured data security and compliance in Redshift by implementing IAM-based access controls, encryption with AWS KMS, and auditing through AWS CloudTrail.
Biogen Jan 2019 - Jan 2020
Data Engineer
• Designed and implemented a scalable Healthcare Data Lakehouse using Azure Data Lake Storage (ADLS) to store Electronic Medical Records (EMR), claims data, and other healthcare datasets efficiently.
• Built end-to-end data ingestion pipelines using Azure Data Factory (ADF) to extract structured and unstructured data from multiple healthcare sources, ensuring efficient data ingestion at scale.
• Developed data processing workflows using PySpark to clean, enrich, and aggregate EMR and claims data.
• Implemented predictive analytics models using TensorFlow, and PyTorch on Azure HDInsight, applying Scikit-learn and NumPy for clustering, decision trees, and healthcare scheduling optimization.
• Designed and maintained CI/CD pipelines using Azure DevOps and Jenkins, automating the build, testing, and deployment of data solutions across production environments.
• Deployed containerized workloads using Docker and Azure Kubernetes Service (AKS), ensuring scalability and real- time analytics capabilities.
• Monitored data pipeline performance with Azure Monitor, Splunk, and Prometheus, setting up custom dashboards and automated alerts to track ETL performance, error rates, and data latency, ensuring high availability and reliability.
• Executed advanced SQL queries (Window Functions, CTEs, Subqueries) to perform deep transactional data analysis, ensuring accurate compliance-driven insights across various departments.
• Developed interactive dashboards in Tableau and Power BI to monitor claims metrics and fraud alerts, resulting in a 25% increase in data-driven decision-making through enhanced data visualization and reporting.
• Collaborated with cross-functional teams and business stakeholders to gather, document, and translate business requirements into technical specifications, ensuring alignment across project lifecycles. EDUCATION
University of North Texas Denton, TX May 2024
M.S, Computer Science
CERTIFICATIONS
• Neo4j Certified Professional
• Microsoft Certified: Fabric Data Engineer Associate
• Oracle Certified Professional: MySQL 8.0 Database Developer
• Oracle Certified Professional