Senior Data Engineer with 10+ Years in Big Data and Cloud Systems

Location:

Phoenix, AZ

Posted:

March 23, 2026

Contact this candidate

Resume:

NAVYA REDDY

Senior Data Engineer

301-***-**** *************@*****.***

PROFESSIONAL SUMMARY

●Senior Data Engineer with around 10+ years of professional IT experience of Big Data consultant experience in analyzing and processing real-time data streams using Azure Stream Analytics, AWS Kinesis, and Apache Spark Streaming.

●Practical experience in developing and implementing a Big Data Management Platform (BMP) utilizing Hadoop 2.x, HDFS, MapReduce/Yarn/Spark, Hive, Pig, Oozie, Apache Nifi, Airflow, Talend, Sqoop, and other components of the Hadoop ecosystem for data storage and retrieval purposes.

●Comprehensive understanding of the Hadoop architecture and the various daemons within Hadoop clusters, including the Name node, Data node, Resource manager, Node Manager, and Job history server.

●Managed MongoDB databases on various cloud platforms, including AWS, Azure, and GCP, utilizing their respective services for enhanced functionality.

●Practical experience with Hortonworks distribution, Cloudera distribution, MapR, and Amazon EMR.

●Proficient in building custom apps, workflows, automations using Power Apps, Automate, and Power BI.

●Spearheaded the design, development, and deployment of serverless applications leveraging AWS Lambda and API Gateway to drive critical business logic and API functionalities, enhancing operational agility and scalability.

●Led the implementation of robust data integration and ETL processes using AWS Glue, orchestrating the seamless extraction, transformation, and loading of diverse data sets from disparate sources into AWS data lakes and data warehouses, ensuring data consistency and integrity.

●Demonstrated expertise in designing scalable end-to-end architectures to address business challenges by leveraging various Azure services such as HDInsight, Data Factory, Data Lake, Databricks, and Machine Learning Studio.

●Designed and orchestrated serverless ETL workflows using AWS Glue, AWS Step Functions, and AWS Lambda, enabling event-driven data processing and automation.

●Integrated Azure CosmosDB with other Azure services like Azure Functions, Azure Logic Apps, and Azure Stream Analytics to build robust, end-to-end data processing pipelines.

●Strong understanding of cloud-native architectures and services, with expertise in deploying and managing

●Medallion architecture and Databricks on cloud platforms such as Azure, AWS.

●Proficient in designing, implementing, and optimizing big data processing workflows using Databricks.

●Designed and maintained databases on AWS cloud platform, including Database, S3, EC2, Redshift, and Athena, to store and manage large datasets, ensuring data integrity and accessibility.

●Proficient in configuring Azure Linked Services and Integration Runtimes to establish pipelines using Azure Data Factory (ADF) and automate them using Azure Scheduler.

●Designed and managed MongoDB databases, optimizing schema design and data modeling to enhance performance and scalability for various applications.

●Experience in developing Web-based clients/server applications. Designing, and developing professional web applications using front-end technologies like HTML, CSS, jQuery, Bootstrap, Angular2, and back-end technologies like Servlets, JSP, JDBC, Spring, Hibernate, Spring MVC, and Web Services.

●Proficient in optimizing CosmosDB performance, scalability, and reliability for various use cases.

●Hands-on experience in coding MapReduce/Yarn Programs using Java, Scala, and Python for analyzing Big Data and Strong experience in building Data-pipelines using Big Data Technologies

●Experience in Creating real-time data streaming solutions using Spark Core, Spark SQL, Kafka, Spark Streaming, Apache Storm.

●Experience in importing and exporting data from various databases like RDBMS, MYSQL, Teradata, Oracle, and DB2 into HDFS using Sqoop and experience with different data formats like Json, Avro, parquet, RC and ORC and compressions like snappy, Gzip.

●Experience in ETL of data from multiple sources like Flat files, Databases, and integration with popular NoSQL databases for huge volumes of data.

●Experience in data processing like collecting, aggregating from various sources using Apache Kafka & Flume.

●Strong experience in designing dimensional data models and lakehouse architectures (Medallion architecture, Iceberg/Delta concepts) to support analytics and machine learning use cases.

●Experience in data governance, data quality, and observability frameworks, ensuring reliability and compliance of mission-critical systems.

●Exposure to MLOps workflows, GenAI/LLM concepts including prompt engineering and Retrieval-Augmented Generation (RAG) for building intelligent data-driven applications.

●Hands-on experience in instrumenting APIs and capturing user behavioral and transactional data for analytics and reporting.

TECHNICAL SKILLS

Big Data & Distributed Systems

PySpark, Kafka, Spark streaming, Hadoop, MapReduce, YARN, Hive, HDFS

Cloud Platforms

Snowflake, AWS (S3, Redshift, EMR, SNS, Lambda, EKS, Athena, Glue, CloudWatch, IAM), Terraform, Kubernetes, Docker

Languages

SQL, PL/SQL, Python, HiveQL, Scala, JAVA.

ETL & Workflow Orchestration

Informatica, SSIS, Sqoop, Flume, Apache Oozie, Apache Airflow, Apache Flink

Databases

Teradata, MS SQL Server 2016/2014/2012, Azure SQL DB, Azure Synapse. MS Excel, MS Access, Oracle 11g/12c, Cosmos DB

Scheduling

IBM Tivoli, Oozie, Airflow

CI/CD & Version Control

Jenkins, Git, GitHub, VSS

Reporting

Tableau, Power BI, MS Excel

Operating Systems

Windows (XP/7/8/10/11), UNIX, LINUX, UBUNTU, CENTOS.

Data Modeling & Architecture

Dimensional Modeling, Star/Snowflake Schema, Medallion Architecture, Lakehouse (Delta Lake, Iceberg)

Machine Learning & GenAI

MLOps Concepts, Feature Engineering, Model Data Pipelines, Prompt Engineering, RAG (Retrieval-Augmented Generation), LLM Integration (basic exposure)

Data Governance & Quality

Data Validation Frameworks, Data Lineage, Data Catalogs, Schema Evolution, Data Privacy & Compliance

Query Engines & Analytics

Trino (Presto), OLAP Systems, Near Real-Time Analytics

WORK EXPERIENCE

Client: Vanguard

September 2023 – Present

Role: Senior Data Engineer

Responsibilities:

•Designed and implemented scalable Snowflake data ingestion frameworks, creating internal and external stages to load financial transaction and market data from AWS S3 into Snowflake tables.

•Created and managed transient, temporary, and permanent tables in Snowflake, optimizing storage and query performance for investment analytics workloads.

•Tuned Snowflake virtual warehouses by configuring multi-cluster settings and auto-scaling to support high-concurrency queries from reporting and analytics teams.

•Developed complex SnowSQL and SQL queries for ETL processes, transforming large volumes of trading, portfolio, and client data from AWS S3 and Amazon Redshift into Snowflake.

•Implemented Snowpipe for near real-time ingestion of market feeds and trading transactions, ensuring continuous data availability for downstream analytics.

•Designed and implemented dimensional data models (star/snowflake schemas) to standardize financial business metrics across trading, portfolio, and customer analytics datasets.

•Built and optimized lakehouse architecture using Snowflake and Databricks, enabling scalable analytics and machine learning workloads.

•Built distributed data processing workflows using Databricks and Apache Spark, performing large-scale transformations on financial datasets.

•Processed semi-structured data (JSON, XML) in Snowflake using regular expressions and built-in functions to extract and standardize market data feeds.

•Developed scalable ETL/ELT workflows using AWS Glue, AWS Lambda, and PySpark to move and transform data between AWS data lakes and Snowflake environments.

•Built real-time and near real-time (NRT) streaming pipelines using AWS Kinesis and Spark Streaming to process market data and transaction events for low-latency analytics use cases.

•Instrumented API-based data ingestion pipelines using AWS Lambda and API Gateway to capture transactional and user behavioral data.

•Optimized query performance and reduced cloud costs by converting raw data formats to Parquet, implementing partitioning strategies, and improving query patterns in AWS Athena.

•Utilized AWS Athena for ad-hoc analysis of S3-based financial data lakes and integrated curated datasets into Snowflake for reporting and compliance analytics.

•Implemented data governance and data quality frameworks, including validation, schema enforcement, and monitoring to ensure high data reliability and compliance.

•Designed data observability mechanisms using AWS CloudWatch and custom alerting frameworks to track pipeline failures, latency, and performance bottlenecks.

•Worked on schema evolution and metadata management to maintain consistency and integrity across distributed data systems.

•Collaborated with data science teams to support ML feature engineering pipelines and data preparation workflows for predictive analytics and advanced insights.

•Managed and scheduled data workflows using Apache Airflow, enabling automated pipeline execution, dependency management, and monitoring.

•Deployed and managed containerized data services using Docker and Amazon EKS (Kubernetes) to support scalable data pipeline orchestration.

•Implemented CI/CD pipelines using Jenkins and Git, automating deployment of data pipelines, Snowflake scripts, and Databricks notebooks.

•Designed big data processing solutions using Hive, Spark, and Hadoop ecosystem tools integrated with Snowflake for large-scale financial data processing.

•Implemented monitoring and alerting using AWS CloudWatch, ensuring reliability and performance of cloud-based data pipelines.

•Optimized database performance and integrated data across platforms including PostgreSQL, Oracle, and SQL Server for enterprise reporting and analytics.

•Maintained version control and collaboration using Git and GitLab, enabling continuous integration and streamlined development across data engineering teams.

Environment: AWS (EC2, S3, EBS, RDS, Redshift, EMR, Glue, Lambda, API Gateway, Kinesis, Athena, IAM, CloudWatch), Snowflake, Databricks, Spark (PySpark, Spark SQL, Spark Streaming), Kafka, Airflow, Hive, Hadoop, Python, SQL, Scala, Java, Docker, Kubernetes (EKS), Jenkins, Git, Data Modeling, Lakehouse Architecture, OLAP Systems

Client: CVS Health, Chicago, IL May 2021 - September 2023

Role: Senior AWS Data Engineer

Responsibilities:

●Designed and implemented scalable ETL pipelines using AWS Glue to extract, transform, and load data from multiple sources into data lakes and data warehouses, supporting enterprise analytics.

●Built end-to-end data integration pipelines to unify customer, CRM, and marketing datasets into curated analytics layers.

●Developed custom Python and PySpark transformation scripts in AWS Glue to cleanse, standardize, validate, and process large-scale datasets.

●Implemented serverless architectures using AWS Lambda, API Gateway, DynamoDB, and S3 to automate workflows and enable event-driven data processing.

●Designed and orchestrated complex workflows using AWS Step Functions, integrating services like Lambda, SNS, and S3 for reliable pipeline execution.

●Created Lambda deployment pipelines and configured event triggers for automated data ingestion and processing.

●Designed and maintained real-time and near real-time (NRT) streaming pipelines using Amazon Kinesis, Apache Kafka, and Apache Flink to process IoT and customer behavior data.

●Integrated API-based ingestion systems to capture real-time transactional and application data.

●Developed and optimized Apache Spark jobs using Scala, Java, and PySpark for large-scale distributed data processing and analytics.

●Leveraged Spark and EMR clusters for efficient data ingestion, transformation, and aggregation across high-volume datasets.

●Designed data models and curated layers to support analytics and reporting use cases across customer and marketing domains.

●Built pipelines supporting machine learning workflows, including feature extraction, transformation, and data preparation for downstream ML use cases.

●Implemented data governance and quality frameworks, including validation rules, schema enforcement, metadata management, and compliance standards.

●Created and maintained AWS Glue Data Catalog for schema discovery, data lineage, and governance.

●Built and managed data pipelines integrating multiple data sources including APIs, relational databases, file systems, and cloud storage.

●Developed advanced SQL queries and transformations in Snowflake, Redshift, and relational databases for analytics and reporting.

●Created external tables and optimized partitioned datasets using Hive, Athena, and Redshift for improved query performance.

●Led migration of legacy batch workloads to cloud-native architectures using AWS Data Pipeline, improving scalability and resource utilization.

●Optimized AWS resource usage and implemented secure coding practices using IAM roles, KMS encryption, and credential management.

●Automated ETL workflows using Apache Airflow and AWS services for scheduling, monitoring, and dependency management of pipelines.

●Monitored system performance and ensured reliability using Amazon CloudWatch metrics, alerts, and logging frameworks.

●Worked with distributed query engines and OLAP systems to enable faster analytical querying and reporting.

●Collaborated with business and marketing teams to translate requirements into scalable data engineering solutions.

Environment: AWS (EC2, S3, EMR, Glue, Lambda, API Gateway, Kinesis, DynamoDB, Redshift, Athena, Step Functions, IAM, CloudWatch), Spark (PySpark, Spark SQL, Spark Streaming), Kafka, Flink, Airflow, Snowflake, Hive, Hadoop, Python, SQL, Scala, Java, Docker, Jenkins, Git, Data Modeling, OLAP Systems

Client: Fidelity Investments

April 2020 – April 2021

New Jersey

Role: Senior Data Engineer

Responsibilities:

●Designed and implemented scalable ETL pipelines using AWS Glue and PySpark to process large volumes of financial and trading data from multiple enterprise systems.

●Built and maintained a centralized data lake on Amazon S3 to store structured and semi-structured investment, portfolio, and customer datasets.

●Developed real-time data ingestion pipelines using Amazon Kinesis and AWS Lambda to process trading transactions and market data feeds.

●Implemented data warehousing solutions using Amazon Redshift, designing optimized schemas and improving query performance for financial reporting and analytics.

●Developed complex SQL and SnowSQL queries to transform and load financial data into Snowflake for enterprise reporting and analytics.

●Designed and optimized Snowflake virtual warehouses and implemented multi-cluster configurations to support high concurrency financial analytics workloads.

●Implemented Snowpipe for near real-time ingestion of market feeds and transactional data into Snowflake.

●Built distributed data processing pipelines using Databricks and Apache Spark for large-scale transformation of financial datasets.

●Processed semi-structured financial data such as JSON and XML using Spark and Snowflake functions to support downstream analytics.

●Automated ETL workflow orchestration using Apache Airflow to manage scheduling, dependencies, and monitoring of data pipelines.

●Developed Python and PySpark scripts for complex data transformations, aggregation, and cleansing of financial datasets.

●Optimized AWS Athena queries and implemented partitioning strategies on S3 to reduce query costs and improve performance.

●Implemented data quality validation frameworks to ensure accuracy and reliability of financial transaction and portfolio data.

●Secured sensitive financial data using AWS IAM roles, KMS encryption, VPC configurations, and S3 bucket policies to meet regulatory standards.

●Built REST-based data services using AWS Lambda and API Gateway to support internal analytics and reporting applications.

●Implemented monitoring and alerting using Amazon CloudWatch to ensure reliability and performance of production data pipelines.

●Containerized data applications using Docker and deployed analytics workloads on Amazon EKS for scalable data processing.

●Developed CI/CD pipelines using Jenkins and Git to automate deployment of ETL workflows and infrastructure updates.

●Collaborated with investment analysts, risk management teams, and compliance stakeholders to deliver reliable data solutions in an Agile environment.

Environment: AWS services (Glue, EMR, S3, Redshift, Lambda, CloudWatch, CloudFormation), Teradata, Snowflake, Apache Kafka, Apache NiFi, Apache Spark, Apache Airflow, Terraform, AWS DataBrew.

Client: HCA Healthcare

January 2018 - March 2020

Role: ETL developer

Responsibilities:

●Designed and developed scalable ETL pipelines to extract, transform, and load healthcare data from multiple hospital systems into centralized data warehouses.

●Built batch data ingestion workflows to move large volumes of patient records, billing data, and clinical datasets from relational databases into enterprise data platforms.

●Developed complex SQL queries and transformation logic to cleanse, standardize, and validate healthcare datasets for downstream analytics.

●Implemented data integration processes to consolidate data from Electronic Health Record (EHR) systems, insurance claims systems, and hospital operational databases.

●Developed and optimized ETL workflows using Python and SQL to support large-scale healthcare reporting and analytics requirements.

●Created and maintained staging, transformation, and reporting layers within the data warehouse environment.

●Processed semi-structured healthcare datasets such as JSON and XML and converted them into structured formats for analytics.

●Designed data models and schemas to support hospital performance dashboards and clinical reporting systems.

●Implemented data validation and quality checks to ensure accuracy, consistency, and reliability of healthcare data.

●Automated ETL workflows and job scheduling to ensure timely data processing and reporting.

●Optimized SQL queries and indexing strategies to improve query performance and reduce reporting latency.

●Collaborated with healthcare analysts, reporting teams, and database administrators to gather requirements and deliver data integration solutions.

●Maintained documentation for ETL processes, data flows, and system configurations to support operational transparency and audits.

●Monitored ETL jobs and system performance, troubleshooting failures and ensuring reliable production data pipelines.

Environment:SQL, Python, ETL, Data Warehousing, Healthcare Data Systems, JSON, XML, Linux, Shell Scripting, Oracle, SQL Server, Git, Agile/Scrum.

Client: I-Verve Infoweb Pvt. Ltd., Hyderabad, India

May 2015 - July 2016

Role: Hadoop Developer

Responsibilities:

•Designed and developed data processing solutions using Hadoop ecosystem tools including HDFS, Hive, and MapReduce to handle large-scale datasets.

•Developed ETL workflows to ingest data from relational databases and flat files into HDFS using Sqoop and Flume.

•Implemented batch processing jobs using MapReduce to transform and aggregate large volumes of structured and semi-structured data.

•Created Hive tables and developed HiveQL queries for data analysis, reporting, and downstream analytics.

•Built data ingestion pipelines to move enterprise application data into Hadoop clusters for centralized storage and processing.

•Optimized Hadoop jobs by tuning MapReduce configurations and improving query performance in Hive.

•Processed and transformed large datasets using Pig scripts for data cleansing and aggregation tasks.

•Implemented partitioning and bucketing strategies in Hive tables to improve query performance and storage efficiency.

•Developed shell scripts to automate data ingestion, job scheduling, and monitoring of Hadoop workflows.

•Integrated Hadoop with relational databases such as MySQL and Oracle for data import and export operations.

•Performed data validation and implemented data quality checks to ensure consistency and reliability of processed datasets.

•Monitored Hadoop cluster performance and job execution using tools such as Ambari and Hadoop job trackers.

•Collaborated with data analysts and development teams to design data models supporting reporting and analytics.

•Worked closely with system administrators to manage Hadoop cluster resources and maintain system stability.

•Maintained version control and collaborated with development teams using Git for source code management.

Environment: Amazon EMR, AWS Glue, Amazon S3, Redshift, Athena, Kinesis, Step Functions, IAM, KMS, VPC, Hive, Spark, Scala, Sqoop, Kafka, HBase, Cassandra, Linux, Shell Scripting.

EDUCATION:

Master’s in Data Science - University of Arizona - December 2017

Bachelors in Computer science - CVR College of Engineering - 2015

Contact this candidate