Senior Data Engineer with Cloud & Big Data Expertise

Location:

Irving, TX

Salary:

120000

Posted:

January 14, 2026

Contact this candidate

Resume:

Madhav Vaddepalli

Sr. Data Engineer

LinkedIn:www.linkedin.com/in/v-madhav

Email: *****************@*****.***

Contact: 469-***-****

PROFESSIONAL SUMMARY:

•Overall, 7+ years of IT experience spanning Database Development, ETL Development, Data Modeling, Report Development, and Big Data Technologies across diverse domains including Health Care, Pharmaceuticals, Finance, and Retail.

•Proficiently utilized cloud platforms such as Azure and AWS, leveraging services like Data Lake, Glue, Databricks, S3, EC2, EMR, Redshift, and Snowflake to architect and manage scalable, secure data storage and processing systems.

•Demonstrated expertise in data warehousing technologies like Amazon Redshift and Snowflake, as well as proficient handling of data storage and retrieval for analytics applications.

•Utilized ETL solutions such as AWS Glue, Informatica Intelligent Cloud Services, and Talend to streamline data integration and transformation operations.

•Proficient in designing, implementing, and optimizing data pipelines on the DataBricks platform, ensuring efficient extraction, transformation, and loading (ETL) of large datasets.

•Specialized in Data warehousing, Data modeling, Data integration, Data Migration, ETL process, and Business Intelligence, with expertise in SSIS, Informatica ETL, and reporting tools.

•Engineered advanced data ingestion and transformation pipelines, automated ETL processes, and enhanced data quality and reliability using Python, PySpark, and Scala.

•Developed reliable, scalable data processing pipelines utilizing a variety of programming languages and technologies including Python, PySpark, R, Scala, SQL, and shell scripting.

•Proficient in Python OOP concepts and experienced in developing Spark applications using Spark Core, Spark SQL, and Spark Streaming APIs. (Done)

•Demonstrated expertise in Big Data technologies such as Hadoop Ecosystem, Spark, Hive, MapReduce, Sqoop, Flume, and Zookeeper, efficiently processing large and complex datasets.

•Experienced in message-based and distributed systems using AWS SQS and SNS for event-driven data pipelines and asynchronous processing.

•Applied AI/ML techniques using Python, Scikit-learn, and AWS SageMaker to develop predictive models and automate analytical workflows.

•Exhibited data visualization and reporting skills using technologies like Power BI, Tableau, and Quick Sight to facilitate informed decision-making.

•Collaborated with data science teams to implement machine learning pipelines and model monitoring integrated with Snowflake and Databricks for continuous training and evaluation.

•Proficient in job scheduling and workflow orchestration using Apache Airflow, enhancing automation and operational efficiency.

•Possessed advanced knowledge of database management systems including Oracle, MangoDB and MS SQL Server, ensuring data integrity, security, and performance.

•Implemented data validation and integrity constraints within Oracle databases using PL/SQL to ensure data accuracy and consistency.

•Designed and implemented NoSQL database solutions using MongoDB, integrating them with data pipelines to handle semi-structured data efficiently.

•Experienced in designing and implementing innovative data solutions on AWS, including data lakes, real-time analytics, and AI-driven insights, ensuring scalability and accessibility for decision-making.

•Led significant migration projects, focusing on security and compliance, utilizing encryption mechanisms, and adhering to regulatory standards.

•Collaborated closely with data scientists and business analysts, translating complex data analysis requirements into actionable data models and insights.

•Practiced version control and CI/CD practices using Git, GitLab, Azure DevOps, and Jenkins to ensure code integrity and streamlined deployment processes.

•Managed and optimized CI/CD pipelines, facilitating efficient code integration and deployment across various environments.

•Proficient in Agile project management methodologies, utilizing tools like Jira and Kanban to enhance team productivity and project visibility.

•Skilled in containerization and microservices deployment using Docker and Kubernetes, ensuring consistent, portable, and scalable data pipeline environments across cloud platforms.

TECHNICAL SKILLS:

Operating Systems

Unix, Linux, Mac OS, Windows, Ubuntu

Big Data

Hadoop, Spark, MapReduce, YARN, Hive, Pig, Sqoop, Oozie, Maven

Hadoop Ecosystem

HDFS, MapReduce, Yarn, Oozie, Zookeeper, Job Tracker, Task Tracker, Name Node, Data Node, Cloudera

Cloud Computing Tools

AWS - EC2, S3, Athena, Glue, RDS, Redshift, EMR, IAM, Lambda, Step Functions, QuickSight, Snowflake, Databricks

Data Ingestion

Spark Streaming, Kafka, Kinesis, Sqoop, Flume

ETL/ELT

PySpark, Informatica Power Center, Talend, Pentaho, Microsoft SSIS, Ab Initio

Databases

Amazon RDS, SQL Server, MySQL, Oracle, PostgreSQL 9.3, Snowflake Cloud DB, Teradata 12/14, DB2 10.5, MS Access, Netezza

SQL Server Tools

SQL Server Management Studio, Export and Import Wizard, Enterprise Manager, Query Analyzer, SQL Profiler, DTS

NO SQL Databases

HBase, Cassandra, DynamoDB, MongoDB

Programming Languages

Python, SQL, R, Java, Scala, PL/SQL, SAS

IDE

Jupyter Notebook, PyCharm, IntelliJ, Eclipse, Visual Studio, IDLE

Web Services

Restful, SOAP, Oracle Form Server, WebLogic 8.1/10.3, Web Sphere MQ 6.0

Reporting BI Tools

MS Excel, Tableau, Tableau Server and Reader, Power BI, QlikView, Crystal Reports, SSRS, Splunk

Automation/Validation Tools

Atlas Workflows, Postman, Jenkins, GitLab CI/CD

Methodologies

Agile, Waterfall

PROFESSIONAL EXPERIENCE:

Safeco Insurance, Seattle, WA November 2022 - Present

AWS Data Engineer

Responsibilities

Contributed to Apache Spark data processing project, involving data processing from RDBMS and various data streaming sources, and developed Spark apps in Python on AWS EMR.

Designed and deployed multi-tier applications on AWS Cloud Formation utilizing AWS services like EC2, Route 53, S3, RDS, and Dynamo DB, focusing on high availability, fault tolerance, and auto-scaling.

Connected and integrated various data sources such as S3, Amazon Redshift, RDS, Athena, and third-party databases with Quick sight for data analysis and visualization.

Implemented AWS Step Functions to manage multiple microservices within financial payment processing workflows.

Built end-to-end ETL pipelines using PySpark, AWS Glue, and Snowflake to process multi-source financial data, improving analytics efficiency and data accessibility across teams.

Designed and optimized Snowflake and Redshift architectures integrated with S3, leveraging partitioning, compression, and concurrency tuning for high-performance query execution.

Automated ingestion and transformation workflows using AWS Lambda, Step Functions, and Glue Jobs, minimizing manual intervention and ensuring timely data delivery for analytics.

Built CI/CD automation using Terraform, Jenkins, and CodePipeline, enabling infrastructure-as-code provisioning and continuous integration for data engineering workflows.

Designed and developed batch processing and real-time processing solutions using ADF, Databricks clusters, and stream analytics.

Migrated legacy ETL jobs to Databricks notebooks running on Azure, modernizing batch jobs into streaming pipelines using Spark Structured Streaming.

Established proactive monitoring through CloudWatch and Datadog, tracking job metrics and error rates to improve reliability of Glue and Snowflake pipelines.

Designed reusable Spark modules for tokenization, masking (PII/PHI), and validation logic across multiple data domains.

Built streaming ingestion from Kafka to Delta tables with schema enforcement and watermarking logic.

Developed SQL-based dashboards in Databricks SQL and integrated them with BI tools like Tableau and Power BI.

Developed interactive dashboards using R Shiny and R Studio, integrating real-time AWS and Snowflake data to support business analytics and executive reporting.

Created data pipelines utilizing Linked Services/Datasets/Pipeline in Azure DataBricks and Azure DataFactory to extract, transform, and load data from a variety of sources, including Azure SQL, Blob Storage, and Azure SQL Data warehouse.

Implemented monitoring and alerting with Amazon Cloud Watch, AWS Cloud Trail, and Datadog, ensuring pipeline reliability.

Performed performance tuning in Hive and Redshift, migrating legacy ETL processes to scalable, cost-effective AWS solutions.

Designed and implemented CI/CD pipelines using AWS CodePipeline, Git, and Cloud Build for automated data workflows, model versioning, and consistent deployment across environments.

Designed and implemented end-to-end ETL pipelines using Python, AWS Glue, and Snowflake’s SnowSQL, integrating S3 and Athena for metadata-driven transformation and querying.

Designed and optimized data warehousing solutions in Snowflake and Redshift, implementing dimensional models and partitioning strategies to improve analytical performance.

Utilized AWS Glue catalog with a crawler to retrieve data from S3 and perform SQL operations using AWS Athena.

Implemented an ETL framework using PySpark to load standardized data into Hive and HBase, orchestrating Sqoop-based ingestion of structured and semi-structured data into HDFS zones.

Developed event-driven AWS Lambda functions integrated with S3 triggers and CloudWatch monitoring to automate Glue ETL pipelines and ensure real-time operational reliability.

Engineered distributed PySpark jobs on AWS EMR to process multi-schema CSV and JSON datasets, optimizing transformations with DataFrames and Spark SQL for analytics readiness.

Processed Parquet data using PySpark and Impala queries, integrating Spark Streaming pipelines with RDDs and DataFrames to enable both batch and real-time analytics.

Developed Spark applications using Scala for high-performance ETL and transformation tasks, improving efficiency for large-scale financial data processing.

Built streaming data pipelines using Apache Kafka, AWS Kinesis, and Glue ETL to consolidate server logs, enable real-time analytics, and deliver low-latency insights into AWS data lakes.

Prepared model-ready datasets by cleaning and engineering features in Python and Scikit-learn, collaborating with Data Science teams to train models on Spark EMR clusters.

Utilized Informatica Data Quality (IDQ) for data profiling, cleansing, matching, deduplication, and validation to ensure consistent, high-quality datasets for analytics and reporting.

Optimized SQL performance across Oracle, MySQL, and Redshift databases and developed interactive Tableau and Power BI dashboards for cross-platform business reporting.

Deployed and configured MongoDB clusters on AWS EC2 instances, integrating NoSQL data sources with analytics pipelines to support semi-structured data ingestion and storage.

Leveraged Hadoop ecosystem components (HDFS, Hive, and Oozie) for managing data pipelines, scheduling workflows, and archiving historical data for compliance analytics.

Deployed Spark streaming and ETL jobs on Kubernetes clusters, enabling horizontal scaling and optimizing resource allocation for real-time processing.

Containerized ETL and data processing jobs using Docker, enabling reproducible environments and faster deployment within AWS and Databricks clusters.

Maintained version control using GitLab, ensuring traceability and accountability in the development lifecycle. Collaborated with cross-functional teams to implement GitLab CI/CD pipelines.

Northern Trust Bank, Chicago, IL December 2019 – October 2022

Cloud Data Engineer

Responsibilities

•Designed and implemented real-time data streaming pipelines using AWS Kinesis to collect and process financial transaction data and market feeds, ensuring low-latency ingestion and processing for timely insights.

•Utilized Apache Kafka and Spark Streaming to handle high-throughput data ingestion and real-time analytics, optimizing the processing of financial market data and trading metrics.

•Integrated Flume and Sqoop to facilitate seamless movement of structured and unstructured financial data, supporting ETL workflows for transactional systems and data lakes.

•Designed and maintained ELT pipelines to extract financial data from multiple transactional systems, transform it using PySpark, and load it into Snowflake and Redshift for analytics.

•Optimized and streamlined complex SQL/PLSQL procedures and queries across Oracle, Teradata, and PostgreSQL to accelerate financial reporting and data transformations.

•Engineered Python-based ETL pipelines with AWS Glue and PySpark, processing large volumes of financial data at scale, with automated data cleaning, transformation, and enrichment.

•Established a scalable data lake architecture on Amazon S3 and Glacier, providing cost-effective storage for structured financial datasets, market data, and unstructured reports.

•Automated complex ETL orchestration scripts and data-quality checks using Python, Pandas, and Boto3 for efficient AWS data pipeline operations.

•Developed and managed DAGs in Apache Airflow to orchestrate daily financial ETL jobs, monitor dependencies, and automate data refresh cycles.

•Built and maintained enterprise data warehouse models in Redshift and Snowflake, supporting business reporting, analytics, and regulatory compliance needs.

•Integrated Databricks notebooks with AWS S3 and Snowflake for high-volume data transformation, standardizing financial reporting models and ensuring data accuracy across environments.

•Utilized Hadoop Distributed File System (HDFS) for large-scale financial data archiving and parallel processing of historical transaction logs and market feeds.

•Developed scalable batch and streaming applications using PySpark and Scala integrated with Kafka topics to process market tick data in real-time.

•Leveraged AWS Redshift, Snowflake, and Teradata for optimized operational data storage and retrieval, supporting high-performance queries and analytics on financial data.

•Developed and maintained RESTful APIs for seamless integration between financial applications, ensuring secure and scalable data exchange between systems.

•Led the migration of raw data to AWS S3 and performed refined data processing, leveraging AWS EMR for large-scale data transformation and movement.

•Utilized serverless architecture with API Gateway, Lambda, and DynamoDB, deploying AWS Lambda functions triggered by events from S3 buckets.

•Created external tables with partitions using Hive, AWS Athena, and Redshift for efficient data organization and querying.

•Designed and automated ETL/ELT workflows using AWS Glue, Lambda, EMR, and Redshift to extract, transform, and load large-scale financial data efficiently.

•Developed and maintained intricate SQL scripts, indexes, views, and complex queries for comprehensive data analysis and extraction.

•Managed relational databases on AWS RDS, writing optimized SQL queries for financial reporting and performing regular maintenance to ensure high availability.

•Implemented security policies with AWS IAM, KMS, and CloudTrail to ensure compliance with financial industry standards, including encryption, access control, and audit logging.

•Collaborated in Agile sprints, ensuring timely delivery of features related to financial data processing, including real-time market data pipelines and financial reporting systems.

•Optimized data processing with Apache Spark on AWS EMR, accelerating processing for financial transaction logs, market data, and portfolio analytics.

•Developed interactive dashboards with AWS QuickSight, Tableau, and Splunk to visualize financial metrics, market trends, and real-time operational performance.

•Integrated CI/CD pipelines and proactive monitoring using Jenkins, GitHub Actions, CloudWatch, and Nagios to ensure continuous delivery and operational reliability of financial data workflows.

•Implemented comprehensive data governance and security frameworks using AWS Lake Formation, KMS, and Apache Atlas, ensuring financial data privacy, lineage, and regulatory compliance.

•Automated infrastructure provisioning and containerized deployments using Terraform, CloudFormation, Docker, and Kubernetes for scalable, reproducible financial data pipelines.

•Optimized SQL and ETL processes for financial data to ensure faster report generation, better decision-making, and efficient querying of large-scale datasets.

•Leveraged Git and Bitbucket for version control, Maven for dependency management, and implemented unit testing frameworks to ensure pipeline reliability before deployment.

•Administered Oracle databases in RDS, handling query optimization, backup strategies, and access control policies.

The Home Depot, Atlanta, Georgia July 2018 – Nov 2019

Big Data Engineer

Responsibilities

Created data pipeline of gathering, cleaning, and perfecting data using Hive, and Spark, and Used Spark-Streaming APIs to perform necessary transformations and actions on the fly for building the common learner data model which gets the data from AWS Kinesis in near real-time and persists

Created dashboards in Tableau that were published to the internal team for review and further data analysis and customization using filters and actions

Designed and developed scalable and cost-effective architecture in AWS Big Data services for the data life cycle of collection, ingestion, storage, processing, and visualization

Collaborated with cross-functional teams to integrate Atlas workflow automation with existing Spark and Kafka processes, improving testing speed and data reliability.

Built and maintained data warehouses using AWS Redshift, optimizing database performance and query execution times.

Involved in creating AWS Pipelines by extracting customers' Big Data from various data sources into Hadoop HDFS and this included data from Excel, Flat Files, Oracle, SQL Server, Teradata, and log data from servers

Configured Linux on multiple Hadoop environments setting up Dev, Test, and Prod clusters within the same configuration.

Develop Reports and Dashboards in Splunk. Utilized Splunk Machine Learning Toolkit for modeling the Clustering model to detection log.

Developed spark applications in Python (PySpark) in a distributed environment to load a huge number of CSV files with different schemas into Hive ORC tables.

Developed JSON Scripts for deploying the Pipeline in Azure Data Factory (ADF) that processes the data using the SQL Activity.

Successfully mapped source data to target data structures and implemented complex data transformations using Talend components.

Developed using object-oriented methodology a dashboard to monitor all network access points and network performance metrics using Django, Python, MongoDB, and JSON.

Developed using LINQ to query and manipulate data in C# collections, databases, and XML documents.

Used Spark Streaming to receive real-time data from Kafka and store the stream data to HDFS using Python and NoSQL databases such as HBase and Cassandra.

Writing Pig Scripts to generate MapReduce jobs and performing ETL procedures on the data in HDFS.

Designed data warehousing architectures integrating Hive, Redshift, and Snowflake for centralized reporting and near real-time analytics across business domains.

Develop solutions to leverage ETL tools and identify opportunities for process improvements using Informatica and Python.

Wrote Spark applications for Data Validation, Cleansing, Transformation, and custom aggregation and used Spark engine, and Spark SQL for data analysis and provided to the data scientists for further analysis.

Designed and developed scalable and cost-effective architecture in AWS Big Data services for the data life cycle of collection, ingestion, storage, processing, and visualization.

Developed various Python scripts to find vulnerabilities with SQL Queries by doing SQL injection, permission checks, and analysis Developed Admin API to manage and inspect topics, brokers, and other Kafka objects.

Good hands-on experience in integrating Talend with various data sources, databases, cloud services, APIs, and data warehouses.

Published interactive data visualizations dashboards, reports /workbooks on Looker.

Worked on automating infrastructure setup and AWS resource configuration using Terraform and Ansible, helping migrate legacy and monolithic systems smoothly to the AWS cloud environment.

Implemented GitHub actions workflows to deploy the terraform templates into AWS.

Developed and deployed Dockerized data applications for Spark and Hive workloads, improving development consistency and environment isolation.

Used Kubernetes to orchestrate Docker-based data microservices for data ingestion and reporting workflows.

Developed Python and PySpark programs for data analysis on Cloudera, and Hortonworks Hadoop clusters.

Developed and designed a system to collect data from multiple portals using Kafka and process it using Spark Streaming and PySpark on EMR, landing raw data in S3/HDFS, curating it into Hive/Parquet, and loading downstream into Redshift for analytics and dashboards.

EDUCATION:

Master of Science in Computer Science, Fitchburg State University.

CERTIFICATION:

AWS Certified Solutions Architect Associate.

Contact this candidate