Data Quality Engineer

Location:

Stamford, CT, 06901

Posted:

June 26, 2025

Contact this candidate

Resume:

Chandu Kondepati

Sr. Data Engineer

*****************@*****.*** Linkedin 203-***-****

ABOUT ME:

9+ years of experience in systems analysis, developing, deploying, and managing in the fields of bigdata applications, Data Warehousing. Hadoop Ecosystem, AWS Cloud Data Engineering, Data Visualization, Reporting, and Data Quality Solutions.

●Hands-on experience with Databricks: Proficient in using Databricks for building and managing scalable data pipelines, performing data transformations, and enabling real-time analytics.

●Developed and maintained scalable data pipelines using Apache Beam and GCP Dataflow for efficient data processing.

●Built and optimized BigQuery data models for analytical and reporting purposes.

●Designed and implemented data storage solutions using GCP Databases like BigQuery, Cloud SQL (MySQL), and Firestore.

●Developed data transformation scripts using Python, Scala, and Java for data processing and manipulation.

●Created automation scripts for pipeline deployment and monitoring using Shell Scripting and Bash.

●Implemented error handling, logging, and monitoring solutions using tools like Splunk and ELK Stack.

●Developed ETL workflows and data ingestion pipelines from various data sources into GCP environments.

●Designed and administered Tableau dashboards for executive reporting and performance monitoring.

●Managed Tableau Server configurations, user permissions, and performance tuning."

●Managed application deployment with zero downtime using CodeDeploy.

●Designed and implemented CI/CD pipelines using AWS CodePipeline for automated build, test, and deployment processes.

●Integrated with CodeCommit, GitHub, Bitbucket, and Jenkins for source control management.

●Advanced Analytics: Leveraged Databricks for data lake house architecture, integrating structured and unstructured data from multiple sources into Snowflake and Azure Data Lake.

●Experienced in using Amazon S3 Glacier for secure, durable, and cost-effective long-term data storage.

●Implemented data archival solutions using S3 Glacier to reduce storage costs.

●Configured Lifecycle Policies to automate data movement from Amazon S3 to S3 Glacier for archival purposes.

●Feature Engineering & Management: Designed and implemented scalable feature pipelines using Tecton, ensuring efficient feature storage, transformation, and retrieval for real-time and batch ML applications.

●Created benchmarking reports and comparative analytics to support client performance reviews.

●Developed KPI tracking systems to monitor operational data quality and business impact.

●Expertise in developing SSIS/DTS Packages to extract, transform, and load (ETL) data from heterogeneous sources into data warehouses and data marts.

●Skilled in performance tuning of SSIS packages by implementing parallel execution, removing unnecessary sorting, and using optimized queries and stored procedures.

●Proficient in writing Spark RDD transformations, actions, and Data Frames for structured data processing.

●Knowledgeable in data warehousing concepts with experience in Redshift; completed a Proof of Concept (POC)for DW implementation with Matillion and Redshift.

●Experience with GCP services: Big Query, Cloud Composer, Airflow, Cloud SQL, Cloud Storage, Cloud Functions, and Dataflow.

●Skilled in writing queries, creating tables, views, and partitions in Big Query.

●Strong experience in ETL migration from on-prem to GCP - Big Query.

●Extensive experience with AWS services: S3, IAM, EC2, EMR, Kinesis, VPC, DynamoDB, Redshift, Amazon RDS, Lambda, Athena, Glue, DMS, Quick Sight, Elastic Load Balancing, Auto Scaling, CloudWatch, SNS, SQS, and other AWS family services.

●Hands-on expertise with S3, EC2, RDS, EMR, Redshift, SQS, and Glue for scalable data storage, compute, and ETL processes.

●Deep understanding of data warehousing implementation in Redshift and proficient in real-time data processing with Spark streaming using Kafka.

●Experience in SQL database migration to Azure Data Lake, Azure Data Lake Analytics, Azure SQL Database, Databricks, Azure Blob Storage, Azure Data Lake Storage Gen2, and Azure SQL Data Warehouse with a focus on access control and permissions.

●Skilled in developing end-to-end scalable architectures with Azure components: Data Lake, Key Vault, HDInsight, Azure Monitoring, Azure Synapse, Function App, Data Factory, and Event Hubs.

●Proficient in using Azure Data Factory and PySpark to create, develop, and deploy high-performance ETL pipelines.

●Knowledgeable in Spark Architecture including Spark Core, Spark SQL, Data Frames, Spark Streaming, Driver Node, Worker Node, Stages, Executors, and Tasks.

●Experienced in Spark streaming for receiving real-time data with Kafka.

●Strong understanding of Tableau for data visualization design and development.

●Developed custom UDFs for Pig and Hive to integrate Python functions within Pig Latin and HiveQL (HQL), including usage of Piggybank UDF Repository.

Programming Languages

Python, SQL, PL/SQL, Shell scripts, Scala, Unix

Scripting Languages:

Java Script, Python, Shell Script. Web Servers: Apache Tomcat4.1/5.0

Big Data Tools

Hadoop, Apache Spark, MapReduce, Flink, PySpark, Hive, YARN, Kafka, Flume, Oozie, Airflow, Data bricks, Tecton, Zookeeper, Sqoop, HBase, codedeploy, codepipeline,

Cloud Services

Amazon Web Services (AWS) - AWS Glue, S3, RedShift, EC2, S3, EMR, DynamoDB, Data Lake, AWS Lambda, Cloud Watch, H2O MLOps

HDInsight, Azure SQL Datawarehouse,Glacier.

GCP

Big Query, Cloud Composer, Airflow, Cloud SQL, Cloud Storage, Cloud Functions and Dataflow.

Azure

Azure Data Lake, Data Factory, Synapse, Key Vault, HDInsight, Azure SQL Database, Azure Blob Storage, Event Hubs, Function App, Azure Monitor

ETL/Data warehouse Tools

Informatica, Talend, DataStage, Power BI, and Tableau

Version Control & Containerization tools

SVN, GIT, Bitbucket, Docker, and Jenkins CVS, Code Commit, GIT hub, ApacheLog4j, TOAD, ANT, Maven, JUnit, JMock, Mockito, REST HTTP Client, JMeter, Cucumber, Aginity.

Databases

Oracle, MySQL, MongoDB, and DB2

Operating Systems

Ubuntu, Windows, and Mac OS

Methodologies

Agile/ Scrum and Traditional Waterfall

Client: CIBC Bank, Chicago April 2023 to May 2025

Insight Global

Role: Sr. Data Engineer

Description: At CIBC Bank in Chicago, as a Senior Data Engineer, I am responsible for architecting and implementing scalable data solutions to enhance financial analytics and reporting capabilities. I leverage advanced data processing frameworks, including Apache Spark and Kafka, to efficiently handle large volumes of transaction data. My role involves optimizing ETL pipelines to ensure high data quality and reliability while using SQL, Python to streamline data workflows. I collaborate closely with cross-functional teams, including data scientists and business analysts, to develop and maintain data models that support real-time analytics and decision-making processes.

Responsibilities:

●ETL & Data Pipelines: Built and optimized ETL pipelines using Tecton, Apache Spark, and SQL, ensuring high-performance feature computation and real-time data processing.

●Configured automated approvals and notifications using AWS SNS and Lambda functions.

●Reduced deployment time and enhanced release management using CodePipeline's parallel and sequential stages.

●Implemented end-to-end CI/CD pipelines for both containerized applications and serverless functions using AWS ECS/EKS and Lambda.

●Experience in configuring deployment strategies like Rolling Updates, Blue/Green Deployments, and Canary Releases.

●Implemented CI/CD pipelines for automated deployment and monitoring using Cloud Build and Cloud Functions.

●Performed data validation, error handling, and root cause analysis to ensure accurate data processing.

●Feature Store Development: Managed feature definitions, transformations, and serving with Tecton to provide consistent feature availability across training and production environments.

●Cloud & Infrastructure Management: Deployed and managed Tecton on AWS, GCP, and Azure, integrating it with Databricks, Snowflake, and MLflow for end-to-end ML model deployment.

●Performed data retrieval operations using Expedited, Standard, and Bulk Retrieval options based on business needs.

●Ensured compliance with regulatory and data retention policies using S3 Glacier’s immutable storage.

●Monitored and optimized storage costs by leveraging Storage Class Analysis and implementing tiered storage solutions.

●Established data governance and security policies using Databricks Unity Catalog, enforcing role-based access controls (RBAC).

●Experienced in Agile/Scrum methodology, providing ongoing support and maintenance for implemented data engineering projects.

●Utilized Python for seamless data transfer from on-premises clusters to Google Cloud Platform (GCP).

●Conducted in-depth analysis of business challenges to identify effective solutions, leveraging big data analytics, automation techniques, and data engineering best practices.

●Developed enterprise-level solutions, integrating batch processing with Apache Pig and real-time streaming using Spark Streaming, Apache Kafka, and Apache Flink.

●Completed a Proof of Concept (POC) for implementing a cloud data warehouse on Big Query (GCP).

●Developed PySpark scripts for data processing in AWS Glue and EMR environments.

●Experienced with orchestration and data pipelines, utilizing AWS Step Functions, AWS Data Pipeline, and AWS Glue.

●Built data pipelines in Azure Data Factory (ADF) using datasets, linked services, and pipelines for ELT processes with various sources including Azure SQL, Blob Storage, and Azure SQL Data Warehouse.

●Managed data migration from Teradata to Google Cloud Platform (BigQuery) during POC phase.

●Monitored and troubleshooted Spark and Databricks clusters, estimating cluster size and optimizing performance.

●Processed massive volumes of data using Spark, calculating essential metrics and insights.

●Developed stored procedures, lookup transformations, execute pipelines, and data flows in Azure Data Factory (ADF).

●Created internal dashboards using Azure Synapse with PowerPivot and Power Query for metrics tracking.

●Proficient in Google Cloud Platform (GCP), with experience in BigQuery, Cloud Composer, Airflow, Cloud SQL, Cloud Storage, Cloud Functions, and Dataflow.

●Used Python to transfer data from on-premises systems to GCP.

●Developed ETL processes to move on-premises data to GCP BigQuery for data warehousing and analytics.

●Created, queried, and optimized tables in BigQuery for scalable data analysis.

●Used Google Cloud Storage (GCS), Dataflow, and DataProc for data processing and storage in GCP.

●Built and optimized real-time data pipelines in GCP with Apache Flink and Kafka.

●Completed POC for migrating Teradata workloads to GCP BigQuery.

●Hands-on experience with GCP services such as Cloud Composer and Airflow for orchestrating data pipelines.

●Automated infrastructure monitoring with Datadog dashboards, leveraging Terraform for deployment.

●Deployed Lambda functions integrated with API Gateway to handle data submission and processing on AWS.

●Built end-to-end ETL pipelines in Azure Data Factory (ADF) to extract, transform, and load data from various sources.

●Designed and implemented interactive dashboards with Azure Synapse, using PowerPivot and Power Query for data visualization and reporting

Environment: AWS Glue, S3, Graph DB, IAM, EC2, RDS, Flink, Redshift, EC2, Azure, Data warehouse, GCP, Lambda, Boto3, Terraform, DynamoDB, Apache Spark, Kinesis, Athena, Hive, Sqoop, Python.

Client: Eastern bank, Boston, MA March 2021 to April2023

(Amplified Sourcing)

Role: Sr Data Engineer

Description: At Eastern Bank in Boston, MA, as a Senior Data Engineer, I was responsible for designing and implementing scalable data solutions to enhance the bank's analytics capabilities. I managed the complete Big Data flow, ingesting data into HDFS and optimizing complex ETL pipelines to ensure data accuracy and reliability for reporting and analysis. Utilizing technologies such as Apache Spark, Hive, and SQL, I processed and transformed large datasets to support real-time decision-making. I also worked extensively with Google Cloud Platform (GCP), leveraging services like BigQuery and Dataflow to create efficient and secure data architectures. Collaborating with cross-functional teams, including data scientists and business analysts, I developed advanced data models and real-time dashboards that provided critical insights into customer behavior and operational performance. My contributions significantly improved the bank's data-driven decision-making processes and overall operational efficiency.

Responsibilities:

●Collaborated with product teams to create various store-level metrics and supported data pipelines written in GCP’s big data stack, ensuring scalability and efficiency.

●Designed and implemented real-time and batch feature pipelines using Tecton, ensuring high availability and low-latency access for ML models.

●Developed feature transformations leveraging Tecton’s declarative framework, integrating Python, SQL, and Apache Spark

●Deployed H2O AutoML models for automated hyperparameter tuning and model selection.

●Integrated H2O.ai models with Databricks & Snowflake, optimizing data transformations and predictions.

●Normalized data according to business needs by performing data cleansing, modifying data types, and applying various transformations using Apache Spark, Scala, and GCP Dataproc.

●Improved query performance by partitioning and clustering high-volume tables on fields in Big Query.

●Implemented scalable infrastructure and platform for large-scale data ingestion, aggregation, integration, and analytics in Hadoop using Apache Spark and Hive.

●Created scripts for data modeling, import, and export, with extensive experience in deploying, managing, and developing MongoDB clusters. Developed JavaScript for DML operations in MongoDB.

●Designed and implemented database solutions in Azure SQL Data Warehouse and Azure SQL to support enterprise-scale analytics.

●Architected and implemented medium-to-large scale BI solutions on Azure using Azure Data Platform services such as Azure Data Lake, Azure Data Factory, Data Lake Analytics, Stream Analytics, Azure SQL DW, HDInsight/Databricks, and NoSQL databases.

●Designed and executed migration strategies for traditional systems to Azure, utilizing methods such as Lift and Shift, Azure Migrate, and third-party tools.

●Configured and monitored shard sets in MongoDB, analyzing data distribution, selecting shard keys for optimal performance, and performing architecture and capacity planning for clusters. Implemented scripts for MongoDB import/export, dump, and restore operations.

●Worked on MongoDB database concepts such as locking, transactions, indexes, sharding, replication, and schema design. Managed sharded collections and shard keys based on project requirements, ensuring availability, performance, and scalability.

●Designed and developed data warehouse models using the Snowflake schema, optimizing them for analytics and reporting.

●Efficiently uploaded and downloaded data to and from GCP Cloud Storage using command-line tools and client libraries.

●Developed PySpark scripts for the migration of large datasets, ensuring minimal downtime and high performance. Analyzed SQL scripts and implemented solutions using PySpark.

●Used Spark SQL on top of PySpark to perform data cleansing, validation, transformations, and executed these programs with the Python API.

●Processed and loaded real-time and batch data from Google Pub/Sub topics to Big Query using Cloud Dataflow with Python.

●Developed reusable Apache Flink modules for serializing and deserializing AVRO data by applying schemas to ensure consistency.

●Indexed processed data and created dashboards and alerts in Splunk for operational support and actionable insights.

●Designed layered architectures in Hadoop, modularizing designs and developing framework scripts to accelerate development. Created reusable shell scripts for Hive, Sqoop, Flink, and Pig jobs, with standardized error handling, logging, and metadata management processes.

●Worked on Google Pub/Sub message partitioning and configured replication factors to improve reliability and performance.

●Demonstrated the ability to work seamlessly in both GCP and Azure cloud environments, leveraging their respective tools and services effectively.

●Gained hands-on experience in AWS services to support multi-cloud strategies and architecture.

Environment: GCP Console, Cloud Storage, Big Query, Dataproc, Apache Spark, PySpark, MongoDB, Data Warehouse, Hadoop, Hive, Flink, Scala, Cloud SQL, Snowflake, Shell Scripting, Pub/Sub, SQL Server 2016/2012, T-SQL, SSIS, Azure Data Lake, Azure Data Factory, Stream Analytics, Azure SQL DW, Azure HDInsight, Databricks, Power BI, PowerShell, Oracle, Teradata, Airflow, Splunk, GIT, Docker.

Client: CSAA Ins Group, Walnut Creek, CA September 2017 to February 2021 (SmartSource Inc.)

Role: Data Engineer

Description: The project aims to enhance its customer experience and increase customer satisfaction by leveraging data-driven insights to personalize interactions and services for its policyholders. The objective of this project is to develop a customer behavior analysis and personalization platform that utilizes data engineering techniques to analyze customer interactions, preferences, and behaviors and deliver personalized recommendations and services.

Responsibilities:

●Performed Spark Streaming and batch processing using Scala and Python, ensuring high efficiency and scalability in data processing pipelines.

●Utilized Hive integrated with Spark for comprehensive data cleansing and transformation tasks.

●Developed and managed data pipelines using Scala, Kafka, and AWS Kinesis, enabling structured, processed, and transformed data workflows.

●Built scalable distributed data solutions using Amazon EMR clusters, specifically leveraging EMR 5.6.1 for optimized performance.

●Migrated Hive and MapReduce jobs from on-premises MapReduce to the AWS Cloud environment using EMR and Qu bole for enhanced reliability and scalability.

●Conducted performance testing with Apache JMeter and developed dashboards using Grafana to visualize results and monitor system performance.

●Participated in creating advanced data and analytics-driven solutions, deploying scalable algorithms to drive predictive analytics using big data tools across AWS and Azure platforms.

●Designed and implemented real-time data feeds and microservices using AWS Kinesis, AWS Lambda, Kafka, and Spark Streaming to optimize analytics and improve customer experience.

●Enhanced data assessment confidence by leveraging and integrating existing tools and algorithms from multiple sources.

●Performed ETL testing using SSIS Tester, automating unit and integration testing to ensure data integrity and pipeline efficiency.

●Designed and built a robust SSIS/ETL framework from scratch to streamline data ingestion and transformation workflows.

●Utilized AWS Glue Catalog with crawlers to ingest data from Amazon S3 and perform SQL queries for actionable insights.

●Developed and deployed Spark and Scala code within a Hadoop cluster running on GCP, optimizing data processing capabilities.

●Conducted extensive testing of Snowflake, identifying the most efficient ways to utilize cloud resources for cost-effective analytics.

●Developed optimized Spark code using Python and Spark SQL to load, transform, and process large datasets efficiently.

●Implemented business-specific requirements by writing UDFs (User-Defined Functions) in Scala and PySpark.

●Developed JSON scripts to automate and deploy pipelines in Azure Data Factory (ADF), enabling seamless data integration and processing using SQL Activities.

●Processed and managed vast volumes of structured, semi-structured, and unstructured data for analytics and reporting.

●Created workflows using Apache Oozie to ingest data from various sources into Hadoop, ensuring smooth data pipeline execution.

●Developed and optimized Spark jobs using Scala and PySpark, leveraging Spark SQL for efficient querying and data transformation.

Environment: HDFS, Spark, Scala, PySpark, ADF (Azure Data Factory), Kafka, AWS, Pig, SBT, Sqoop, Maven, Zookeeper, AWS Glue, AWS Lambda, Amazon S3, Docker, Terraform, Snowflake, Apache Oozie, Grafana, Apache JMeter, SQL Server, SSIS.

Client: Radian Group, Philadelphia, PA Jan 2015 to August 2017

( NorthHill Technology Resources, LLC)

Data Engineer

Project Description: At Radian Group in Philadelphia, PA, as a Data Engineer, I was responsible for designing and implementing robust software solutions to enhance the company's data processing capabilities. I developed and optimized Python-based applications that integrated with various data sources, enabling seamless data extraction, transformation, and loading (ETL) processes. Utilizing frameworks such as Django and Flask, I built scalable web applications that provided users with intuitive interfaces for data analysis and reporting. I also implemented automation scripts for regression testing using Selenium WebDriver, ensuring the reliability of our applications. My role involved collaborating with cross-functional teams to gather requirements and translate them into technical specifications, as well as leveraging libraries like Pandas and NumPy for data manipulation and analysis.

Responsibilities:

●Designed and built a custom and generic ETL framework using Apache Spark and Scala, enabling dynamic transformations and scalable data processing.

●Performed complex data transformations based on business requirements using Spark and Scala.

●Configured Spark jobs for weekly and monthly executions using Amazon Data Pipeline, ensuring timely processing and reporting.

●Developed Python scripts for custom ETL processes, integrating with AWS Lambda for real-time trigger-based transformations.

●Executed complex SQL queries using Spark SQL for joins, aggregations, and validations in large-scale datasets.

●Designed and implemented data ingestion pipelines using Python, AWS Glue, and Amazon S3, integrating data into Amazon Redshift.

●Developed advanced Mapplets and complex transformations using Informatica to load data into Data Marts, Enterprise Data Warehouses (EDW), and Operational Data Stores (ODS).

●Created and optimized SSIS packages to dynamically process source file names using the For Each Loop Container.

●Leveraged SSIS Data Flow Transformations such as Lookup, Merge, Data Conversion, and Sort for efficient ETL processes.

●Automated data ingestion and validation pipelines using AWS S3, Python, and checksum validation for data integrity.

●Built independent reusable components for AWS S3 connections to extract and load data into Amazon Redshift, ensuring high throughput and reliability.

●Designed ETL workflows to dynamically adapt to changing source schemas using AWS Glue Catalog and Python.

●Developed and deployed AWS Lambda functions to integrate with SNS and trigger notifications for ETL job completions or errors.

●Optimized ETL pipelines for both batch and real-time processing, leveraging AWS Kinesis, Lambda, and Python.

●Worked on data ingestion pipelines for large datasets using Spark, Scala, and Python, ensuring high performance and reliability.

●Collaborated with cross-functional teams to integrate Python scripts for advanced data preprocessing and enrichment.

Environment: Apache Spark, Scala, Python, AWS Lambda, Amazon Redshift, Amazon S3, Amazon Glue, Cassandra, Zeppelin, DBeaver, AWS Kinesis, SSIS, SQL Server, Informatica, Alteryx 11, Shell Scripting.

Educational Qualifications:

Masters : Computer Science (University of Bridgeport) 2014 to 2015

Bachelors : Computer Science (Sri vasavi Engineering College) 2014 May

PROFESSIONAL SUMMARY

TECHNICAL SKILLS

PROFESSIONAL EXPERIENCE

Contact this candidate