Data Engineer

Location:

Jersey City, NJ

Posted:

May 16, 2025

Contact this candidate

Resume:

SAI ESWAR TALLAPANENI

***********@*****.*** +1-929-***-****

PROFESSIONAL SUMMARY:

Results-driven Senior Data & Cloud Engineering Professional with 9+ years of experience designing, implementing, and optimizing data solutions across multiple cloud platforms.

Expert in building scalable data pipelines and ETL/ELT processes for processing petabyte-scale datasets.

Proficient in developing and deploying end-to-end big data engineering solutions using Apache Spark, Hadoop, and cloud-native services.

Experienced in designing and implementing real-time data streaming architectures using Kafka, Spark Structured Streaming, and cloud messaging services.

Skilled in data warehouse design and implementation using Snowflake, Redshift, BigQuery, and Azure Synapse Analytics.

Strong background in cloud infrastructure architecture across AWS, Azure, and GCP environments.

Adept at creating and maintaining CI/CD pipelines for data engineering workflows using Jenkins, GitHub Actions, and Azure DevOps.

Experienced in containerization technologies including Docker and Kubernetes for microservices deployment.

Proficient in programming languages including Python, Scala, SQL, Java, and R for data manipulation and analysis.

Expert in data modeling techniques including dimensional modeling, data vault, and entity-relationship diagrams.

Skilled in implementing data governance frameworks and ensuring data quality and compliance.

Experience in designing and implementing data lakes using cloud storage solutions like S3, Azure Blob Storage, and Google Cloud Storage.

Knowledgeable in MLOps and machine learning pipeline development using MLflow, SageMaker, and Vertex AI.

Proficient in orchestrating complex data workflows using Apache Airflow, AWS Step Functions, and Azure Data Factory.

Experience in implementing serverless architectures using AWS Lambda, Azure Functions, and Google Cloud Functions.

Strong skills in database administration and optimization across SQL and NoSQL databases.

Expertise in implementing data security best practices and ensuring GDPR, CCPA, HIPAA and industry-specific compliance.

Skilled in performance tuning of big data applications and SQL queries for optimal efficiency.

Demonstrated ability to lead technical teams and mentor junior engineers on data engineering best practices.

Experience in implementing Infrastructure as Code (IaC) using Terraform for cloud resource provisioning.

Adept at translating business requirements into technical solutions and communicating complex technical concepts to stakeholders.

Proficient in data visualization techniques using Tableau, Power BI, and other BI tools.

Strong analytical skills for identifying patterns, trends, and insights from large datasets.

Skilled in creating and maintaining technical documentation for data products and pipelines.

Proficient in agile methodologies and project management tools for efficient delivery of data engineering projects.

Experience in designing and implementing disaster recovery and business continuity solutions for data platforms.

Knowledgeable in implementing data mesh and data fabric architectures for large enterprises.

TECHNICAL SKILLS:

Category

Skills

Programming Languages

Python, Scala, SQL, Java, R, Shell Scripting

Python Libraries

Pandas, NumPy, scikit-learn, Matplotlib, Seaborn, PySpark, Beautiful Soup, PyTorch, TensorFlow

Big Data Technologies

Apache Spark, Hadoop, HDFS, MapReduce, Apache Beam, Apache Flink, HBase, Hive, Pig, Sqoop

Streaming & Messaging

Apache Kafka, Spark Structured Streaming, Azure Event Hubs, AWS Kinesis, RabbitMQ, Pub/Sub

ETL/ELT Tools

Airflow, Informatica, SSIS, dbt, AWS Glue, Databricks Delta Live Tables, DataStage, Microsoft Fabric

Cloud Platforms - AWS

S3, EMR, EC2, Lambda, Redshift, Glue, DynamoDB, CloudWatch, SageMaker, Step Functions

Cloud Platforms - Azure

Databricks, Data Factory, Event Grid, Functions, Blob Storage, Synapse Analytics, HDInsight

Cloud Platforms - GCP

BigQuery, Dataflow, Dataproc, GCS, Cloud Functions, Pub/Sub, Dataprep, Vertex AI

Data Warehousing/Lakes

Snowflake, AWS Redshift, Google BigQuery, Azure Synapse, Data Lake Design Patterns

Databases

MySQL, PostgreSQL, SQL Server, Oracle, MongoDB, Cassandra, DynamoDB, HBase, Neo4j

Data Modeling

Star Schema, Snowflake Schema, Data Vault, Dimensional Modeling, Entity-Relationship Modeling

Orchestration & DevOps

Apache Airflow, Kubernetes, Docker, Jenkins, GitHub Actions, Terraform, CI/CD Pipelines

Visualization Tools

Tableau, Power BI, QlikView, Looker, Google Data Studio, D3.js, Excel

Machine Learning & AI

Regression, Classification, Clustering, Feature Engineering, MLflow, Model Deployment

Data Formats

JSON, Avro, Parquet, ORC, CSV, XML, Protocol Buffers

Version Control & CI/CD

Git, GitHub, GitLab, Jenkins, Azure DevOps, BitBucket

PROFESSIONAL EXPERIENCE:

Standard Chartered Bank, New York, NY

Data Engineer Mar 2023 - Present

Responsibilities:

Lead the design and implementation of a cloud-native data platform on AWS, enhancing data processing capabilities while reducing operational costs.

Architected and deployed a real-time fraud detection system using Kafka, Spark Structured Streaming, and AWS Lambda functions, reducing fraud incidents.

Designed and implemented data lake architecture on AWS S3 using data lake design patterns, enabling storage and analysis of petabytes of financial transaction data.

Created and maintained ETL/ELT pipelines using AWS Glue and dbt for transforming raw data into analytics-ready formats.

Implemented automated data quality checks and validation frameworks using custom Python libraries to ensure data integrity.

Developed a metadata management system using AWS Glue Data Catalog and custom solutions for data lineage tracking.

Optimized Spark jobs on EMR clusters, improving processing efficiency by 40% and reducing processing time for critical batch jobs from hours to minutes.

Designed and implemented a data mesh architecture to enable domain-oriented data ownership and governance.

Optimized Flink jobs for low latency and fault tolerance, leveraging keyed state, windowing, and checkpointing mechanisms.

Developed Dataflows and Pipelines in Microsoft Fabric to ingest data from core banking systems, transaction logs, CRM platforms, and regulatory feeds with built-in transformations and data quality rules.

Implemented serverless data processing workflows using AWS Step Functions and Lambda for cost-efficient execution of intermittent workloads.

Developed custom PySpark libraries for standardized data processing patterns across the organization.

Designed and implemented Snowflake data warehouse architecture, including security controls, resource management, and optimization.

Created a centralized logging and monitoring solution using CloudWatch, Elasticsearch, and custom dashboards for real-time insight into data pipeline health.

Implemented data governance controls and PII data handling procedures in compliance with financial regulations and internal policies.

Led the migration of on-premises Hadoop workloads to AWS EMR, reducing infrastructure costs by 55% while improving scalability.

Extensive experience with PostgreSQL, implementing normalized schemas and optimizing complex SQL queries for real-time financial reporting and compliance auditing.

Designed and implemented ML pipelines using SageMaker for credit risk modeling and customer segmentation.

Created data visualization solutions using Tableau connected to Redshift and S3 for executive dashboards and business intelligence.

Implemented encryption and security controls for data at rest and in transit across all data platforms.

Led development of high-performance PL/SQL procedures and packages to process real-time and batch trading data for reconciliation and audit trails.

Designed and implemented a data catalog solution for self-service data discovery using AWS Glue Data Catalog and custom metadata repositories.

Optimized Redshift cluster performance through distribution key selection, sort key optimization, and query performance tuning.

Implemented robust CI/CD pipelines for data engineering and ML workflows using tools like AWS CodePipeline, Jenkins, and Git, ensuring version control, automated testing, and smooth deployment of ML models.

Developed automated testing frameworks for data pipelines using pytest and custom validation tools.

Implemented Infrastructure as Code using Terraform for automated provisioning and configuration of AWS resources.

Led cross-functional teams in designing and implementing data solutions for regulatory reporting and compliance initiatives.

Target, Minneapolis, MN

Data Engineer Jan 2021 - Feb 2023

Responsibilities:

Designed and implemented a cloud data platform on Azure to support retail analytics and inventory management systems.

Developed and maintained data pipelines using Azure Data Factory and Databricks Delta Live Tables for ETL/ELT operations.

Architected real-time data streaming solutions using Azure Event Hubs and Spark Structured Streaming for processing point-of-sale and inventory data.

Implemented a modern data warehouse using Azure Synapse Analytics with optimized star schema design for retail analytics.

Created a data lake architecture on Azure Blob Storage with bronze, silver, and gold layer organization for different data processing stages.

Led the design and development of real-time streaming data pipelines using Apache Flink, enabling high-throughput, low-latency processing of event-driven data.

Developed machine learning pipelines using Azure Databricks and MLflow for demand forecasting and product recommendation engines.

Implemented automated orchestration workflows using Apache Airflow on Azure for scheduling and monitoring data processing jobs.

Designed and deployed a containerized microservices architecture using Docker and Azure Kubernetes Service for data processing applications.

Developed and optimized ETL mappings using Informatica PowerCenter for data extraction from Oracle and flat files into the data warehouse.

Created custom Python and Scala libraries for standardized data processing across the organization.

Implemented data quality frameworks using Great Expectations and custom validation rules to ensure accuracy of retail data.

Migrated legacy ETL processes from DataStage to Azure Data Factory, optimizing data workflows and enhancing pipeline performance in cloud environments.

Developed Power BI dashboards and reports for sales analytics, inventory management, and executive decision support.

Optimized Spark jobs on Databricks, reducing processing costs by 30% while improving performance for large-scale data transformations.

Designed and implemented a metadata management system for data lineage tracking and impact analysis.

Created a unified customer data platform integrating online and in-store purchase data using Azure data services.

Implemented security controls and row-level security in Azure Synapse Analytics for appropriate data access management.

Developed CI/CD pipelines using Azure DevOps for automated testing and deployment of data solutions.

Created a data governance framework ensuring compliance with retail industry standards and privacy regulations.

Implemented automated monitoring and alerting for data pipeline health using Azure Monitor and Log Analytics.

Developed strategies for handling and processing semi-structured data from various retail systems using Spark and Azure functions.

Implemented a feature store for machine learning models using Delta Lake to standardize feature engineering.

Created automated documentation generation for data pipelines and data models using custom Python tools.

Implemented a self-service data platform using Azure Synapse Analytics workspaces for business analysts.

Developed custom connectors for integrating legacy retail systems with modern cloud data platform.

Implemented cost monitoring and optimization strategies for Azure resources, reducing monthly cloud costs by 25%.

Led the implementation of a disaster recovery strategy for critical retail data assets with defined RPO and RTO metrics.

Mastercard, New York, NY

Data Engineer Sep 2018 – Nov 2020

Responsibilities:

Designed and implemented a hybrid cloud data platform leveraging Google Cloud Platform services for payment processing analytics.

Developed streaming data pipelines using Apache Kafka, Google Pub/Sub, and Dataflow for real-time transaction processing.

Architected and built a data lake solution on Google Cloud Storage using Avro and Parquet file formats for efficient storage and querying.

Implemented ETL workflows using Apache Beam for processing payment transaction data from multiple global sources.

Created and maintained BigQuery data warehouse for analytics, reporting, and ad-hoc querying of financial transaction data.

Developed custom machine learning models using TensorFlow and Vertex AI for fraud detection and risk assessment.

Implemented data quality validation frameworks using custom Python libraries and BigQuery stored procedures.

Created data visualization dashboards using Looker connected to BigQuery for business intelligence and executive reporting.

Designed and implemented a global data replication strategy across multiple GCP regions for disaster recovery.

Developed containerized data processing applications using Docker and GKE (Google Kubernetes Engine).

Implemented security controls including encryption, access management, and audit logging for PCI compliance.

Created automated data pipeline orchestration using Cloud Composer (managed Apache Airflow) for scheduling and monitoring.

Developed data models and schemas optimized for financial transaction analytics in BigQuery.

Implemented change data capture (CDC) processes using Dataflow for synchronizing database changes to the data lake.

Created custom Python libraries for standardized data transformation and validation across multiple payment systems.

Developed serverless data processing workflows using Cloud Functions for event-driven architectures.

Implemented data governance controls and policies for handling sensitive payment card information in compliance with industry regulations.

Created automated testing frameworks for data pipelines and models using pytest and custom validation tools.

Designed and implemented a metadata management system for data lineage tracking and impact analysis.

Optimized BigQuery performance through partitioning, clustering, and query optimization techniques.

Developed CI/CD pipelines using Jenkins and Google Cloud Build for automated testing and deployment.

Implemented cost monitoring and optimization strategies for GCP resources, reducing cloud expenditure by 30%.

Created comprehensive technical documentation and runbooks for all data engineering solutions.

Developed disaster recovery procedures and conducted regular disaster recovery testing for critical data systems.

Led cross-functional teams in implementing data solutions for regulatory reporting and compliance initiatives.

AT&T, Dallas, TX

Big Data Engineer Mar 2016 - Aug 2018

Responsibilities:

Designed and implemented on-premises Hadoop data platform for processing and analyzing network telemetry and customer data.

Developed ETL pipelines using MapReduce, Pig, and Hive for transforming raw network logs into structured analytics datasets.

Created and maintained HBase and Cassandra NoSQL databases for storing and querying high-volume telecommunications data.

Implemented Apache Kafka for real-time streaming of network performance metrics and customer usage patterns.

Developed Spark applications in Scala and Python for processing telecommunications data at scale.

Created and optimized SQL queries for data extraction and analysis from relational databases including Oracle and MySQL.

Implemented data quality checks and validation frameworks using custom Java and Python scripts.

Developed shell scripts for automating data ingestion and processing workflows on Hadoop clusters.

Created data visualization solutions using Tableau connected to Hadoop data sources for network performance monitoring.

Implemented security controls including Kerberos authentication and encryption for Hadoop ecosystem components.

Developed and maintained Sqoop jobs for data transfer between relational databases and HDFS.

Created and optimized Hive tables and partitioning strategies for efficient querying of large telecommunications datasets.

Implemented automated monitoring and alerting for Hadoop clusters using Nagios and custom scripts.

Developed data models for telecommunications data in both relational databases and Hadoop ecosystem.

Created comprehensive documentation for data pipelines, schemas, and cluster configurations.

Implemented disaster recovery procedures for Hadoop clusters including regular backups and failover testing.

Developed custom UDFs (User Defined Functions) in Hive and Pig for specialized data transformations.

Created and maintained continuous integration pipelines for deploying code to production Hadoop environments.

Implemented data archiving and retention policies for managing the lifecycle of telecommunications data.

Developed capacity planning and scaling strategies for Hadoop clusters based on projected data growth.

Created custom dashboards for monitoring cluster performance and resource utilization.

Implemented data partitioning and bucketing strategies in Hive for optimizing query performance.

Developed automated data ingestion workflows for streaming log files from network devices to HDFS.

Created and maintained data dictionaries and metadata repositories for telecommunications data elements.

Led the initial proof-of-concept and planning for cloud migration of on-premises Hadoop workloads to AWS.

EDUCATION:

Master of Science in Computer Engineering

University of Cincinnati Dec 2016

CERTIFICATIONS:

Google Cloud Certified - Professional Data Engineer

AWS Certified Solutions Architect - Associate

Certified Kubernetes Administrator

Contact this candidate