Data Engineer Senior

Location:

Arlington, TX

Posted:

September 10, 2025

Contact this candidate

Resume:

PREM KUMAR

SENIOR DATA ENGINEER

Email: *****.***@*****.***

Direct# +1-469-***-****

PROFESSIONAL SUMMARY:

Over 5 years of experience as a Cloud Data Engineer, specializing in building scalable, secure, and high-performance data pipelines across AWS, GCP, and Azure.

Proven expertise in Snowflake data warehousing, including automated data ingestion workflows using Snowflake SQL procedures, handling VARIANT/JSON/CSV formats, and performance tuning via partitioning, clustering, and materialized views. Skilled in building and orchestrating Azure Data Factory (ADF) pipelines for moving data between SQL Server, Snowflake, and other cloud platforms.

Expertise in designing end-to-end ETL workflows for moving structured and unstructured files across cloud systems using AWS Glue, Lambda, Step Functions, EventBridge, and GCP Dataflow.

Proficient in Python (including pandas), PySpark, and Apache Beam for data transformation, cleansing, profiling, and enrichment in both batch and streaming scenarios.

Hands-on experience with Terraform to automate infrastructure provisioning and enforce IaC best practices across cloud environments.

Extensive experience with data lakes and data warehouses including Amazon S3, GCS, Snowflake, BigQuery, Redshift, and Azure Data Lake.

Strong command of PostgreSQL for both operational analytics and transactional integration within cloud pipelines.

Skilled in orchestrating data workflows using Airflow, Cloud Composer, and AWS Step Functions for automation and dependency management.

Skilled in containerization and orchestration using Docker, Kubernetes, and Google Kubernetes Engine (GKE) for deploying scalable data processing applications.

Implemented enterprise-grade data security measures using IAM, DLP, encryption protocols, and fine-grained access controls.

Adept at real-time streaming with Kafka, Cloud Pub/Sub, Kinesis, and Apache Flink for building low-latency, fault-tolerant pipelines.

Experience building and exposing APIs for enterprise data consumption and enabling seamless integration with BI and reporting tools.

Proficient in monitoring and debugging data pipelines using AWS CloudWatch, Google Cloud Operations Suite, and custom alerting mechanisms.

Committed to continuous improvement of pipeline performance and cost optimization through query tuning, partitioning, clustering, and autoscaling.

Familiar with CI/CD practices using Git, Jenkins, GitHub Actions, and Bitbucket for managing code and deployment pipelines.

Effective collaborator with cross-functional teams including architects, analysts, DevOps, and security to deliver robust, scalable, and compliant data solutions.

Open to working with new tools and frameworks and experienced with pandas for exploratory data analysis, debugging, and quick data validation.

Big Data Eco System

HDFS, MapReduce, Hive, Pig, Sqoop, Flume, HBase, Kafka Connect, Impala, Stream sets, Oozie, Airflow, Apache Calcite, Zookeeper, Amazon Web Services, Microsoft Azure, Azure Synapse Analytics, Azure Data Lake Storage, Azure Data Factory, Azure Blob Storage, Microsoft Fabric.

Hadoop Distributions

Apache Hadoop 1x/2x, Cloudera CDP, Hortonworks HDP

Languages

Python, Scala, Java, R, Pig Latin, HiveQL, Shell Scripting.

Software Methodologies

Agile, SDLC Waterfall.

Design Patterns

Eclipse, Net Beans, IntelliJ, Spring Tool Suite.

Databases

MySQL, Oracle, DB2, PostgreSQL, DynamoDB, MS SQL SERVER, Snowflake (SQL Procedures, VARIANT handling, Performance Optimization), Aurora

NoSQL

HBase, MongoDB, Cassandra.

ETL/BI

Power BI, Tableau, Talend, Informatica, Looker, Hex.

Version control

GIT, SVN, Bitbucket.

Operating Systems

Windows (XP/7/8/10), Linux (Unix, Ubuntu), Mac OS.

Cloud Technologies

Amazon Web Services, EC2, S3, SQS, SNS, Lambda, EMR, Code Build, CloudWatch, GCP, Microsoft Dynamics 365

KEYBANK – DALLAS, TX. JUL 2024 - PRESENT

SENIOR DATA ENGINEER

DESCRIPTION: As a Senior Data Engineer at KeyBank, I specialize in designing and optimizing data pipelines within Google Cloud Platform (GCP) and Amazon Web Services (AWS). My expertise includes building scalable data solutions for batch and real-time processing using Cloud Dataflow, AWS Glue, Snowflake, AWS Athena, and Informatica. I have developed real-time streaming pipelines with Cloud Pub/Sub and Amazon Kinesis, optimized query performance with data partitioning and clustering, and utilized Google Cloud Storage and Amazon S3 for cost-effective storage.

I orchestrate workflows using Cloud Composer and AWS Step Functions, ensuring secure operations through IAM and DLP tools, while also implementing monitoring and alerting solutions with CloudWatch and Google Cloud Operations Suite. With Python and PySpark, I efficiently process large datasets and integrate Microsoft Dynamics 365 to enhance CRM automation, improve reporting, and streamline workflows.

Additionally, I have worked with Looker and Hex for data visualizations, dashboards, and business intelligence reports, providing real-time insights for decision-making. I implemented a semantic layer using Apache Calcite to ensure efficient data access, optimize queries, and deliver a unified view of data across disparate systems.

RESPONSIBILITIES:

Developed balancing routines and reconciliation processes for Program Integrity data marts, ensuring accuracy and completeness of financial claims datasets. Automated data flows between SQL Server and Snowflake using Azure Data Factory (ADF), integrating data validation and error-handling mechanisms.

Designed and deployed scalable data pipelines using GCP Cloud Dataflow, AWS Glue, and Apache Beam, supporting both batch and real-time processing of structured and unstructured data.

Migrated ETL workflows from legacy platforms such as Teradata and on-prem SQL systems to modern cloud-based solutions like BigQuery, Snowflake, and Redshift, ensuring zero data loss and optimal performance.

Built automated file-based ETL pipelines using AWS S3, Lambda, EventBridge, and Step Functions, enabling secure and efficient movement of data across cloud ecosystems.

Developed and managed ETL processes using Informatica PowerCenter and Informatica Cloud (IICS) for high-volume data ingestion and transformation from enterprise systems into cloud data platforms.

Utilized Terraform for consistent, secure, and scalable infrastructure provisioning across AWS, GCP, and hybrid cloud environments.

Engineered robust, reusable data frameworks using Python (pandas), PySpark, and Scala, supporting transformation, validation, enrichment, and data quality assurance at scale.

Integrated PostgreSQL, Teradata, and NoSQL sources as both ingestion and output points in pipelines, enabling support for transactional and analytical workloads.

Developed event-driven real-time streaming pipelines using Cloud Pub/Sub, Amazon Kinesis, and Kafka, powering low-latency ingestion and insights.

Performed data quality checks using pandas, Informatica Data Quality (IDQ), and PySpark, with alerting via CloudWatch, Google Cloud Monitoring, and custom Slack/email notifiers.

Implemented and maintained semantic data layers with Apache Calcite, unifying cross-system queries and improving logical data abstraction and performance.

Developed and optimized dashboards using Looker, Hex, and Power BI, supporting operational analytics, executive KPIs, and real-time business insights.

Built APIs and microservices to expose curated datasets to downstream applications, enabling real-time data consumption by internal platforms and client-facing products.

Implemented disaster recovery and cross-region backup workflows using AWS Backup, Cloud Storage Snapshots, and automated replication mechanisms.

Partnered with DevOps teams to enable CI/CD automation for ETL deployments using GitHub Actions, Terraform, and Jenkins, increasing agility and reducing deployment risks.

Documented complex data flows, built comprehensive metadata catalogs, and maintained lineage tracking using Collibra, Informatica EDC, and Data Intelligence Cloud (DIC) to support compliance and governance.

Conducted enablement sessions and internal workshops on Terraform, pandas, GCP Storage architecture, and Informatica integration best practices.

Utilized Fivetran for low-code/automated ingestion of SaaS application data (e.g., Salesforce, Microsoft Dynamics 365) into cloud data lakes and warehouses.

Configured and optimized Snowflake data models with partitioning, clustering, and materialized views, improving cost-efficiency and performance for analytical queries. Also built Snowflake SQL stored procedures for automated transformations, managing VARIANT/JSON data, and optimizing large financial data loads.

Orchestrated and scheduled complex workflows using GCP Cloud Composer and AWS Step Functions, integrating dependencies across source systems, APIs, and downstream storage layers.

Provisioned and managed containerized workloads using GKE and EKS, enabling scalable processing for ML models and high-throughput batch pipelines.

Designed data models and partitioning strategies in BigQuery, AWS Redshift, and Snowflake, significantly improving query response times and storage efficiency.

Integrated Microsoft Dynamics 365 data with cloud pipelines using Informatica and Fivetran, supporting real-time analytics and business reporting.

Deployed machine learning models using Databricks MLflow, integrated with feature engineering pipelines built in PySpark, supporting predictive analytics at scale.

Leveraged Microsoft Fabric for implementing end-to-end ingestion, processing, and reporting pipelines, streamlining data flow and governance in a unified architecture.

ENVIRONMENT: Cloud Dataflow, BigQuery, Cloud Pub/Sub, Cloud Composer, Google Cloud Storage, Dataproc, Google Kubernetes Engine (GKE), Google Cloud Functions, Google Data Studio, Informatica, Cloud Armor, IAM, Data Loss Prevention (DLP), Terraform, SQL Server Integration Services (SSIS), Teradata, Google Cloud Operations Suite (formerly Stackdriver), AWS Glue, AWS Redshift, Amazon Kinesis, AWS S3, AWS EMR, Elastic Kubernetes Service (EKS), AWS Lambda, AWS WAF, AWS Step Functions, AWS CloudWatch, Apache Calcite, Apache Flink, Fivetran, Apache Beam AWS Code Pipeline, Hadoop, Python, Microsoft Azure, Microsoft Fabric, Microsoft Dynamics 365, Power BI, Looker, Hex, Databricks.

HSBC - GRAND RAPIDS, MI. Jan 2023 - JUN 2024

DATA ENGINEER

DESCRIPTION: As a Data Engineer at HSBC, I specialize in building and optimizing data pipelines to support financial data initiatives. I designed and implemented end-to-end data engineering solutions using Big Data technologies like Spark and Kafka to process large volumes of structured and unstructured data. I leveraged GCP tools such as BigQuery, Dataproc, and Cloud Composer to develop scalable pipelines for data ingestion, transformation, and storage. My role involved orchestrating workflows with Airflow, implementing real-time streaming solutions, and ensuring data security and compliance using IAM. Additionally, I focused on optimizing ETL workflows, managing data models, and enabling high-speed analytics for decision-making. I also utilized Hadoop for processing large-scale datasets and Python for automating workflows and creating efficient data transformations.

RESPONSIBILITIES:

Built ADF pipelines with balancing checks to extract, transform, and load medical claims data between on-prem SQL Server and Snowflake. Implemented validation scripts in Python to ensure complete and accurate data loads, supporting compliance and operational accuracy.

Engineered scalable ETL pipelines in GCP using Dataflow, Dataproc, and PySpark, processing high-volume financial datasets for both batch and streaming workloads.

Integrated Teradata legacy systems with modern GCP pipelines, using BigQuery federated queries and Apache Beam to enable a hybrid analytics solution during cloud migration phases.

Automated ingestion from cloud storage, SFTP, and SaaS applications using Airflow, Fivetran, and Python, handling schema drift and ensuring compatibility with BigQuery and Snowflake targets.

Built near real-time data streams using Kafka, Cloud Pub/Sub, and Apache Flink, driving real-time dashboards for clinical operations and patient analytics.

Utilized Terraform and Cloud Composer to automate infrastructure provisioning and workflow orchestration, enforcing consistency across staging, dev, and prod environments.

Applied Apache Hudi to enable efficient upserts and change data capture (CDC) mechanisms in cloud data lakes, improving analytics freshness.

Designed reporting datasets in Snowflake, implementing performance optimizations using clustering keys and materialized views for large analytical queries.

Used Informatica PowerCenter to extract and transform legacy financial data into GCP buckets, standardizing data formats and cleansing for downstream analytics.

Developed secure, tokenized RESTful APIs using Python/Flask and deployed via Cloud Run, enabling internal users to retrieve curated datasets on demand.

Performed advanced data transformation prototyping in pandas and Jupyter, validating complex logic prior to production implementation.

Supported analytics and visualization efforts by collaborating with BI teams to fine-tune BigQuery SQL, resulting in reduced query cost and improved dashboard performance.

Created real-time and batch dashboards in Google Data Studio, Tableau, and Looker, supporting both executive and operational reporting.

Implemented robust security controls using GCP IAM, VPC Service Controls, and Data Loss Prevention (DLP).

Conducted cross-region failover testing for pipelines using GCS snapshots, ensuring RPO/RTO alignment for disaster recovery scenarios.

Maintained thorough documentation for pipeline SLAs, lineage (with Collibra integration), dependencies, and runbooks to support platform reliability and audit readiness.

Migrated ETL logic from Teradata BTEQ scripts to Python and Spark, ensuring functional parity, improved performance, and cloud compatibility.

ENVIRONMENT: Dataproc, BigQuery, Cloud Storage, Teradata, Cloud Composer, Cloud Pub/Sub, Cloud Functions, IAM, Data Factory, Data Lake, Synapse Analytics, Power BI, IAM, Spark, Informatica, Databricks, Flink, Fivetran, Kafka, Airflow, PySpark, Scala, Parquet, ORC, Python, SQL, SQL Server Integration Services (SSIS), Google Cloud Platform (GCP).

MYNTRA - BANGALORE, INDIA MAR 2020 – JUL 2022

DATA ENGINEER

Description: As a Data Engineer at Myntra, I focused on developing and optimizing data pipelines to support the company's analytics and reporting infrastructure. My responsibilities included designing and implementing ETL workflows using Apache Spark and leveraging AWS services such as S3, EC2, RDS, and Redshift to process and store large-scale datasets. I implemented Kafka for real-time data streaming, ensuring timely insights for critical business use cases. Additionally, I collaborated with analytics teams to ensure data quality, consistency, and reliability, while automating workflows using Python and PySpark to enhance efficiency and reduce operational overhead.

RESPONSIBILITIES:

Developed Terraform scripts to automate the provisioning of AWS resources such as EC2, S3, Auto Scaling Groups, ELB, and CloudWatch Alarms.

Engineered data ingestion pipelines with Kafka to enable real-time data streaming and processed transactional data for fraud detection models.

Utilized AWS S3 for scalable data storage and created backup and archival strategies to ensure data durability and cost efficiency.

Processed large datasets using Spark and optimized performance by leveraging broadcast variables, partitioning, and efficient joins during the ingestion process.

Designed Hive UDFs and queries to preprocess data ingested into AWS S3, later loaded into Redshift for analytics. Implement Talend for ETL processes to automate data extraction, transformation, and loading from various data sources into Databricks.

Integrate Talend with Azure Databricks to manage and orchestrate complex data pipelines and data integration workflows.

Ensure data security and compliance by implementing role-based access controls (RBAC)and data encryption for Azure Databricks workspaces.

Designed data models using star and snowflake schemas in Amazon Redshift, optimizing them for reporting and analytics use cases.

Optimized query performance and data modeling in Amazon Redshift, ensuring efficient analytics and reporting capabilities for large-scale datasets.

Automated data ingestion and ETL workflows using AWS Glue, enabling seamless integration of structured and semi-structured data into Redshift.

Created serverless ETL pipelines using AWS Glue, integrating data from S3 and loading it into Redshift for reporting.

Implemented batch and streaming ETL processes, optimizing cluster configurations in EMR for cost efficiency and performance.

Designed and implemented scalable ETL pipelines on AWS, using Spark for data transformation and Redshift for analytics and reporting.

Used AWS SNS to send notifications for ETL job status updates, ensuring smooth communication and monitoring during pipeline execution.

Leveraged NoSQL databases such as DynamoDB for storing and retrieving semi-structured and high-velocity transactional data, enabling fast access patterns and supporting scalable, low-latency applications within ETL pipelines.

Configured AWS Lambda with Aurora to develop serverless architectures, enabling efficient API proxy integrations.

Automated test strategies for data workflows, ensuring data quality and reliability in the AWS environment.

Developed custom Python scripts for data transformation, cleansing, and integration tasks to streamline ETL processes.

Engineered data models and optimized performance using PySpark for large-scale data transformation tasks, improving ETL job execution speed.

ENVIRONMENT: S3, Redshift, RDS, EC2, Lambda, SNS, Glue, EMR, Terraform, Kafka, Spark, PySpark, Hive, Scala, Python, SQL.

TECHNICAL SKILLS

PROFESSIONAL EXPERIENCE

Contact this candidate