Data Engineer Machine Learning

Location:

Phoenix, AZ

Salary:

60000

Posted:

September 11, 2025

Contact this candidate

Resume:

Koushik Peddi

682-***-**** ****************@*****.*** USA LinkedIn

PROFESSIONAL SUMMARY

• Data Engineer with 4+ years of experience delivering scalable, cloud-native data solutions across finance, retail, and higher education, with a focus on real-time streaming, batch processing, and automated ETL pipelines.

• Proficient in building distributed data systems using GCP (BigQuery, Dataflow, Pub/Sub), Apache Spark, Kafka, and Airflow to support mission-critical analytics, fraud detection, and machine learning workflows.

• Skilled in Python, SQL, and Scala for designing transformation logic, ensuring data quality, and implementing governance practices using Data Catalog, encryption, and access control models.

• Adept at collaborating with cross-functional teams to optimize performance, reduce infrastructure cost, and deliver actionable insights through robust, compliant, and production-grade pipelines. TECHNICAL SKILLS

Big Data & Distributed Computing: Apache Spark, Apache Flink, Hadoop, Dataproc, EMR, HDFS, Hive, HBase, Presto, Trino Cloud Platforms: Google Cloud Platform (BigQuery, Dataflow, Cloud Composer, Pub/Sub, Dataform), AWS (S3, Redshift, Glue, MSK, Kinesis), Azure Data Factory

ETL & Workflow Orchestration: Apache Airflow, Apache Beam, Cloud Composer, Luigi, DBT, NiFi Data Processing & Programming: PySpark, Spark SQL, Scala, Python, SQL, Bash, Shell Scripting Data Modeling & Warehousing: Snowflake, BigQuery, Redshift, Star/Snowflake Schemas, Partitioning, Clustering, Schema Evolution Streaming & Real-Time Processing: Apache Kafka, Spark Structured Streaming, GCP Pub/Sub, AWS Kinesis, Flink DevOps & CI/CD: Git, GitHub, GitLab, Cloud Build, GitHub Actions, Terraform (IaC) Data Governance & Security: Data Catalog, Cloud Logging, IAM Policies, Row-Level Security, Encryption (at-rest/in-transit) Monitoring & Observability: Stackdriver, Prometheus, Grafana, ELK Stack Databases: PostgreSQL, MySQL, Oracle, NoSQL (MongoDB, Cassandra) Business Intelligence & Visualization: Tableau, Power BI, Looker, Google Data Studio PROFESSIONAL EXPERIENCE

Data Engineer Sep 2024 - Present

American Express Phoenix, AZ (Onsite)

• Automated the ingestion and transformation of high-volume financial transactions by building Python-based ETL pipelines in Apache Airflow, which reduced manual dependencies and enabled consistent, SLA-driven data delivery across analytics layers.

• Engineered partitioned and clustered BigQuery tables to organize multi-terabyte datasets, optimizing cost-efficiency and accelerating credit risk and spend pattern queries used by analytics teams.

• Authored scalable PySpark scripts for cleansing and joining datasets from internal systems and third-party APIs, enabling enriched data flow to downstream pipelines powering fraud detection models.

• Implemented SLA thresholds, alerting policies, and failure recovery logic within Airflow DAGs, ensuring robust orchestration and cutting critical job downtime by over 70% during peak usage cycles.

• Leveraged Data Catalog to enforce metadata tagging, table ownership, and data classification across BigQuery assets, streamlining governance for audit teams and enabling better data discoverability.

• Optimized Dataproc-based Spark SQL workloads by applying caching strategies, repartitioning large datasets, and selecting join algorithms that minimized shuffle operations, resulting in 2x faster transformations.

• Partnered with machine learning engineers to deliver curated feature sets from transactional logs, shortening model training lead time and improving accuracy by ensuring timely delivery of labeled datasets.

• Tuned Airflow worker configurations and scheduled job batches using dynamic resource pools and priority queues, helping reduce GCP infrastructure costs while maintaining end-to-end pipeline performance. Data Engineer Sep 2023 - Aug 2024

ValueLabs Texas (Remote)

• Built real-time analytics pipelines using GCP Pub/Sub and Apache Beam via Dataflow to process streaming customer and fraud signals from APIs and webhooks.

• Automated batch transformations using Apache Spark on Dataproc for log data, enabling hourly ingestion into BigQuery and improving report freshness by 75%.

• Developed DAGs in Cloud Composer to schedule and orchestrate both streaming and batch workflows across GCS, BigQuery, and Pub/Sub topics.

• Applied schema enforcement, type casting, and transformation layers to ensure structured ingestion of raw JSON and CSV feeds.

• Built Scala and PySpark jobs for business logic enforcement and aggregation, enabling better product segmentation and campaign performance analytics.

• Integrated GitHub Actions with Cloud Build to automate deployment of Beam jobs, reducing manual errors and accelerating feature release cycles.

• Built analytical data marts in Snowflake by transforming curated datasets using SQL and scheduled Snowpipe loads, enabling near real-time reporting for campaign performance and customer engagement metrics.

• Applied role-based access controls, row-level security, and dynamic data masking within Snowflake to protect sensitive customer attributes, ensuring full compliance with organizational data privacy standards. Student Assistant - Data Engineering Aug 2022 - May 2023 The University of Texas at Arlington

• Built real-time ingestion pipelines using Amazon Kinesis Data Streams and AWS Lambda to capture transactional data for fraud monitoring, reducing response time to critical events across campus systems.

• Managed pipeline orchestration using AWS Step Functions and Glue Workflows, improving data delivery transparency and ensuring reliable handoffs between batch and streaming stages.

• Developed scalable Spark jobs on AWS EMR to transform structured and semi-structured log data before loading into Snowflake, supporting high-volume analytics with optimized query performance.

• Applied schema detection and enforcement using AWS Glue Crawlers and metadata tagging, improving consistency in incoming academic datasets and preventing ingestion disruptions.

• Automated CI/CD deployments through GitHub Actions and AWS CodePipeline, enabling the data team to release transformation scripts and infrastructure changes with higher confidence and reduced rollout time.

• Authored modular PySpark scripts and Snowflake SQL transformations to generate research and enrollment KPIs, empowering analysts to deliver insights without manual data prep.

• Implemented role-based access control and column masking within Snowflake to safeguard sensitive student information, aligning data usage policies with FERPA and institutional governance standards.

• Streamed behavioral indicators using Spark Structured Streaming on EMR and published enriched datasets to Snowflake, enabling live performance dashboards for deans and department leads. Data Engineer Aug 2019 - Dec 2021

Cognizant Technology Solutions Hyderabad, India

• Transitioned legacy ETL workflows to Azure by integrating Data Factory, Azure Functions, and Synapse Analytics, which streamlined refresh schedules and reduced ongoing maintenance by 40%.

• Leveraged Azure Databricks and Spark to build scalable batch processing solutions for financial transaction data, decreasing transformation latency and improving system throughput across reporting cycles.

• Automated orchestration of data ingestion from Blob Storage to Synapse using Data Factory pipelines, enabling SLA- aware retry mechanisms and consistent pipeline behavior during failure recovery.

• Designed parameter-driven Python scripts to enforce schema integrity, eliminate duplication, and ensure clean ingestion into Synapse tables, improving downstream reporting consistency and audit reliability.

• Converted legacy Hive queries into optimized Spark SQL jobs using caching strategies, custom partitions, and join tuning in Databricks, reducing complex query runtimes in business-critical dashboards.

• Enforced enterprise data security by implementing dynamic row-level access policies and column encryption in Synapse, supporting both governance controls and GDPR-aligned compliance.

• Delivered internal knowledge sessions and authored reusable framework documentation for Azure-based data orchestration, which improved adoption and delivery efficiency across cloud migration projects.

• Architected a real-time fraud monitoring pipeline using Azure Stream Analytics and Event Hubs, optimizing ingestion performance to support immediate event detection and reactive analytics for high-velocity streams. EDUCATION

The University of Texas at Arlington Dec 2023

Master of Science in Information Systems

VNR Vignana Jyothi Institute of Engineering and Technology, India May 2020 Bachelor of Technology in Electronics and Instrumentation Engineering CERTIFICATIONS & TRAINING

• Google Cloud Professional Data Engineer - Coursera

• Python for Data Engineering - Coursera

• Apache Spark with Scala - Udemy

PROJECTS

Fraud Detection Real-Time Streaming Pipeline

Tech Stack: Apache Kafka, Spark Streaming, Scala, GCP Pub/Sub, BigQuery

• Developed a real-time streaming pipeline in Spark using Scala and Kafka to process transaction events, enabling continuous fraud signal ingestion and delivering alerts to downstream systems within seconds.

• Applied time-windowed aggregations, schema validation, and anomaly detection logic to classify outlier behavior and compute fraud risk scores pushed to BigQuery for further modeling.

• Integrated the pipeline into Airflow with SLA monitoring, dynamic retry triggers, and Slack-based incident alerts, ensuring continuous observability and response handling across all stages. Data Lakehouse Migration & Optimization

Tech Stack: Apache Spark, GCP BigQuery, Airflow, Python, Data Catalog

• Executed end-to-end migration of raw, processed, and curated data layers from on-prem HDFS storage to GCP using Cloud Storage and BigQuery, allowing the organization to unify reporting and analytics under a scalable lakehouse architecture.

• Rewrote batch transformations in PySpark for improved modularity and applied BigQuery partitioning and clustering to cut query scan size by over 40% across business-critical dashboards.

• Configured metadata management using Data Catalog by applying table tags, owners, and lineage tracking, improving asset discoverability and compliance for internal audit and regulatory teams.

Contact this candidate