Data Engineer Machine Learning

Location:

Posted:

August 19, 2025

Resume:

PROFESSIONAL SUMMARY

Data Engineer with over * years of IT experience across diverse industries, specializing in designing and developing scalable big data pipelines, data lakes, and ETL/ELT workflows on cloud platforms like AWS, Azure, and GCP.

Proven expertise with Apache Spark, PySpark, Apache Beam, Kafka, Hive, Airflow, and DataStage, integrating large-scale data from RDBMS, NoSQL, flat files, and APIs.

Hands-on experience with Hadoop ecosystem (HDFS, YARN, Sqoop, Hive, Pig, HBase, Impala), and major Hadoop distributions including Cloudera, Hortonworks, and Amazon EMR.

Skilled in data ingestion, modeling (Star/Snowflake), performance tuning, and processing structured and semi-structured data using Spark SQL, MapReduce, Spark Streaming, and Spark MLlib.

Proficient in programming with Python, Scala, SQL, Java, and UNIX Shell, with practical exposure to machine learning and predictive analytics using libraries such as Scikit-learn, TensorFlow, Keras, and implementing models like Linear/Logistic Regression, Decision Trees, Random Forest, and K-Means Clustering.

Well-versed in cloud-native data services like AWS Glue, Redshift, Azure Data Factory, GCP BigQuery, Dataflow, Pub/Sub, and orchestration tools like Apache Airflow, SSIS, and Flume.

Adept at working with NoSQL databases (MongoDB, HBase, Cassandra) and traditional RDBMS (Oracle, MySQL, SQL Server, DB2).

Experienced in data visualization and reporting using Tableau, Power BI, Cognos, and Excel (Pivot Tables, VLOOKUP, Macros). Collaborative and agile-focused team player familiar with DevOps tools (Git, Jenkins, Docker, Kubernetes) and delivery frameworks like Agile/Scrum via Jira and Confluence.

TECHNICAL SKILLS

Hadoop/Big Data Technologies: HDFS, Hive, Pig, Sqoop, Yarn, Spark, PySpark, Spark SQL, Apache Beam, Kafka,Flink

Hadoop Distributions: Hortonworks, Cloudera Hadoop, Amazon EMR, Google Cloud Dataproc

Cloud Platforms: AWS (S3, Redshift, Glue, EC2, Data Pipeline, IAM), Azure (ADF, Data Lake, SQL, Synapse, Databricks), GCP (BigQuery, Cloud Storage, Dataflow, Pub/Sub, Cloud SQL, Cloud Functions, Stackdriver, Dataproc), Lambda, Kinesis, Event Hubs

Languages: Python, Scala, R, SQL, PL/SQL, T-SQL, Shell Script, C, C++, UNIX Shell Script, COBOL

Databases: Teradata, Oracle 9i/10g, DB2, SQL Server, MySQL 4.x/5.x, PostgreSQL, Snowflake, MongoDB, BigQuery, Redshift,Redis

Query Engines & Lakehouse: Presto, Trino, Apache Iceberg, Delta Lake, Scuba

ETL Tools: IBM InfoSphere Information Server V8, V8.5, V9.1, Apache Airflow, Azure Data Factory, AWS Glue, SSIS, Cloud Dataflow, DataStage, Apache NiFi, Talend, Informatica, Pentaho

Reporting & Visualization: Tableau, Power BI, Looker, Cognos, SSRS, Excel (Pivot Tables, VLOOKUP, Macros), R Shiny, ggplot2, Matplotlib, Seaborn

DevOps & CI/CD Tools: Git, GitHub, Jenkins, GitHub Actions, Docker, Kubernetes, Terraform, GitLab

Machine Learning & Analytics: Scikit-learn, TensorFlow, Keras, Keras2DML, Pandas, NumPy, Logistic Regression, Decision Trees, Random Forest, Naive Bayes, MLflow, Feature Store

Tools: Teradata SQL Assistant, Pycharm, Autosys, Erwin Data Modeler, Cloud Shell, bq, gsutil

Operating Systems & Orchestration: Linux, Unix, ZOS, Windows, Luigi, Prefect, Oozie

Project Management & Collaboration: Jira, Confluence, SharePoint, Agile/Scrum

SOFT SKILLS: Strong Analytical & Problem-Solving Skills, Effective Communication, Agile & Scrum Practices, Stakeholder Management, Adaptability, Attention to Detail, Time Management, Mentoring & Knowledge Sharing, Teamwork, Confidentiality

PROFESSIONAL EXPERIENCE

Wells Fargo, Texas Jan 2025 – Present

Azure Data Engineer

Responsibilities:

•Architected and implemented end-to-end ETL pipelines using Azure Data Factory, SSIS, and HDInsight for ingesting data from source systems to Azure SQL DB, Azure Data Lake, and Azure Synapse.

•Led one-time large-scale data migration projects from on-premise Oracle to Azure SQL Data Warehouse, including schema design, star schema modeling, and optimization.

•Developed data profiling, cleansing, and rollback frameworks with automated batch pipeline restarts, implemented data masking, encryption, and SSIS IR for secure and reliable workflows.

•Used Azure Databricks and PySpark for data transformation, curation, and building scalable ELT jobs; created Databricks notebooks and managed Spark clusters with high concurrency.

•Built real-time streaming pipelines using Kafka and Spark Structured Streaming; handled ingestion from Azure Event Hubs and batch processing with Hive bucketing and partitioning.

•Developed and managed ADF pipelines with features like lookup, stored procedures, data flows, Azure Functions, copy activity, and data ingestion using Azure Blob Storage.

•Integrated hybrid solutions with AWS EMR, Redshift, S3, Athena, and CloudWatch; created mapping documents, distribution/sort keys in Redshift, and performed ingestion to downstream systems.

•Automated workflows using Python scripting; performed data curation, built reusable PySpark functions/packages, and used U-SQL, Spark SQL, T-SQL for transformation logic.

•Created reports and dashboards using Tableau and Splunk for regression analysis and KPI visualization; collaborated with business and IT teams for fast insights.

Environment: Data Factory, Azure Databricks, Azure Synapse, Azure Data Lake, Azure SQL DB, Kafka, Hive, HDInsight, AWS EMR, Redshift, Spark, Tableau, Splunk, Python, SQL, Git.

CVS Health, Richardson, TX Nov 2023 – Dec 2024

AWS Data Engineer

Responsibilities:

Built and optimized AWS Data Pipelines to load data from S3 to Redshift, and integrated multiple heterogeneous sources including Oracle, CSV, Excel, and flat files.

Developed robust ETL pipelines using PySpark, Spark SQL, and AWS Glue, implementing data transformations, encryption using hashing algorithms, and performance tuning.

Created and maintained complex data models and metadata repositories using Erwin, Kimball methodology, and implemented Star/Snowflake schemas.

Implemented Azure Data Factory (ADF V2) pipelines and activities for ingesting and transforming data into Azure Data Lake, SQL DW, and Databricks environments.

Migrated on-premises data from SQL Server/Oracle/DB2/MongoDB to cloud platforms (ADLS, Snowflake) using ADF, SSIS, PowerShell, and U-SQL.

Designed and deployed Python-based APIs for revenue tracking and analytics; integrated data validation and cleansing mechanisms for trusted data.

Created interactive dashboards and KPIs using Power BI and Tableau, performing complex DAX calculations, and real-time comparisons with DirectQuery.

Authored complex T-SQL queries, stored procedures, views, and triggers; optimized database performance with indexing and execution plan tuning.

Performed ETL testing and data quality checks, including SCD Type 2 logic, and validated warehouse loads across staging and production layers.

Automated workflows using Airflow and Shell scripts, integrating job orchestration and monitoring with CI/CD pipelines (GitHub, Jenkins).

Supported enterprise-level data governance, access control, and data security using IAM roles, encryption, and metadata classification.

Delivered daily/monthly MIS reports and ad hoc analytics to business teams; collaborated cross-functionally with DevOps and BI teams.

Environment: AWS (S3, Redshift, Glue, Data Pipeline, EC2), Azure (Data Factory, Data Lake, SQL, Synapse, Databricks), Snowflake, Spark (PySpark, Spark SQL), Hadoop, Hive, Pig, DataStage, SSIS, Erwin, SQL Server, Oracle, PostgreSQL, MySQL, Python, Scala, Shell Scripting, SQL (T-SQL, PL/SQL), Power BI, Tableau, SSRS, Cognos BI, Git, GitHub, Jenkins, Confluence, Jira, SharePoint, Advanced Excel (Pivot Tables, VLOOKUP, Macros)

Maryah Bank,India Jul 2021 to May 2023

Cloud Data Engineer

Responsibilities:

Designed and maintained scalable ETL pipelines using Apache Airflow, Spark, and Python to ingest data from multiple sources into GCP BigQuery and AWS Redshift.

Led migration of on-premise data systems to Google Cloud Platform (GCP) and implemented data warehousing using BigQuery, Cloud SQL, and GCS Buckets.

Built and optimized Spark-based ETL jobs for transforming high-volume sales, analytics, and healthcare data, reducing job run time by 30%.

Implemented fact and dimensional modeling (Star, Snowflake schema) and handled Slowly Changing Dimensions (SCDs) to support robust reporting infrastructure.

Developed configurable data delivery pipelines using Python, enabling scheduled updates to customer-facing stores; integrated Docker, GitHub, and CI/CD for versioning and deployment.

Designed and managed Cloud Dataflow pipelines using Apache Beam, and handled orchestration with Cloud Functions and Cloud Shell for event-driven data processing.

Performed data profiling, cleansing, and validation; wrote SQL for performance-tuned reporting in Hive, Spark-SQL, and traditional RDBMS (MySQL, Postgres, SQL Server).

Worked on DataOps pipelines and analytics automation by integrating TensorFlow optimization and ML models for decision-making use cases.

Collaborated cross-functionally using Jira, Confluence, and Agile methodologies to translate business needs into actionable data workflows.

Provided operational support by monitoring Hadoop and Hive workloads, ensuring SLAs were met, and documenting design/logic using clear specification reports.

Environment: GCP (BigQuery, GCS, Cloud Functions, Cloud SQL, Cloud Dataflow, Dataproc, Apache Beam), AWS (EC2, S3, Redshift), Apache Airflow, Docker, Spark, PySpark, Hive, Hadoop, SQL Server, PostgreSQL, MySQL, Python, Scala, GitHub, TensorFlow, JIRA, Confluence, CI/CD, Advanced SQL

Cyient, India May 2020 to Jun 2021

Data Engineer

Responsibilities:

Designed and built scalable ETL/ELT pipelines on Google Cloud Platform (GCP) using BigQuery, Cloud Dataflow (Apache Beam), Cloud Functions, and GCS to ingest and process real-time and batch data.

Automated ingestion from Google Pub/Sub into BigQuery using Python-based pipelines and Apache Airflow DAGs for orchestration and monitoring.

Engineered near real-time pipelines using Apache Spark and Cloud Dataproc, enabling event-driven analytics on high-volume datasets.

Developed Python scripts for data cleansing, transformation, integration, and migration, incorporating complex SQL logic and PL/SQL stored procedures.

Implemented Docker containers for microservice deployment and integrated with CI/CD pipelines via GitHub Actions; containerized jobs on Kubernetes.

Built machine learning models (Logistic Regression, Decision Trees) for customer behavior prediction using Scikit-learn, Pandas, and NumPy.

Applied dimensional modeling techniques (Star & Snowflake schemas, SCD types) for az solutions, optimizing query performance with indexes, aggregates, and materialized views.

Used Confluence and Jira for Agile tracking; produced data visualizations with Matplotlib, Seaborn, and Heatmaps for business reporting.

Managed structured/unstructured data at scale using Hadoop, Hive, and Spark SQL across hybrid environments (AWS + GCP).

Proficient in using GCP tools: bq, gsutil, Cloud Shell, Stackdriver, and Data Catalog for data auditing, quality checks, and lineage tracing.

Environment: GCP (BigQuery, Dataflow, Pub/Sub, GCS, Cloud Functions), Apache Airflow, Spark, Hadoop, Hive, Docker, Kubernetes, Python, SQL, PL/SQL, Scikit-learn, Matplotlib, Seaborn, GitHub, Apache Beam, AWS EC2/S3, Airflow, Confluence, Jira

EDUCATION:

Masters in Computer Science at University of Bridgeport - 2024

Contact this candidate