VERONICA RUTH
Sr. Data Engineer
****************@*****.*** 216-***-**** https://www.linkedin.com/in/veronica-ruth-412136256/
Professional Summary:
9+ years of experience in Data Engineering with a strong foundation in building scalable ETL/ELT pipelines, data lakes, and cloud-based analytics solutions.
Proficient in Python, PySpark, Scala, SQL, T-SQL, with deep experience in Spark, Kafka, Hive, Airflow, and cloud-native orchestration tools.
Expertise in Azure (ADF, Databricks, Synapse, Functions, Cosmos DB) and AWS (Glue, Lambda, Redshift, S3) for modern data platform development.
Hands-on experience with Snowflake, Teradata, and SQL Server for cloud data warehousing and performance-tuned analytics.
Integrated diverse data sources including Oracle, flat files, JSON, NoSQL (Cassandra, HBase) into structured warehouses and lakehouses.
Built and automated end-to-end data pipelines, performed real-time streaming using Kafka, Azure Event Hub, and enabled high-volume batch processing.
Applied Data Science and ML using Scikit-learn, Pandas, NumPy, MLflow, and built predictive pipelines using PySpark and Python.
Developed POCs for Generative AI and LLM integration using OpenAI, LangChain, Hugging Face, including LLM-driven SQL generation, metadata documentation, and semantic search with FAISS.
Created impactful dashboards using Power BI, Tableau, and implemented data governance through validation checks, profiling, and compliance (GDPR, CCPA).
Strong in DevOps and CI/CD practices with Git, GitHub, Docker, Azure DevOps, and passionate about bridging Data Engineering with AI-driven innovation.
AI / GenAI Projects (Proof of Concept)
LLM-Powered Pipeline Summaries (POC): Developed a proof-of-concept using OpenAI GPT API and Azure Functions to automatically summarize data pipeline errors and logs in natural language, aiming to improve triage efficiency for DevOps teams.
Natural Language to SQL Generator (Prototype): Built a prototype with LangChain and Snowflake to translate plain-English business questions into executable SQL queries, supporting self-service analytics for business users.
Automated Metadata Documentation (Internal POC): Used OpenAI and Python to auto-generate table and column descriptions for data lake assets, streamlining metadata management and reducing manual documentation efforts.
Text Analytics for Feedback (Exploration): Leveraged Hugging Face Transformers to test sentiment classification and summarization on customer feedback data, exploring applications in voice-of-customer analysis.
Vector-Based Semantic Search (POC): Implemented a semantic search prototype using FAISS and sentence-transformer embeddings to retrieve relevant documents and KPIs from internal wikis and reporting systems.
Note: These projects were implemented as internal prototypes and POCs to evaluate the integration of GenAI and LLMs into enterprise data engineering workflows.
Category
Technologies / Tools
Programming
Python, Scala, PySpark, SQL, T-SQL, Spark-SQL, Java (basic), R (basic)
Big Data Frameworks
Apache Spark, Hadoop, Hive, Pig, MapReduce, Sqoop, Flume, HDFS, Kafka
ETL & Orchestration
Azure Data Factory, SSIS, Informatica PowerCenter/IICS, Talend, DataStage, Apache Airflow
Cloud Platforms
Azure: Synapse, Databricks, Blob Storage, Functions, Logic Apps, Event Hub
AWS: EC2, S3, Redshift, Glue, Lambda, Step Functions, EMR
Data Warehousing
Snowflake, Teradata, Redshift, Azure Synapse Analytics, SQL Server, Netezza
Data Modeling
Star Schema, Snowflake Schema, Dimensional & Relational Modeling, Normalization
DevOps & CI/CD
Azure DevOps, Git, GitHub, Docker, ARM Templates
Data Visualization
Power BI, Tableau, Qlik
Data Governance
Data Quality Checks, Profiling, GDPR, CCPA Compliance, Secure Shares (Snowflake)
Databases
SQL Server, Oracle, PostgreSQL, Cassandra, HBase, Cosmos DB
Other Tools
AI & Data Science Tools
Ab Initio, SSDT, OBIEE, Visual Studio, StrongView
Scikit-learn, Pandas, NumPy, Matplotlib, Seaborn, Jupyter Notebooks, TensorFlow (basic), Great Expectations, MLFlow, Predictive Modeling, Model Monitoring
Professional Experience
Sr.Data Engineer
Navy Federal Credit Union Bank, FL Nov 2023-Present
Designed and implemented a real-time analytics platform using Azure Databricks and Spark Structured Streaming to monitor banking project statuses with live data feeds.
Orchestrated complex ETL workflows in Azure Data Factory, leveraging dynamic scaling and performance tuning to manage fluctuating banking data loads..
Developed SQL Server Integration Services (SSIS) packages for data warehousing solutions, enabling consistent data availability for the dashboard's reporting features.
Architected a data lakehouse pattern on Azure Synapse Analytics, blending the capabilities of data lakes and warehouses to provide a unified data management platform for the dashboard.
Developed and maintained ETL processes within Azure Data Factory, incorporating data from multiple banking systems to provide a holistic view of project statuses.
Designed data models and data flows to optimize the storage and retrieval of project-related data, ensuring efficient operation of the banking dashboard.
Developed and optimized transformations and stored procedures in Snowflake for processing high-volume historical transaction data.
Integrated Snowflake with Azure Data Lake/Blob Storage and third-party APIs for ingesting external financial feeds.
Employed Azure Functions and serverless architecture to replace custom scripts and workflows from Informatica, enhancing automation and reducing maintenance overhead
Led the migration of data integration processes from Informatica PowerCenter to Azure Data Factory, ensuring minimal disruption to ongoing operations and improved performance.
Orchestrated complex ETL workflows in Apache Airflow, automating data ingestion, cleansing, and loading from core banking systems.
Built dynamic DAGs for multi-branch transaction processing, reducing manual intervention and increasing throughput by 40%.
Built Python-based microservices for API-driven data ingestion and integrated Scikit-learn-based ML models using PySpark to provide predictive insights into customer churn and transaction anomalies.
Established continuous monitoring and alerting systems for the dashboard’s data pipelines, using Azure DevOps to track performance metrics and trigger notifications for any anomalies.
Implemented CI/CD with Azure DevOps, including automated testing and IaC using ARM templates.
Applied data governance and compliance standards specific to the banking industry, ensuring the dashboard adhered to regulations like GDPR.
Environment: Azure Databricks, Spark Structured Streaming, Azure Data Factory, SSIS, Azure Synapse Analytics, Snowflake, Azure Data Lake, Azure Functions, Informatica PowerCenter, Apache Airflow, Python, REST APIs, PySpark, Azure DevOps, ARM Templates, SQL Server, S3, Git, CI/CD, GDPR, CCPA
Sr.Data Engineer
Kuvare Holdings, Rosemont, IL June 2023-October 2023
Designed and developed data storage, integration, transformation, and distribution strategies capable of handling large volumes of structured and unstructured data.
Monitored and maintained database security by implementing access controls, authentication, audit trails, and other robust security measures.
Integrated data from multiple sources including legacy systems, cloud applications, proprietary file formats, APIs, and event sourcing platforms.
Developed, implemented, and maintained Data Quality Services to ensure high levels of data accuracy and consistency.
Worked with both relational and dimensional data models, particularly in Azure SQL Server environments.
Built ETL/ELT pipelines using SSIS, Azure Data Factory, Azure Function Apps, Logic Apps, Databricks, Apache Spark, and Azure Synapse Analytics.
Implemented event-driven architectures using Azure Event Hub and Apache Kafka for real-time data streaming and processing.
Handled structured, semi-structured, and unstructured data formats such as SQL, JSON, and YAML.
Collaborated with stakeholders to gather ETL requirements and translate them into technical specifications.
Maintained best practices in software development for analytics pipelines and data engineering codebases.
Deployed distributed storage systems like Hadoop Distributed File System (HDFS) and Azure Blob Storage for scalable storage solutions.
Applied partitioning and indexing techniques to improve query performance on large datasets.
Enforced encryption for data at rest and in transit; performed regular security audits and vulnerability assessments.
Designed and implemented data virtualization for real-time access to heterogeneous data sources using normalization/denormalization techniques.
Embedded data validation and quality checks within Informatica workflows and pipelines.
Designed profiling and cleansing routines to detect and resolve data inconsistencies.
Created optimized relational database schemas to support efficient reporting and querying.
Applied star and snowflake schema models for dimensional data modeling in warehousing environments.
Orchestrated data pipelines using tools like Apache Airflow and Azure Data Factory.
Used Change Data Capture (CDC) to minimize processing time by capturing only modified records.
Built microservices-based data mesh systems using event-driven architectures for scalable, modular pipelines.
Managed diverse data formats such as XML, Avro, Parquet, and ORC, including advanced queries on unstructured data using Azure Data Lake Analytics.
Environment: Azure Data Factory, SSIS, Azure Function Apps, Azure Logic Apps, Azure Databricks, Apache Spark, Azure Synapse Analytics, Azure Event Hub, Apache Kafka, Azure SQL Server, Hadoop Distributed File System (HDFS), Azure Blob Storage, Apache Airflow, Informatica, SQL, JSON, YAML, XML, Avro, Parquet, ORC, Data Virtualization, Data Quality Services, Star Schema, Snowflake Schema, Change Data Capture (CDC), Microservices Architecture, Azure Data Lake Analytics
Sr. Data Engineer
Vizient, Chicago, IL December 2022-May 2023
Responsibilities:
Been a pivotal initiative in migrating our Clinical Supply Chain from traditional on-premises infrastructure to the dynamic Azure Cloud environment.
Collaborated seamlessly with diverse teams and stakeholders to discern intricate business needs, ensuring a smooth transition for our data migration endeavors.
Worked on sophisticated Teradata ELT frameworks to facilitate streamlined data ingestion from various sources.
Modeled HIPAA-compliant datasets in Snowflake using secure views, masking policies, and RBAC to ensure patient data privacy and access control.
Leveraged legacy Teradata load utilities and employing shrewd SQL optimization strategies for enhanced efficiency and scalability.
Developed the intricate landscape of Azure Databricks Notebooks to enrich and refine data processing workflows. Applying nuanced SQL queries for data manipulations and transformations that add significant value to our operations.
Integrated Airflow with data quality validation scripts (Python/Great Expectations) to ensure data completeness and consistency across care records.
Skillfully orchestrated end-to-end data pipelines via Azure Data Factory, harmonizing data flow seamlessly between on-premises and cloud systems.
Executed Spark jobs and Hive queries within the Azure Databricks environment, contributing to efficient data processing and insightful analytics.
Prioritized security, implementing robust measures, including the savvy use of Azure Key Vault, to fortify sensitive information throughout our migration.
Upholded compliance with stringent data protection regulations and industry standards within the intricate landscape of the clinical supply chain.
Established a comprehensive SFTP framework, ensuring both security and efficiency in our data transfer operations across on-premises and Azure Cloud environments.
Seamlessly integrated Qlik and Power BI, elevating data visualization and analytics to provide actionable insights that resonate with stakeholders.
Active participation in continuous improvement initiatives within a cutting-edge CI/CD environment, deploying applications through Docker containers.
Created detailed technical documentation for our data engineering processes, ensuring smooth knowledge transfer and project handovers.
Used Power BI to generate comprehensive reports, offering valuable insights into our migration progress and maintaining high data quality metrics.
Environment: Azure Cloud, Azure Databricks, Azure Data Factory, Spark, Hive, Teradata, Snowflake, SQL, RBAC, Azure Key Vault, Airflow, Python, Great Expectations, Power BI, Qlik, Docker, SFTP
Sr. Data Engineer
Macy's, New York, NY June 2020 to Nov 2022
Responsibilities:
Selected and generated data into CSV files, stored them in AWS S3 using AWS EC2, and structured it within AWS Redshift.
Used AWS S3 buckets to store files and injected them into Snowflake tables using Snowpipe; managed deltas via data pipelines.
Migrated data from on-premises to AWS storage buckets and developed Python scripts to transfer and extract data via REST APIs into S3.
Deployed applications using AWS EC2 standard deployment techniques, worked on AWS infrastructure automation, and utilized Docker in a CI/CD environment.
Ingested and transformed data using AWS Lambda, Glue, and Step Functions for structured processing.
Created secure data sharing mechanisms using Snowflake Secure Shares to enable vendor and supplier collaboration.
Automated ingestion of POS and e-commerce data using Apache Airflow DAGs integrated with Snowflake, including parameterized DAGs for scaling.
Designed and developed ETL pipelines in Talend/DataStage to support data extraction, transformation, and loading into the EDW.
Partnered with ETL developers to ensure that data is cleansed and the data warehouse remains current.
Created tables and views in Snowflake using SQL, which were consumed directly by end users; also developed Glue table stack creation using YAML files.
Extracted data from Netezza using Python scripts and transferred it to AWS S3; also developed Lambda functions with IAM roles and triggers (SQS, EventBridge, SNS).
Built infrastructure for optimal ETL from various data sources using SQL, ensuring reliability and scalability.
Designed and developed end-to-end data pipelines using Ab Initio, implementing complex transformations via its component library.
Maintained production support for SSIS, SQL Server, stored procedures, Matillion jobs, interim marts, and Snowflake; developed predictive reports using Python and Tableau.
Worked extensively with Kafka, followed Agile and Scrum methodology for project and team execution.
Environment: PySpark, SQL, Spark, Python, Snowflake, Databricks, Airflow, Kafka, Snowflake, Tableau, GitHub, AWS EMR/EC2/S3/Redshift, ETL, Git, CI/CD, Scala, Data pipelines, Data analysis, Data visualization
Data Engineer
Nationwide, Columbus, OH May 2017 to April 2019
Responsibilities:
Designed scalable distributed data solutions using Hadoop after conducting requirement analysis and identifying relevant data sources.
Developed MapReduce jobs in Pig and Hive for cleansing, preprocessing, and logic implementation using Pig scripts and UDFs.
Developed Spark-SQL and Scala queries to analyze structured data (CSV, text files) within the Spark ecosystem.
Implemented AWS cloud integrations by deploying Hadoop clusters on EC2, managing RDS instances, and using S3 for data storage.
Conducted requirements analysis and sourced data for building scalable distributed data solutions using Hadoop.
Collaborated with Hadoop Administrators to configure and optimize clusters with Hadoop, MapReduce, and HDFS, including patch and update management.
Utilized Pig, Hive, HBase, and SQOOP for data analysis, HDFS/Hive import/export, and AWS S3 dependency creation.
Integrated AWS cloud services, deploying Hadoop clusters on EC2, storing data in S3, and managing relational data in RDS.
Developed MapReduce jobs in Pig and Hive for data cleansing and logic implementation using Pig scripts and UDFs.
Used SQOOP for bidirectional data transfers between HDFS and RDBMS; data processed further in AWS Redshift for analytics.
Built Spark-SQL and Scala queries to analyze diverse datasets (CSV, text) and created Hive tables to store PII across portfolios.
Designed ETL pipelines using Informatica PowerCenter and Informatica IICS; installed and configured Oozie for workflow automation; supported email campaigns and StrongView platform training with AWS-backed data storage.
Environment: Hadoop, Pig, Hive, HBase, SQOOP, MapReduce, HDFS, OBIEE, Spark, Spark-Sql, Scala, Linux, RDBMS, Azure Data Factory, Oozie, UDFs, Oracle.
Data Engineer
Couth Infotech Pvt. Ltd, Hyderabad, India October 2014 to February 2017
Responsibilities:
Developed data integration programs in Hadoop and RDBMS environments using Spark/Scala and Python, with regex-based processing in Hive on Linux/Windows platforms.
Designed and executed ETL workflows to extract, transform, and load large datasets into CSVs using Python and SQL; applied in-memory processing, partitioning, broadcasting, and joins for optimized performance.
Built models leveraging statistical and probability principles and performed data modeling in Azure Analysis Services using Visual Studio (SSDT).
Created and validated schemas for Parquet and Avro formats, implemented partitioning, dynamic partitioning, and bucketing in Hive for efficient data processing.
Conducted a POC comparing Apache Hive and Impala to evaluate performance for batch analytics and supported both structured and NoSQL sources.
Utilized Tableau for daily reporting on Hive datasets and collaborated with cross-functional teams to ensure data accuracy, availability, and reporting reliability.
Environment: Mapr, Hadoop MapReduce, HDFS, Spark, Hive, Pig, SQL, Sqoop, Flume, Oozie, Java 8, Eclipse HBase, Shell Scripting, Scala.
Education Details:
Bachelor of Technology in Computer Science and Engineering
Jawaharlal Nehru Technological University – 2014