Vidyarupa Sarupuri *************@*****.***
SUMMARY
I have 8+ years of IT experience as a data engineer in Azure, Databricks, Hadoop, AWS, Teradata, DataStage, Spark, and Hive.
Hands-on experience migrating existing data processing workflows, ETL pipelines, and analytics applications to Databricks.
Experience in rewriting/refactoring code to leverage Databricks APIs, libraries, and features (e.g., Spark SQL, Data Frame API).
Experience in building ETL data pipelines leveraging Spark and Spark SQL.
Expertise in building and optimizing data lakehouse architectures using Databricks.
Strong background in reorganizing data lakes, rewriting pipelines, and migrating legacy workflows into modern frameworks.
Proficiency in SQL across several dialects (MySQL, PostgreSQL, SQL Server, and Oracle).
Practical experience with Delta Live Tables (DLT) for building reliable, automated ETL pipelines.
Implemented Unity Catalog for governance, lineage tracking, and role-based data access across enterprise datasets.
Designed and delivered bronze-silver-gold layer architectures ensuring scalable and auditable data workflows.
Collaborated with cross-functional teams to deliver curated datasets for analytics, BI, and machine learning use cases.
Proven ability to tune Databricks Spark clusters and pipelines for cost efficiency and performance.
Strong communication and collaboration skills with experience working in Agile delivery environments.Proficient in converting SQL queries into Spark transformations using Data frames and Data sets.
Experience in Integration of various data sources like Oracle.
Ability to rapidly learn new concepts together with excellent interpersonal skills.
Excellent team player with outstanding analytical and problem-solving skills.
I have very good verbal and written communication skills.
Have excellent Analytical, Problem Solving, and Communication skills with a keen interest in learning new technologies.
Have good working experience in Agile/Scrum methodologies, communication with scrum calls for project analysis and development aspects.
EDUCATION
B. Tech (Bachelor of Technology) from JNTU (Jawaharlal Nehru Technological University) in May 2009.
CERTIFICATIONS & AWARDS
Certified Databricks Associate Data Engineer.
Received “Live Wire” Award for Active Performer.
TECHNICAL SKILLS
Technologies
Python Coding, SQL Coding, Scala, Bigdata, Hadoop, Hive
Automation Tools & OS
AutoSys, Unix, Windows, MacOS
Tools
GitHub, DataStage 11.7, DataStage 8.x
Databases
Oracle, MySQL, Teradata, SQL, Cosmos DB, Lake Federation, Vector Search
Cloud
Databricks, AWS, Azure
Concentrix GPO July 2024 – Present
Client: GPO - Internal Role: Sr. Data Engineer
Implemented Azure Databricks integration with Azure Cosmos DB endpoint to enable seamless data flow for real-time analytics and processing.
Configured and optimized Spark clusters in Databricks for efficient data access and query performance from Cosmos DB, reducing data processing time.
Developed and automated data pipelines in Databricks for extracting, transforming, and loading (ETL) data from Cosmos DB to Databricks for analysis and reporting.
Designed and implemented data pipelines in Databricks to support large-scale analytics, ensuring seamless ingestion and transformation of structured and unstructured data.
Reorganized data lakes to align with medallion architecture, optimizing query performance and storage costs.
Applied SQL and PySpark transformations to cleanse, enrich, and validate critical enterprise datasets.
Implemented Delta Live Tables (DLT) pipelines for automated ETL workflows, enabling real-time analytics.
Integrated Unity Catalog for governance, lineage, and access control across multi-cloud data assets.
Migrated and refactored legacy pipelines into Databricks workflows, leveraging Spark SQL and APIs for scalability.
Configured and optimized Databricks Spark clusters, reducing processing time and infrastructure costs.
Built Delta tables and views to provide reusable, consistent datasets for reporting and analytics teams.
Collaborated with cross-functional teams to define requirements and deliver curated datasets in Databricks.
Developed data quality rules and monitoring checks using DLT expectations, ensuring high accuracy.
Streamlined data ingestion from Cosmos DB and external APIs into Databricks, automating validation processes.
Designed dashboards in Databricks to visualize pipeline metrics, usage trends, and system performance.
Implemented automated job scheduling within Databricks Workflows, ensuring reliability and SLAs.
Partnered with ML teams to deliver curated data layers for model training and LLM Gateway testing.
Provided technical mentorship on Databricks best practices, Unity Catalog policies, and SQL optimizations. Applied SQL and PySpark within Databricks to analyze and visualize key LLM Gateway metrics, such as request volume, response times, and failure rates.
Applied SQL techniques to validate and clean data, including handling missing values, duplicates, and inconsistencies, ensuring the accuracy and quality of datasets used in reporting
Implemented data exploration using Databricks dashboards.
Concentrix Catalyst Sept 2023 – July 2024
Client: Evernorth Healthcare Role: Sr. Data Engineer
Led migration of ETL workflows and analytics applications into Databricks, ensuring business continuity.
Worked on different phases of data validation: Extraction, Transformation, Comparison, identification of discrepancies, and publishing a final report.
Designed and developed Delta Live Tables (DLT) pipelines for incremental ingestion and transformation.
Applied SQL transformations to build standardized data marts supporting healthcare analytics.
Refactored legacy Hadoop jobs into Databricks pipelines, improving maintainability and performance.
Implemented Unity Catalog governance policies to ensure compliance with healthcare regulations.
Reorganized data lakes into bronze, silver, and gold layers, reducing redundancy and improving traceability.
Built reusable Spark SQL libraries to simplify transformations across multiple healthcare data domains.
Integrated Snowflake and Databricks for hybrid storage and compute optimization.
Performed data validation frameworks to ensure consistency between Hadoop and Databricks platforms.
Designed test cases and automated validations for migrated pipelines, ensuring accuracy of results.
Collaborated with business teams to define and implement reporting solutions on curated Delta tables.
Developed Git-based CI/CD workflows for Databricks notebooks, enabling version control and deployments.
Enhanced pipeline performance tuning through partitioning, indexing, and caching strategies.
Supported real-time analytics use cases by integrating streaming data into Delta tables.
Trained new team members on Databricks data lakehouse concepts and SQL optimization techniques.
Extensively worked on Data Validation as part of cloud migration delivery
Identify the business requirements, goals, and objectives for migrating to Databricks.
Assess the compatibility of existing systems and applications with Databricks.
Plan the migration strategy, timeline, and resources required for the migration process.
Analyze the existing data sources, formats, and structures and transform the data to ensure compatibility with Databricks.
Evaluate the data quality and integrity to identify any issues or inconsistencies.
Determine the data migration approach (e.g., batch migration, incremental migration, or streaming data migration)
Rewrite or refactor code to leverage Databricks APIs, libraries, and features (e.g., Spark SQL, Data Frame API).
Test the migrated code and applications to ensure functionality, performance, and compatibility with Databricks
Mentored the team on the data validation process.
Actively trained new members of the team and got them up to speed.
OSF Healthcare March 2022 – May 2023
Client: OSF Healthcare Role: Data Engineer
Migrated data lake pipelines from Hadoop to Databricks, ensuring scalability and maintainability.
Rewrote ETL workflows using PySpark and SQL in Databricks, reducing processing overhead.
Designed Delta Lake structures for consistent, auditable, and versioned datasets.
Implemented DLT pipelines to simplify ingestion and enforce data quality expectations.
Applied Unity Catalog policies to secure patient data and manage healthcare data lineage.
Developed SQL-based transformations and views for downstream BI and reporting applications.
Reorganized data lake architecture to support efficient queries and analytics workloads.
Built automated pipelines to ingest Teradata and external files into Databricks.
Created job automation frameworks in Databricks for daily and weekly ETL runs.
Integrated Databricks pipelines with PostgreSQL targets to deliver curated healthcare datasets.
Conducted data quality checks and reconciliations to validate completeness and accuracy.
Collaborated with healthcare domain experts to implement reporting solutions on claims and patient data.
Tuned Spark SQL jobs for cost and performance optimization in the Databricks environment.
Designed audit logging and monitoring solutions for critical healthcare pipelines.
Provided documentation and training on new Databricks pipelines for operations and business teams.
HCL Nov 2009 – May 2014
Client: CBA Domain: Banking Role:
ETLDeveloper
Project Description:
The DS-UPGRADE project is mainly undertaken to overcome the constraints currently faced in the CBA production environment. One more cause for undertaking this project is to increase the speed of all streams which are currently running for a long time in production that loads bulk amounts of data into the CBA Warehouse.
This program is designed to deliver value early & continuously regularly over stage by stage. This will be achieved by approaching the overall streams as several discrete projects, contributing & aligned to the same goals.
The intent is for the bank to have a business platform suitable for delivering its mission of being Australia’s finest financial services organization through excelling in customer service & will be positioned to extend this internationally to meet new business objectives.
The DS-UPGRADE Program aims to modernize the Bank’s warehouse systems by replacing.
The old sun OS with new AIX which is more efficient and the DataStage 8.0 with DataStage 8.1. The bank’s one more point of view is as both AIX and Datastage8.1 belong to the same owner there will be good co-ordination between these two applications which increases the efficiency of the bank’s warehouse.
Roles and Responsibilities
Designed and developed ETL pipelines in Databricks to process large-scale banking data from multiple source systems.
Performed Unit Testing and Regression testing when needed along with Documentation.
Creating jil files with Autosys and Unix commands for running the DataStage jobs
Scheduling AutoSys jobs.
Built Delta Lake tables and views to support reporting, risk analysis, and regulatory compliance requirements.
Automated pipeline scheduling and monitoring using Databricks Workflows.
Preparing migration documents, checklists, and test cases for clients.
Developed PLSQL Packages/Procedures/Functions to populate data to stage tables.
Develop Mappings and Workflows to load the data into Oracle tables.
Provided support to the development team for the creation of SQL queries and received ongoing coaching and mentorship on database design.
Migrated legacy ETL workflows into Databricks for improved scalability and performance.
Develop Mappings and Workflows to load the data into Oracle tables.
Coordinate with senior database developers to create stored procedures, manage data feeds processed through the ETL pipeline, and maintain the SQL server platform.
Provided support to the development team for the creation of SQL queries and received ongoing coaching and mentorship on database design.
Developed PLSQL Packages/Procedures/Functions to populate data to stage tables.
Wrote complex SQL queries on Oracle tables to pull data for analysis purposes.
Involved in ETL Development using native Oracle tools (SQL*Loader, Oracle Pl/SQL)
HVC (High-Value Customer) Client: CBA
Domain: Banking Role: ETLDeveloper
Description:
This provides a high-level summary to understand the HVC program. The program aims to increase Premier Banking by identifying Premier Customers. The Premier customers are the one who has Accounts with CBA and with Other Financial Institutions (OFI), in business terms, these Premier Customers are all the time meant as Group Level customers. In addition, the intention at the Group level is to have more satisfied HVCs who do more with the Bank – business development, retention, and product-per-customer growth will lead to higher revenues.
The program was developed to identify these customers from COMMSEE and to calculate the below items.
1.Share of Wallet
2.Tier Service Model
These calculations will help the Business to identify what type of strategic policy can be provided to the individual group. There will be several new functions implemented in R27 HVC to achieve the goal:
Share of Wallet (SOW)
SOW will take the data from the COMM See Balance Sheet and data acquired from the FHC process. The SOW calculation is to identify and store the percentage of business held with CBA and Other Financial Institutions (OFIs). The outcome will feed TSM for allocating a Tier to each Relationship. SOW Reports will show account detailed information which is grouped into five banking categories.
Tiered Service Model (TSM)
TSM will allocate a Tier to each Relationship based on the data from SOW and the number of open CBA accounts per relationship. Tiering data will feed to SOW/TSM reports which will drive the portfolio review and rebalancing processes.
Roles & Responsibilities:
Understand the business functionality & Analysis of business requirements.
Developed the jobs based on the Mappings docs.
Developed Job Sequences for creating the flow of the jobs to load the target system.
Tested the Job Sequences for different results/Conditions of the jobs.
Prepared unit test cases and testing the developed code to meet all the possible scenarios.
Analyzed certain existing Jobs, which were “producing errors/slow” and modifying them to produce accurate results with optimum performance.
Train new Joiners with our client coding and development standards.
EXPERIENCE