C2C Data Engineer - AWS / Spark Specialist

Location:

Cleveland, OH

Posted:

November 29, 2025

Contact this candidate

Resume:

Yashwantej D

Email: ************@*****.***

Phone: 330-***-**** Cleveland, OH

https://www.linkedin.com/in/yashwantanalyst/

PROFESSIONAL SUMMARY:

Data Engineer with 5+ years of specialized experience in Databricks, AWS, and modern data stacks (Airflow, S3, Spark, Lakehouse). Experienced in cloud migrations, PVC-to-SaaS transitions, and scalable data pipelines. Skilled in PL/SQL, Python, and data visualization tools like Tableau, with a strong background in healthcare, finance, and e-commerce domains. Committed to leveraging expertise in data engineering to drive efficient data solutions and support business analytics.

Effective team member, collaborative, and comfortable working independently. Utilized Amazon Athena for ad hoc queries and analysis on raw and curated datasets stored in S3

Proficient in using tools like Erwin and Power Designer for model visualization, lineage, and versioning.

Developed complex SQL and Snowflake SQL scripts for data validation, transformation, and aggregation in support of analytics initiatives.

Experienced in developing Snowflake-based data warehouses with emphasis on scalability, performance tuning, and cost optimization.

Efficient in working with Hive data warehouse tool, creating tables, distributing data by implementing Partitioning and Bucketing strategy, writing, and optimizing HiveQL queries

Experience in ingestion, storage, querying, processing, and analysis of Big Data with hands-on experience in Big Data, including Apache Spark, Spark SQL, Hive.

Built ETL pipelines and automated workflows in Alteryx to extract, transform, and load data from diverse sources (SQL Server, XML, CSV, Excel), enabling smooth migration, modeling, and dashboarding in Tableau.

Experience in AWS services (Amazon Redshift, EMR, S3, Glue, Lambda, Step Functions, Athena).

Experience in designing, developing, and deploying projects in AWS suite, including Amazon S3, Glue, EMR, Lambda, Redshift, Athena, Step Functions, CloudWatch, and QuickSight.

Designed, tested, and maintained the data management and processing systems using Spark, AWS, Hadoop, and shell scripting.

Designed and implemented CI/CD pipelines using Azure DevOps for automated build, test, and deployment workflows across multiple environments.

Integrated code repositories (Git), build pipelines, and release pipelines to ensure rapid, consistent software delivery with rollback strategies.

Created custom YAML pipelines with automated testing, versioning, artifact generation, and deployment to Azure App Services and Kubernetes clusters.

Developed and standardized data dictionaries for enterprise datasets, including field definitions, data types, permissible values, and business context.

Implemented monitoring, alerting, and logging for ingestion to ensure reliability and traceability of data flows.

Experience with inferential statistics, hypothesis testing, and statistical modeling in Python and SAS.

Built regression-based predictive models in SAS for behavior analysis (e.g., Next-Basket-Delivery or NBD regression), extracting key variables to predict consumer purchase behavior.

Created and maintained data mapping documents to trace lineage, validate transformations, and support audit/compliance requirements.

Designed and maintained scalable data platforms for real-time analytics using Databricks, Spark, Azure (and Kafka/Kinesis, processing millions of events per minute.

Built and optimized ELT/ETL pipelines with Fivetran, Airflow, Databricks Workflows, and DBT, ensuring automated, hard, and production-grade data workflows.

PROJECTS: https://github.com/Yashds691543

Winery Business (Vineyard and Sales): Designed ER diagrams, normalized schema, implemented full SDLC

Python Search Engine: GUI-based IR system using TF-IDF, cosine similarity, Tkinter

EDUCATION:

MS in Data Science, University of Memphis, TN (GPA: 3.70) 08/2022 – 05/2024

BS in Computer Science, JNTU, India (GPA: 3.10) 06/2016 – 06/2020

TECHNICAL SKILLS:

Analysis and Modelling Tools: Erwin r9.6/r9.5/r9.1/r8.x, Sybase Power Designer, Hackolade v2.5.2, Oracle Designer, BP win ER/Studio, MS Access 2000, Oracle, Star-Schema, Snowflake-Schema Modeling, and FACT and dimension tables, Pivot Tables., Tableau, PowerBI, Tibco Spotfire, Business Objects.

OLAP Tools: Tableau, SAP BO, SSAS, Business Objects, and Crystal Reports 9.

Programming Language: Pandas, NumPy, SciPy, PySpark, TensorFlow, T-SQL, PL/SQL, Teradata SQL, C++, Java, HTML, UNIX shell scripting, Snowflake, Apache Airflow, Data Fusion, Dataprep, Informatica, Fivetran

Big Data: Apache Spark, Hadoop, HDFS, YARN, Hive, Sqoop, MapReduce, Tez, Ambari, Zookeeper, Data warehousing, Starburst, Collibra

ETL Tools: SSIS, Pentaho, Informatica Power Center 9.7/9.6/9.5/9. DataStage, Talend, Snowflake.

Database: MySQL, SQL Server, DB2, Cassandra, Teradata, Big Query, Druid, Oracle 11g, 12c, SAP-HANA, SQL server, Amazon RDS, SnowflakeDB

Cloud: Amazon Web Services (AWS) – S3, Redshift, EMR, Glue, Lambda, Athena, Step Functions, RDS, EC2, CloudWatch, Azure Synapse, ADF.

Methodologies: Agile(scrum), Waterfall.

Data Visualization Tools: Looker, PowerBI, Microsoft Excel (Pivot tables, graphs, charts, Dashboards)

Tools: Automic, Hue, Looker, IntelliJ IDEA, Eclipse, Maven, Zookeeper, VMware, Putty, DB visualizer.

Certification: Google Data Analytics Professional Certificate (Link) August 2024.

Version Control & Protocols: Git, GitHub, SFTP, SSH, LINUX, UNIX.

PROFESSIONAL EXPERIENCE:

Quantum Code, Kissimmee, FL Sep 2024 - Till Date

BIOGEN (Contract)

Data Engineer

Responsibilities:

Built ETL pipelines from data lake to various databases, improving data accessibility and processing efficiency

Designed and developed data applications using Hadoop, HDFS, Hive, Spark, Scala, and other tools, enhancing data processing capabilities

Made the data available on Amazon Redshift performing ELT process using AWS Glue to move the data from Amazon S3 to Redshift tables

Developed jobs using SSIS and Pentaho tools to transform and load the data to data marts.

Architected solutions in AWS for a large claims data mart, optimizing query execution in Redshift for Medicare and commercial populations

Wrote complex PL/SQL queries on Redshift involving joins, user-defined functions (UDFs), and jobs to help the reporting team in generating reports involving bonus, HSA incentive and comparing daily step activities across the population.

Performed claims (EDI) and billing analysis, working with ICD-9/-10, CPT, DRG coding, Medicaid/Medicare claims, and EDI formats to support NCQA and accreditation reporting

Managed clinical trial and research data, building and validating datasets (CRFs, tables, listings) for phases I-IV studies in compliance with regulatory guidelines

Ensured data integrity and regulatory compliance, implementing HIPAA-aligned processes, validating data via QA testing, and maintaining policies around access and confidentiality

Managed data migrations and conversions, handling legacy data load for contracts, rates, billing events, suppliers/customers, and invoices utilizing ETL tools, APIs, and SFTP feed mechanisms

Automated ingestion workflows from on-premise systems (Hadoop, Oracle) into cloud platform AWS using Airflow in conjunction with SQOOP, Spark, and Cloud Dataflow, enabling reliable migration of 500+ tables to S3/Dataflow

Integrated Alteryx with Tableau reporting, using Alteryx for data cleansing, blending, and creating extracts (e.g., Tableau Data Extracts) to power interactive dashboards with blended and optimized data sources.

Developed statistical forecasting and predictive models using SAS tools like SAS Studio, SAS Enterprise Guide, and PROC SQL to analyze customer behavior and operational demand.

Developed and deployed predictive models using Python (scikit-learn, XGBoost) to forecast trends, detect anomalies, and support decision-making.

Developed base and consumption tables in the data lake and facilitated data movement to Teradata, improving data accessibility and integration

Developed streaming data architectures using Spark Streaming, Kafka, Flume, Kinesis, and Flink, enabling near real-time event processing at scale

Employed Databricks DLT and Autoloader, alongside Delta Lake, to build robust, event-driven pipelines with efficient schema enforcement and incremental loading

Developed a proof-of-concept prototype with faster iterations to maintain design documentation, test cases, monitoring, and performance evaluations using Git, Putty, Maven, Confluence, ETL, Automic, Zookeeper, and Cluster Manager, improving project documentation and testing efficiency

Developed automated Power BI dashboards using DAX and data modelling techniques to visualize claims and incentive data across Medicare and commercial populations.

Wrote Python scripts to parse and normalize data from diverse file formats, including JSON, XML, and CSV before loading into Redshift and Snowflake.

Used SFTP and Git to manage data and maintain version control for ETL scripts and pipeline configurations.

Passionate about staying current with the latest data technologies-including Databricks DLT, Delta Lake, Autoloader, DBT, and new open-source frameworks-while championing best practices across engineering teams.

Built metadata pipelines and standardized business logic across vault layers using tools like Airflow and Python in Snowflake environments.

Led Databricks PVC-to-SaaS workspace migration, optimizing cluster configurations, mount points, and workspace policies for scalable multi-tenant data processing

Presented actionable insights from healthcare eligibility and claims data to non-technical business stakeholders, helping improve patient program engagement.

I built semantic models with row-level security and integrated Power BI with curated datasets in Redshift and Snowflake. This enabled self-service reporting while ensuring compliance with data access policies.

Automated validations between different databases using shell scripting and reported data quality to users with frameworks Aorta and Unified Data Pipeline, enhancing data accuracy and user trust

Ensured ETL processes succeeded and data loaded successfully in Snowflake DB, leading to seamless data availability for analytics

Worked extensively in Linux environments to deploy, schedule, and monitor big data workflows using Shell scripting, Airflow, and Oozie

Troubleshot issues related to data pipeline failures or slowness using MapReduce, Tez, Hive, or Spark, ensuring SLA adherence and minimizing downtime.

Optimized Hive scripts by reengineering the DAG logic to use minimal resources and provide high throughput.

Worked with business users to resolve discrepancies like error records and duplicate records across tables, writing complex PL/SQL queries to validate reports.

Integrated Collibra for metadata management and Starburst for federated querying across AWS and Databricks, ensuring unified governance and auditability

Created data ingestion processes to maintain the Global Data Lake on Amazon S3 and Redshift.

I developed automated Power BI dashboards for healthcare claims and incentives data, using DAX and data modelling techniques. These helped Medicare stakeholders track population-level trends and improved visibility into HSA incentive engagement.

Hands-on experience with the Hadoop eco-system (HDFS, Hive, MapReduce, Hbase, Hive, Impala, Spark)

Followed Agile methodology and used JIRA for sprint planning and task management, improving team collaboration and project delivery timelines

University of Memphis, Memphis, TN May 2023 – May 2024

Graduate Assistant

Responsibilities:

Responsible for collecting, cleaning, labelling, and conceptualizing large databases.

Utilized SQL queries, Python libraries, MS Access, and Excel to filter and clean data.

Created analysis reports and maintained communication with lab researchers.

Provided a solution using HIVE, SQOOP (to export/import data), replacing traditional ETL with HDFS for faster load to target tables.

Created Hive tables, partitions, and buckets; performed analytics using Hive ad-hoc queries.

Created UDFs and Oozie workflows to Sqoop data from source to HDFS and into target tables.

Imported data from multiple sources using Sqoop, transformed with Hive, loaded into HDFS.

NFS IT Solutions, Hyderabad, India June 2020 - July 2022

Data Engineer / Hadoop Developer

Responsibilities:

Applied principles and best practices of Software Configuration Management (SCM) in Agile, Scrum, and Waterfall methodologies, enhancing project efficiency and collaboration

Investigated, analyzed, recommended, configured, installed, and tested new hardware and software, leading to improved system performance and reliability

Verified and validated Business Requirements Document, Test Plan, and Test Strategy documents, ensuring alignment with project goals and reducing errors

Experience in working GIT for branching, tagging, merging, and maintained GIT source code tool

Performed advanced procedures like text analytics and processing, using the in-memory computing capabilities of Spark using Scala.

Followed Agile Methodology and participated in Sprints and daily Scrums to deliver software tasks on time, coordinating with onsite and offshore team

Built and maintained AWS Infrastructure as Code (IaC) using Terraform and CloudFormation, enabling consistent, repeatable deployments across dev, QA, and production environments.

Performed log analysis, process debugging, and performance tuning for Spark and Hive jobs within Hadoop clusters running on Linux-based infrastructure

I worked with OBIEE to build analytical reports on contract billing and supplier data, pulling from Oracle backends. I customized RPD layers and subject areas to align with finance and procurement business logic.

Configured and extended contract-to-cash workflows, including advanced billing scenarios like milestone billing, project-based billing, advance billing, write-offs, and intercompany transactions across multiple countries (Canada, Mexico) using PaaS/SaaS UI customization and REST APIs

I supported OBIEE dashboards migration efforts by translating logic into Tableau and Power BI, ensuring consistency in KPIs and maintaining lineage from source systems through to report outputs.

Enhanced workflow orchestration across big data environments, coordinating Oozie and Airflow for Hadoop/Spark task scheduling, and debugging pipelines with end-to-end accountability

Implemented data lake storage on Amazon S3 for structured and semi-structured data ingestion, enhancing data accessibility and scalability

Developed and deployed AWS Lambda functions to automate triggers for data validation and transformation workflows, improving job performance monitoring failures using Amazon CloudWatch alerts for SLA breaches

Contributed to PoC for migrating legacy Hive workloads to EMR-based Spark pipelines on AWS, facilitating improved processing efficiency and scalability

Utilized Spark for interactive queries, processing streaming data, and integrating with popular NoSQL databases, enhancing data processing speed and flexibility

Performed detailed source-to-target data mapping for ETL pipelines across disparate systems, ensuring alignment with business rules and transformation logic.

Worked with numerous file formats like Text, Sequence files, Avro, Parquet, ORC, JSON, XML files, and Flat files using Map Reduce Programs, improving data processing versatility and efficiency

Initiated predictive modeling and advanced analytics projects, utilizing machine learning to forecast patient no-shows or readmissions, enhancing preventive care strategies

Led end-to-end SaaS billing platform implementation for enterprise financial systems (e.g. NetSuite, Oracle ERP), customizing billing logic, invoice generation, credit/debit memo handling, and payment rules to align with company-specific processes and finance policies

Created and optimized dimensional models (star/snowflake schemas) and Logical, physical models to support business intelligence and reporting use cases.

Worked with stakeholders, product owners, and BI teams to gather data requirements, translate them into logical/physical data models, and deliver actionable insights.

Managed data migrations and conversions, handling legacy data load for contracts, rates, billing events, suppliers/customers, and invoices utilizing ETL tools, APIs, and SFTP feed mechanisms

Cataloged S3-based datasets into AWS Glue Data Catalog using AWS Glue Crawlers, enhancing data accessibility for downstream analytics

Analyzed SQL scripts and designed a Scala-based solution, improving data processing efficiency

Resolved performance issues in Hive and Pig scripts by analyzing joins, groups, and aggregations, leading to more efficient MapReduce jobs

Stored data into Spark RDD and performed in-memory data computation, delivering outputs precisely matching the requirement.

Contact this candidate