Yashwantej D
Email: ************@*****.***
Phone: 330-***-**** Cleveland, OH
https://www.linkedin.com/in/yashwantanalyst/
PROFESSIONAL SUMMARY:
Data Engineer with 5+ years of specialized experience in Databricks, AWS, and modern data stacks (Airflow, S3, Spark, Lakehouse). Experienced in cloud migrations, PVC-to-SaaS transitions, and scalable data pipelines. Skilled in PL/SQL, Python, and data visualization tools like Tableau, with a strong background in healthcare, finance, and e-commerce domains. Committed to leveraging expertise in data engineering to drive efficient data solutions and support business analytics.
Effective team member, collaborative, and comfortable working independently. Utilized Amazon Athena for ad hoc queries and analysis on raw and curated datasets stored in S3
Proficient in using tools like Erwin and Power Designer for model visualization, lineage, and versioning.
Developed complex SQL and Snowflake SQL scripts for data validation, transformation, and aggregation in support of analytics initiatives.
Experienced in developing Snowflake-based data warehouses with emphasis on scalability, performance tuning, and cost optimization.
Efficient in working with Hive data warehouse tool, creating tables, distributing data by implementing Partitioning and Bucketing strategy, writing, and optimizing HiveQL queries
Experience in ingestion, storage, querying, processing, and analysis of Big Data with hands-on experience in Big Data, including Apache Spark, Spark SQL, Hive.
Built ETL pipelines and automated workflows in Alteryx to extract, transform, and load data from diverse sources (SQL Server, XML, CSV, Excel), enabling smooth migration, modeling, and dashboarding in Tableau.
Experience in AWS services (Amazon Redshift, EMR, S3, Glue, Lambda, Step Functions, Athena).
Experience in designing, developing, and deploying projects in AWS suite, including Amazon S3, Glue, EMR, Lambda, Redshift, Athena, Step Functions, CloudWatch, and QuickSight.
Designed, tested, and maintained the data management and processing systems using Spark, AWS, Hadoop, and shell scripting.
Designed and implemented CI/CD pipelines using Azure DevOps for automated build, test, and deployment workflows across multiple environments.
Integrated code repositories (Git), build pipelines, and release pipelines to ensure rapid, consistent software delivery with rollback strategies.
Created custom YAML pipelines with automated testing, versioning, artifact generation, and deployment to Azure App Services and Kubernetes clusters.
Developed and standardized data dictionaries for enterprise datasets, including field definitions, data types, permissible values, and business context.
Implemented monitoring, alerting, and logging for ingestion to ensure reliability and traceability of data flows.
Experience with inferential statistics, hypothesis testing, and statistical modeling in Python and SAS.
Built regression-based predictive models in SAS for behavior analysis (e.g., Next-Basket-Delivery or NBD regression), extracting key variables to predict consumer purchase behavior.
Created and maintained data mapping documents to trace lineage, validate transformations, and support audit/compliance requirements.
Designed and maintained scalable data platforms for real-time analytics using Databricks, Spark, Azure (and Kafka/Kinesis, processing millions of events per minute.
Built and optimized ELT/ETL pipelines with Fivetran, Airflow, Databricks Workflows, and DBT, ensuring automated, hard, and production-grade data workflows.
PROJECTS: https://github.com/Yashds691543
Winery Business (Vineyard and Sales): Designed ER diagrams, normalized schema, implemented full SDLC
Python Search Engine: GUI-based IR system using TF-IDF, cosine similarity, Tkinter
EDUCATION:
MS in Data Science, University of Memphis, TN (GPA: 3.70) 08/2022 – 05/2024
BS in Computer Science, JNTU, India (GPA: 3.10) 06/2016 – 06/2020
TECHNICAL SKILLS:
Analysis and Modelling Tools: Erwin r9.6/r9.5/r9.1/r8.x, Sybase Power Designer, Hackolade v2.5.2, Oracle Designer, BP win ER/Studio, MS Access 2000, Oracle, Star-Schema, Snowflake-Schema Modeling, and FACT and dimension tables, Pivot Tables., Tableau, PowerBI, Tibco Spotfire, Business Objects.
OLAP Tools: Tableau, SAP BO, SSAS, Business Objects, and Crystal Reports 9.
Programming Language: Pandas, NumPy, SciPy, PySpark, TensorFlow, T-SQL, PL/SQL, Teradata SQL, C++, Java, HTML, UNIX shell scripting, Snowflake, Apache Airflow, Data Fusion, Dataprep, Informatica, Fivetran
Big Data: Apache Spark, Hadoop, HDFS, YARN, Hive, Sqoop, MapReduce, Tez, Ambari, Zookeeper, Data warehousing, Starburst, Collibra
ETL Tools: SSIS, Pentaho, Informatica Power Center 9.7/9.6/9.5/9. DataStage, Talend, Snowflake.
Database: MySQL, SQL Server, DB2, Cassandra, Teradata, Big Query, Druid, Oracle 11g, 12c, SAP-HANA, SQL server, Amazon RDS, SnowflakeDB
Cloud: Amazon Web Services (AWS) – S3, Redshift, EMR, Glue, Lambda, Athena, Step Functions, RDS, EC2, CloudWatch, Azure Synapse, ADF.
Methodologies: Agile(scrum), Waterfall.
Data Visualization Tools: Looker, PowerBI, Microsoft Excel (Pivot tables, graphs, charts, Dashboards)
Tools: Automic, Hue, Looker, IntelliJ IDEA, Eclipse, Maven, Zookeeper, VMware, Putty, DB visualizer.
Certification: Google Data Analytics Professional Certificate (Link) August 2024.
Version Control & Protocols: Git, GitHub, SFTP, SSH, LINUX, UNIX.
PROFESSIONAL EXPERIENCE:
Quantum Code, Kissimmee, FL Sep 2024 - Till Date
BIOGEN (Contract)
Data Engineer
Responsibilities:
Built ETL pipelines from data lake to various databases, improving data accessibility and processing efficiency
Designed and developed data applications using Hadoop, HDFS, Hive, Spark, Scala, and other tools, enhancing data processing capabilities
Made the data available on Amazon Redshift performing ELT process using AWS Glue to move the data from Amazon S3 to Redshift tables
Developed jobs using SSIS and Pentaho tools to transform and load the data to data marts.
Architected solutions in AWS for a large claims data mart, optimizing query execution in Redshift for Medicare and commercial populations
Wrote complex PL/SQL queries on Redshift involving joins, user-defined functions (UDFs), and jobs to help the reporting team in generating reports involving bonus, HSA incentive and comparing daily step activities across the population.
Performed claims (EDI) and billing analysis, working with ICD-9/-10, CPT, DRG coding, Medicaid/Medicare claims, and EDI formats to support NCQA and accreditation reporting
Managed clinical trial and research data, building and validating datasets (CRFs, tables, listings) for phases I-IV studies in compliance with regulatory guidelines
Ensured data integrity and regulatory compliance, implementing HIPAA-aligned processes, validating data via QA testing, and maintaining policies around access and confidentiality
Managed data migrations and conversions, handling legacy data load for contracts, rates, billing events, suppliers/customers, and invoices utilizing ETL tools, APIs, and SFTP feed mechanisms
Automated ingestion workflows from on-premise systems (Hadoop, Oracle) into cloud platform AWS using Airflow in conjunction with SQOOP, Spark, and Cloud Dataflow, enabling reliable migration of 500+ tables to S3/Dataflow
Integrated Alteryx with Tableau reporting, using Alteryx for data cleansing, blending, and creating extracts (e.g., Tableau Data Extracts) to power interactive dashboards with blended and optimized data sources.
Developed statistical forecasting and predictive models using SAS tools like SAS Studio, SAS Enterprise Guide, and PROC SQL to analyze customer behavior and operational demand.
Developed and deployed predictive models using Python (scikit-learn, XGBoost) to forecast trends, detect anomalies, and support decision-making.
Developed base and consumption tables in the data lake and facilitated data movement to Teradata, improving data accessibility and integration
Developed streaming data architectures using Spark Streaming, Kafka, Flume, Kinesis, and Flink, enabling near real-time event processing at scale
Employed Databricks DLT and Autoloader, alongside Delta Lake, to build robust, event-driven pipelines with efficient schema enforcement and incremental loading
Developed a proof-of-concept prototype with faster iterations to maintain design documentation, test cases, monitoring, and performance evaluations using Git, Putty, Maven, Confluence, ETL, Automic, Zookeeper, and Cluster Manager, improving project documentation and testing efficiency
Developed automated Power BI dashboards using DAX and data modelling techniques to visualize claims and incentive data across Medicare and commercial populations.
Wrote Python scripts to parse and normalize data from diverse file formats, including JSON, XML, and CSV before loading into Redshift and Snowflake.
Used SFTP and Git to manage data and maintain version control for ETL scripts and pipeline configurations.
Passionate about staying current with the latest data technologies-including Databricks DLT, Delta Lake, Autoloader, DBT, and new open-source frameworks-while championing best practices across engineering teams.
Built metadata pipelines and standardized business logic across vault layers using tools like Airflow and Python in Snowflake environments.
Led Databricks PVC-to-SaaS workspace migration, optimizing cluster configurations, mount points, and workspace policies for scalable multi-tenant data processing
Presented actionable insights from healthcare eligibility and claims data to non-technical business stakeholders, helping improve patient program engagement.
I built semantic models with row-level security and integrated Power BI with curated datasets in Redshift and Snowflake. This enabled self-service reporting while ensuring compliance with data access policies.
Automated validations between different databases using shell scripting and reported data quality to users with frameworks Aorta and Unified Data Pipeline, enhancing data accuracy and user trust
Ensured ETL processes succeeded and data loaded successfully in Snowflake DB, leading to seamless data availability for analytics
Worked extensively in Linux environments to deploy, schedule, and monitor big data workflows using Shell scripting, Airflow, and Oozie
Troubleshot issues related to data pipeline failures or slowness using MapReduce, Tez, Hive, or Spark, ensuring SLA adherence and minimizing downtime.
Optimized Hive scripts by reengineering the DAG logic to use minimal resources and provide high throughput.
Worked with business users to resolve discrepancies like error records and duplicate records across tables, writing complex PL/SQL queries to validate reports.
Integrated Collibra for metadata management and Starburst for federated querying across AWS and Databricks, ensuring unified governance and auditability
Created data ingestion processes to maintain the Global Data Lake on Amazon S3 and Redshift.
I developed automated Power BI dashboards for healthcare claims and incentives data, using DAX and data modelling techniques. These helped Medicare stakeholders track population-level trends and improved visibility into HSA incentive engagement.
Hands-on experience with the Hadoop eco-system (HDFS, Hive, MapReduce, Hbase, Hive, Impala, Spark)
Followed Agile methodology and used JIRA for sprint planning and task management, improving team collaboration and project delivery timelines
University of Memphis, Memphis, TN May 2023 – May 2024
Graduate Assistant
Responsibilities:
Responsible for collecting, cleaning, labelling, and conceptualizing large databases.
Utilized SQL queries, Python libraries, MS Access, and Excel to filter and clean data.
Created analysis reports and maintained communication with lab researchers.
Provided a solution using HIVE, SQOOP (to export/import data), replacing traditional ETL with HDFS for faster load to target tables.
Created Hive tables, partitions, and buckets; performed analytics using Hive ad-hoc queries.
Created UDFs and Oozie workflows to Sqoop data from source to HDFS and into target tables.
Imported data from multiple sources using Sqoop, transformed with Hive, loaded into HDFS.
NFS IT Solutions, Hyderabad, India June 2020 - July 2022
Data Engineer / Hadoop Developer
Responsibilities:
Applied principles and best practices of Software Configuration Management (SCM) in Agile, Scrum, and Waterfall methodologies, enhancing project efficiency and collaboration
Investigated, analyzed, recommended, configured, installed, and tested new hardware and software, leading to improved system performance and reliability
Verified and validated Business Requirements Document, Test Plan, and Test Strategy documents, ensuring alignment with project goals and reducing errors
Experience in working GIT for branching, tagging, merging, and maintained GIT source code tool
Performed advanced procedures like text analytics and processing, using the in-memory computing capabilities of Spark using Scala.
Followed Agile Methodology and participated in Sprints and daily Scrums to deliver software tasks on time, coordinating with onsite and offshore team
Integrated Alteryx with Tableau reporting, using Alteryx for data cleansing, blending, and creating extracts (e.g., Tableau Data Extracts) to power interactive dashboards with blended and optimized data sources
Built and maintained AWS Infrastructure as Code (IaC) using Terraform and CloudFormation, enabling consistent, repeatable deployments across dev, QA, and production environments.
Performed log analysis, process debugging, and performance tuning for Spark and Hive jobs within Hadoop clusters running on Linux-based infrastructure
I worked with OBIEE to build analytical reports on contract billing and supplier data, pulling from Oracle backends. I customized RPD layers and subject areas to align with finance and procurement business logic.
Configured and extended contract-to-cash workflows, including advanced billing scenarios like milestone billing, project-based billing, advance billing, write-offs, and intercompany transactions across multiple countries (Canada, Mexico) using PaaS/SaaS UI customization and REST APIs
I supported OBIEE dashboards migration efforts by translating logic into Tableau and Power BI, ensuring consistency in KPIs and maintaining lineage from source systems through to report outputs.
Enhanced workflow orchestration across big data environments, coordinating Oozie and Airflow for Hadoop/Spark task scheduling, and debugging pipelines with end-to-end accountability
Implemented data lake storage on Amazon S3 for structured and semi-structured data ingestion, enhancing data accessibility and scalability
Developed and deployed AWS Lambda functions to automate triggers for data validation and transformation workflows, improving job performance monitoring failures using Amazon CloudWatch alerts for SLA breaches
Contributed to PoC for migrating legacy Hive workloads to EMR-based Spark pipelines on AWS, facilitating improved processing efficiency and scalability
Utilized Spark for interactive queries, processing streaming data, and integrating with popular NoSQL databases, enhancing data processing speed and flexibility
Performed detailed source-to-target data mapping for ETL pipelines across disparate systems, ensuring alignment with business rules and transformation logic.
Worked with numerous file formats like Text, Sequence files, Avro, Parquet, ORC, JSON, XML files, and Flat files using Map Reduce Programs, improving data processing versatility and efficiency
Initiated predictive modeling and advanced analytics projects, utilizing machine learning to forecast patient no-shows or readmissions, enhancing preventive care strategies
Led end-to-end SaaS billing platform implementation for enterprise financial systems (e.g. NetSuite, Oracle ERP), customizing billing logic, invoice generation, credit/debit memo handling, and payment rules to align with company-specific processes and finance policies
Created and optimized dimensional models (star/snowflake schemas) and Logical, physical models to support business intelligence and reporting use cases.
Worked with stakeholders, product owners, and BI teams to gather data requirements, translate them into logical/physical data models, and deliver actionable insights.
Managed data migrations and conversions, handling legacy data load for contracts, rates, billing events, suppliers/customers, and invoices utilizing ETL tools, APIs, and SFTP feed mechanisms
Cataloged S3-based datasets into AWS Glue Data Catalog using AWS Glue Crawlers, enhancing data accessibility for downstream analytics
Analyzed SQL scripts and designed a Scala-based solution, improving data processing efficiency
Resolved performance issues in Hive and Pig scripts by analyzing joins, groups, and aggregations, leading to more efficient MapReduce jobs
Stored data into Spark RDD and performed in-memory data computation, delivering outputs precisely matching the requirement.