Jagadeesh Nadimpalli
Irving, TX 572-***-**** *********.**********@*******.*** LinkedIn GitHub
(Open to Relocate)
SUMMARY
Data Engineer with 5+ years of experience building scalable pipelines, trusted datasets, and analytics-ready data solutions across enterprise, cloud, and AI-supporting environments. Experienced in batch processing, data modeling, warehouse modernization, and knowledge platforms that support internal RAG and analytics workflows. Actively seeking high-impact data engineering opportunities.
AWARDS, RECOGNITION, CERTIFICATIONS
● Google Cloud Professional Data Engineer
● Snowflake SnowPro Core Certification
● Databricks Data Engineer Associate
Associated with TCS
● Recognized with the TCS ‘On the Spot Award’, ‘Special Achievement Award’, and ‘Best Team Award’ for outstanding contribution, successful migration delivery, and exceptional team performance. PROFESSIONAL EXPERIENCE
NATIONAL SNOW AND ICE DATA CENTER Remote, US
Data Engineer Apr 2024 - Present
Project 1: Data Pipeline for Internal RAG-Based Knowledge Platform An end-to-end platform initiative focused on transforming fragmented internal documentation into trusted, retrieval-ready knowledge assets that powered internal AI assistant workflows and improved knowledge access across internal teams.
● Delivered a medallion pipeline that ingested and normalized 2TB+ of PDFs, Markdown, Confluence pages, and technical documents, establishing a scalable Delta Lake foundation for an internal AI assistant using Databricks, GCS, Cloud Composer, and BigQuery.
● Developed a distributed Spark-based text normalization framework to transform raw scientific documents into governed, metadata-rich records, reducing multi-source document inconsistency by 40% and improving downstream document consistency using schema-driven PySpark logic and Databricks.
● Engineered section-aware chunking to transform standardized documents into embedding-ready, metadata-rich retrieval records with stable identifiers, delivering ~12M governed chunks into a BigQuery vector database layer that enabled production-ready semantic search for internal AI assistant workloads.
● Implemented multi-stage data validation controls that reduced invalid downstream records by 30%, ensuring only trusted, traceable, embedding-ready records reached BigQuery vector database tables for semantic retrieval through Airflow, PySpark, and Python-based quarantine checks.
● Built a content-hash-based incremental refresh pipeline that reduced chunk regeneration and re-embedding workload by 70%, keeping AI assistant knowledge current without full reprocessing using Cloud Composer and Databricks watermarking logic.
Project 2: Operational Excellence for Internal RAG Platform Strengthening production reliability, governance, observability, and scale readiness for an internal RAG platform through controlled operations, trusted data delivery, and disciplined release practices supporting ongoing AI assistant workloads.
● Engineered platform observability that reduced manual health-monitoring effort by 50%, improving operational visibility across AI platform teams through structured BigQuery operational marts and run summaries modeled from Cloud Composer, Cloud Monitoring, and Databricks pipeline metadata.
● Operationalized incident-response workflows that reduced mean time to recovery by 40%, enabling faster recovery for AI assistant workloads through idempotent reruns, BigQuery quarantine visibility, and documented recovery procedures across Cloud Composer and Databricks.
● Implemented governance, lineage, and access controls across a production RAG data platform, improving auditability and downstream data access through Unity Catalog, BigQuery role-based access, and least-privilege service account policies.
● Standardized CI/CD release discipline across Cloud Composer, Databricks, and BigQuery, reducing deployment risk by 30% through GitHub version control, approval-gated environment promotion, and rollback controls.
● Optimized scale-ready refresh operations that reduced reprocessing overhead by 45% to preserve refresh and query performance as corpus volume grew through Databricks autoscaling, BigQuery partitioning and clustering, and GCS lifecycle controls.
Project 3: Production Data Operations & ML Workflow Automation Managed production workflows for cryosphere products, including Snow Today, Sea Ice Today, Ice Sheets, and Sea Ice Index, ensuring trusted data availability for scientists, institutions, technical users, and the public across NSIDC’s NASA and NOAA data ecosystem.
● Engineered 10 incremental batch Jenkins pipelines for unstructured satellite and structured sensor data ingestion, maintaining 95% data availability for downstream climate products using Bash scripting and failure recovery on Linux
● Delivered timely satellite and sensor data to an on-premises production data layer for 5+ data products, keeping maps, graphs, downloads, and downstream analysis current for analytics, ML, and public users through Jenkins-driven Linux workflows
● Built a post-ingestion Python validation framework with product-specific integrity checks, reducing stale data releases by 45% to ensure data trust for downstream users through file existence, naming, freshness, and completeness checks on Linux
● Automated data freshness monitoring by scheduling product-specific Python checks in Jenkins 3 times daily, reducing manual effort by 80%, improving observability, and enabling proactive incident triage through Slack and email alerts across 30+ pipelines
● Established a trusted ground truth for internal classification workflows by screening 4,000+ Antarctic satellite images each month across 27 ice shelves and isolating ~100 cloud-free samples using Jenkins, SCP, and Linux
● Engineered an ML-preprocessing pipeline for 1,500 Antarctic satellite images, resolving a 14:1 class imbalance to 2,800+ samples, standardizing unstructured imagery for model compatibility using Python, augmentation, and grayscale-to-RGB conversion
● Spearheaded a proof of concept for automated 1,500-image classification, validating data quality with 85% accuracy to demonstrate automation feasibility for internal developers using Python, transfer learning, and TensorFlow/Keras on CPU-only infrastructure
● Designed a scalable AWS batch pipeline to support training on 48,000 images and automate classification of 4,000 monthly images, reducing projected review time to ~20 minutes through S3, SageMaker, Step Functions, and DataSync TATA CONSULTANCY SERVICES Hyderabad, India
Data Engineer, Analytics Aug 2021 - Sep 2023
Client: Apple Inc, California
Project 1: Enterprise Data Warehouse Engineering & Analytics Delivery Engineered the core warehouse pipelines, data models, and semantic datasets that enabled global retail reporting, Tableau dashboards, and downstream BI analysis.
● Delivered Snowflake ELT workflows across 15+ Python DAGs, maintaining a 98% on-time refresh rate for Tableau dashboards and BI users, by orchestrating task dependencies, retries, and SLA monitoring in Apache Airflow
● Built Python RESTful APIs exposing curated Snowflake datasets to BI tools, improving data access performance by 30% through low-latency endpoints and enforced RBAC governance
● Built scalable ETL/ELT pipelines across structured and semi-structured sources, improving processing efficiency by 33% through parallelized ingestion, incremental loading, and reusable pipeline templates
● Developed 15 fact and dimension tables across Teradata and Snowflake, reducing downstream query complexity for BI analysts by 40%, by implementing Kimball star schemas, staging tables, CTEs, window functions, and semantic-layer design
● Delivered curated semantic tables and views for Tableau dashboards and ad hoc BI analysis, improving analyst data usability by 45%, by translating business requirements into SQL models and analytics-ready datasets
● Automated core ingestion for daily batch files, improving source-data readiness for global sales reporting across 30+ downstream semantic tables, by loading CSV/JSON data from Amazon S3 through external stages and serverless Snowpipe
● Optimized Snowflake analytical queries by 30%, reducing dashboard latency to accelerate business decision-making by leveraging micro-partition pruning, automatic clustering, caching, and Query Profile analysis
● Applied Snowflake Time Travel and zero-copy cloning in UAT workflows, reducing validation setup time by 40%, by querying prior data states with AT/BEFORE and creating isolated test copies for troubleshooting
● Implemented incremental loading across Teradata and Snowflake workflows, reducing full-table reloads and protecting reporting SLAs by using batch_sk, watermark timestamps, SQL-based change logic, AutoSys, and Airflow DAG
● Delivered batch retail pipelines across 25+ Teradata ETL workflows, maintaining a 95% on-time refresh rate for global reporting, by orchestrating core and semantic stored procedures with SQL-based job controls and AutoSys scheduling
● Improved Teradata reporting performance by 20%, stabilizing read-heavy dashboard and ad hoc workloads for stakeholders, using EXPLAIN, COLLECT STATISTICS, PI/PPI/SI indexing, partition pruning, and join optimization Project 2: Enterprise Data Platform Modernization (Teradata to Snowflake Migration) Delivered a multi-phase modernization of global retail sales data from a legacy warehouse to a cloud-native platform, improving performance and scalability while ensuring zero-downtime reporting continuity.
● Modernized 30+ legacy Teradata stored procedures into Snowflake-compatible Airflow DAGs, enabling scalable cloud orchestration for global reporting, using SnowConvert and Apache Airflow task dependencies and workflow patterns
● Migrated 7 years of historical retail sales data (1.5TB +) into AWS Snowflake, standardizing the historical source of truth for global sales reporting and analytics, using Amazon S3, external stages, and serverless Snowpipe loading
● Achieved 100% data parity across 30+ migrated tables, ensuring migration accuracy and stakeholder trust, using a Python-based cross-platform validation framework to compare Teradata and Snowflake data
● Improved analytics performance by 50% in Snowflake than legacy Teradata, enabling faster insights for BI analysts, validated through benchmark testing of response latency across core analytical workloads. Project 3: Operational Excellence and Governed Data Delivery Strengthened trusted analytics operations for global retail sales reporting across Teradata and Snowflake by automating data quality, improving production reliability, and managing controlled releases with governed access for downstream BI users.
● Built an automated data quality framework, reducing manual validation effort by 2+ hours/day and increasing stakeholder trust in dashboards, using 10+ Python and SQL checks for freshness, nulls, and duplicates
● Automated timezone normalization to UTC for retail transactions, reducing manual processing time by 90% and preventing DST-related reporting errors by developing a robust Python pipeline with reusable conversion logic
● Strengthened pipeline reliability and observability across 25+ workflows, improving 98% SLA adherence and reducing dashboard disruptions by enforcing idempotent reruns, monitoring freshness, and configuring email and Slack alerts in Airflow and AutoSys
● Implemented controlled CI/CD for 10+ pipeline logic and schema changes, achieving 100% UAT-validated production releases and protecting warehouse stability, using GitHub version control and approval-driven production deployments
● Built a metadata lineage tool for 30+ pipelines, accelerating impact analysis by 40% and reducing onboarding time by parsing procedures, DAGs, and job configs with Python into dependency maps
● Implemented data governance and access controls across datasets, achieving 100% compliance and ensuring secure analytics access for downstream users by enforcing PII masking, RBAC, and secure database views using SQL PROJECTS
Wikipedia Articles Summarization and Conversion to Audio
● Built an autoscaling microservices system on GCP using K8, Redis, and BigQuery, REST API to web scrape, summarize(LLMs), and convert Wikipedia articles to audio, supporting 200+ users with sub-3s median latency for cached requests
Self-State Patent Citation Analysis Using PySpark
● Developed scalable PySpark pipeline on GCP Dataproc to process 6M+ U.S. patent records, optimizing joins and aggregations with 3x speedup, validating self-citation metrics across cluster environments using RDDs and Data Frame APIs
Netflix Movies & TV Dashboard Using Tableau
● Designed an interactive Tableau dashboard using enriched Netflix metadata (8,800+ titles), uncovering global genre trends and enabling drilldowns through Python-preprocessed KPIs, calculated fields, maps, and dynamic filtering actions
SKILLS
● Programming & Scripting: Python (Pandas, NumPy, Scikit-learn, PyTorch), SQL, Bash/Shell, R, SnowSQL, REST APIs
● Data Engineering & Pipelines: Airflow, Apache Spark, Snowflake, Databricks, Teradata, Hadoop, Palantir Foundry,
● Analytics Engineering: Data Modeling, Dimensional Modeling, Facts and Dimensional Tables, Semantic Layer
● Databases: PostgreSQL, MySQL, Microsoft SQL Server, PL/SQL, NoSQL
● Cloud Platforms: Google Cloud Platform (GKE, Cloud Composer, Dataproc, BigQuery, GCS), AWS (S3, EC2)
● Developer Tools & Workflow: Git, Docker, Kubernetes, Jenkins, JIRA, VS Code.
● Visualization & Reporting: Tableau, Excel (Charts, PivotTables), Matplotlib, Seaborn, Plotly EDUCATION
University of Colorado Boulder Boulder, US
Master of Science in Data Science GPA: 3.97 / 4.00 Aug 2023 - May 2025 Jawaharlal Nehru Technological University Kakinada Kakinada, India Bachelor of Technology (B. Tech), Computer Science and Engineering Aug 2017 - Jun 2021