Sanjit Thapa Magar
Email: ******.***@*****.*** Ph#:469-***-**** LinkedIn: https://www.linkedin.com/in/sanjit-m-49aa79236/
Professional Summary
Seasoned and results-oriented Data Engineer with over 7 years of progressive experience in designing, developing, and managing large-scale data engineering solutions across cloud and on-prem environments. Proven expertise in architecting robust ETL/ELT pipelines, automating data workflows, and integrating structured, semi-structured, and unstructured data using cutting-edge technologies such as Apache Spark, Kafka, Snowflake, Airflow, Databricks, and dbt. Demonstrated success in implementing complex data lakes and warehousing solutions on AWS and Azure, driving operational efficiency and accelerating analytics for enterprise-level clients in healthcare and financial services. Adept in data quality, governance, and infrastructure automation, with a solid foundation in Python, SQL, and cloud DevOps practices. Strong grasp of modern data stack tools such as Delta Lake, Great Expectations, and Apache Atlas. Recognized for mentoring junior engineers, cross-functional collaboration, and delivering high-impact data platforms that support real-time and batch processing for critical business insights.
Key Achievements
Reduced ETL job processing time by 40% through Spark optimizations and efficient memory management.
Migrated on-prem data pipelines to AWS Glue and Snowflake, improving data availability by 60%.
Designed real-time Kafka-based ingestion architecture that decreased event latency by 70%.
Created cost-efficient data lakes on AWS S3 and Azure Data Lake Storage, reducing cloud spend by 30%.
Led data governance initiatives, improving data accuracy and compliance across the organization.
Implemented Delta Lake and Great Expectations to enhance pipeline reliability and data validation.
Technical Skills
Operating Systems: Unix, Linux, Windows, Ubuntu
Cloud Platforms: AWS (EC2, Lambda, EMR, Glue, S3, Redshift, Athena, CloudFormation), Azure (ADF, Synapse, Databricks, Data Lake Gen2, Cosmos DB, AKS)
Big Data Technologies: Apache Spark, Hadoop, Hive, Kafka, Kafka Connect, Presto, Sqoop, Flume, HDFS, MapReduce
ETL Tools: Informatica PowerCenter, SSIS, AWS Glue, Azure Data Factory, dbt
Databases: Snowflake, Redshift, MySQL, PostgreSQL, SQL Server, Oracle, MongoDB, DynamoDB, Trino
Programming: Python (Pandas, NumPy, PySpark, pytest), Scala, SQL, PL/SQL, KQL
Workflow Orchestration: Apache Airflow, Oozie, Zookeeper
CI/CD & DevOps: Git, Jenkins, GitLab, Docker, Kubernetes, GitHub Actions, Azure DevOps
Infrastructure as Code: Terraform, AWS CloudFormation, ARM Templates
Visualization & Monitoring: Kibana, Azure Monitor, Elasticsearch, Looker, Tableau, JIRA
Governance & Metadata: Apache Atlas, Collibra, Alation
Education
Bachelor of Science in Mathematics, Dec 2021 – Texas A&M University-Commerce
Master of Science in Business Analytics, Dec 2024 – Texas A&M University-Commerce
Professional Experience
Senior Data Engineer May 2023 – Present
Truist Financial – Charlotte, NC
Created scalable data lakes with enforced encryption policies and partitioning logic, enhancing accessibility while maintaining security and cost-efficiency.
Built and optimized ETL pipelines using AWS Glue, Apache Airflow, and Python, automating over 40 complex workflows to reduce manual efforts and human error.
Developed serverless data processing workflows using AWS Lambda, Amazon S3, and DynamoDB, achieving 95% success in real-time event processing across distributed systems.
Implemented Apache Hudi and Delta Lake to support incremental data ingestion, enabling efficient backfills and consistent analytics snapshots.
Enhanced data ingestion efficiency by designing automated data ingestion frameworks using AWS CloudFormation and Terraform, reducing environment setup time by 50%.
Built and maintained real-time Kafka Connect pipelines integrating more than 12 internal systems, ensuring smooth event stream processing and business continuity.
Integrated over 20 external REST APIs and system connectors into unified pipelines using Python, boosting data source reach for analytics.
Improved PostgreSQL performance by writing optimized functions and leveraging parallel query execution capabilities.
Integrated AWS Athena and Amazon Redshift Spectrum for ad-hoc querying capabilities, significantly reducing query response time and enabling real-time analytics on large-scale datasets stored in S3.
Designed complex Snowflake workflows using dbt, enabling 150+ stakeholders to access consistent, governed data.
Built data profiling modules with Great Expectations, reducing data quality issues in reports by 60%.
Deployed automated CI/CD pipelines using GitHub Actions, Jenkins, and Kubernetes, achieving 98% deployment success rate.
Implemented advanced monitoring and logging solutions using AWS CloudWatch, Kibana, and Elasticsearch, improving issue detection and resolution turnaround by 40%.
Led adoption of Apache Atlas for cataloging 10+ data domains, enabling improved data lineage visibility.
Partnered with data scientists to provision real-time ML feature stores, expediting model inference processes.
Conducted training sessions for junior developers on best practices for pipeline design and unit testing.
Technologies Used: AWS Glue, Apache Airflow, Lambda, S3, DynamoDB, Redshift, Kafka, Kinesis, Python, Snowflake, dbt, Jenkins, GitHub Actions, Kubernetes, Great Expectations, Apache Atlas, AWS CloudFormation, Terraform, AWS Athena, Redshift Spectrum, CloudWatch, Elasticsearch, Kibana
Big Data Developer Aug 2020 – Mar 2023
Cerner Corporation – Kansas City, MO
Designed end-to-end ingestion pipelines from Azure Blob to Snowflake using Azure Data Factory (ADF), Snowpipe, and metadata-driven ingestion logic.
Built complex data marts and views in Azure Synapse, optimizing joins, partitions, and CTAS queries for better performance.
Created modular dbt packages with dynamic configurations across dev/test/prod environments, shortening onboarding time for new team members.
Developed and integrated data anomaly detection layers using Great Expectations and KQL rules in Azure Monitor.
Implemented event-driven workflows using Azure Functions triggered by Cosmos DB and Event Grid, improving notification systems.
Designed 50+ PySpark pipelines for cleaning and aggregating clinical data, delivering daily insights for operations.
Developed schema registry workflows to manage schema evolution across streaming and batch pipelines, ensuring backward compatibility.
Automated Spark job deployment via Databricks CLI, decreasing pipeline setup time for new projects.
Implemented version-controlled Terraform scripts for infrastructure provisioning, reducing onboarding time by 40%.
Built robust error-handling and retry logic across ingestion pipelines to guarantee SLA adherence.
Integrated Azure Key Vault into data pipelines for secure credential management and improved compliance posture.
Contributed to internal knowledge bases, documenting best practices for data reliability, schema evolution, and CI/CD patterns.
Engaged with cross-functional healthcare teams to ensure HIPAA-compliant pipeline designs and data masking.
Technologies Used: Informatica PowerCenter, Apache Flume, Kafka, HDFS, Presto, dbt, PySpark, SQL Server, Snowflake, Azure Synapse, Azure Data Factory, Databricks, Cosmos DB, Azure Functions, Terraform, Apache Atlas, Jenkins, Great Expectations, Elasticsearch, JIRA, Kibana
Data Engineer Jan 2017 – Jun 2020
WinCo Foods – Boise, ID
Designed and supported more than 30 high-throughput batch pipelines for sales, inventory, and supplier data using Informatica PowerCenter and Python scripts.
Modernized legacy MapReduce workflows into PySpark-based modules, improving maintainability and performance.
Built reconciliation frameworks using SQL Server and Snowflake to ensure data parity and compliance for audits.
Used Apache Flume and Kafka to move log files and transactional data to HDFS, allowing real-time log analysis.
Maintained Presto queries for ad hoc analysis across large Parquet datasets, reducing query times by 50%.
Designed and maintained dbt models with dimensional modeling techniques supporting dashboards used by 300+ stores.
Built automated validation scripts using Great Expectations and integrated them into Jenkins CI pipelines.
Created Kibana-based dashboards to visualize job execution metrics, which reduced SLA breaches by 45%.
Developed centralized metadata stores and Apache Atlas-based lineage views, streamlining audits and data access requests.
Built alerting and monitoring tools with Elasticsearch and JIRA, significantly improving issue resolution turnaround time.
Led a team-wide initiative to implement code versioning standards and enforce peer code reviews, increasing deployment stability.
Collaborated with retail analytics team to deliver daily executive dashboards, ensuring alignment with business KPIs.
Technologies Used: Informatica PowerCenter, Apache Flume, Kafka, HDFS, Presto, dbt, PySpark, SQL Server, Snowflake, Apache Atlas, Jenkins, Great Expectations, Elasticsearch, JIRA, Kibana