Bikash Thapa Magar
Senior Data Engineer
Farmers Brach(TX) ***************@*****.*** 614-***-****
Professional Summary:
Results-driven Data Engineer with 7+ years of experience building scalable, high-performance data and web solutions across Banking, E-commerce, Regulatory Compliance, and Operational Risk domains. Proven track record in architecting large-scale data pipelines to process both structured and unstructured data from complex sources, including clinical trials, legal documents, academic research, and financial filings. Expert in developing automated, fault-tolerant ETL workflows using Spark, Snowflake, dbt, Python, and SQL, with a strong focus on performance optimization and cost efficiency. Skilled in building data ingestion and preprocessing systems that support real-time machine learning applications and advanced analytics. Hands-on experience with cloud platforms (AWS, Azure), containerization (Docker, Kubernetes), and orchestration tools (Airflow, Jenkins, Azure DevOps) to streamline deployments through CI/CD pipelines. Proficient in delivering end-to-end data products, from raw ingestion to insightful dashboards using Power BI, Tableau, and AWS QuickSight. Proven expertise in architecting and optimizing data pipelines and data warehouse ecosystems (Snowflake, Redshift, BigQuery, Hive, HBase) on cloud platforms (AWS, GCP, Azure), with strong proficiency in SQL, Python, and performance tuning for large-scale AI/ML workloads. Hands-on experience with advanced database technologies and NoSQL systems, including MongoDB, Cassandra, Redis, and PostgreSQL, alongside building chatbot-specific pipelines, integrating NLP models, and supporting conversational AI metrics to enhance reliability and user experience. Known for translating business needs into robust data models and collaborating effectively in Agile environments, along with guiding junior developers. Committed to data quality, security, and governance, with practical experience implementing secure transmission protocols, compliance workflows, and audit traceability across enterprise. Technical Skills:
Programming Languages: Python (Pandas, NumPy, tidyverse, bots, sqlalchemy, ggplot2, shiny, Scikit-learn, PyTorch), R, SQL, Java, Scala, Data Structure
AI/ML: Generative AI, Prompt Engineering, Deep Learning, Data Mining, LangChain, LangGraph, Tensorflow
Big Data & Processing: PySpark, HDFS, MapReduce, Hive, Spark SQL, Kafka, Airflow, Hadoop, Data Security, Data Ingestion, Data Warehouse, Data Lake, DataBricks, Data Governance, Data Pipelines
Cloud Platforms: AWS (SageMaker, EMR, S3, Lambda, Redshift, Athena, Glue), GCP (BigQuery), Azure
(Azure SQL, Azure Synapse Analytics, Azure Data Factory)
Data Visualization & BI Tools: Power BI, Tableau, SAP Analytics Cloud, SAP Lumira, Matplotlib, Seaborn, AWS QuickSight
Databases: Snowflake, PostgreSQL, MongoDB, SQL Server
DevOps & CI/CD: Git, Jenkins, GitHub Actions, GitLab CI/CD, JIRA, Linux, Confluence, Udeploy, Artifactory, Git
Containerization & Orchestration: Docker, Kubernetes
Infrastructure-as-Code: Terraform, AWS CloudFormation, Boto3
Development Frameworks: FastAPI, Flask, Spring Boot, React, Redux, Django
Mathematics: Probability and Statistics, Linear Algebra, Calculus, Geometry, Analytical Thinking, Data Science, Machine Learning, Problem-solving Skills
Professional Experience
Company: Pfizer – Remote Sep 2022 – Present
Senior Data Engineer
Responsibilities:
• Developed and enhanced scalable ETL workflows using Apache Spark and Airflow to process multi-format data sources such as JSON, XML, Parquet, and delimited files, supporting finance and risk data pipelines.
• Designed and supported real-time ingestion pipelines using Azure Service Bus to ingest financial transactions and regulatory data, enabling responsive trading and compliance platforms.
• Applied machine learning, natural language processing, and causal inference to standardize complex datasets and generate actionable insights in pharmaceutical and healthcare domains.
• Led a cross-functional team of data engineers and analysts to deliver a compliant, enterprise-grade data lake platform that consolidated clinical, operational, and financial datasets across business units.
• Mentored junior engineers on Spark performance tuning, PySpark best practices, and secure data pipeline design aligned with HIPAA compliance standards.
• Tuned Spark job performance and optimized cluster configurations to handle petabyte-scale data volumes, improving financial query response efficiency for data science and reporting teams by over 40%.
• Integrated portfolio, accounting, and investment datasets from client-facing systems into unified analytical views to support onboarding of institutional partners.
• Built end-to-end pipelines on Google Cloud Platform (GCP) using Azure SQL, Cloud Composer (Airflow), DataProc, and Pub/Sub to enable real-time business intelligence on real estate and healthcare asset flows.
• Created star-schema based dimensional models to support healthcare and financial operations reporting across internal BI tools.
• Implemented pytest-driven validation layers across ETL pipelines to ensure accuracy, consistency, and compliance with internal QA standards.
• Defined and documented enterprise-wide ingestion and transformation rules using Python, PySpark, dbt, and Snowflake for complex financial instruments like Derivatives, Fixed Income, and Futures.
• Automated large-scale batch ETL flows using PySpark, adding monitoring and data quality checkpoints, which reduced end-to-end processing time by 40% and ensured reliable ingestion of investment and insurance data.
• Used Delta Lake on Databricks along with Snowflake to manage real-time and batch processing while maintaining ACID properties for compliance and availability.
• Designed a centralized data lake architecture that consolidated multiple financial and operational systems into a single source of truth, enabling unified analytics for cross-departmental insights.
• Collaborated with data governance and compliance teams to embed HIPAA-aligned practices into data pipelines, including encryption, PII masking, and audit logging of healthcare-related financial data.
• Participated in agile-based data delivery cycles, working closely with business users and compliance officers to translate complex financial and regulatory requirements into reliable data pipelines.
• Collaborative and forward-thinking data engineer with a track record of designing robust schemas, ensuring data health and governance, and integrating seamlessly with enterprise platforms. Capitol One - Remote May 2020 – Aug 2022
Senior Data Engineer
Responsibilities:
• Managed and maintained ETL processes and jobs across development, UAT, and production environments, ensuring data consistency and operational stability.
• Created and optimized complex ETL workflows using PySpark, Hive, and Spark SQL, handling data ingestion from relational databases and semi-structured formats such as Parquet and JSON.
• Led the design and implementation of a scalable data infrastructure on AWS, migrating legacy on-premise data warehouses to a modern data lake architecture, resulting in a 30% reduction in processing time and a 15% decrease in operational costs.
• Utilized CI/CD processes with Jenkins and Azure DevOps to streamline the build and version control of data pipelines, improving deployment efficiency and reducing manual error.
• Created and managed service tickets for all ETL and job-related tasks, ensuring proper tracking and resolution of issues.
• Designed and developed automated ETL pipelines and Python scripts, including job scheduling and dependency management, to support data workflows and business reporting.
• Built real-time Spark Streaming pipelines integrated with Kafka, enabling event-driven data processing and analytics.
• Developed and managed data lakes using AWS S3, Redshift, and Glue, supporting large-scale analytics and BI reporting.
• Containerized Spark and ML jobs using Docker and deployed them on Kubernetes clusters, enabling elastic scaling and resource efficiency.
• Implemented infrastructure as code using Terraform, Boto3, and AWS CloudFormation to provision secure and scalable cloud environments with strict IAM policies and VPC segregation.
• Developed CI/CD pipelines using Jenkins to automate build, test, and deployment processes for microservices on AWS EKS and Kubernetes.
• Designed and implemented automated data reconciliation and governance processes for financial datasets using SQL and PySpark, ensuring data quality and compliance.
• Designed and maintained data models employing star and snowflake schemas, performing query optimization and performance tuning within Snowflake data warehouse. Signify Health – Dallas, TX Jan 2018 – April 2020
Data Engineer – Data Warehouse Modernization & Analytics Enablement Responsibilities:
• Built data archival processes using HDFS and Hive to optimize storage and improve read performance on large datasets.
• Created validation routines using Python and SQL to verify the consistency of ingested healthcare data before transformation.
• Developed lineage tracking for ETL jobs to ensure transparency from raw ingestion to final reporting layers.
• Designed ingestion workflows for patient analytics data using Hive external tables and Sqoop imports from relational databases.
• Worked on upgrading the legacy data environment using Hadoop components such as HDFS, Hive, and Spark to meet new reporting and data volume demands.
• Built ingestion workflows using Sqoop to move relational data into Hadoop and transformed data using Hive and PySpark.
• Developed Spark applications using RDD and DataFrame APIs for cleaning, validating, and enriching operational, web, and marketing datasets.
• Created Hive schemas based on star schema models to support SEO and marketing performance reporting.
• Managed data validation using Python scripts and SQL queries to ensure data quality throughout the pipeline.
• Scheduled ETL workflows with Apache Oozie and evaluated Apache Airflow for DAG-based orchestration improvements.
• Designed custom Hive UDFs in Java to handle campaign-specific parsing, tagging, and classification logic.
• Performed reconciliation between raw input data and final datasets using row counts and checksum comparison methods.
• Developed and deployed a full-stack web application using Java, Angular 6, and JavaScript within a microservices-based architecture, enabling real-time campaign metrics visualization and driving business impact. Certification/Awards
AWS Certified: AWS Certified Solution Architect Associate.
AWS Certified : AWS Certified Data Engineer
Insta Award: Recognized as one of the Best Developers at Infosys Research:
Nvidia CUDA PSSL: Researched and developed an algorithm to compute line segment intersections in computer chips and network paths, achieving a 25% improvement over Plane Sweep. Applied this to optimize geometric applications in real-time systems. Implemented Bounding Box on Per Segment Sweep Line using the parallelization provided by CUDA Thrust on GPUs.
Breast Cancer Detection: Developed a machine learning model using Python, Scikit-learn, and TensorFlow, achieving 97% accuracy on the Breast Cancer Wisconsin Diagnostic Dataset. Education:
• Master of Computer Science
Southern Illinois University Edwardsville