Senior Data Engineer

Location:

Arlington, TX

Posted:

January 28, 2025

Contact this candidate

Resume:

Name: Mandeep Adhikari, PhD Senior Data Engineer

Email: *******.***********@*****.***

Ph#: (408) – 673 – 7412

Professional Summary:

Data Engineer with 8+ years of experience in ETL development, big data analytics, data modeling, and designing scalable data solutions in cloud platforms, including AWS, Azure, and GCP.

Expertise in big data technologies, including Apache Spark (Core, SQL, Streaming, MLlib, GraphX), Hadoop (HDFS, MapReduce, Hive, Pig), Flume, and Kafka for batch data and real-time data processing.

Proficient in Python, PySpark, Scala, and SQL for data pipelines and machine learning, with a strong foundation in statistics and data science.

Hands-on experience with Docker, Kubernetes, and Terraform for containerization, microservices architecture, and infrastructure as code.

Strong understanding of the Software Development Life Cycle (SDLC), with expertise in requirements analysis, design specification, and Waterfall and Agile methodologies testing.

Demonstrates excellent communication skills to collaborate effectively with cross-functional teams, stakeholders, and business analysts, ensuring alignment between technical solutions and business objectives.

Designed and implemented distributed systems leveraging the Hadoop ecosystem components, including HDFS, MapReduce, and YARN, ensuring efficient data storage and processing.

Proficient in structuring data into medallion layers (Bronze, Silver, Gold) using Databricks and designing OLAP data models to enable seamless data transformation with improved quality and scalability.

Experienced in developing and orchestrating ETL pipelines using Python, PySpark, Talend, MapReduce, and Cleo for data cleansing, preprocessing, advanced analytics, and API integrations.

Developed and implemented data models using tools like ER/Studio, Erwin, and Sybase Power Designer, including Dimensional, Relational, Star Schema, and Snowflake designs.

Skilled in Tableau and Power BI, delivering compelling data visualization and enabling data-driven decisions.

Implemented Apache Kafka pipelines to enhance real-time data streaming to optimize processing efficiency.

Extensive experience with cloud platforms such as Azure (Data Factory, Data Lake, Synapse Analytics, Databricks), AWS (EMR, Redshift, S3, Lambda), and GCP (BigQuery, Cloud Dataflow), enabling scalable cloud-based solutions.

Skilled in implementing CI/CD pipelines using Azure DevOps and automating workflows with tools like Apache Airflow for robust scheduling and orchestration.

Experienced in leveraging big data technologies, including Apache Spark, Hadoop, and Databricks, to build data lakes, warehouses, and advanced analytics pipelines.

Strong background in version control and build automation using Git, GitHub, Jenkins, and Artifactory, ensuring streamlined collaboration and deployment.

Proficient in managing NoSQL (MongoDB, DynamoDB, Cassandra, HBase) and SQL databases with advanced query optimization skills (T-SQL, PL/SQL).

Developed Airflow workflows to automate and optimize ETL processes, improving data pipeline efficiency.

Proficient in data security and compliance (GDPR, HIPAA) and skilled in Agile development practices.

Technical Skills:

Databases

Snowflake, AWS RDS, Teradata, Oracle, MySQL, Microsoft SQL, PostgreSQL, OLAP

NoSQL Databases

MongoDB, HBase, Hive, Cassandra, DynamoDB.

Programming Languages

Python, R, SQL, Scala, Java, MATLAB.

Cloud Technologies

Azure, AWS, GCP.

Data Formats

CSV, JSON, Parquet, EXCEL, XML

Querying Languages

SQL, NO SQL, PostgreSQL, MySQL, Microsoft SQL.

CI/CD Integration Tools

Jenkins, Git, GitLab, Azure DevOps, AWS CodePipeline

Scalable Data Tools

Databricks, Hadoop, Hive, Apache Spark, Flume, Medallion, Pig, Map Reduce, Sqoop.

Operating Systems

Red Hat Linux, Unix, Windows, macOS.

Reporting & Visualization

Tableau, Power BI, Looker, Matplotlib.

Professional Experience

Senior Data Engineer – Johnson & Johnson, Santa Clara, CA June 2022 – Current

Designed and optimized ETL pipelines using PySpark, processing large-scale healthcare datasets from Medicare, Medicaid, and commercial lines to support advanced analytics.

Developed Scala workflows for extracting and transforming data from cloud systems using Apache Spark.

Built a near real-time data ingestion pipeline with Spark Streaming, integrated with Flink, Kafka, AWS Kinesis, and APIs, enabling low-latency data processing and real-time analytics.

Deployed and orchestrated Spark applications using Docker and Kubernetes, enabling scalable deployments.

Utilized Terraform to provision cloud computing infrastructure and automated the deployment pipelines, streamlining the workflows.

Architected a PySpark Data Ingestion framework to cleanse, aggregate, and deduplicate source data, loading processed data into Hive tables.

Designed and implemented scalable batch data processing workflows using the Hadoop ecosystem, including HDFS and MapReduce, to support batch processing and large-scale data transformations.

Migrated on-premises data from SQL Server, Oracle, and MongoDB to Azure Data Lake using Azure Data Factory, SQL Azure, SSIS, and PowerShell, ensuring seamless transitions and data integrity.

Developed and deployed Spark applications on AWS EMR, leveraging EC2 instances for cluster management, S3 for data storage, and Redshift for analytics, enabling efficient large-scale data processing.

Optimized data pipelines by integrating AWS Glue for ETL workflows, Snowpipe for seamless data ingestion into Snowflake, Lambda for serverless processing, and Athena for interactive queries over S3 stored datasets.

Integrated Data Build Tool (DBT) into data transformation workflows to standardize SQL-based transformations, ensuring scalable and reliable analytics pipelines with effective metadata management.

Automated workflow orchestration using AWS Step Functions, simplifying data pipeline management and enhancing operational efficiency.

Designed and implemented scalable Azure-based solutions, leveraging Azure Data Factory, Azure Data Lake, and Azure HDInsight while optimizing workflows with tools like Informatica and U-SQL.

Designed and implemented Star and Snowflake schemas, creating facts, dimensions, and aggregate tables to enhance data warehouse performance and support analytics workflows.

Configured and managed Kafka brokers, implementing pipelines to aggregate web log data across multiple servers into Spark Streaming, supporting downstream systems and engineering operations.

Automated data workflows with Apache Airflow DAGs, optimizing scheduling, execution, and monitoring of pipelines.

Authored SQL scripts for transformation tasks, enabling seamless integration into analytics pipelines.

Collaborated with business analysts and data scientists to design models supporting complex machine learning algorithms, predictive analytics, and business intelligence.

Supervised a team of 3+ engineers, managing task allocation, code reviews, and ensuring adherence to project timelines and quality standards

Created and maintained Tableau dashboards featuring advanced data visualization and Power BI reports, enabling actionable insights and business decision-making.

Used GitHub for version control, ensuring effective collaboration and deployment of code repositories.

Strengthened data security and backup strategies, ensuring compliance with GDPR and HIPAA standards.

Utilized Jira for bug tracking and agile task management, contributing to efficient project delivery.

Followed Agile practices by participating in SCRUM meetings and sprint planning to align with project goals while providing updates to stakeholders to support decision-making.

Environment: Apache Spark, Scala, PySpark, Python, Hadoop (HDFS, MapReduce, Hive), AWS (EMR, EC2, S3, Redshift, ElastiCache, Glue, Lambda, Kinesis, Athena, Step Functions), Azure (Data Factory, Data Lake, HDInsight), Docker, Kubernetes, Snowpipe, Snowflake, Kafka, Terraform, DBT, Tableau, Power BI, SQL, PL/SQL, Apache Airflow, GitHub, Jira, Agile.

Data Engineer – PNC Bank, Dallas, TX Jan 2020 – May 2022

Developed high-performance Spark code using PySpark, Scala, and Spark-SQL/Streaming, optimizing large-scale data transformations and workflows within the Hadoop ecosystem.

Structured data into Medallion architecture layers (Bronze, Silver, Gold) using Azure Data Lake, ensuring high data quality and scalability for advanced analytics.

Engineered scalable ETL pipelines to process structured and unstructured data formats using PySpark, Talend, Matillion, AWS Glue, and GCP Dataflow, enabling seamless multi-cloud data integration.

Containerized ETL pipelines with Docker and orchestrated them using Kubernetes for scalability and fault tolerance. Automated infrastructure provisioning with Terraform, streamlining multi-cloud resource management.

Automated data workflows using Bash scripting in Linux/Unix environments, supporting ETL pipelines and ensuring efficient multi-cloud integration

Migrated data from Oracle, SQL Server, and MongoDB to Azure Data Lake and GCP BigQuery, using Azure Data Factory (ADF V1/V2) and Google Cloud Storage, ensuring workflows and data availability.

Integrated Sqoop and Flume for seamless data ingestion from relational databases and streaming sources into HDFS, supporting batch and real-time processing workflows.

Configured and managed Kafka clusters for real-time data ingestion and streaming, integrating with Spark Streaming and enabling downstream analytics pipelines.

Performed data transformations and analytics using Pyspark, Spark Core, Spark SQL, Java, and GCP BigQuery, enabling scalable and efficient data processing across systems.

Developed custom MapReduce programs, Hive UDFs, and HBase tables to process queries and store large datasets, optimizing workflows for business analytics.

Architected and implemented large-scale BI solutions using Azure Synapse Analytics, Databricks, GCP BigQuery, and Cloud Pub/Sub, delivering scalable and reliable analytics pipelines.

Designed and implemented Star and Snowflake schemas, creating Fact and Dimension tables with tools like Erwin, supporting advanced analytics and reporting.

Built interactive dashboards and reports using Power BI and Tableau, enabling actionable insights and strategic decision-making.

Automated workflows using Apache Airflow DAGs, orchestrating end-to-end data pipelines for efficient execution and monitoring.

Designed MongoDB schemas with efficient data structures for document modeling, indexing, and tuning, improving database performance and query efficiency.

Conducted comprehensive data analysis, cleaning, and transformation using Python libraries (pandas, numpy) to optimize machine learning models and predictive analytics workflows.

Leveraged AWS Glue and S3 for ETL workflows and scalable data storage, complementing multi-cloud data processing strategies.

Utilized Git for version control and implemented CI/CD pipelines using Jenkins and GitLab, ensuring streamlined deployments and efficient collaboration across teams.

Used Jira for issue tracking and task management to align with project goals.

Environment: Apache Spark, Scala, PySpark, Python (pandas, numpy), Hadoop (HDFS, MapReduce, Hive, HBase), Azure (Data Factory, Data Lake, Synapse Analytics, Databricks), GCP (BigQuery, Dataflow, Pub/Sub, Cloud Storage), AWS (Glue, S3, EMR, EC2, Redshift, Lambda), Docker, Kubernetes, Terraform, Jenkins, GitLab, Sqoop, Flume, Linux, Bash scripting, Tableau, Power BI, Kafka, Airflow, MongoDB, SQL, PL/SQL, Git, Jira, Agile.

Data Engineer – Western Union, Starkville, MS Aug 2018 – Dec 2019

Designed and implemented ETL workflows using AWS Glue and GCP Dataflow, processing structured and semi-structured data formats (JSON, CSV, Parquet) for multi-cloud integration.

Built and managed data pipelines to transform, aggregate, and load datasets into AWS S3, Snowflake, and RDS using Matillion for batch and streaming workflows.

Automated data ingestion into Snowflake using Snowpipe, optimizing integration pipelines sourced from AWS S3 for real-time reporting and analytics.

Developed Spark applications using Scala, PySpark, and Spark SQL, optimizing batch processing efficiency, addressing troubleshooting issues, and supporting workflows within the Hadoop ecosystem.

Configured and managed Kafka clusters for real-time data ingestion, integrating with Spark Streaming to process application logs and support downstream systems.

Migrated datasets from on-premises sources (SQL Server, Oracle) to AWS S3 and GCP BigQuery, leveraging Google Cloud Storage for scalable and reliable data integration.

Utilized AWS services such as EMR, EC2, RDS, Redshift, and Lambda to architect scalable big data solutions and enable advanced analytics.

Authored Hive queries and created HDFS tables, enabling efficient querying and transformation compatible with MapReduce-based processes.

Automated repetitive workflows using Bash scripting in Linux/Unix environments, streamlining ETL tasks and enhancing operational efficiency.

Performed data validation, cleaning, and transformation using Python (pandas, numpy) and R, ensuring high-quality datasets for analytics and machine learning workflows.

Authored complex SQL queries using PL/SQL and T-SQL, enhancing data quality and supporting profiling efforts in relational and dimensional data warehouses.

Designed advanced visualizations in Tableau, including Heat Maps, Geographic Maps, and Cross Tabs, facilitating actionable business insights and decision-making.

Applied Azure Data Lake and Data Factory for targeted ingestion tasks, integrating hybrid-cloud pipelines and supporting complex transformation workflows.

Defined IAM roles and policies to secure AWS resource access, ensuring compliance and operational safety.

Collaborated with cross-functional teams of engineers and analysts to design scalable pipelines, ensuring alignment with business goals and operational requirements.

Contributed to Agile practices, actively participating in SCRUM meetings, sprint planning, and iterative reviews to align deliverables with project goals and broader business strategy.

Employed GitHub for version control, streamlining collaboration, and deployment processes.

Environment: Apache Spark, Scala, PySpark, Python (pandas, numpy), R, Hadoop (HDFS, MapReduce, Hive), AWS (S3, Glue, EMR, RDS, Redshift, Lambda, Snowflake, Snowpipe, IAM), GCP (BigQuery, Dataflow, Cloud Storage), Azure (Data Lake, Data Factory), Kafka, Tableau, Bash scripting, Linux/UNIX, SQL, GitHub, Agile.

Data Engineer – Heifer International, Kathmandu, Nepal Feb 2016 – Jan 2018

Contributed to the Analysis, Design, development, and maintenance phases of data engineering projects, focusing on optimizing workflows and core modules.

Developed Spark Streaming jobs in Scala to consume data from Kafka Topics, perform data transformations, and insert processed data into HBase storage.

Engineered batch processing workflows using Apache Spark and MapReduce, ensuring efficient handling of large-scale datasets within the Hadoop ecosystem.

Configured and deployed a Hadoop Cluster on AWS, setting up core components for distributed processing and scalable storage solutions.

Designed and implemented Spark workflows to support batch processing, streaming, and interactive queries, laying the foundations for the data science team to implement machine learning applications.

Developed Hive queries and optimized MapReduce jobs using compression techniques to enhance query performance and storage efficiency.

Authored Pig Latin scripts to extract, transform, and load data from log files into HDFS, enabling downstream analytics and reporting.

Worked with NoSQL databases, including HBase, MongoDB, and Cassandra, to efficiently store and query large, distributed datasets.

Performed advanced data analysis using Python (pandas, numpy), R, and SQL, ensuring clean, structured datasets for analytics workflows.

Automated ETL pipelines using Apache Spark, reducing processing latency and ensuring data reliability.

Collaborated with cross-functional teams in SCRUM meetings to ensure timely delivery of high-quality solutions.

Environment: Apache Spark, Scala, Python (pandas, numpy), PySpark, Hadoop (HDFS, MapReduce, Hive, Pig), AWS (EC2, S3), Kafka, NoSQL (HBase, MongoDB, Cassandra), SQL, JSON, XML, CSV, Agile.

Education: PhD in Bioengineering, University of Hawaii at Manoa, Honolulu, HI, USA

Contact this candidate