Mohammed Majeed
Email: ***************@*****.*** Phone: +1-469-***-****
SUMMARY OF EXPERIENCE:
Experienced Data Engineer and Hadoop Developer with over 7+ years of experience building and optimizing large-scale data solutions across cloud (Azure, AWS) and on-premise (Cloudera Hadoop) ecosystems. Proficient in Apache Spark (Core, SQL, Streaming), Hive, Databricks, and Delta Lake, with a strong track record of developing robust ETL/ELT pipelines and implementing real-time streaming applications using Kafka and Spark Structured Streaming, Hands-on expertise in data lakehouse architectures, data governance (Unity Catalog, Apache Ranger, PII/PHI masking), and infrastructure automation using Terraform and CI/CD tools like Azure DevOps and GitHub Actions. Collaborative team player with a passion for performance improvements, cost optimization, and ensuring high availability of data systems. TECHNICAL SKILLS
Big Data & Cloud Platforms:
Hadoop (HDFS, MapReduce, Hive, Sqoop), Apache Spark (Core, SQL, Streaming), Databricks, Delta Lake, Azure (Blob Storage, Synapse Analytics, ADF, SQL Pools), AWS (S3, Glue, IAM) Data Engineering & Processing:
Databricks Auto Loader, Delta Live Tables, Spark Streaming, Structured Streaming, ETL/ELT pipeline design, Data Lakehouse architectures
Databases:
Teradata, PostgreSQL, Snowflake, MySQL, Hive, Oracle, Delta Lake, DynamoDB & MongoDB Data Governance & Security:
Unity Catalog, RBAC, ABAC, Data Lineage, Data Masking (PII/PHI), Azure Key Vault, AWS IAM ETL Tools:
Informatica PowerCenter, Apache Oozie, Apache Spark & PySpark, Hive, Sqoop, Azure Data Factory, AWS Glue, Apache Airflow
Orchestration & Workflow Automation:
Databricks Workflows, Azure Data Factory (ADF), Apache Airflow, Autosys, AWS Step Functions, Oozie Infrastructure as Code & CI/CD:
Terraform (Databricks, Azure, AWS), GitHub Actions, Azure DevOps Optimization & Performance Tuning:
Delta Lake optimizations, Caching, Partitioning & Bucketing, Z-Ordering, Oracle SQL tuning, AQE
(Adaptive Query Execution), resource-aware tuning on Spark/YARN Programming & Scripting Languages:
Python, Scala, SQL, Shell Scripting, Bash
Version Control & DevOps:
Git, Bitbucket Pipelines, Jenkins, GitHub, Azure Repos. PROFESSIONAL EXPERIENCE
(Data Engineer/ Hadoop Developer)
PNC Bank, Farmers Branch, Texas. Jul 2023 - Jul 2025
• Developed and maintained scalable, high-performance data pipelines using PySpark, Hive, and Cloudera Hadoop to support critical reporting and analytics applications across multiple lines of business.
• Designed and implemented end-to-end ETL workflows using Oozie, UNIX Shell scripting, and Informatica PowerCenter to automate data ingestion, cleansing, and transformation of large datasets from Oracle and Teradata into Hadoop and Snowflake environments.
• Wrote optimized HiveQL and Impala SQL queries for distributed processing of financial and operational data, reducing query execution time by over 40% on large Hive tables through partitioning and bucketing.
• Migrated legacy data pipelines to Hadoop-based infrastructure, ensuring secure, accurate, and zero- downtime transformation of data assets while maintaining full compliance with PII/PHI masking standards.
• Integrated Azure Synapse Analytics with Databricks for seamless data transformation workflows, utilizing SQL Pools to enhance performance
• Implemented role-based access control (RBAC) and audit logging on Hadoop data assets using tools like Apache Ranger, ensuring data governance policies aligned with PNC’s Enterprise Risk Management.
• Built data ingestion and transformation pipelines in Scala using Apache Spark, processing terabytes of structured and unstructured data.
• Conducted performance tuning of PySpark jobs on the Cloudera platform using YARN resource optimization, Spark caching, and adaptive query execution, while designing and scheduling complex batch workflows in AutoSys to manage dependencies and optimize job execution across the data pipeline.
• Developed reusable Python and shell-based utilities for data validation, error handling, and metadata- driven ingestion, improving developer productivity and consistency across multiple projects. Data Engineer
Johnson & Johnson, New Brunswick, NJ. Jan 2023 - June 2023
• Developed and optimized ETL pipelines using Spark/Scala and Informatica best practices in AWS Databricks, processing 70–80 billion records daily and reducing job execution time by 40% through parallelization and caching.
• Optimized Spark job execution, reducing runtime from 20 hours to 3 hours using df.cache, partitioning and Auto Loader optimizations.
• Optimized resource allocation using Databricks Spark and Yarn UI, leading to significant cost savings.
• Implemented incremental data processing using Delta Lake, reducing storage costs by a significant value.
• Developed dashboards in Databricks to visualize daily production job status, improving monitoring and issue resolution efficiency.
• Migrated large datasets across Teradata, Hive, Databricks Delta, AWS S3, and SingleStore (MemSQL).
• Automated infrastructure deployment using Terraform, provisioning Databricks clusters, Unity Catalog, ACLs, and S3 storage, reducing manual effort by 30%.
• Designed and implemented Unity Catalog for centralized data governance, enabling fine-grained access control (RBAC/ABAC), data lineage tracking and audit logging mechanisms.
• Designed Git-based CI/CD workflows for automated and reproducible deployments and version control. Big Data Engineer
Infosys, Hyderabad - India Jul 2017 – Dec 2021
• Designed and deployed high-volume ETL pipelines using Python, Spark, and SQL.
• Migrated legacy MapReduce jobs to Spark, achieving a 4x performance improvement.
• Processed structured and unstructured data using Hive, Snowflake, and SQL transformations.
• Developed custom UDFs in Scala for complex data transformations & developed Kafka-based streaming applications for real-time event processing.
• Integrated Tableau for real-time analytics, enabling business decision-making with live dashboards.
• Designed and deployed modular, reusable data engineering frameworks in Python and Shell, improving code reusability.
• Designed and implemented fault-tolerant data pipelines using Apache Spark and Kafka, ensuring high availability and zero data loss during node or cluster failures. Education :
Master of Science In Business Analytics Trine University- 2023 Bachelor of Technology in Computer Science JNTUH-2017 Certifications:
AWS Certified Machine Learning Engineer - Associate (In Progress)