Abdul Ayaaz Shaik
DATA ENGINEER BIG DATA CLOUD DATA PLATFORMS
USA +1-470-***-**** ****************@*****.*** LinkedIn SUMMARY
Data Engineer with 4+ years of experience designing and building scalable data platforms, ETL pipelines, and real-time streaming systems across cloud environments. Expertise in batch and streaming data processing using Apache Spark, PySpark, Kafka, and Airflow to process large-scale datasets for analytics and machine learning. Experienced in modern data lakehouse architectures using Delta Lake, Apache Iceberg, and Snowflake across AWS, Azure, and Databricks. Proven ability to optimize data pipelines, improve data quality, and enable reliable analytics and reporting for enterprise platforms. TECHNICAL SKILLS
Programming: Python, SQL
Big Data Processing: Apache Spark, PySpark, Apache Flink
Streaming & Messaging: Apache Kafka, AWS Kinesis, Confluent Schema Registry
Data Engineering: ETL Pipelines, Batch Processing, Streaming Data Processing, Data Modeling
Data Lakehouse: Delta Lake, Apache Iceberg, Apache Hudi, Parquet, ORC
Orchestration: Apache Airflow, Prefect
Data Warehouses: Snowflake, Amazon Redshift, Google BigQuery
Databases: PostgreSQL, MySQL, SQL Server, MongoDB
Cloud Platforms: AWS, Azure, GCP, Databricks
Data Quality & Observability: Great Expectations, Monte Carlo, OpenLineage
DevOps & Infrastructure: Terraform, Docker, Kubernetes, CI/CD
Visualization: Tableau, Power BI
WORK EXPERIENCE
Nefroverse Technologies, USA Data Engineer Jan 2025– Present
Architected and launched 25+ production-grade batch and streaming data pipelines using Airflow, PySpark, and AWS Glue to ingest data from databases, APIs, and file sources into a centralized cloud data lake, automating ingestion workflows and reducing manual processing effort by 45%.
Engineered real-time streaming pipelines using Kafka and Spark Structured Streaming to process over 600K events per day with low latency, enabling near real-time dashboards and improving data freshness for business reporting by 80%.
Established a scalable cloud lakehouse architecture on Amazon S3 integrated with Redshift, Snowflake, and BigQuery, applying partitioning, clustering, and columnar storage (Parquet/ORC) to improve query performance by 38% and reduce storage costs by 27%.
Consolidated data from heterogeneous sources including MySQL, PostgreSQL, MongoDB, Cassandra, and HBase using CDC-based incremental pipelines, schema standardization, and automated monitoring, achieving 99.9% pipeline uptime across production data workflows.
Deployed a dbt-based transformation layer with 50+ modular models, adding automated testing, documentation, and lineage tracking to enable version-controlled transformations and reduce analyst SQL complexity by 40%.
Enhanced large-scale Spark workloads through partition tuning, broadcast joins, adaptive execution, and memory configuration, cutting end-to-end batch processing time by 25% and lowering cloud compute costs. Flipkart, India Data Engineer Dec 2022 – Dec 2023
Supported the development of a cloud-native AWS data lake on Amazon S3 storing over 5 TB of structured and semi- structured data, implementing partitioning strategies and optimized storage formats to improve analytical query performance by 35%.
Constructed and maintained scalable PySpark ETL pipelines to ingest and transform more than 25 GB of data per day from multiple sources including databases, object storage, APIs, and message queues, applying standardization, validation, and deduplication logic to achieve 99.5% data accuracy across large data batches.
Established data governance and data lineage tracking across ingestion and transformation layers to ensure transparency, regulatory compliance, and improved data trust for enterprise analytics.
Built and orchestrated 20+ production-grade data workflow using Apache Airflow with dependency management, SLA monitoring, and alerting, automating complex reporting pipelines and reducing manual data preparation efforts by 45%.
Enabled event-driven and near real-time ingestion pipelines using serverless compute and message-based triggers, enabling data availability within 5 minutes of arrival and reducing operational intervention in pipeline execution by 40%.
Introduced data quality validation and monitoring across bronze, silver, and gold layers to maintain consistent and reliable datasets for analytics and reporting.
Automated CI/CD pipelines for data engineering workloads using Git-based version control and automated testing, enabling controlled deployments and rollbacks, cutting deployment time by 40% and improving production pipeline reliability.
Delivered scalable batch and streaming data pipelines to unify multi-source operational data into a centralized analytics layer, enabling high-volume reporting workloads without impacting transactional systems. Miraicoders Technology, India Jr. Data Engineer Jul 2021– Dec 2022
Assisted in the development of an enterprise data platform by building ETL pipelines, implementing transformation logic, and assisting in defining source-to-target mappings for analytics workflows.
Prepared ETL workflows and curated datasets supporting BI dashboards and operational reporting used by leadership and analytics teams.
Produced Python-based feature engineering pipelines using Pandas, NumPy, and Scikit-learn to process 10M+ records per run, enabling the Data Science team to reduce model training prep time by 60% and deliver production-ready models faster.
Facilitated the migrating legacy SSIS-based ETL workflows to a scalable Python and Spark-based processing framework with improved monitoring and error handling.
Refinded Spark data processing workloads and pipeline configurations under senior guidance to improve performance and reduce cloud resource usage.
Participated to cloud cost optimization efforts by improving Spark workload configurations, enabling auto-scaling, and optimizing storage usage across data pipelines.
Developed a high-performance MySQL data access layer to support both transactional and analytical workloads using optimized indexing, stored procedures, and materialized views, reducing average query latency by 22% and improving system concurrency under peak user traffic.
EDUCATION
Master of Science in Computer Science
Avila University Kansas City, MO, USA
Bachelor of Science in Electronics & Communication Engineering VIT AP University, Amaravati, India
CERTIFICATIONS
AWS Data Engineer Associate
Databricks Spark
SnowPro Core