Data Engineer Business Intelligence

Location:

Jersey City, NJ

Posted:

September 10, 2025

Contact this candidate

Resume:

Intioz Shaik

+1-732-***-**** ******.**@*****.*** IntiozShaik-Linkedin www.intiozshaik.com

PROFESSIONAL SUMMARY

Experienced Data Engineer specializing in developing scalable ETL pipelines and real-time data processing systems using Apache Spark, Python, and AWS services. Proven track record in integrating diverse data sources into centralized data lakes, enhancing business intelligence and decision-making. Successfully implemented cloud-native data flows and automation, improving data reliability and team productivity. Eager to leverage expertise in data engineering to drive innovative solutions and insights in the target role. TECHNICAL SKILLS

• Programming Languages: Python, Scala, SQL, Shell Scripting

• Big Data Frameworks: Apache Spark (PySpark/Scala), Hadoop, Hive, Kafka, HBase

• Cloud Platforms: AWS (S3, Glue, Redshift, Lambda, Athena, EMR), Azure (ADF, Blob Storage), GCP (BigQuery, Cloud Storage)

• ETL & Orchestration: Apache Airflow, AWS Glue, SSIS, dbt, NiFi, Talend

• Data Warehousing: Snowflake, Amazon Redshift, Google BigQuery, Azure Synapse Analytics

• Databases: MySQL, PostgreSQL, SQL Server, Teradata, Oracle

• Data Visualization: Power BI, Tableau, Looker, Grafana, Excel (Pivot Tables, Power Query, VBA)

• Machine Learning: scikit-learn, MLlib, XGBoost, Prophet, Pandas, NumPy, ARIMA, K-means, Regression, Classification, Clustering

• Data Processing Formats: Parquet, ORC, JSON, Avro, CSV, Delta Lake, Apache Hudi

• Data Governance & Security: Apache Atlas, AWS Glue Catalog, IAM, KMS, Data Masking, Encryption, HIPAA, GDPR Compliance

• DevOps & CI/CD: Git, Bitbucket, Jenkins, Docker, Terraform (basic), Kubernetes

• Documentation & Tools: Confluence, JIRA, MS Office, Lucidchart, ERwin, dbt Docs

• Soft Skills: Data Storytelling, Business Acumen, Critical Thinking, Cross-functional Collaboration, Agile/Scrum, Mentoring, Problem Solving

PROFESSIONAL EXPERIENCE

Capital One Jul 2024 - Present

Data Engineer – ETL & Streaming USA

• Designed and implemented scalable ETL pipelines using Apache Spark and Airflow to process terabytes of customer transaction data, enabling real-time fraud detection and credit risk modeling with high accuracy.

• Developed Delta Lake tables on Databricks, ensuring ACID compliance, enabling schema evolution, and supporting both batch and streaming workloads for real-time analytics and time-travel queries.

• Utilized AWS Glue, S3, and Athena to transform and query financial data, creating Parquet datasets to optimize performance for downstream BI tools and reduce storage overhead.

• Captured and streamed real-time credit card activity using Kafka, ingesting event metadata into the centralized data lake to support immediate alerting, investigation, and risk scoring.

• Built serverless workflows using AWS Lambda and Step Functions, automating credit analysis pipelines with orchestration logic that significantly reduced processing time and manual interventions.

• Created Power BI dashboards by directly connecting to Redshift and Snowflake, enabling visualization of loan performance, application status, and delinquency trends for real-time executive monitoring.

• Automated ingestion of third-party credit data and economic indicators using Python and API calls, transforming external datasets into a standardized format for credit risk decision modeling.

• Wrote complex SQL queries with CTEs, window functions, and multi-table joins to facilitate enrichment, segmentation, and perfor- mance analytics across billions of financial records.

• Built a data lakehouse architecture with Databricks, Unity Catalog, and Hive Metastore, offering centralized metadata management and secure multi-team access with role-based controls.

• Developed and deployed ML model pipelines on Databricks, integrating them with APIs to support real-time scoring for customer segmentation, fraud detection, and creditworthiness evaluation.

• Performed cost tuning on Amazon Redshift Spectrum, optimizing query execution with partitioning, pruning, and result caching, reducing daily query costs by over 30%.

• Used Amazon EMR to run large-scale Spark jobs for batch workloads, employing autoscaling and spot instances to manage compute resources effectively and minimize operational costs.

• Integrated CI/CD pipelines using AWS CodePipeline, Jenkins, and CloudFormation, automating deployment of Spark applications and infrastructure provisioning across dev, staging, and production environments.

• Scheduled and monitored workflows using Apache Airflow DAGs, incorporating error handling, retry mechanisms, and notification systems to meet strict SLA requirements and reduce failure rates.

• Documented data workflows and logic using Confluence, dbt docs, and markdown templates, creating accessible references for engineering, analytics, and compliance teams.

• Created data validation frameworks in PySpark to detect schema mismatches, null propagation, and drift across batches and streams, boosting trust in pipeline outputs.

• Implemented AWS Glue Data Catalog to provide unified metadata access for Athena, Redshift, and EMR, supporting schema discovery, version control, and governance.

• Designed cross-account data sharing policies using AWS Lake Formation, enforcing row-level security and enabling governed collaboration across departments.

• Monitored and audited pipeline activity using AWS CloudWatch and CloudTrail, configuring custom alarms, dashboards, and logs for visibility and proactive resolution.

• Built Looker dashboards displaying customer behavior, onboarding drop-offs, approval funnel conversions, and underwriting cycle times to drive improvements across operations and strategy.

• Participated in Agile sprint ceremonies, collaborated with stakeholders to define user stories, delivered demos, and aligned technical solutions with business KPIs and regulatory mandates. UnitedHealth Group Jan 2021 - Jan 2023

Cloud Data Engineer – ETL Integration

• Developed robust, scalable data ingestion pipelines using Apache Spark and Kafka to process patient records, claims, provider information, and insurance data in compliance with HIPAA regulations.

• Leveraged AWS Glue for building serverless ETL jobs to clean, transform, and standardize structured and unstructured healthcare data coming from multiple third-party vendors and internal systems.

• Utilized Amazon S3 and Parquet file format to store transformed datasets efficiently, enabling downstream analytics teams to perform faster queries using Athena and Redshift Spectrum.

• Implemented Apache Airflow DAGs for scheduling and managing batch jobs that load millions of medical claims records daily into secure cloud data warehouses like Amazon Redshift and Azure Synapse.

• Designed and maintained Delta Lake tables on Databricks to manage slowly changing dimensions (SCD) and incremental loads for critical patient and provider dimension tables.

• Created streaming data pipelines with Spark Structured Streaming and Kafka Connect to ingest real-time data feeds from EHR systems, enabling up-to-date analytics for care coordination teams.

• Built interactive dashboards using Power BI and Tableau, powered by optimized datasets in Azure SQL Database, providing real-time visibility into provider performance, claim denials, and treatment outcomes.

• Developed robust data validation scripts in PySpark and SQL to apply quality checks like null checks, data profiling, duplicate detection, and referential integrity before pushing to production layers.

• Migrated legacy ETL pipelines from Informatica and on-premise SQL Server to AWS Glue and ADF, improving data refresh times, lowering maintenance costs, and increasing scalability.

• Used AWS Lake Formation and IAM policies to enforce row-level security and fine-grained access control, ensuring only authorized users can view PHI/PII data in accordance with HIPAA.

• Participated in the development of a healthcare data lake on Azure Data Lake Gen2, organizing raw, curated, and trusted layers to support enterprise-wide analytics and predictive modeling.

• Worked closely with data scientists to provide clean, structured datasets in AWS Redshift and Azure SQL, accelerating machine learning model development for early disease prediction and fraud detection.

• Performed data lineage tracking using Apache Atlas and AWS Glue Data Catalog, ensuring transparency of data flow from ingestion to reporting, meeting data governance requirements.

• Enabled schema evolution and versioning by integrating Apache Hudi with Spark, supporting incremental upserts of member data and reducing full table refresh costs and complexity.

• Deployed CI/CD pipelines using AWS CodePipeline and GitHub Actions for automated deployment of PySpark jobs, schema updates, and configuration changes to ETL processes.

• Orchestrated workflows using AWS Step Functions and integrated with Lambda functions to handle dynamic transformations, external API calls, and conditional routing of medical data flows.

• Automated metadata capture and reporting using Azure Purview, ensuring data assets were cataloged and tagged with data owner, sensitivity, and classification information.

• Configured alerting and monitoring with CloudWatch, Datadog, and Azure Monitor to detect anomalies in data volume, job duration, and schema mismatches in near real-time.

• Collaborated with business analysts and compliance officers to identify key KPIs, define SLAs for data delivery, and ensure traceability of source data to downstream reports.

• Led migration of large datasets from on-prem Oracle and Teradata to Amazon Redshift using AWS DMS, achieving faster performance and reducing dependency on expensive legacy systems.

• Worked with Azure Key Vault and AWS Secrets Manager to secure sensitive credentials used in ETL pipelines, ensuring compliance with data security best practices in healthcare.

• Conducted peer code reviews, documentation walkthroughs, and contributed to reusable modules and data pipeline frameworks, enhancing team productivity and standardization across the enterprise. Wayfair Aug 2018 - Dec 2020

GCP Data Engineer – Cloud ETL & Analytics

• Designed and implemented end-to-end ETL pipelines on Google Cloud Platform (GCP) using Cloud Dataflow and Apache Beam, enabling seamless processing of structured and semi-structured data from multiple sources.

• Used Google Cloud Dataproc with Apache Spark-Scala to handle large-scale batch processing of customer clickstream and transactional data, improving data transformation speed and reducing infrastructure cost.

• Ingested data in JSON, CSV, and AVRO formats from Google Cloud Storage (GCS) and applied schema normalization and metadata enrichment for standardized input to analytical layers.

• Implemented GCP-native security and compliance tools like Data Loss Prevention (DLP) and Cloud KMS to automatically detect and encrypt sensitive PII/PCI fields across GCS and BigQuery, enhancing regulatory compliance and data governance.

• Built ingestion frameworks in Python to pull data from REST APIs including order, product, and inventory services, and store it into GCS for further processing and historical tracking.

• Migrated critical retail datasets from on-premise Oracle to Google BigQuery, streamlining business reporting workflows and reducing query latency for BI teams by over 40%.

• Translated and optimized legacy Teradata SQL logic into BigQuery Standard SQL, ensuring analytical accuracy and improving query efficiency with partitioning and clustering strategies.

• Developed real-time data ingestion pipelines using Kafka to publish events into Google Pub/Sub, supporting scalable, low-latency processing for user behavior and product interactions.

• Scheduled and monitored complex ETL DAGs using Apache Airflow hosted on Cloud Composer, managing interdependent job runs, alerts, retries, and SLA compliance across all data pipelines.

• Used Hive on Dataproc to aggregate and transform large-scale inventory, product, and pricing data, integrating multiple sources into a unified warehouse for data science usage.

• Stored and accessed long-term datasets in Parquet and ORC formats within GCS, improving processing efficiency and lowering storage cost for frequently accessed and archived datasets.

• Designed and implemented data validation frameworks using Great Expectations and Python to apply schema checks, null validations, and business logic checks before data publication.

• Implemented column-level encryption and row-level data masking in BigQuery to comply with GDPR regulations and secure sensitive customer attributes from unauthorized access.

• Managed operational metadata, job logs, and pipeline exceptions using Log4j, integrated with Cloud Logging (formerly Stackdriver) to ensure real-time observability and issue triage.

• Enforced IAM-based access control policies for GCS buckets, BigQuery datasets, and Composer workflows to protect PII and manage fine-grained role-based access in compliance frameworks.

• Used Google BigQuery as the core data warehouse, enabling ad-hoc analysis for finance and marketing teams and supporting interactive dashboards via Looker for executive reporting.

• Designed and containerized Python transformation services using Docker, deployed them in Google Kubernetes Engine (GKE) for auto-scaling during high-volume traffic during promotional events.

• Employed Cloud Build to automate CI/CD workflows for Airflow DAGs, Python jobs, and infrastructure-as-code deployments, improving development velocity and production reliability.

• Maintained all data pipelines, code artifacts, and workflow scripts in GitHub, ensuring collaboration, version control, and audit tracking across the entire data engineering team.

• Participated in Agile ceremonies using Jira, contributed to sprint planning, backlog grooming, and retrospectives while aligning deliverables with evolving product and analytical requirements.

• Contributed to architectural planning for a GCP-based data lakehouse, leveraging GCS, BigQuery, and Dataproc for hybrid workloads with structured and unstructured data at scale.

• Conducted performance optimization and cost analysis using BigQuery Query Plan Explanation, GCP Billing Reports, and Data Catalog lineage, enabling the team to reduce monthly costs by optimizing query patterns and storage classes. EDUCATION

Saint Peters University – Jersey City, New Jersey

Master of Science, Business Analytics

CERTIFICATIONS

• Certified Kubernetes Administrator. (CKA)

Contact this candidate