Data Engineer Cloud

Location:

Posted:

March 10, 2025

Resume:

Professional Summary:

Results-driven Senior Snowflake Data Engineer with 7+ years of expertise in cloud data warehousing, ETL/ELT processes, and big data solutions. Strong proficiency in Snowflake architecture, query optimization, performance tuning, Snowpipe, Streams & Tasks, and cost-effective data modeling.

Expert in SQL, Python, dbt, and Apache Airflow, building scalable, automated data pipelines across AWS (S3, Glue, Redshift), Azure (ADF, Synapse), and GCP (BigQuery, Dataflow) for end-to-end Snowflake implementations. Extensive experience in data migration (Teradata, Oracle, Redshift to Snowflake), ELT strategies, and cloud cost optimisation.

Deep knowledge of data security, governance (RBAC, data masking, encryption), and compliance (GDPR, HIPAA). Passionate about designing high-performance, secure, and scalable data solutions to drive business insights and innovation.

Key Expertise:

Snowflake & Cloud Data Warehousing

Expertise in Snowflake architecture, including Time Travel, Zero-Copy Cloning, Streams & Tasks, Query Acceleration

Hands-on experience in performance tuning, clustering, materialized views, and warehouse scaling

Data migration from Teradata, Oracle, Redshift to Snowflake

Cost optimization strategies for Snowflake query performance

Configured RBAC, data masking, and encryption for data security

ETL/ELT Development & Data Pipelines

Developed automated ETL/ELT pipelines using dbt, Apache Airflow, Python, and SnowSQL

Implemented Snowpipe for real-time and batch ingestion from AWS S3 and Azure Blob Storage

Built Delta Live Tables (DLT) on Databricks for real-time data analytics.

Built DBT Core models (staging, intermediate, fact, and dimension tables) to transform raw event data into structured reporting tables.

Used incremental DBT models to process high-velocity streaming data from Kafka and Snowflake Streams & Tasks.

Implemented DBT tests (unique, not null, referential integrity) to ensure data accuracy across transformations.

Atscale for BI analytics

Designed and deployed AtScale semantic models on Snowflake to support self-service analytics for business users.

Standardized KPIs and business metrics across Power BI, Tableau, and Looker, ensuring consistency.

Enabled role-based access control (RBAC) in AtScale to manage data governance and security compliance.

Configured AtScale’s aggregate awareness to accelerate complex queries by 80%, reducing load on Snowflake.

Created pre-aggregated summary tables in Snowflake to improve performance for daily, monthly, and yearly reports.

Improved BI dashboard response times by dynamically selecting the most granular data level available.

Cloud Platforms & Big Data Technologies

AWS (S3, Glue, Redshift, EMR, Lambda, RDS, EC2)

Azure (ADF, Synapse, Databricks, Data Lake, Storage, Azure DW)

GCP (BigQuery, Dataflow, Pub/Sub, Cloud Storage, Cloud SQL)

Hadoop Ecosystem (HDFS, MapReduce, Hive, Pig, Oozie, Flume, Cassandra)

Apache Spark (Scala, PySpark, Spark SQL, Spark Streaming, MLlib, GraphX)

Data Modeling & Database Management

Designed Star Schema & Snowflake Schema models for efficient analytics

Worked with SQL, PL/SQL, NoSQL databases (DynamoDB, MongoDB, Cassandra, CosmosDB)

Experience optimizing SQL queries, indexing strategies, and materialized views.

Data Streaming & Real-Time Processing

Built real-time data processing feeds using Kafka, Spark Structured Streaming, Azure Stream Analytics

Hands-on experience in streaming analytics for real-time decision-making

Used NiFi and Kafka for processing system logs and real-time analytics

Infrastructure as Code & DevOps

Terraform for Snowflake and cloud infrastructure automation

Experience with CI/CD tools (Jenkins, Docker, Concourse, Bitbucket, Azure DevOps)

Implemented version control for data models using Git

Data Governance, Security & Compliance

Strong understanding of RBAC, data masking, encryption for securing sensitive data

Compliance expertise in GDPR, HIPAA, CCPA

Implemented metadata management, access controls, and data lifecycle management

Data Visualization & Reporting

Built Tableau and Power BI dashboards for business insights

Integrated Looker, SSRS, and ad-hoc SQL reports for data visualization

Machine Learning & AI Integration

Experience in integrating ML models into data pipelines using TensorFlow, PyTorch

Developed predictive models to enhance business insights

Databricks & Apache Spark

Implemented data transformation and de-duplication logic to ensure high data quality.

Developed DataFrames and performed data analysis using Spark SQL, including ad-hoc joins and broadcast joins for optimized queries.

Built and optimized Delta Live Table (DLT) frameworks, successfully implementing four dashboards using DLT.

Designed and deployed real-time data processing pipelines using Spark Structured Streaming and Apache Kafka for dynamic data insights.

Real-Time Streaming & Kafka

Deep expertise in Kafka architecture, including topic partitioning, replication, and fault tolerance.

Utilized Kafka for real-time activity tracking and log aggregation to monitor system performance.

Designed real-time streaming pipelines using NiFi, Kafka, Spark, and HBase, processing system logs in real time.

Optimized streaming jobs by fine-tuning Kafka consumer configurations, checkpointing mechanisms, and resource allocation.

SQL & NoSQL Databases

Strong command over SQL with the ability to write complex queries for structured data analysis.

Designed and optimized HBase tables, implementing offset management and de-duplication strategies.

Extensive experience with NoSQL databases, including DynamoDB, CosmosDB, MongoDB, Cassandra, and HBase, ensuring scalable and resilient data storage solutions.

Technical Skills:

Programming Languages: Python, Scala, Java, Golang

Scripting: Shell scripting within UNIX/Linux environments

Cloud Platforms: AWS (EMR, EC2, RDS, S3, Lambda, Glue, Redshift), Azure (Data Lake, Storage, SQL, Databricks), Google Cloud Platform

Big Data Technologies: Hadoop ecosystem (HDFS, MapReduce, Hive, Pig, Oozie, Flume, Cassandra), Spark (Scala integration, PySpark, RDDs, Data Frames, Spark SQL, Spark MLlib, Spark GraphX)

ETL Tools: Microsoft Integration Services, Informatica Power Centre, Snow SQL, Talend

Data Modelling: Star Schema, Snowflake Schema

Databases: SQL Server, NoSQL (DynamoDB, MongoDB), Oracle (PL/SQL)

Data Visualization Tools: Tableau, Power BI

Infrastructure as Code: Terraform

API Development: JWT, OAuth2, API keys

Microservices Framework: Spring Boot

CI/CD Tools: Jenkins, Docker, Concourse, Bitbucket

Version Control Systems: Git, SVN, Bamboo

Testing Tools: Apache JMeter, Query Surge, Talend Data Quality

Network Protocols: DNS, TCP/IP, VPN

Real-time data Streaming: Apache Kafka, Apache Storm

Machine Learning & AI Platforms: TensorFlow, PyTorch

Methodologies: Agile, Scrum, Test-Driven Development (TDD)

Professional Experience:

Client: Altice, Bethpage, NY May 2024 – Till Date

Role: Sr Data Engineer

Responsibilities:

Designed and implemented FDA-regulated US state reporting solutions on Azure Cloud, leading Teradata retirement and data migration to GCP (BigQuery, Cloud Storage, Cloud SQL).

Built Azure Synapse pipelines for ingesting inbound data via APIs, processing through SQL pools and Spark pools for data cleansing and transformation.

Developed real-time streaming solutions using Kafka, Synapse Analytics, and Spark Structured Streaming to process high-velocity data.

Created and optimized ETL pipelines in Apache Airflow (GCP) and Azure Data Factory (ADF) to automate data integration workflows.

Implemented Delta Live Tables (DLT) in Databricks with Medallion architecture, auto-inferred JSON schema, and real-time streaming live tables for enhanced data availability.

Designed high-performance Snowflake data models, optimizing query performance with clustering, materialized views, and warehouse scaling.

Migrated HDFS and Hive workloads to Snowflake and BigQuery, ensuring cost-efficient ELT transformations.

Designed and implemented AtScale semantic layer on Snowflake to support self-service analytics for BI tools like Power BI, Tableau, and Looker.

Optimized query performance using AtScale Aggregate Awareness, reducing Snowflake query execution time by 80% through pre-aggregated roll-ups.

Standardized business metrics and KPIs across multiple BI tools, ensuring cross-platform consistency and reducing data discrepancies.

Integrated AtScale security with Snowflake’s RBAC model, providing role-based access control and enforcing data governance policies.

Reduced Snowflake compute costs by 30% by pushing heavy aggregation queries to AtScale’s semantic layer, improving query efficiency.

Applied Machine Learning models in Databricks using TensorFlow/PyTorch, integrating predictive analytics into data pipelines.

Optimized MongoDB performance through indexing, bulk operations, and schema design for large-scale data ingestion.

Integrated DBT with Apache Airflow to schedule and orchestrate Snowflake transformations in an automated pipeline.

Developed Jinja macros and DBT hooks to configure table partitioning dynamically, reducing query execution times by 40%.

Implemented CI/CD pipelines in DBT & Git for version-controlled data modeling and automated deployments.

Reduced Snowflake compute costs by 30% by offloading heavy BI queries to AtScale’s semantic layer.

Leveraged AtScale's push-down optimization to execute queries efficiently on Snowflake without moving data.

Developed and maintained CI/CD pipelines for cloud-based data solutions using Terraform, Jenkins, and Git.

Led end-to-end data migration from Teradata to Snowflake and BigQuery, optimizing performance and reducing cloud costs.

Designed Snowflake-based ELT pipelines using dbt and SnowSQL, implementing role-based access control (RBAC), data masking, and encryption for security compliance.

Built real-time analytics solutions by integrating Snowflake Streams & Tasks with Kafka and Spark Streaming.

Fine-tuned Snowflake warehouse performance, leveraging clustering, auto-scaling, and query optimization strategies to improve execution speed.

Implemented Machine Learning workflows in Databricks, training predictive models on customer behavior, fraud detection, and anomaly detection using Spark MLlib, TensorFlow, and PyTorch.

Developed feature engineering pipelines in Databricks and Snowflake, enabling ML models to process structured and semi-structured data efficiently.

Created time-series forecasting models using Azure ML and GCP Vertex AI, integrating predictive analytics into business intelligence dashboards.

Developed modular DBT transformations (staging, intermediate, and fact/dimension models) to convert raw event data into structured tables in Snowflake.

Implemented DBT incremental models to optimize streaming data ingestion from Kafka and Snowflake Streams & Tasks, reducing data processing time by 60%.

Built CI/CD pipelines for DBT models using Git and Jenkins, ensuring automated deployments and version control.

Enhanced data quality using DBT tests (unique, not null, referential integrity), ensuring clean and accurate reporting data.

Orchestrated DBT workflows with Apache Airflow, scheduling transformations and maintaining lineage tracking.

Designed a governed semantic layer using AtScale, allowing business users to interact with trusted, curated data sources.

Integrated AtScale and DBT models, ensuring that transformations performed in DBT are reflected in AtScale’s semantic layer for analytics.

Developed a unified data governance framework that ensured KPI standardization across finance, sales, and marketing teams.

Automated lineage tracking in AtScale, mapping semantic layer elements to source tables in Snowflake.

Implemented column-level security and data masking policies, ensuring compliance with data privacy regulations.

Automated data quality checks using Great Expectations and dbt, ensuring high accuracy in Snowflake and Databricks datasets.

Designed streaming data pipelines in GCP (Dataflow, Pub/Sub) and Azure (Synapse, Event Hubs) for near real-time data ingestion and transformation.

Integrated Snowflake with Looker, Tableau, and Power BI to create interactive dashboards and self-service analytics for business users.

Built high-performance, scalable ETL solutions using Apache Airflow, Azure Data Factory, and Terraform for infrastructure automation.

Environment: Azure Data Factory, Azure Databricks, Azure Synapse, Azure Data Lake, Azure DW, Azure Event Hubs, GCP BigQuery, Cloud Storage, Cloud SQL, Dataflow, Dataproc, Pub Sub, Vertex AI, AWS S3, Glue, Redshift, Lambda, EC2, Snowflake Streams, Tasks, Snowpipe, Clustering, RBAC, Data Masking, PostgreSQL pgAdmin 4, DBeaver, MongoDB, CosmosDB, DynamoDB, Cassandra, Oracle, Teradata, Apache Spark PySpark, Spark SQL, MLlib, Structured Streaming, Kafka, HDFS, Hive, NiFi, Airflow, dbt, Great Expectations Data Quality, SnowSQL, Talend, Terraform, Jenkins, Docker, Git, Azure DevOps, Looker, Power BI, Tableau, SharePoint, TensorFlow, PyTorch, Azure ML, Scikit Learn, MLflow.

Client: UPS, Atlanta, GA Jun 2022 – April 2024

Role: Sr. Data Engineer

Responsibilities:

Directed and orchestrated complex data engineering initiatives from inception through to maturity in the insurance sector, adeptly navigating through planning, strategy, execution and maintenance phases while seamlessly integrating Agile and Waterfall methodologies to ensure flexibility and rigor in project management.

Engineered and maintained robust multi-node clusters on AWS, leveraging EC2 instances for scalable computing. Implemented comprehensive monitoring and alerting systems utilizing CloudWatch and CloudTrail, ensuring operational excellence and security across EBS, EC2, ELB, RDS, S3 and SNS services, while enforcing best practices in data security within S3 buckets.

Experience in Google Cloud components, Google container builders, and GCP client libraries.

Spearheaded the modernization of legacy insurance claims database systems and Informatica ETL processes to AWS Cloud, Redshift and Snowflake platforms, innovating with asynchronous task management tools like Celery, RabbitMQ and Redis to enhance performance and scalability.

Advanced AWS DynamoDB functionalities by integrating with Lambda for serverless operations and developing Spark scripts for efficient AWS Glue jobs and EMR processing, automating workflows with Python to achieve operational efficiency and reliability.

Mastered data transfer operations with Sqoop, facilitating seamless data exchange between Snowflake, Oracle and DB2 systems in the insurance sector. Refined database management capabilities with advanced SQL and PL/SQL scripting, deepening proficiency in Snowflake's database architecture.

Developed a governed semantic layer using AtScale, allowing business users to query pre-modeled data without writing complex SQL.

Enabled federated querying across Snowflake and Redshift, allowing seamless data access through AtScale’s virtualization layer.

Built dynamic cubes and hierarchies in AtScale to support hierarchical rollups, drill-downs, and multi-level aggregation for sales and finance reports.

Leveraged AtScale’s caching mechanisms to improve query response times, reducing dashboard load times by 50%.

Ensured compliance with enterprise data governance standards, mapping AtScale semantic models to GDPR and HIPAA regulatory policies.

Experience building snow pipe and developing transformation logic using snow pipe.

Employed Big Query for efficient data warehousing solutions, optimizing data integration, loading and transformation processes from Google Cloud Storage, Cloud Pub/Sub and heterogeneous external databases, ensuring data fluidity and accessibility.

Leveraged a comprehensive suite of technologies including Spark, PySpark, Hive, Hadoop and Scala for a broad spectrum of data-related tasks from analytics, ingestion and integrity verification to the management of diverse data formats such as JSON, CSV, Parquet and Avro in the insurance analytics domain.

Migrated legacy ETL processes to DBT Core, reducing pipeline execution time and improving maintainability.

Refactored DBT models to align with Snowflake best practices, using materialized views and incremental processing to reduce warehouse consumption.

Enabled automated DBT documentation, ensuring data lineage visibility for analysts and business users.

Integrated DBT with AtScale, allowing transformed DBT models to be consumed directly by the semantic layer without additional processing.

Implemented DBT exposures and seeds to provide curated datasets for machine learning models and advanced analytics.

Architected and administered real-time data streaming infrastructures using Apache Kafka, facilitating immediate data processing and insights, essential for real-time decision-making and analytics in insurance.

Pioneered automation in data collection from a variety of sources such as APIs, AWS S3, Teradata and Redshift, utilizing PySpark and Scala. Implemented Oozie workflows for strategic job orchestration, enhancing the software development lifecycle's efficiency.

Created compelling and interactive dashboards and reports with Power BI, transforming raw insurance data into actionable insights and decision-support tools.

Good understanding of snowflake cloud technology and experience working with snowflake clone and time travel.

Enabled cross-platform metric consistency by centralizing KPI logic in the AtScale semantic layer.

Reduced dashboard response times by 40% by leveraging AtScale’s in-memory processing and optimized query execution paths.

Implemented data classification and tagging policies in AtScale, allowing security teams to audit and manage data access effectively.

Built a metadata-driven approach to data modeling, reducing manual transformations and increasing standardization.

Created automated alerts for data anomalies, integrating AtScale and DBT quality checks with Snowflake monitoring tools.

Conceptualized and evaluated advanced dimensional data models, employing Star and Snowflake schemas and integrating industry-standard methodologies advocated by Ralph Kimball and Bill Inmon to optimize data warehouse design and functionality in insurance.

Developed and instituted sophisticated logging, monitoring and error management frameworks within REST APIs, enhancing system reliability and operational transparency.

Executed the deployment of microservices architecture on Kubernetes clusters, leveraging Jenkins for robust CI/CD pipelines and utilized Jira for effective project management and issue tracking, facilitating agile and efficient development cycles in the insurance domain.

Demonstrated version control using Git, ensuring meticulous code management practices and fostering a culture of transparency and collaboration in the development workflow.

Applied advanced testing strategies and tools, including Apache JMeter, to rigorously validate the accuracy, performance and integrity of ETL processes and data migrations, ensuring the highest data quality standards in insurance data handling.

Environment: Agile, Waterfall, AWS, EC2, CloudWatch, GCP,CloudTrail, EBS, ELB, RDS, S3, SNS, Informatica ETL, Redshift, Snowflake, Celery, RabbitMQ, Redis, AWS DynamoDB, Lambda, Spark, AWS Glue, EMR, Python, Sqoop, Oracle, DB2, SQL, PL/SQL, Big Query, Google Cloud Storage, Cloud Pub/Sub, Spark, PySpark, Hive, Hadoop, Scala, JSON, CSV, Parquet, Avro, Apache Kafka, APIs, Teradata, Oozie, Power BI, Star Schema, Snowflake Schema, REST APIs, Kubernetes, Jenkins, Jira, Git, Apache JMeter.

Client: HTC Global Services, India DEC 2019 – DEC 2021

Role: Data Engineer

Responsibilities:

Full Data Pipeline (FDPL) -Building configurable, metadata driven & customized frameworks. Developed ETL pipeline using Databricks to build SILVER/GOLD layer tables using PySpark.

Experience sizing clusters for development and GIT integration with Azure DevOps. Data Ingestion to ingest data from various data sources using Data factory and Azure web apps.

Migrated the Databricks ETL jobs to Azure Synapse Spark pools. The ETL deals with structured data coming from Oracle source system will be ingested through Synapse pipelines using the Oracle SQL connector.

Designed a star schema for customer analytics, resulting in a 20% reduction in query response time.

Implemented snowflake schema for a financial reporting system enhancing data integrity and ease of maintenance.

Collaborated with marketing teams to align dimensional models with campaign tracking requirements, enabling more accurate performance analysis.

Built Synapse spark ETL notebooks to standardize the data coming from vendor specific SFTP accounts and write the resultant data into Azure blob storage.

Used Azure Synapse for dedicated SQL database to build data model using Fact and dimensional tables for KPI using 24-hour delta records.

Built Synapse Dataflows for standard KPI model building by reading the data from shared data sources (internal data teams/pods) and load the data into SQL Pools (MPP). The data will then be pulled into Power BI for dashboarding.

Migrated the legacy workflows from batch jobs on-premise systems (xml, CSV, XLS files) to Data Lake using Synapse pipelines.

Data Integration to apply business rules & make data available to different consumers using Databricks spark. Data delivery FW for data-driven caching, Adhoc data access, Vendor & API integration

Building a data-driven Caching Layer for data delivery of Sales Dashboard using Power BI and Snowflake. Building a standardized automated Vendor Integration Model (recognized as a standard template by Eng. org).

Migrated two Data marts from Teradata into Snowflake using SnowSQL and External storage integration. Used COPY INTO command for ingesting large files from Azure Blob to Snowflake stage and automated the ingestion process using SnowSQL commands with shell scripts.

Conduct code reviews daily. Provide low level architecture design for the Azure pipelines. Interact with SLT to gather additional requirements (stretch goals) and provide demos to cross-functional teams.

Developed workflows suing Databricks Delta live tables and used MERGE SQL to perform Change data capture for implementing the SCD type-2 tables. Used Z-optimize for data compaction and VACCUM commands for maintaining the lifecycle for datasets.

Created Mount point on Databricks to connect with blob storage to retrieve data and perform data analysis using Pyspark on Databricks clusters.

Enabled GIT on Databricks for versioning and used widgets for setting the parameters in the script. Implemented sub-routines using remote data bricks job execution in case of master job failure.

Implemented streaming pipeline on Clickstream data by connecting Databricks with Azure Event-hubs and with Azure Stream Analytics.

Migrated Apache Hive tables/models from Hadoop to Databricks on Azure, implemented access policies to restrict user access on Table and Schema level for all Databricks tables. Currently working on Snow Spark to convert the pyspark jobs into Snowflake equivalent Snow Spark jobs.

Built Shell scripts which load the data from SFTP and land in HDFS as raw data. Build python jobs to run data quality checks using DBT and using great Expectations (this is a python package name).

Environment: Azure Data factory, Azure Databricks, spark, Kafka, Log Analytics, Azure DevOps (git &CICD), HDP, hive, Sqoop, Oracle Goldengate, Teradata, Google Campaign Manager, python, shell scripting, snowflake, Azure Web apps, Azure AppService’s, Azure Datahub.

Client: Mach Solutions-India Aug 2017 – Nov 2019

Role: Data Analyst/Data Engineer

Responsibilities:

Build Databricks job to implement data quality checks and transform the sales data that is received from front end applications.

Build data bricks jobs to read from real-time feeds coming from online sales and process them through Spark structured streaming.

Built ETL jobs to read from SFTP accounts and extract files to Databricks dbfs storage and process them through pyspark.

Built data factory jobs using the auto-loader interface, built various ingestion pipelines for oracle, salesforce and land the data into Azure Data Lake storage Gen1(ADLS Gen1).

Built interdependent jobs for capturing the KPI and building metrics for sales funnels and then redirect the output to power BI dashboards.

Built streaming jobs using Spark Streaming which transfer data from transaction severs had been loaded into Cassandra tables in real-time workloads. The tables are further used by machine learning models to verify transaction authenticity.

Written Spark jobs to convert the Oracle materialized views to generate the datasets faster for the tableau Dashboards. The ETL for these jobs run on EMR which trigger downstream jobs on successful execution using Step-functions. Developed the end-to-end automation framework to implement this functionality.

Build NiFi pipelines to transform the raw data from JSON format into separate datasets as per their category. Installed NiFi platform was installed on EC2 and redeployed the workflows from On-prem NiFi to EC2 NiFi instance.

Migrated historical data to S3 and developed a reliable mechanism for processing the incremental updates. The data has been migrated from Hadoop cluster and used DIST CP command to migrate large datasets to AWS S3.

The raw datasets of transactional vertical are merged based on business criteria using Spark to create a master dataset for building Dashboard. Used Python UDFs to implement business logic and the UDF are embedded into MAP transformation using pySpark.

Developed Spark jobs using Scala on top of Yarn for interactive and Batch Analysis. Developed UNIX shell scripts to load many files into HDFS from Linux File System.

Experience in querying data using Spark SQL for faster processing of the data sets. Offloaded data from EDW into Hadoop Cluster using Sqoop. Developed Sqoop scripts for importing and exporting data into HDFS and Hive

Configured CRON jobs to run RAW layer data extraction jobs on a fixed schedule. Also used Oozie to schedule Sqoop and Hive jobs to build the data model on Hive.

Environment: AWS, S3, Sqoop, Kafka, Spark, Spark SQL, Hive, LINUX, Oozie, Java, Scala, Eclipse, Tableau, UNIX Shell Scripting, Putty.

Contact this candidate