Post Job Free
Sign in

Senior Data Engineer Lakehouse & Real-Time Pipelines

Location:
Irving, TX
Posted:
March 19, 2026

Contact this candidate

Resume:

Sravan Thotapally

******************@*****.*** +1-314-***-****

PROFESSIONAL SUMMARY

Data Engineer with 6+ years of experience designing, building, and optimizing large-scale data pipelines, ETL/ELT frameworks, and cloud-native solutions across banking, insurance, and telecom domains. Proven expertise in modern data architectures (Lakehouse, Medallion), batch and real-time ingestion, data modeling, cloud platforms (AWS, Azure, GCP), and enterprise-grade tools such as Databricks, Snowflake, Kafka, Airflow, and DBT. Adept in Python, SQL, Spark, and Java for scalable data transformations and advanced analytics. Skilled in cross-functional collaboration with data scientists, analysts, and DevOps for delivering secure, reliable, and high-performance solutions.

TECHNICAL SKILLS

Languages: Python, SQL, Scala, PySpark, Java, Shell

Big Data: Apache Spark, Hadoop, Hive, HBase, MapReduce, Kafka, Flink

ETL Tools: Apache Airflow, Azure Data Factory, AWS Glue, Informatica, SSIS

Data Warehousing: Snowflake, Redshift, BigQuery, Synapse Analytics

Cloud Platforms: AWS (Glue, S3, Lambda, EMR, Redshift), Azure (ADF, Databricks, Synapse, Blob), GCP (Dataflow, BigQuery, Composer, GCS)

Data Lakehouse: Databricks, Delta Lake, Unity Catalog, Lake Formation

CI/CD & DevOps: Git, GitLab, Jenkins, Docker, Kubernetes, Terraform

Monitoring & Logging: CloudWatch, Stackdriver, Azure Monitor

Databases: PostgreSQL, MySQL, Oracle, SQL Server, MongoDB, Cassandra

BI Tools: Power BI, Tableau, Looker

Security: OAuth2, IAM, RBAC, VPC, Data Encryption (KMS/CMK)

Other: DBT, AI/ML, JIRA, Confluence

PROFESSIONAL EXPERIENCE

Bank of America, Plano, TX

Senior Data Engineer Jan 2024 – Present

Designed and implemented a centralized CSDR data lakehouse on Azure Databricks integrating 25+ trading and settlement systems, improving compliance data accessibility by 40%.

Built real-time Kafka and PySpark pipelines processing over 10M daily transactions, achieving 99.8% on-time delivery for regulatory surveillance data.

Developed automated reconciliation and data quality checks using Great Expectations, reducing compliance data mismatches by 95%.

Delivered CSDR performance metrics and dashboards (settlement fail rates, buy-in timeliness, penalty accuracy) in Power BI, enabling 35% faster regulatory reporting.

Built Lakehouse pipelines using Delta Lake and implemented Medallion Architecture (Bronze, Silver, Gold) on Azure Databricks.

Created automated ETL + ML orchestration workflows with Airflow, ensuring end-to-end reproducibility of AI/ML processes.

Ingested data from 20+ source systems via CDC using Kafka Connect and Debezium, pushing to Azure Data Lake Storage Gen2.

Developed reusable PySpark notebooks for batch and streaming ETL and deployed to ADF pipelines and Databricks jobs.

Authored complex SQL queries, stored procedures, and views in Snowflake and Synapse to support fraud detection, reporting, and analytics.

Designed and optimized Snowflake and Synapse warehouse schemas (fact/dimension tables) for regulatory and business reporting.

Created DBT models in Snowflake for transformation and metrics logic, powering 100+ Power BI dashboards.

Integrated structured (RDBMS), semi-structured (JSON, Avro), and streaming (Kafka topics) data into unified Lakehouse layers.

Partnered with data science teams to design real-time model scoring infrastructure by consuming predictions through Kafka topics and exposing REST APIs for fraud risk classification.

Developed feature engineering pipelines in Databricks using PySpark and Delta Lake, automating model input generation for supervised ML models aimed at transaction anomaly detection.

Optimized performance in Synapse Analytics through distribution strategy tuning, caching, and partitioning.

Implemented Unity Catalog and enabled fine-grained access controls and column-level masking for sensitive PII.

Designed DQ framework using Great Expectations, integrated validation with CI/CD in Azure DevOps.

Collaborated with ML teams to operationalize fraud detection models using Kafka REST proxy and Python APIs.

Implemented GitLab CI/CD pipelines to automate deployment of 200+ Databricks notebooks and PySpark jobs, reducing release cycle time by 40%.

Created Terraform scripts to provision Azure infrastructure, ensuring consistency and scalability.

Designed and implemented machine learning pipelines in Databricks using PySpark and MLlib for classification, regression, and clustering tasks, improving fraud detection accuracy.

Integrated TensorFlow and Scikit-learn models into production ETL workflows for real-time scoring via Kafka streaming.

Tools & Technologies:

Kafka, Spark, Databricks, Delta Lake, Azure Data Factory, Synapse, Snowflake, DBT, Python, SQL, Airflow, PowerBI, Gitlab, Tableau, Azure Monitor, Unity Catalog, Great Expectations, Azure DevOps, MLflow, Scikit-learn, REST APIs

Unified Life Insurance Co – Overland Park, KS

Big Data Engineer Jan 2023 – Dec 2023

Built modular PySpark jobs in Azure Databricks to process actuarial, claims, and policy data from ADLS Gen2 to Snowflake and Azure Synapse Analytics.

Automated schema evolution and partition management in Delta Lake and Snowflake tables, enabling self-healing ingestion pipelines.

Created dbt models on Snowflake and Azure Synapse for generating curated datasets used by actuaries and business analysts.

Developed Python-based data preprocessing scripts and embedded feature engineering logic in Databricks ETL workflows to feed life insurance underwriting ML models hosted in Azure Machine Learning.

Designed normalized and dimensional schemas in Snowflake for policy, claims, and actuarial analytics.

Modeled Snowflake tables and Synapse structures with partitioning strategies tailored for actuarial and claims data.

Developed and tuned SQL queries, views, and stored procedures for RDBMS workloads, ensuring optimized access to large insurance datasets

Supported actuarial data scientists by automating training data generation, feature selection, and accuracy tracking across model versions using Snowflake, Azure Data Explorer, and Pandas profiling.

Built near real-time Azure Functions to run validations using Great Expectations, triggering alerts via Azure Event Grid and Azure Monitor.

Implemented Azure Key Vault encryption, RBAC-based data access, Snowflake role-based security, and Purview classification for HIPAA-aligned compliance.

Automated metadata lineage and data quality reporting pipelines, shared with stakeholders via Power BI dashboards.

Maintained GitLab-based version control for 50+ ETL workflows, ensuring consistent artifact deployment across dev, test, and prod.

Developed reusable SQL logic for reporting pipelines and actuarial models in Snowflake.

Enabled cost-saving measures by refactoring workflows to use Databricks spot VMs and optimizing Snowflake warehouse sizing and scheduling.

Performed advanced anomaly detection using Python-based statistics and integrated into ADF, Databricks, and Snowflake pipelines.

Maintained Git-based CI/CD pipelines for data artifact deployment using Azure DevOps, monitored with Azure Monitor, Snowflake Resource Monitors, and Azure Policy.

Provisioned and managed Azure and Snowflake infrastructure resources using Terraform, enabling automated, consistent, and version-controlled deployments.

Tools & Technologies:

Azure Data Factory (ADF), Azure Databricks, Azure Synapse Analytics, Snowflake, ADLS Gen2, Azure Data Explorer, Azure Machine Learning, Azure Functions, Azure Event Grid, Azure Logic Apps, Power BI, dbt, Gitlab, Python, Pandas, NumPy, Scikit-learn, Great Expectations, Delta Lake, Terraform, Azure Key Vault, Azure Purview, Azure DevOps, RBAC, Git, Databricks Delta Live Tables, Snowflake Role-Based Security, Snowflake Resource Monitors.

Vodafone India – Mumbai, India

Data Engineer Jan 2021 – Aug 2022

Built distributed Spark and Flink jobs to process over 2TB daily from network logs, CDRs, and tower performance datasets.

Developed near-real-time customer churn prediction pipelines using Kafka + Flink and orchestrated via Airflow.

Designed normalized relational schemas and dimensional models (star/snowflake) to structure telecom datasets, improving query efficiency, scalability, and supporting advanced analytics for customer behavior and billing systems.

Authored and optimized SQL queries, views, and stored procedures to improve database performance and ensure reliable reporting across high-volume datasets.

Designed and scheduled Airflow DAGs for 100+ ETL jobs with GCP Composer, ensuring 99.9% SLA adherence.

Wrote advanced SQL queries to process telecom datasets, including customer churn, CDRs, and tower performance

Designed partitioned and clustered BigQuery datasets for network KPIs and customer churn insights.

Tuned Spark jobs using broadcast joins, adaptive query execution, and caching for reduced latency.

Built Looker dashboards for operations and customer insights using data from BigQuery and GCS.

Set up logging and alerting using Stackdriver and integrated with Slack for real-time support.

Configured IAM policies and VPC Service Controls for data governance and cross-project access restriction.

Designed streaming aggregation pipelines to monitor tower health and alert drop-rate thresholds in real-time.

Implemented cost-tracking and usage optimization using GCP billing APIs and budget alerts.

Tools & Technologies:

Apache Spark, Flink, Kafka, Airflow, BigQuery, GCS, Dataflow, Composer, Looker, Stackdriver, GCP IAM, Slack, Python, SQL, Java.

ICICI Bank – Hyderabad, India

Data Analyst May 2019 – Dec 2020

Developed 150+ SSIS packages to extract and transform customer, transaction, and loan data from Oracle into SQL Server DW.

Authored optimized T-SQL scripts and stored procedures to support business-critical dashboards and compliance reporting.

Designed relational database schemas and created stored procedures, triggers, and indexes in SQL Server and Oracle to optimize banking operations.

Performed root cause analysis on data quality issues, reducing error rates in AML/KYC processing pipelines by 80%.

Developed normalized and dimensional models (star/snowflake) in SQL Server DW to support credit, lending, and compliance reporting.

Developed and optimized complex SQL queries, stored procedures, and performance tuning in SSMS to support banking operations and reporting needs.

Automated Excel-based reconciliation reports using Python and Excel VBA for audit readiness.

Built dimensional models (star/snowflake) to support lending and credit card portfolio analytics.

Created KPIs and trend dashboards in Power BI for delinquency rates, collections, and default forecasts.

Tuned long-running ETL jobs by introducing batch windows, pre-staging, and parallelization in SSIS.

Conducted monthly data refresh audits, ensuring regulatory accuracy and GDPR alignment for PII.

Led business requirements gathering and translated functional specs into SQL/ETL logic for BI teams.

Documented schema metadata, field mapping, and process flowcharts to support cross-team collaboration.

Tools & Technologies:

SSIS, SSMS, SQL Server, Oracle, Power BI, T-SQL, Python, Excel VBA, Git, JIRA, MS Visio

EDUCATION

Master of Science in Big Data Analytics and IT

University of Central Missouri, Warrensburg, MO

CERTIFICATIONS

Microsoft Certified: Fabric Data Engineer Associate link

AWS Certified Data Engineer Associate link

Google Cloud Certified – Professional Data Engineer Link

SnowPro Core Certification



Contact this candidate