Thumma Srikanth
Data Engineer
• ******.*@**************.*** • +1-913-***-**** • LinkedIn • Open to Relocate SUMMARY
Data Engineer with 5+ years of experience designing and optimizing scalable data pipelines in finance and insurance domains. Skilled in Python
(PySpark, Pandas), SQL, Apache Spark, Kafka, Delta Lake, Databricks, Airflow, dbt, Snowflake, and AWS. Built real-time streaming pipelines processing 1.1M+ daily transactions, enhanced ETL workflows improving daily data refresh from 5 hours to 1 hour 50 minutes, integrated ML model datasets boosting operational efficiency 2.5x, and optimized reporting pipelines reducing query runtime from 2 hours to 40 minutes. Experienced in batch and near-real-time pipelines, integrating LLM datasets (BERT, GPT, LangChain, Transformer, Generative AI), delivering high-performance, audit-ready data solutions.
SKILLS
Programming & Frameworks: Scala, Java, dbt, Apache Beam, Jupyter, Airflow, SQL, PySpark, Pandas, NumPy, Python Big Data & Analytics: Databricks, Delta Lake, Hadoop, Apache Spark, Snowflake, Event-driven Analytics Cloud & Data Platforms: AWS (S3, Redshift, Glue, EMR, Lambda), GCP BigQuery, Azure Data Factory Data Engineering & Storage: Kafka, ETL Pipelines, PostgreSQL, MySQL, MongoDB, Cassandra, Redis, Data Lake Architecture Data Modeling & Optimization: Star & Snowflake Schemas, Partitioning, Indexing, Query Performance Tuning, Materialized Views Pipeline & Workflow Orchestration: Batch & Streaming Processing, DAG Management, CI/CD Automation, Data Lineage, Workflow Scheduling Security & Compliance: Encryption, Role-based Access Control (RBAC), GDPR, HIPAA, SOC 2 Visualization & Reporting: Tableau, Power BI, Interactive Dashboards AI & LLM Integration: BERT, GPT, LangChain, Transformers, Generative AI for NLP and predictive analytics PROFESSIONAL EXPERIENCE
Citigroup Nov 2023 – Present USA
Data Engineer
● Developed real-time streaming pipelines with Kafka and Spark Structured Streaming to process over 1.1 million daily transactions and account updates, enabling faster fraud detection, portfolio monitoring, and operational reporting for risk and compliance teams.
● Designed end-to-end ETL pipelines using Python, PySpark, and Apache Spark to aggregate loan origination, credit scoring, and transactional datasets from multiple banking systems, improving daily data refresh from 5 hours to 1 hour 50 minutes.
● Integrated ML model datasets for credit scoring, fraud detection, and transaction anomaly analysis using Python, SQL, and Airflow, enhancing predictive decision-making and improving internal operational efficiency 2.5x.
● Orchestrated large-scale workflow automation using AWS Glue and S3, ingesting transactional, customer, and account data from multiple APIs and databases, creating audit-ready datasets for finance, compliance, and operations teams.
● Optimized internal reporting and analytics pipelines on Databricks and Delta Lake, implementing partitioning and incremental updates, reducing query runtime for credit risk, portfolio performance, and customer analytics datasets from 2 hours to 40 minutes.
● Built data validation and quality frameworks with Great Expectations and Airflow to monitor loans, credit approvals, and transaction pipelines, ensuring compliance with SLAs across 30+ branches.
● Collaborated with data architects and cross-functional teams to design star and snowflake schemas in Snowflake and PostgreSQL, supporting dashboards for portfolio performance, transaction velocity, and risk exposure matrices.
● Executed historical transaction aggregation and irregularity identification workflows, reducing reconciliation errors from 500+ entries per week to under 20, improving reporting accuracy for finance controllers and auditors. LTIMindtree Feb 2020 – Jun 2023 India
Data Engineer
● Engineered ETL workflows with Apache Airflow and Python to consolidate life, health, and auto insurance datasets, creating dependency matrices to automate data validation and ensure end-to-end traceability across 10+ insurance business units.
● Embedded streaming insurance transaction data with Apache Kafka, building feature-engineered datasets for predictive claim modeling and anomaly detection workflows, supporting matrices of over 120,000 customer risk segments for actuarial teams.
● Constructed real-time policy and claims ingestion pipelines using NiFi and Flink, processing data from multiple insurance systems, cutting claim backlog from 120 pending cases to under 15 daily and enabling near real-time updates for analytics dashboards.
● Built advanced data orchestration frameworks using DBT and Prefect for premium calculation, underwriting, and claims datasets, generating lineage matrices that improved consistency across multiple reporting and actuarial systems.
● Architected and implemented high-performance data lakes with Delta Lake and HDFS, creating partitioned and schema-evolved structures to support risk modeling, fraud detection, and actuarial analytics pipelines for over 1M policy records.
● Automated internal insurance reporting and management dashboards using Python (Pandas, NumPy) and Airflow, defining report matrices that reduced manual processing time and improved audit-readiness for regulatory compliance.
● Partnered with actuarial analysts to design relational and dimensional models in Snowflake and PostgreSQL, creating data matrices for policy lifecycle, claims frequency, and risk exposure, supporting downstream predictive insurance models.
● Enhanced batch and near real-time data pipelines using Spark Structured Streaming and Python, constructing dependency matrices to improve claim processing throughput, reducing queue length from 500 pending claims to under 50 per cycle across insurance channels.
● Established data integrity and verification frameworks using Great Expectations, monitoring incoming policy, claims, and payment transactions, and generating error matrices that ensured compliance with internal SLAs across 50+ insurance branches.
● Formulated ML-ready datasets by embedding structured and unstructured insurance data, including customer claims, underwriting notes, and historical policy information, leveraging Python and Spark to construct model input matrices for claim fraud prediction.
● Partnered with cross-functional teams to maintain metadata, audit logs, and lineage documentation, generating traceability matrices to track transformations and ensure regulatory compliance in insurance reporting workflows.
● Formed Python-based ingestion modules for integrating external actuarial and claims APIs, leveraging matrices to map external data sources to internal insurance datasets, improving data reconciliation efficiency by 75% for underwriting and policy servicing teams. EDUCATION
Master of Science in Computer and Information Systems Security University of Central Missouri