Data Quality Business Intelligence

Location:

Dallas, TX

Posted:

October 15, 2025

Contact this candidate

Resume:

Sai Sruthi T

Email: ***************@*****.*** contact no: +1-972-***-****

Results-driven SQL and Business Intelligence developer with 5+ years of experience delivering cloud-native, big data, AI/ML and analytics solutions across banking, healthcare, telecom, and insurance domains.

Skilled in designing ETL/ELT pipelines, SQL optimization, real-time streaming, data transformation and data governance frameworks. Skilled in building star/snowflake schemas and optimizing OLAP cubes for analytical dashboards.

Hands-on expertise with Azure, AWS, GCP, Snowflake, and Lakehouse migrations, enabling scalable and secure data platforms. Designed KPI-driven dashboards in Power BI/Tableau for delinquency analysis, telecom usage, and healthcare denial rates, enabling business teams to track performance in real time.

Proficient in Python, R, and Unix/Linux scripting, applying statistical tools, feature engineering, and ML/AI infrastructure for predictive analytics.

Adept in data quality assurance, data quality monitoring, and metadata-driven frameworks supporting compliance (HIPAA, PCI, SOX). Experienced in data warehousing, and enterprise reporting, enabling faster business decision-making.

Strong experience in business requirements gathering, agile methodologies, debugging, and analytical skills, ensuring alignment between business goals and data engineering solutions.

Experienced in batch, streaming, and near real-time ingestion from SFTP, REST APIs, FTP, RabbitMQ, and on-prem databases. Hands-on exposure to SAP and ERP system integrations, supporting financial and operational data pipelines.

Hands-on experience building data quality rules, CDC logic, and validation layers in ADF Mapping Data Flows, PySpark, and Apache Storm, supporting real-time alerting for network events, claim anomalies, and delinquency triggers.

Skilled in queue backlog monitoring, shell-based restart automation, schema drift handling, and audit logging using ADF, Hive, Control-M, and Azure Monitor, enabling high-availability pipelines and compliance support.

Experienced in Git-based CI/CD, version control, and environment promotion of ADF JSON pipelines, Bitbucket-based notebooks, and SQL stored procedures using Azure DevOps and Jenkins.

Familiar with parsing HL7, X12 EDI (837/835) healthcare files and telecom CDR/EDR formats using PySpark and Hive for reporting, plan optimization, and fraud detection use cases.

TECHNICAL SKILLS

SQL Development: T-SQL, PL/SQL, CTEs, Stored Procedures, Functions, Views, Indexing, Triggers, Query Optimization.

Cloud & Big Data: Azure (ADF, ADLS Gen2, Synapse, Databricks, Logic Apps), AWS (S3, Glue, EMR, Redshift, Lambda, RDS), GCP (BigQuery, DataProc, Pub/Sub), Snowflake.

Data Platforms: Snowflake, MS SQL Server, Oracle 12c/11g, PostgreSQL, Teradata, MongoDB, DB2, Netezza, NoSQL (MongoDB, DynamoDB, HBase), Lakehouse Migration, Database Architecture.

ETL & Orchestration: Azure Data Factory, Apache Airflow, Control-M, Informatica, Talend, NiFi, Alteryx, Apache Storm, Sqoop, SSIS, Dataiku.

Big Data Frameworks: Apache Spark (PySpark, Spark SQL, MLlib, DataFrame API), Hadoop (HDFS, Hive, Pig, Oozie, MapReduce), Kafka, RabbitMQ, Flume, Zookeeper, ElasticSearch.

Programming & Scripting: Python (pandas, NumPy, scikit-learn, regex, Web Scraping, API Deployment), R, Shell Scripting, Scala, Java, Linux/Unix Scripting.

BI & Visualization: Power BI, Tableau, Google Data Studio, Excel (Pivot Tables, VLOOKUP, Data Models, Statistical Tools), Looker.

DevOps & Version Control: Git, Jenkins, Docker, Kubernetes, Terraform, Azure DevOps, CI/CD pipelines, Debugging Skills.

Specialized Data Handling: HL7, X12 (837/835) EDI healthcare files, CDR/EDR telecom logs, JSON, XML, Parquet, Avro, CSV

Monitoring & Security: Azure Monitor, Prometheus, Grafana, RBAC, Data Masking, Column-level Security, Data Quality Assurance, Data Quality Monitoring, Data Governance.

AI & ML: Feature Engineering, Machine Learning, Generative AI Services, AI/ML Infrastructure

Data Warehousing & OLAP: Star/Snowflake schema, Fact/Dimension modeling, Cube creation, Aggregations, Performance Tuning.

Enterprise Systems: SAP ECC, SAP HANA, SAP BW, ERP data extraction, IDoc/Flat File integration, SAP-to-Cloud migration.

PROFESSIONAL EXPERIENCE

Client: Ally Financial, Dallas, Tx Feb 2024-present

Role: SQL Data Engineer

Responsibilities:

Developed metadata-driven ETL pipelines in Azure Data Factory, ingesting 40K+ daily approved loan application files via SFTP into ADLS Gen2 Bronze/Silver/Gold layers.

Built reusable ADF pipelines using parameterized datasets, Lookup, ForEach, and dynamic metadata control for scalable ingestion from multiple loan origination systems and third-party feeds.

Designed and managed Self-Hosted Integration Runtime (IR) to securely connect with on-prem SQL Server and SFTP endpoints. Migrated legacy SSIS workflows into Azure ADF pipelines, ensuring seamless modernization.

Applied ADF Mapping Data Flows for validation (credit score thresholds, consent flags, nulls/duplicates) and schema drift handling for evolving loan file structures.

Integrated Databricks PySpark notebooks for cleansing, enrichment, and file format conversions (CSV to Parquet), with audit logging in Azure SQL (file name, record counts, runtime, status), enhancing data quality and traceability

Created CDC logic using approval date and hash-based comparisons to enable incremental refresh and reduce data latency.

Modeled reporting-ready star schemas in Synapse Dedicated SQL Pools (Customer, Branch, Product dimensions + FactLoan tables), optimized for Power BI dashboards. Applied Azure Key Vault + MFA for secrets and secure credential management.

Tuned Synapse queries with HASH distribution, columnstore indexes, and partitioning on ApprovalDate to support visuals on 100M+ records. Supported Lakehouse migration strategy from on-prem SQL Server to Delta Lake/Snowflake.

Developed stored procedures, UDFs, triggers, and materialized views in SQL Server and Synapse to improve analytical query performance. Collaborated in agile delivery cycles, translating business intelligence requirements into technical specifications. Logged data lineage and metadata for compliance with SOX and regulatory audits.

Configured PolyBase external tables and COPY INTO for high-volume loads from ADLS into Synapse.

Built Power BI dashboards with drill-through capabilities for delinquency buckets (1–15, 16–30, 31+ days late) and credit score trends. Applied feature engineering and ML models in PySpark for delinquency risk prediction.

Integrated Azure Monitor and Logic Apps for real-time alerts (email/Teams) on pipeline failures and SLA breaches.

Automated deployments with CI/CD pipelines in Azure DevOps (ADF JSONs, PySpark notebooks, SQL scripts).

Migrated legacy SSIS packages to Azure Medallion architecture, reducing maintenance overhead by 40%.

Enabled secure data sharing pipelines for collections and risk teams using row-level security and column masking.

Automated historical backfill jobs with PySpark, ADF triggers, and UNIX shell scripts for FTP-based edge cases.

Deployed Dockerized ETL scripts for low-latency ingestion scenarios, orchestrated with Kubernetes jobs.

Used Terraform scripts to provision Synapse pools, ADLS storage accounts, and IR hosts.

Extended pipelines with Snowflake staging tables for federated reporting across business units.

Implemented PySpark MLlib models (logistic regression, random forest) to predict high-risk delinquencies and feed results into Power BI dashboards. Extended pipelines to include data governance checks and quality monitoring dashboards.

Environment: Azure Data Factory, Azure Data Lake Gen2, Azure Synapse Analytics, Azure SQL Database, Azure Monitor, Azure Logic Apps, Azure Blob Storage, Azure DevOps, Databricks (PySpark, Notebooks), Apache Airflow, UNIX Shell Scripting, Python, Git, SFTP, SQL Server, Power BI, PolyBase, Parquet, JSON, CSV, Docker (edge cases), Self-hosted Integration Runtime (IR).

Client: Molina HealthCare, Remote Sep 2022- Jan 2024

Role: SQL Data Engineer

Responsibilities:

Ingested EDI 837/835 claim files via secure SFTP and parsed CPT, ICD-10, service dates, and billing codes using PySpark DataFrames. Built data quality assurance and governance frameworks ensuring HIPAA compliance.

Designed ETL workflows using AWS Glue and S3 for staging claim files before PySpark transformations.

Built metadata-driven ETL frameworks in PySpark to standardize claim ingestion across multiple provider groups and states, enabling faster onboarding. Applied R and Python statistical tools for fraud detection and denial rate analysis.

Integrated CMS and state Medicaid REST APIs using Python clients for real-time eligibility verification, storing JSON responses in MongoDB for audit.

Implemented data quality pipelines (null handling, duplicate removal, regex checks, referential validation) in PySpark.

Applied rule-based fraud detection logic (invalid codes, inactive coverage, out-of-network providers) with PySpark UDFs.

Routed failed claims into exception queues in MongoDB, storing errors in schema-flexible JSON for audit and reporting.

Created Airflow DAGs for batch and event-driven workflows, with retries, SLA monitoring, conditional branching, and automated failure notifications, enhancing workflow reliability. Deployed fraud detection microservices on AWS Lambda, enabling event-driven claim validations and reducing processing time.

Integrated CloudWatch monitoring with Airflow DAGs for proactive alerts on ETL job failures, which improved response times and minimized downtime.

Deployed validation scripts in Docker containers and orchestrated them with Kubernetes clusters for horizontal scaling during peak provider load. Leveraged AI/ML infrastructure in Kubernetes for scalable fraud detection pipelines.

Implemented HIPAA compliance controls with field-level masking, RBAC, and AES-256 encryption for PHI attributes.

Deployed validation scripts in Docker containers and orchestrated them with Kubernetes clusters for horizontal scaling during peak provider load, ensuring system stability. Leveraged AI/ML infrastructure in Kubernetes for scalable fraud detection pipelines, enhancing detection capabilities.

Built Snowflake staging and curated layers for claims reporting and downstream analytics.

Leveraged Redshift clusters for analytical workloads on processed claim datasets.

Delivered Tableau dashboards showing denial rates, eligibility failures, and top rejection patterns across providers.

Partnered with denial management teams to embed business rules into pre-adjudication filters, reducing false rejections.

Logged end-to-end lineage and audit metadata (file arrival, validation stage, API status, final adjudication flag).

Automated ETL testing with PyTest and integrated into Jenkins CI/CD pipelines to detect schema drift early.

Monitored ETL jobs and cluster health using Prometheus + Grafana, exposing Airflow and Spark job metrics.

Used Kafka producers/consumers for streaming eligibility updates and real-time claim adjudication triggers.

Processed high-volume claims data in Parquet format, registering Spark SQL tables for downstream reuse.

Implemented machine learning models in PySpark MLlib to flag potentially fraudulent claims (outlier detection).

Collaborated across Claims Ops, Compliance, and Billing teams to ensure HIPAA-secure, audit-ready pipelines.

Configured IAM roles, encryption, and data governance policies on AWS to maintain HIPAA compliance, resulting in enhanced data security and regulatory adherence

Environment: Python, PySpark, SQL, EDI 837, ICD-10, CPT, Apache Airflow, Apache Spark, MongoDB, PostgreSQL, REST APIs, CMS API, Jenkins, Git, Docker, Kubernetes, Prometheus, Grafana, Tableau, Parquet, SFTP, Linux, CI/CD, HIPAA Compliance, RBAC, Data Masking, Data Partitioning, Exception Handling Frameworks, Log Management.

Client: AT&T, India May 2020 to Jul 2022

Role: ETL Developer

Responsibilities:

Built near real-time data ingestion pipelines with RabbitMQ + Apache Storm, processing high-volume telecom CDR/EDR records from FTP streams with sub-minute SLAs.

Developed Storm topologies (spouts and bolts) for schema validation, deduplication, and enrichment (tower ID location mapping). Incorporated ElasticSearch for near real-time querying of high-volume CDR/EDR logs

Modeled telecom star schema (call sessions, SIM plans, tower metadata) in Hive for downstream reporting and analytics.

Designed Hive staging and integration tables, partitioned by eventDate and tower region to improve query performance.

Logged invalid records and pipeline metrics (row counts, error codes, timestamps) into HBase tables, enabling SQL-style diagnostics through Apache Phoenix.

Automated recovery with shell scripts and cron jobs for topology restarts, backlog cleanup, and HDFS archival of expired usage logs, improving system reliability. Modernized ETL processes with SSIS and Storm orchestration for enhanced telecom usage reporting. Used Flume agents to ingest weblogs and device telemetry data into Hadoop in near real-time.

Scheduled workflows in Control-M, orchestrating Storm jobs, Hive transformations, and Netezza export processes.

Tuned Hive queries with partition pruning, vectorized ORC file format, and Tez execution engine, reducing query time by 35%. Implemented data governance and lineage tracking for audit and compliance.

Built Sqoop imports from Oracle into HDFS for subscriber metadata, enabling joins with usage records.

Implemented Oozie workflows to chain multi-step ETL jobs across Hive, Pig, and Sqoop.

Configured Zookeeper for distributed coordination of Kafka brokers and Storm clusters, ensuring high availability.

Exported cleaned and aggregated usage summaries into Netezza for fraud detection, billing accuracy, and marketing insights.

Integrated with CI/CD pipelines in Jenkins + Bitbucket, automating Storm topology deployments, Hive scripts, and SQL jobs. Built Unix/Linux automation scripts to monitor and restart backlogged pipelines.

Created audit dashboards in Phoenix SQL to monitor failed jobs, backlog size, and retry counts, reducing issue resolution time by 40%. Applied security policies with Kerberos authentication for Hadoop cluster access.

Processed 100M+ CDR/EDR rows daily, supporting downstream fraud teams, network optimization, and churn analysis.

Built shell-based file watchers to detect FTP arrivals and trigger ingestion events in RabbitMQ automatically.

Environment: FTP, RabbitMQ, Apache Storm, HDFS, Hive, HBase, Phoenix, Netezza, Shell Scripting, Cron, Control-M, Jenkins, Bitbucket, SQL

CERTIFICATIONS

Microsoft Azure Data Fundamentals – DP-900

Microsoft Fabric Data Engineer Associate – DP-700

Microsoft Power BI Data Analyst Associate – PL-300

EDUCATION

Master of Science in Computer Science

University of North Texas CGPA: 3.75 / 4.0

PROJECTS

Early Depression Detection- Python, MLflow, CNN, Regression, Tableau

Built a machine learning pipeline to identify early signs of depression from 1.6M+ tweets; applied regression and CNN models, tracked experiments with MLflow, and visualized outcomes in Tableau.

Amazon Bookstore Application- Java, HTML, CSS, MongoDB, AWS, Docker, Git

Developed a full-stack e-commerce bookstore with AWS-hosted backend; containerized services using Docker and applied debugging strategies for smooth deployment.

Contact this candidate