Data Engineer Machine Learning

Location:

Austin, TX, 78727

Salary:

75000

Posted:

October 15, 2025

Contact this candidate

Resume:

Krishna Tejaswi Tallada Data Engineer

************************@*****.*** 361-***-**** USA LinkedIn Summary

Experienced Data Engineer with 6+ years of expertise in designing and optimizing scalable data pipelines, real-time streaming, and cloud-based data architectures. Proficient in big data technologies, ETL orchestration, data governance, and machine learning support. Skilled in improving data quality, ensuring compliance, and enabling data-driven decision-making across diverse industries using tools like Spark, Kafka, Airflow, and AWS/GCP/Azure platforms. Technical Skills

• Programming & Scripting: Python, SQL, Scala, Java, R, Bash, Shell Scripting, Pandas, NumPy, Dask, PySpark

• Big Data & Distributed Systems: Apache Spark (Catalyst optimizer, Tungsten, adaptive query execution, partition pruning, caching), Apache Flink, Hadoop Ecosystem (HDFS, MapReduce, YARN, Hive, HBase), Apache Beam, Presto, Apache Impala, Apache Druid, Spark Streaming, Dask

• Data Architecture & Modeling: Medallion Architecture (Bronze, Silver, Gold), Dimensional Modeling, Data Vault, Star & Snowflake Schema, Event- Driven Architecture, Microservices, Real-Time & Batch Processing, SCD Type 2

• Data Ingestion & Streaming: Apache Kafka, AWS Kinesis, Apache NiFi, Apache Pulsar, Flume, Logstash, Sqoop, Spark Streaming, Google Pub/Sub, Azure Event Hubs, Amazon MSK, Apache Storm, Debezium

• Cloud Platforms & Services: AWS (Lambda, Glue, Redshift, EMR, S3, CloudWatch), Azure (Data Factory, Synapse Analytics, HDInsight), GCP (BigQuery, Dataflow, Pub/Sub, Dataproc), Snowflake, Azure Pipelines, Cloud Build

• Data Warehousing & Storage: Snowflake, Apache Hive, Apache Druid, Delta Lake, Apache Iceberg, HDFS, Amazon Redshift, Teradata, Cloud SQL

• ETL & Workflow Orchestration: Apache Airflow, Prefect, dbt, Talend, Informatica MDM, Sqoop, Luigi, Databricks ETL, Azure Data Factory (ADF)

• Data Governance & Quality: Great Expectations, Apache Atlas, OpenLineage, Automated Data Lineage, Metadata Management, GDPR, HIPAA, CCPA Compliance, RBAC, Encryption, SOC 2, PCI DSS

• NoSQL & Search Databases: MongoDB, Cassandra, Elasticsearch, Redis, DynamoDB

• Machine Learning Support & Tools: Hugging Face, PySpark MLlib, BERT, TensorFlow, Scikit-learn, MLflow, Hugging Face Datasets, LightGBM

• Visualization & BI Tools: Power BI, Tableau, Apache Superset, Looker, Data Studio

• Data Security & Compliance: Encryption, Role-Based Access Control, Data Masking, Network Security Certificates

Microsoft Certified: Azure Data Engineer Associate Databricks Certified Data Engineer Associate

Professional Experience

Data Engineer, MetLife 10/2024 – Present Remote, USA

• Worked on a Real-Time AI-Driven Insurance Claim Risk Assessment and Personalization Platform using Kafka, Kinesis, and AWS Lambda. Collaborated with senior data engineers and actuarial teams to ensure AI explainability, regulatory compliance, and scalable data contracts.

• Developed real-time streaming pipelines with Apache Flink and Spark Streaming to detect fraudulent insurance claims instantly, reducing detection latency by 40% and enabling faster responses to mitigate high-risk behaviors and financial losses.

• Built robust ETL pipelines with Airflow and Spark to process policyholder data, claims history, and customer interactions. Applied SQL, Pandas, and NumPy for customer segmentation and risk scoring models using LightGBM, enhancing data quality and decreasing model training time by 30%.

• Designed a low-latency analytics layer on AWS Redshift and Apache Druid to deliver real-time insurance risk dashboards with sub-second query speeds and 99.9% uptime, supporting millions of daily policy evaluations and underwriting decisions.

• Created high-quality training datasets from anonymized insurance claim records and customer data for LightGBM-based claim fraud detection and risk scoring models. Leveraged Dask and Hugging Face Datasets to scale preprocessing, increasing model accuracy by 45% and reducing data preparation time by 3x.

• Integrated Great Expectations and OpenLineage into CI/CD pipelines for constant validation of mission-critical data workflows. Collaborated with Sr. data engineers to reduce data anomalies by 38%, improve audit readiness, and boost confidence in ML and BI systems across insurance product. Data Engineer, Mercedes-Benz R&D 01/2022 - 08/2023 Bangalore, India

• Designed a centralized sales data lakehouse on GCP with BigQuery, Cloud Storage, and Dataproc, integrating CRM, POS, SAP, and inventory data. Collaborated with sales, analytics, and product teams through requirement gathering, improving data completeness by 40%.

• Built scalable ETL pipelines using Apache Beam (Dataflow), PySpark, Pandas, and Airflow, reducing batch processing runtime by 60%. Developed reusable Python and SQL modules with thorough documentation, boosting developer productivity and pipeline maintainability.

• Engineered real-time streaming pipelines leveraging Apache Kafka, Google Pub/Sub, and Dataflow, cutting data latency by 80%, enabling dynamic KPI dashboards, real-time alerts, and increasing operational responsiveness across dealerships by 35%.

• Implemented Master Data Management (MDM) standardizing vehicle, customer, and dealer entities, improving data integrity by 55%. Used CDC with Debezium and Kafka plus SCD Type 2 in BigQuery to ensure accurate historical sales tracking.

• Ensured data quality using Great Expectations within Airflow, reducing pipeline failures by 45%. Enforced data governance and security best practices, including RBAC and encryption, increasing compliance adherence by 50% and data trustworthiness significantly.

• Modeled data with star schema and SCD Type 2, automated SQL transformations using dbt and CI/CD via Cloud Build. Delivered datasets to Looker and Data Studio, increasing report generation speed by 65%, with strong team collaboration and documentation. Data Engineer, Health Catalyst 07/2018 - 12/2021 Hyderabad, India

• Designed and used scalable data pipelines using Talend, Apache NiFi, and Informatica MDM to integrate diverse enterprise data, improving data fullness by 45%. Collaborated closely with cross-functional teams and participated in requirement gathering sessions.

• Developed robust ingestion workflows with ADF and Sqoop, moving high-volume data into HDFS and Hive environments. Reduced ingestion errors by 50%, accelerated data availability, and maintained detailed pipeline documentation for operational support.

• Created optimized PySpark ETL jobs on Azure Databricks to transform and aggregate transactional and IoT data, reducing processing time by 70%. Produced technical documentation to ensure maintainability and facilitate knowledge transfer.

• Implemented Master Data Management (MDM) processes to standardize customer, product, and vendor identifiers across systems, increasing data integrity by 60%, and documented MDM workflows for audit and compliance purposes.

• Built real-time data streaming pipelines using Apache Kafka and Spark Streaming, cutting data latency by 80%, and provided timely operational insights. Documented streaming architecture and best practices for team reference.

• Prepared multi-source enterprise data for predictive analytics and machine learning models, boosting model accuracy by 30%, supporting initiatives like churn prediction and demand forecasting, with clear documentation of data preparation steps.

• Enhanced data quality and governance by integrating Great Expectations for validation and Apache Atlas for lineage tracking. Modeled data in Snowflake, reducing report generation time by 65%, while ensuring thorough documentation for compliance. Education

Texas A&M University–Corpus Christi — Corpus Christi, TX, USA Master of Science (MS), Computer Science 09/2023 – 05/2025 Projects

Telematics Drive Data Analytics Platform - Power BI, Apache Superset, GCP, Python, Report Server, Cloud SQL

• Developed dynamic dashboards and automated GCP data pipelines for telematics data ingestion, transformation, and storage—enhancing visibility into vehicle system failures and supporting detailed drive cycle analysis.

• Migrated analytics infrastructure from on-premise to GCP using VMs, Cloud Storage, and SQL, integrating Python scripts to detect performance trends and support platform-wide deployment for engineering and product teams. Sales Performance Insights (PepsiCo) - Power BI, Tableau, DAX, Teradata, Azure Pipelines, Power Automate

• Designed and deployed hierarchical sales dashboards by extracting and transforming Teradata data, highlighting underperforming store zones and enabling actionable GTM strategy improvements.

• Integrated complex DAX logic with real-time alerting via Power Automate and ensured dashboard efficiency with incremental refresh and optimized queries, deployed seamlessly using Azure Pipelines. Leadership

Employee of the Year – Rec Sports, Texas A&M University Facility Supervisor – Rec Sports, Texas A&M University (2024–2025) Facility Staff – Rec Sports, Texas A&M University (2023–2024) Received 'One Cognizant Award' for outstanding contributions to ETL workflow optimization and cross-team collaboration

Contact this candidate