Data Engineer Real-Time

Location:

Houston, TX

Posted:

July 23, 2025

Contact this candidate

Resume:

YAKKALI PAVAN

DATA ENGINEER

******************@*****.*** +1-832-***-**** Houston, Texas

PROFESSIONAL SUMMARY

Results-driven Data Engineer with 5 years of experience in designing, building, and optimizing scalable data pipelines, ETL processes, and cloud-based architectures. Proficient in SQL, Python, Spark, and modern data warehousing solutions, with a strong background in big data processing and real-time analytics. WORK HISTORY

AWS Data Engineer

Mohawk Industries – Dallas, TX 07/2024 - Current

• Improved data integrity by 40% across healthcare and retail pipelines by applying rigorous data validation, telemetry monitoring, and business logic using Python, SQL, and Snowflake.

• Developed scalable real-time and batch ETL/ELT pipelines with dbt, SnowSQL, and AWS Glue, enabling self- service analytics for product, finance, and sales teams.

• Integrated Kafka and Kinesis for distributed streaming, enabling fault-tolerant ingestion across AWS and GCP and reducing downtime by 35%.

• Optimized large data sets and processing with Spark on Snowflake, HDFS, and ClickHouse, improving reporting performance by 50%, visualized insights with Power BI and Tableau for UX-focused dashboards.

• Automated multi-environment CI/CD deployment with Terraform, GitHub, and Jenkins; reduced release cycle time by 60% and enhanced release management transparency with JIRA and Confluence.

• Supported application design and backend microservices development using Python and Node.js APIs to accelerate downstream data access for ML and reporting.

• Contributed to cloud-native security policies and RBAC controls, enforcing DevOps and data governance best practices.

• Environment: AWS (Glue, Lambda, Kinesis, S3, Redshift, EMR, CloudWatch), Azure (Azure Data Factory, Azure synapse, Azure Data Lake, Azure databricks), Snowflake, Apache Spark, Kafka, Airflow, API’s, NLP, CI/CD, Distributed Computing, Information Systems, dbt, Data Architecture, Domain Knowledge, Data Security, Orchestration, Networking, Debugging, Agile development, Version Controls, fivetran. GCP Data Engineer

Cholamandalam General Insurance – Chennai, India 05/2022 - 07/2023

• Developed scalable ETL pipelines using Apache Beam and Cloud Dataflow, reducing manual effort by 50% and boosting reliability for claims analytics.

• Designed BigQuery and Tableau dashboards, enabling executive stakeholders to drive faster data-driven decisions. Applied design thinking to schema governance, automating validation via PySpark, resulting in 30% improvement in schema consistency across platforms.

• Led cross-functional collaboration to improve data onboarding, enhancing SLA adherence and reducing onboarding latency by 25%.

• Established cloud computing best practices using GCP-native tools, improving data cataloging, security visibility, and compliance tracking.

• Environment: GCP (Cloud Bucket, Cloud Functions, Cloud Fusion, Dataflow, BigQuery, Dataproc, DAG, Cloud Composer, Pub/Sub), Apache Beam, Kafka, Cluster, Data Factory, EC2, EMR, ETL, Hive, Python, S3, Scala, Tableau, Airflow, DataBricks.

GCP Data Engineer

Carat Lane – Chennai, India 01/2022 - 04/2022

• Reduced manual data operations by 75%, by automating ingestion and transformation using Databricks, PySpark, and Airflow, while ensuring SLA-compliant delivery across sources.

• Designed and deployed cloud-native pipelines on Azure and GCP, enabling dynamic autoscaling and reliable performance even during high-traffic periods.

• Implemented data governance and masking policies to protect sensitive customer information and ensure compliance with HIPAA, PII, and internal data security standards.

• Standardized schema designs and scheduling patterns across pipelines, leading to improved maintainability and lower failure rates in production.

• Environment: Airflow, Apache, API, Azure, BigQuery, Data Factory, Factory, GCP, HBase, HDFS, Lake, Microsoft, PySpark, Python, SDK, Spark, SQL, Sqoop, VPC. Data Engineer

Amrutanjan Healthcare – Chennai, India 03/2020 - 12/2021

• Designed and delivered high-volume ETL pipelines to support analytics and ML use cases, processing terabytes of structured and semi-structured data using Redshift, PySpark, and Azure Data Factory accelerated sales reporting by 60%.

• Improved infrastructure scalability by 20% by containerizing Spark-based pipelines with Docker and Kubernetes, enabling modular development and more flexible CI/CD workflows.

• Developed complete BI solutions, from data modeling and transformation to visualization, by building reporting layers in SQL and dashboards in Tableau empowering product, marketing, and leadership teams with real-time insights.

• Led efforts to integrate a secure, cloud-native data warehouse, combining Azure Synapse, Kafka, and Databricks into a unified analytics platform to support batch and streaming workloads.

• Environment: API, Azure Data Bricks, Cassandra, CI/CD, Data Factory, Docker, EC2, EMR, ETL, Factory, Hive, Java, Jenkins, Kafka, Kubernetes, PySpark, Python, Redshift, S3, SAS, Spark, Spark SQL, SQL, Tableau, dbt. TECHNICAL SKILLS:

• Cloud Technologies: AWS (S3, glue, redshift, EMR, lambda, kinesis, Athena, EC2, CloudWatch), azure (data lake, ADF, synapse, Data bricks), GCP (Big query, dataflow, dataproc, pub/sub).

• ETL & Orchestration: Airflow, Dagster, Talend, informatica, AWS glue, dbt, Apache airflow, Apache Nifi, Azure data factory (ADF), Databricks, Fivetran.

• Big data technologies: Hadoop (HDFS, MapReduce, hive), Spark (PySpark, sparksql), Kafka, Cloudera, Apache Flink.

• Databases: MySQL, SQL, SSIS, NoSQL, PostgreSQL, MongoDB, DynamoDB, redis, Cassandra, HBase.

• Data warehousing and Modelling: Snowflake, Snowsql, Data Modelling, Data Infrastructure, Dimensional modelling, Data governance, BigQuery, Redshift, Azure synapse, Data Security, Security policies.

• Programming & scripting: Python(Object Oriented Programming), Java, Scala, SQL, Bash, Shell scripting, JSON, JDBC, Hibernate, R.

• GEN AI: RAG, MCP, LLMS (LARGE LANGUAGE MODELS), AI/ML (ARTIFICIAL INTELLIGENCE)

• DevOps & containerization: Git, GitLab, Jenkins, Terraform, ansible, maven, ci/cd, docker, Kubernetes.

• Machine learning integration: AWS Sagemaker, Jupyter, Azure Machine learning, Databricks notebooks, TensorFlow, scikit-learn.

• Security & compliance: PII, RBAC, MFA, DLP, HIPAA, Encryption, GDPR, Apache atlas.

• Visualization & bi tools: Tableau, Power BI, Looker, Qlik.

• Operating systems & developer tools: Windows, Linux, Shell Scripting, Unix, Eclipse, Dreamweaver, SQL, Azure Data studio.

• Data engineering skills: Data processing, Orchestration, Data monitoring, Metadata management, Data structures, Data quality, Data Analytics, Data Automation, Data Mapping, Root cause Analysis, Business Automation, Advanced Analytics.

• Data analysis & optimization: Query performance tuning, ETL pipeline optimization, Data manipulation, Improving Efficiency, Product Quality.

• Problem-solving skills: Data-driven decision making, Process optimization, Stakeholder collaboration.

• Soft skills: Analytical thinking, Problem-solving, Troubleshooting, Collaboration, Communication skills, Leadership, Mentorship, Adaptability, Creativity, Strong critical thinking, Accuracy, Reliability, accountability, Confident, Customer service, Work independently, Detail Attention, Customer Satisfaction.

• Collaboration & agile tools: Jira, Confluence, Microsoft teams, Slack. CERTIFICATIONS:

• ITIL FOUNDATION CERTIFICATE IN IT SERVICE MANAGEMENT

• GCP CLOUD DATA ENGINEER

• AWS ASSOCIATE DATA ENGINEER

EDUCATION:

Master’s in science, Engineering Data Science, University Of Houston, Texas. LinkedIn: linkedin.com/in/pavan-yakkali-67a1b526b

Contact this candidate