GOWTHAM YANGALASETTY
570-***-**** Plano, TX *******************@*****.*** Linkedin
SUMMARY
Experienced Data Engineer with a proven track record in architecting scalable, low-latency data pipelines for operational intelligence and real-time analytics. Specialized in streaming data processing using Google Cloud Pub/Sub, Apache Spark Structured Streaming, and Redis to enable timely insights and predictive modeling. Proficient in orchestrating robust batch and streaming workflows using Apache Airflow, and managing analytical data lakes with Google Cloud Storage (GCS) and BigQuery. Integrated Cloud Monitoring, Cloud Logging, Grafana, and Great Expectations for proactive pipeline monitoring and data quality enforcement. Skilled in GCP Dataproc, Spark SQL, Scala, TECHNICAL SKILLS
Programming Languages: Python, PySpark, Java, Scala, SQL. Data Integration & Modeling ETL/ELT Pipelines, Data Modeling, ER Modeling, Data Warehousing, Data Pipelines, Data Lakehouse Architecture, Data Governance, Spark Streaming.
Data Engineering & Big Data Frameworks: Apache Spark, DBT, Hadoop, Hive, Kafka, Airflow, Flink, Druid, Beam, Singa, HDFS, Spark SQL, ML &AI: PyTorch, TensorFlow, Scikit-learn, XGBoost, Hugging Face Transformers, LangChain, OpenAI API, Pinecone, CrewAI (or similar) for orchestration of autonomous agents.
Cloud Technologies: GCP (Pub/Sub / Cloud Dataflow / Cloud Functions, GCS, Vertex AI Endpoints / Cloud Run, Cloud Monitoring + Logging, IAM, DLP).
Analytics and BI Tools: Power BI, Tableau, MS Excel, Grafana, Streamlit, EDA, Time Series Analysis, A/B Testing, Forecasting, Predictive Modeling, Statistical Modeling.
PROFESSIONAL EXPERIENCE
Data Engineer, Honeywell, US May 2024 – Present
• Developed real-time data processing pipelines to ingest and analyze large-scale user interaction logs using Apache Kafka and Spark Structured Streaming on GCP Dataproc, enabling event tracking and durable storage in BigQuery and Hive for auditing and analytics.
• Designed low-latency, context-aware data access layers using Redis as a caching engine to accelerate dynamic user profile retrieval, reducing system response time by over 60% for downstream analytics services.
• Built and orchestrated hybrid batch-streaming workflows using Apache Airflow on Cloud Composer, integrating dynamic DAGs for parameterized pipelines, automating GCS-to-HDFS synchronization, incremental Hive partitioning, metadata propagation, and scheduled archival strategies for multi-tenant environments.
• Deployed and monitored ML models in production using Docker containers and Kubernetes pods integrated with AWS SageMaker endpoints and GCP Vertex AI for hybrid cloud experimentation.
• Used Terraform for provisioning AWS infrastructure (S3, Lambda, IAM) and Google Cloud components (BigQuery, Pub/Sub) to automate infrastructure management and cross-cloud data pipelines.
• Contributed to dbt transformations on BigQuery, building modular SQL models with automated tests and lineage documentation, improving analytics data quality and governance for BI consumption.
• Played a key role in a cross-cloud POC that integrated Kafka + Spark Structured Streaming on GCP Dataproc with downstream storage in BigQuery & Hive, validating sub-second event tracking and durable replay for audit/analytics
• Managed CI/CD workflows for Spark jobs and SQL-based logic using GitHub and Cloud Build, modularizing pipeline components for reproducible deployments, version tracking, and environment-specific configuration management.
• Transformed nested JSON logs from application telemetry into flattened, denormalized tables using Spark (Scala) on GCP Dataproc. Data Engineer, LTI Mindtree, India Jul 2020 – Aug 2023
• Optimized Spark-based data transformation workflows using PySpark and Sparklens, reducing job runtimes by 30% and improving distributed execution efficiency across Hadoop clusters.
• Automated secure ingestion pipelines using FTP/SFTP into HDFS and orchestrated ETL jobs via Apache Airflow and Hive, ensuring timely and consistent availability of structured data for critical downstream analytics and reporting use cases.
• Built and maintained data-matching logic in PySpark, improving entity resolution accuracy by 25% across healthcare claim datasets.
• Improved query performance for reporting and pipeline workloads by using Spark SQL optimizations and applying static and dynamic Hive partitioning techniques for better data pruning and scan efficiency.
• Performed comprehensive data validation using PySpark transformations and embedded unit tests within Apache Airflow DAGs to proactively catch anomalies early in the pipeline, significantly improving overall data quality and reliability by 40%.
• Collaborated with product and analytics teams to build interactive Power BI dashboards, implemented row-level security using role- based filters, and led UAT testing to verify data accuracy and compliance with business rules.
• Used Git and GitHub for code versioning and collaboration, implemented branching strategies for team workflows, and integrated Spark jobs with basic CI/CD pipelines to support consistent development and deployment.
• Supported enterprise data governance initiatives by applying encryption best practices and implementing fine-grained access control policies on HDFS and Hive-managed datasets, ensuring alignment with organizational security and compliance standards. EDUCATION
Masters in Information Technology Aug 2023 –May 2025 Saint Francis College, Brooklyn,NY
Bachelors in Mechanical Engineering Jul 2017 – Jun 2021 Sree Vidhyanikethan Engineering College, Tirupathi, India