HARSHA NARASIMHA MURTHY
Location: Dallas, TX Phone: +1-469-***-**** Email: ***************@*******.*** LinkedIn Professional Summary
Data Engineer with 4+ years of hands-on experience designing and maintaining scalable ETL/ELT pipelines, data lakes, and cloud-based data architectures across industries including telecom, energy, and public transit.
Proficient in Python, SQL, Apache Airflow, Power BI, MySQL, Snowflake, and AWS/Azure ecosystems, with deep expertise in data modeling, pipeline orchestration, and business intelligence solutions.
Successfully built batch and streaming data workflows using tools like AWS Glue, Databricks, and Kafka, enabling near real-time insights and improving reporting latency by up to 80%.
Strong background in data validation and testing using tools like Great Expectations, ensuring data integrity, schema consistency, and production readiness across ingestion zones.
Delivered multiple end-to-end analytics projects, including transit reliability dashboards and GitHub activity lakehouses, leveraging modern data engineering practices and cloud-first design.
Excellent communicator and collaborator with experience working cross-functionally with DevOps, QA, and domain experts, aligning data delivery with strategic business needs. Experience
UBER TX
Data Engineer Jan 2025 – Apr 2025
Built ELT pipelines using Python and SQL to process structured and semi-structured data from internal APIs and flat files into Snowflake.
Designed schema models in Snowflake and implemented dbt transformations for dimensional tables supporting business metrics.
Developed batch ingestion workflows in AWS Glue and automated scheduling using Airflow DAGs for daily data loads.
Used Apache Spark on Databricks to clean and transform raw datasets from S3, improving query efficiency for reporting layers.
Configured data validation logic using Great Expectations to enforce schema consistency across ingestion zones.
Monitored data pipelines using CloudWatch and implemented retry logic with Lambda functions to handle ingestion failures.
Infosys India
Data Engineer Dec 2021 – Jul 2023
Built scalable ETL pipelines using Python, SQL, and Azure Data Factory to ingest call detail records (CDRs), billing logs, and CRM feeds into Azure Synapse for downstream analytics.
Developed and deployed PySpark jobs in Databricks to standardize high-volume usage data from multiple vendor systems, improving consistency across KPIs.
Designed star and snowflake schemas in Snowflake to support high-concurrency reporting on customer churn, usage trends, and billing anomalies.
Integrated real-time Kafka streams into Azure Data Lake Gen2, implementing checkpointing and watermarking to manage out-of-order events and late-arriving data.
Implemented Delta Lake architecture with Bronze, Silver, and Gold layers to enforce data lineage and enable reprocessing of failed batches.
Built reusable CI/CD pipelines using Azure DevOps, automating deployment of notebooks, Spark jobs, and infrastructure templates.
Authored modular dbt models for customer segmentation, recharge behavior, and tariff plan mapping, enabling analysts to self-serve insights.
Created operational dashboards in Power BI to monitor pipeline health, row-level latency, and file arrival metrics, reducing downtime across critical loads.
Tuned Spark job configurations and optimized partitioning logic to reduce data transformation latency by over 30%.
Worked cross-functionally with DevOps, QA, and domain SMEs to align data delivery with business requirements and release cycles.
KPMG India
Data Analyst Jul 2019 – Nov 2021
Collected, cleaned, and transformed large-scale operational data using MySQL to identify inefficiencies and boost production performance.
Developed interactive Power BI dashboards to visualize key metrics, enabling faster data-driven decision- making for plant leadership.
Implemented Apache Airflow pipelines to automate ETL workflows, streamlining data ingestion from ERP and IoT sources.
Collaborated with cross-functional teams to perform root cause analysis on downtime events, reducing machine failure rates by leveraging data mining techniques.
Utilized Python and SQL to perform statistical analysis, trend forecasting, and anomaly detection in smelting and mining operations.
Maintained version control of analytics scripts and reports using Git, ensuring smooth collaboration and code reproducibility.
Created consolidated reports combining SCADA and ERP data, enhancing operational visibility and supporting predictive maintenance strategies.
Delivered weekly insights to stakeholders, supporting strategic planning through detailed reports and KPI tracking dashboards.
Technical Skills
Core Analytics: Data Cleaning, Data Wrangling, Trend Analysis, Statistical Analysis, Predictive Modeling, Data Mining, Report Automation
Programming & Scripting: Python (Pandas, NumPy, Matplotlib, Seaborn, TensorFlow, PyTorch), SQL (Joins, CTEs, Subqueries), R, VBA, Shell Scripting
Visualization Tools: Power BI, Tableau, Google Data Studio, Excel (Pivot Tables, VLOOKUP, Power Query)
Databases & Data Management: MySQL, PostgreSQL, Oracle, MongoDB, Google BigQuery, Snowflake, Redshift, SQL, NoSQL, Data Modeling, Data Warehousing
ETL, Integration & Orchestration: Apache Airflow, Talend, SSIS, Azure Data Factory, AWS Glue, Power Query, Apache Kafka, AWS Step Functions
Cloud Platforms: AWS (S3, RDS, Redshift, EC2, VPC, Lambda), Azure (Data Lake, Data Factory, VMs, Functions), GCP (BigQuery)
Model Deployment: Flask, FastAPI, Docker
Testing & Validation: A/B Testing, Cross-Validation, Data Validation, ROC-AUC
Operating Systems & Virtualization: Linux (Ubuntu, CentOS), Windows Server, macOS, VMware, Hyper-V
Networking & Security: TCP/IP, DNS, VPNs, Firewalls, IDS/IPS, Network Troubleshooting
Automation & Configuration Management: Ansible, Puppet, Terraform, Kubernetes, Docker
DevOps & CI/CD Tools: Jenkins, Git, GitLab, JIRA, CI/CD Pipelines
Monitoring & Performance: Nagios, Zabbix, Prometheus, Grafana, ELK Stack Education
University of Texas at Dallas
TX
M.S. in Information Technology and Management Aug 2023 - May 2025 M.S. Ramaiah Institute of Technology
India
Bachelor of Technology in Mechanical Engineering Aug 2017 - Aug 2021 Projects
Public Transit Reliability Dashboard: End-to-End Data Pipeline Mar 2025 - Present
Developed a real-time pipeline using Python, Kafka, Spark, and Airflow to ingest GTFS feeds into AWS Redshift, processing 50K+ daily records for transit reliability insights.
Enabled Power BI dashboards tracking on-time rates and delays across 120+ bus routes, reducing reporting lag by 80% and improving transit planning accuracy.
GitHub Activity Lakehouse: End-to-End Data Engineering Pipeline Jan 2025 – Mar 2025
Engineered a data lake on AWS S3 using AWS Glue and Airflow, storing and transforming GitHub event logs
(~10GB/day) into partitioned Parquet format.
Modeled and loaded curated data into Snowflake, enabling Tableau dashboards for 100+ repo insights on contributor activity, commit velocity, and issue frequency. Building a Scalable ETL Pipeline for Sales Data Analytics Sep 2024 – Dec 2024
Built an ELT pipeline using SQL, dbt, and Apache Airflow to consolidate multi-source sales data into Snowflake, processing 2M+ records/month with 99.9% reliability.
Automated Power BI reporting, cutting manual data prep by 40 hours/month and improving sales performance insights across 6 regional teams by 30%.