Hetesh Gopala Krishna
EDUCATION
Masters in Artificial intelligence and Business Analytics, Muma College of Business, University of South Florida, Tampa,May 2026 Undergraduate from Koneru Lakshmaiah University, Nov-2019 SKILLS
Programming: SQL, PL/SQL, C, R, Python Shell Script, Microsoft Excel, Tableau, Advanced SQL Technologies/Environment: Pandas, NumPy, TensorFlow, Oracle SQL, Oracle Customer Care and Billing (CCB), Maven Database tools: MYSQL, Oracle SQL, MongoDB,Big Query, BIG DATA ECOSYSTEMS : Map Reduce, HDFS, Spark, Hive, Sqoop, Kafka. RELEVANT EXPERIENCE
Big Data Quality Analyst,
ClearTax, Bangalore, India
Feb 2021 – Jan 2024
• Ensured data accuracy and reliability across large-scale tax and financial datasets using big data tools.
• _
• Collaborated with data engineers to design scalable ETL processes and improve data ingestion accuracy.
• Conducted regular audits and root cause analysis to resolve anomalies and ensure compliance. Data Analyst (Contract)
Volvo Truck Plant, Bangalore, India
Jan 2020 – Jan 2021
• Collected and analyzed production data to support operational efficiency and maintenance planning.
• Developed dashboards and visual reports for tracking manufacturing KPIs.
• Supported predictive analytics initiatives for truck maintenance using historical sensor data. PROJECTS
Scalable Big Data Pipeline for GST Filing and Tax Analytics (Feb 2021- Jan 2024)
• Developed a scalable big data pipeline using Apache Spark and Hadoop for real-time and batch processing of GST filings across India, ensuring compliance and anomaly detection.
• Optimized Spark performance enabled ITC reconciliation and delivered insights via Hive and Presto to support financial audits and decision-making.
Cassandra Astra Cloud Integration Project (Jan 2025-May2025)
• Developed a Python-based data processing pipeline that connects to a DataStax Astra Cassandra database using the secure connect bundle and Cassandra-driver. Set up a local development environment in VS Code with Anaconda for dependency management.
• Created and managed keyspaces and tables on Astra's Cassandra cluster instances. Demonstrated data insertion and complex querying, including filtering on non-partition keys using CQL and ALLOW FILTERING. Verified operations using both the Python client and DataStax CQL Console. Documented and recorded the full tutorial with voice-over walkthrough for deployment and training purposes.
Big Data project in Insurance Data(Jan 2025- May2025)
• Sourced insurance data from a BigQuery database and connected it to Databricks for processing and analysis.
• Cleaned the dataset by removing headers, footers, and blank rows, and applied feature engineering to ensure consistent data with 10 columns per row.
• Utilized Spark SQL for advanced data manipulation, including calculating average market coverage per state.
• Implemented Isolation Forest to detect anomalies in the data, identifying potential outliers and fraudulent claims.
• Applied SparkML to train a Logistic Regression model, predicting Market Coverage based on features like IssuerId, StateCode, and DentalOnlyPlan.
• Optimized data processing pipelines in Databricks, improving performance for large datasets and complex operations. Airline Delay Data Warehouse (Jan 2025- May 2025)
• Designed and implemented a relational data warehouse using Oracle SQL with dimension and fact tables for airline delaysPerformed advanced data munging: handled nulls, removed invalid entries, standardized formats, and ensured referential integrity.
• Utilized SQL for data cleaning, integrity checks, and transformations across flight, weather, and carrier datasets.Built queries for outlier detection, delay trend analysis, and cancellation insights with JOINs and aggregations.
• Tools & Technologies: Oracle SQL, PL/SQL, Data Warehousing,Data munging, Data Cleaning, ER Modeling, ETL Principles.
Insurance Fraud Detection Using R Duration: May 2024 – December 2024(may2024-dec2024)
• Built a complete ML pipeline in R for fraud detection in insurance, achieving 93% accuracy with XGBoost and explainable insights using LIME and IML.
• Engineered fraud indicators, handled class imbalance with SMOTE, and deployed a fraud-flagging system to enhance early detection and risk mitigation.
Real-Time Spam Mail Detection Using Machine Learning (may 2024-dec2024)
• Developed and evaluated Logistic Regression, Decision Trees, and Random Forest models for classifying emails as spam or legitimate in real-time.
• Achieved 87% accuracy with the Random Forest model, maintaining a strong balance between precision and recall.
• Optimized email filtering by tuning hyperparameters and selecting key features to reduce false positives and enhance filtering efficiency.
Risk Assessment and Mitigation of Workplace Mental Health Using Machine Learning in R (May 2024 to December 2024)
• Built a risk assessment framework using R with models like Random Forest and Logistic Regression to predict mental health risks from over 300,000 records.
• Achieved 96.3% accuracy with Random Forest, identifying key predictors such as stress, family history, and workplace conditions.
• Used R libraries like caret, randomForest, and ggplot2 for preprocessing, model tuning, and visual profiling.
• Delivered insights to support HR compliance, improve employee well-being, and reduce mental health- related absenteeism.
Efficient Transactional Database Management for E-Commerce Platforms(May 2024 to Dec 2024)
• Designed and implemented a transactional database using Google BigQuery to manage large-scale e-commerce datasets.
• Enabled efficient CRUD operations for products and user data, ensuring ACID compliance.
• Optimized performance using partitioning, clustering, and indexing to support high-volume queries.
• Generated actionable insights into customer behavior and product trends, supporting data-driven business strategies. Databricks-Based Data Lake for Credit/Debit Transaction Processing ( JAN 2025 to DEC 2025)
• Built a scalable data lake on Databricks using Spark, Delta Lake, Kafka, and ElasticSearch to process credit/debit card adjustments.
• Developed real-time and batch ETL pipelines with NiFi and Kafka, enabling live fraud detection through Spark Structured Streaming.
• Stored processed data in Delta Lake and ElasticSearch, with interactive dashboards in Kibana for real-time and historical insights.
• Optimized performance with auto-scaling clusters and ensured secure access via Unity Catalog and role-based controls.
• Improved transaction visibility and operational efficiency for finance and compliance teams.