Big Data Analyst

Location:

Tampa, FL, 33602

Posted:

June 02, 2025

Contact this candidate

Resume:

Hetesh Gopala Krishna

EDUCATION

Masters in Artificial intelligence and Business Analytics, Muma College of Business, University of South Florida, Tampa,May 2026 Undergraduate from Koneru Lakshmaiah University, Nov-2019 SKILLS

Programming: SQL, PL/SQL, C, R, Python Shell Script, Microsoft Excel, Tableau, Advanced SQL Technologies/Environment: Pandas, NumPy, TensorFlow, Oracle SQL, Oracle Customer Care and Billing (CCB), Maven Database tools: MYSQL, Oracle SQL, MongoDB,Big Query, BIG DATA ECOSYSTEMS : Map Reduce, HDFS, Spark, Hive, Sqoop, Kafka. RELEVANT EXPERIENCE

Big Data Quality Analyst,

ClearTax, Bangalore, India

Feb 2021 – Jan 2024

• Ensured data accuracy and reliability across large-scale tax and financial datasets using big data tools.

• _

• Collaborated with data engineers to design scalable ETL processes and improve data ingestion accuracy.

• Conducted regular audits and root cause analysis to resolve anomalies and ensure compliance. Data Analyst (Contract)

Volvo Truck Plant, Bangalore, India

Jan 2020 – Jan 2021

• Collected and analyzed production data to support operational efficiency and maintenance planning.

• Developed dashboards and visual reports for tracking manufacturing KPIs.

• Supported predictive analytics initiatives for truck maintenance using historical sensor data. PROJECTS

Scalable Big Data Pipeline for GST Filing and Tax Analytics (Feb 2021- Jan 2024)

• Developed a scalable big data pipeline using Apache Spark and Hadoop for real-time and batch processing of GST filings across India, ensuring compliance and anomaly detection.

• Optimized Spark performance enabled ITC reconciliation and delivered insights via Hive and Presto to support financial audits and decision-making.

Cassandra Astra Cloud Integration Project (Jan 2025-May2025)

• Developed a Python-based data processing pipeline that connects to a DataStax Astra Cassandra database using the secure connect bundle and Cassandra-driver. Set up a local development environment in VS Code with Anaconda for dependency management.

• Created and managed keyspaces and tables on Astra's Cassandra cluster instances. Demonstrated data insertion and complex querying, including filtering on non-partition keys using CQL and ALLOW FILTERING. Verified operations using both the Python client and DataStax CQL Console. Documented and recorded the full tutorial with voice-over walkthrough for deployment and training purposes.

Big Data project in Insurance Data(Jan 2025- May2025)

• Sourced insurance data from a BigQuery database and connected it to Databricks for processing and analysis.

• Cleaned the dataset by removing headers, footers, and blank rows, and applied feature engineering to ensure consistent data with 10 columns per row.

• Utilized Spark SQL for advanced data manipulation, including calculating average market coverage per state.

• Implemented Isolation Forest to detect anomalies in the data, identifying potential outliers and fraudulent claims.

• Applied SparkML to train a Logistic Regression model, predicting Market Coverage based on features like IssuerId, StateCode, and DentalOnlyPlan.

• Optimized data processing pipelines in Databricks, improving performance for large datasets and complex operations. Airline Delay Data Warehouse (Jan 2025- May 2025)

• Designed and implemented a relational data warehouse using Oracle SQL with dimension and fact tables for airline delaysPerformed advanced data munging: handled nulls, removed invalid entries, standardized formats, and ensured referential integrity.

• Utilized SQL for data cleaning, integrity checks, and transformations across flight, weather, and carrier datasets.Built queries for outlier detection, delay trend analysis, and cancellation insights with JOINs and aggregations.

• Tools & Technologies: Oracle SQL, PL/SQL, Data Warehousing,Data munging, Data Cleaning, ER Modeling, ETL Principles.

Insurance Fraud Detection Using R Duration: May 2024 – December 2024(may2024-dec2024)

• Built a complete ML pipeline in R for fraud detection in insurance, achieving 93% accuracy with XGBoost and explainable insights using LIME and IML.

• Engineered fraud indicators, handled class imbalance with SMOTE, and deployed a fraud-flagging system to enhance early detection and risk mitigation.

Real-Time Spam Mail Detection Using Machine Learning (may 2024-dec2024)

• Developed and evaluated Logistic Regression, Decision Trees, and Random Forest models for classifying emails as spam or legitimate in real-time.

• Achieved 87% accuracy with the Random Forest model, maintaining a strong balance between precision and recall.

• Optimized email filtering by tuning hyperparameters and selecting key features to reduce false positives and enhance filtering efficiency.

Risk Assessment and Mitigation of Workplace Mental Health Using Machine Learning in R (May 2024 to December 2024)

• Built a risk assessment framework using R with models like Random Forest and Logistic Regression to predict mental health risks from over 300,000 records.

• Achieved 96.3% accuracy with Random Forest, identifying key predictors such as stress, family history, and workplace conditions.

• Used R libraries like caret, randomForest, and ggplot2 for preprocessing, model tuning, and visual profiling.

• Delivered insights to support HR compliance, improve employee well-being, and reduce mental health- related absenteeism.

Efficient Transactional Database Management for E-Commerce Platforms(May 2024 to Dec 2024)

• Designed and implemented a transactional database using Google BigQuery to manage large-scale e-commerce datasets.

• Enabled efficient CRUD operations for products and user data, ensuring ACID compliance.

• Optimized performance using partitioning, clustering, and indexing to support high-volume queries.

• Generated actionable insights into customer behavior and product trends, supporting data-driven business strategies. Databricks-Based Data Lake for Credit/Debit Transaction Processing ( JAN 2025 to DEC 2025)

• Built a scalable data lake on Databricks using Spark, Delta Lake, Kafka, and ElasticSearch to process credit/debit card adjustments.

• Developed real-time and batch ETL pipelines with NiFi and Kafka, enabling live fraud detection through Spark Structured Streaming.

• Stored processed data in Delta Lake and ElasticSearch, with interactive dashboards in Kibana for real-time and historical insights.

• Optimized performance with auto-scaling clusters and ensured secure access via Unity Catalog and role-based controls.

• Improved transaction visibility and operational efficiency for finance and compliance teams.

Contact this candidate