Rakesh Thota
Charlotte, NC **************@*****.*** LinkedIn GitHub +1-716-***-****
PROFESSIONAL SUMMARY
Data Engineer with 4+ years of experience architecting scalable and real-time data pipelines across multi-cloud environments (GCP, AWS, Azure), that process terabytes of data daily with near-zero loss and sub-second latency. Implemented data solutions using PySpark, Databricks, BigQuery, Kafka, Airflow, and BI tools (Power BI, Tableau, Looker). Engineered pipelines that improved data processing speeds by 45% and reduced costs, for scalable, resilient, and secure data systems that empower organizations to make data-driven decisions with confidence. EDUCATION
University at Buffalo – State University of New York Buffalo, New York Master of Science in Computer Science (AI/ML) Jan 2024 – May 2025 EXPERIENCE
Egen, Hyderabad, India Jul 2022 – Dec 2023
Data Engineer
• Engineered and scaled end-to-end streaming ETL pipelines using Google Cloud Pub/Sub & Dataflow, ingesting 2M+ records daily with 40% lower latency, enabling near real-time analytics.
• Orchestrated ML workflows with Apache Airflow (Cloud Composer) and Vertex AI, deploying inference results into BigQuery, boosting pipeline efficiency by ~25% and ensuring end-to-end reliability.
• Optimized BigQuery performance through partitioning, clustering, and schema optimization, cutting query execution time by 60% and compute costs by 40% across analytical workloads.
• Engineered recovery and secure integrations (Cloud Functions, Cloud Run, OAuth2, Secret Manager) ensuring near-zero data loss and 99.9% reliable real-time client delivery.
• Deployed CI/CD pipelines using Cloud Build, Git, Docker, and GKE (Kubernetes), provisioned & managed infrastructure with Terraform, automating 55+ GCP resources, reducing manual effort by 80%, achieving more stable releases.
• Led proof-of-concept (POC) initiatives for financial document extraction automations using Google Document AI, Camelot, OCR, and complex Regex patterns, extracting key structured fields and eliminating manual data entry. Persistent Systems, Hyderabad, India Jan 2021 – Jun 2022 Associate Data Engineer
• Built and maintained scalable batch and streaming data pipelines using Apache Kafka, Azure Data Factory, Databricks, PySpark, and Snowflake, processing 10TB+ data daily and reducing transformation latency by 45%.
• Refined data models in Snowflake and Azure Synapse analytics, while supported BigQuery integration for multi-cloud reporting, which cut computing costs by 20% and accelerated dashboard refreshes by 30%.
• Optimized Spark workloads in Databricks with better partitioning, caching, and modeling techniques, reduced runtime by 35%, while infrastructure costs fell by 15%.
• Conducted exploratory data analysis (EDA) and developed interactive Power BI dashboards with DAX calculations, enabling real- time KPI tracking and improving decision-making efficiency by 30-35% for business teams. Novartis, Hyderabad, India Nov 2019 – Dec 2020
Junior Software Engineer
• Developed scalable data pipelines with AWS Glue, Airflow and Redshift, processing 10TB+ clinical trial data monthly and reduced reporting latency by 20%.
• Migrated legacy data systems to Amazon S3 and Redshift, enabling faster query execution and improving storage efficiency. Achieved 30% reduction in storage costs while maintaining compliance with industry standards.
• Generated advanced Tableau dashboards to provide researchers with real-time experiment insights, reducing manual reporting by 40% and enabling faster data-driven insights.
• Collaborated with cross-functional teams (scientists, analysts, and engineers) to align pipelines with compliance requirements
(under HIPAA and corporate governance rules), improving delivery timelines by 20%. TECHNICAL SKILLS
• Programming Languages & APIs: Python, SQL, Java, REST APIs (Flask, FastAPI)
• Big Data Technologies & Cloud: Apache Spark, Apache Kafka, Apache Airflow, Hadoop, Databricks, BigQuery, Snowflake, Redshift, Delta Lake, Iceberg, DBT, Parquet, ORC, Avro
• Cloud & DevOps: GCP (Dataflow, Composer, Dataproc, Cloud Functions, GCS), AWS (EC2, Kinesis, Lambda), Azure (ADLS Gen 2, Event Hubs), Docker, Kubernetes, Git, Github Actions, Terraform, Jenkins, Cloud Monitoring & Logging
• AI/ML: Vertex AI, Transformers, Tensorflow, PyTorch, NLP, Diaglogflow CX/ES, Prompt Engineering, RAG pipelines
• Databases: PostgreSQL, MySQL, Firebase, DynamoDB, Cassandra, Redis
• Analytics & Visualization: Pandas, Tableau, Power BI (DAX), Looker, Excel (Power Query, Pivoting)
• Other Skills: Predictive Modeling, A/B Testing, Data Warehousing, Data Modeling, Data Governance & Compliance (GDPR, HIPAA) CERTIFICATIONS
• 5 x Google Cloud Certified: Professional Cloud Architect, Data Engineer, Cloud Developer, Cloud Security Engineer, and Associate Cloud Engineer, AWS Certified: Cloud Practitioner