THOYAJA PRIYA MOTAMARRI
Phone: 703-***-**** Email: *********************@*****.*** LinkedIn
Senior Data Engineer Big Data Specialist Cloud Data Platform Expert
Data Engineer with 4+ years of experience in delivering cloud-native data platforms, real-time pipelines, and enterprise data lakes. Proficient in Hadoop, Spark, Kafka, and AWS/Azure/GCP ecosystems. Skilled in SQL, NoSQL, Airflow, and Terraform, with a track record of improving performance, reducing costs, and enabling data-driven decisions through automated workflows and dashboards.
PROFILE SUMMARY
Skilled in designing, deploying, and optimizing Hadoop and Spark clusters (Cloudera, Hortonworks, MapR, AWS EMR), ensuring performance, scalability, and cost-efficiency.
Hands-on expertise with Spark SQL, Spark Streaming, PySpark, Scala APIs, Kafka, and real-time streaming pipelines for high-throughput data processing.
Strong experience in ETL pipeline development, data lake architecture, and large-scale ingestion using AWS Glue, Lambda, S3, Apache Flume, Sqoop, and Kafka.
Proficient in SQL and relational databases (Oracle, SQL Server, MySQL, Teradata, Postgres) and data warehouses (Snowflake, Redshift, BigQuery, Vertica).
Practical exposure to NoSQL databases including HBase and Cassandra, and experience with multiple data formats (Parquet, ORC, Avro, JSON, CSV).
Expertise in multi-cloud platforms:
oAWS (EC2, EMR, S3, Lambda, CloudFormation, Redshift, CloudFront)
oAzure (AKS, Data Factory, Storage, Functions)
oGCP (BigQuery, Pub/Sub, Composer, Cloud Storage).
Advanced in infrastructure automation using Terraform, Docker, Kubernetes for reproducible and scalable deployments.
Skilled in workflow orchestration & scheduling with Apache Airflow, Oozie, Unix scripting, cron jobs.
Strong knowledge of CI/CD and version control tools: Git, GitHub, GitLab, Bitbucket, Jenkins, Bamboo.
Experience in backup planning, monitoring, performance tuning, and auto-scaling for high availability in production environments.
Delivered real-time dashboards and reporting with Power BI, Tableau, and SQL-driven solutions to enable actionable business insights.
TECHNICAL PURVIEW
CATEGORY
TOOLS & TECHNOLOGIES
HADOOP ECOSYSTEM
Apache Hadoop, Hdfs, Mapreduce, Apache Hive, Flume, Sqoop, Apache Spark (Pyspark, Spark Streaming), Apache Airflow, Kafka, Oozie, Zookeeper, Hue, Tdch, Talend, Informatica
HADOOP DISTRIBUTIONS
AWS (Amazon EMR), Cloudera (CDP, CDH), Hortonworks (HDP)
PROGRAMMING LANGUAGES
Python, R, Java, Scala, C, C++, PL/SQL, Unix shell scripting, HiveQL, Pig Latin
SCRIPTING LANGUAGES
Shell Scripting, Java Scripting
NOSQL DATABASES
MongoDB, HBase, Cassandra
SQL DATABASES
MySQL, DB2, Teradata, Snowflake, Vertica, Salesforce
CLOUD PROVIDERS
AWS (EC2, S3, RDS, EMR, Glue, Redshift), GCP (BigQuery, Dataflow), Azure (Data Lake, Data Factory)
BI TOOLS
Tableau, Power BI
VERSION CONTROL TOOLS
Bitbucket, Git, GitHub, Gitlab
WEB TECHNOLOGIES
HTML, CSS
PROJECTS UNDERTAKEN
Project 1 Huntington National Bank Financial Data Lake & Risk Analytics Platform
Role: Senior Data Engineer Aug 2024-Present
Domain: Finance & Banking Cloud: AWS
Challenge:
The financial institution required a scalable data platform to unify transaction, compliance, and risk data across multiple systems. The goal was to enable faster fraud monitoring, real-time reporting, and regulatory compliance while ensuring data security.
Contributions:
Designed and implemented a cloud-native financial data lake on AWS S3 + Redshift, enabling storage and query of billions of transactions.
Built ETL pipelines with AWS Glue, Lambda, and Kinesis, ingesting structured/unstructured data from payment systems, trading platforms, and compliance logs.
Developed Python and SQL-based transformation scripts to standardize schemas, cleanse data, and enforce consistency across multi-source datasets.
Automated workflow orchestration with Apache Airflow, ensuring reliable daily/real-time refreshes for compliance and risk analytics.
Applied Terraform for IaC (infrastructure as code), enabling reproducible deployments of ETL pipelines, Redshift clusters, and security configurations.
Integrated Kafka streaming for near real-time ingestion of fraud alerts and payment anomalies.
Enforced data governance and compliance policies, including encryption, access controls, and PII redaction, to align with financial regulations.
Delivered risk and compliance dashboards in Tableau and Power BI, enabling executives to track fraud alerts, portfolio exposures, and audit trails in near real-time.
Conducted ad-hoc SQL analysis to support risk officers in identifying suspicious transaction patterns.
Impact:
Reduced fraud detection latency by 40% through near real-time streaming pipelines.
Improved regulatory reporting turnaround by 50%, ensuring audit readiness.
Achieved cost optimization of 25% by automating infra provisioning and scaling with Terraform.
Enabled enterprise-wide risk visibility with a unified AWS-native data platform.
Tech Stack: Python, SQL, AWS (S3, Glue, Lambda, Kinesis, Redshift), Terraform, Kafka, Apache Airflow, Snowflake, Tableau, Power BI
Project 2 Johnson&Johnson Healthcare Data Lake & Clinical Analytics Platform
Role: Data Engineer & Analyst Aug 2022- July 2023
Domain: Healthcare Cloud: Azure
Challenge:
Hospitals needed a centralized data lake and analytics platform to integrate patient records, EHR systems, and operational data while ensuring regulatory compliance and enabling faster reporting for clinical teams.
Contributions:
Built end-to-end ETL pipelines using Azure Data Factory, Data Lake Storage, and SQL Database, consolidating structured and unstructured healthcare data.
Developed Python and SQL scripts for data validation, cleansing, and schema harmonization across EHR, insurance, and operational systems.
Deployed and scaled containerized ingestion workflows using Kubernetes (AKS) for parallel processing of large healthcare datasets.
Automated infrastructure provisioning with Terraform, ensuring consistent and cost-optimized Azure environments.
Designed streaming pipelines with Azure Event Hubs + Kafka to process real-time patient monitoring and device telemetry data.
Applied data governance practices including HIPAA-compliant PII/PHI redaction, audit logging, and access control policies.
Created clinical performance dashboards in Power BI and Tableau, enabling administrators to track patient admission trends, ICU utilization, and treatment outcomes.
Conducted descriptive analysis to identify operational bottlenecks and seasonal patient admission patterns.
Impact:
Reduced data processing and integration time by 45% with automated pipelines.
Enabled real-time monitoring of critical patient data streams through Azure-native streaming pipelines.
Improved regulatory reporting efficiency by 30%, ensuring HIPAA compliance.
Delivered actionable dashboards that empowered clinical staff with faster decision support.
Tech Stack: Python, SQL, Azure (Data Factory, Data Lake Storage, SQL Database, Event Hubs, AKS), Terraform, Kafka, Tableau, Power BI
Project 3 TATA AIG Insurance Data Consolidation & Claims Reporting Platform
Role: Associate Data Engineer Aug 2021-July 2022
Domain: Insurance Cloud: GCP (Google Cloud Platform)
Challenge:
The insurer struggled with siloed claims, policyholder, and actuarial data spread across multiple systems, slowing down reporting and fraud monitoring.
Contributions:
Built scalable ETL pipelines using GCP Dataflow, BigQuery, and Cloud Storage to unify policy, claims, and customer data from legacy and modern systems.
Developed Python and SQL scripts for data cleansing, schema standardization, and transformation, ensuring audit-ready datasets.
Designed data warehouse layers in BigQuery and Snowflake, enabling consolidated views across actuarial, underwriting, and claims operations.
Automated data ingestion workflows with Apache Airflow and Composer, minimizing manual intervention in data refresh cycles.
Applied data governance practices, including PII redaction and role-based access controls, to comply with insurance regulations.
Optimized query performance and storage costs in BigQuery using partitioning, clustering, and compression strategies.
Created self-service dashboards in Tableau and Power BI, giving underwriters and claims analysts real-time visibility into claims ratios, fraud indicators, and customer behavior.
Impact:
Reduced claims data processing time by 40% through automated pipelines.
Improved regulatory reporting accuracy with audit-ready datasets.
Enabled real-time insights for actuaries and claims teams, accelerating decision-making by 2x.
Tech Stack: Python, SQL, GCP (BigQuery, Dataflow, Cloud Storage, Composer), Snowflake, Apache Airflow, Tableau, Power BI
Project 4 Flipkart Retail Data Integration & Reporting Platform
Role: Python Developer Associate Data Engineer. Jan 2021-July 2021
Domain: Retail & E-commerce Cloud: AWS
Challenge:
The retailer needed a unified data platform to consolidate sales, product, and customer data from multiple systems and enable reliable, automated reporting across regions.
Contributions:
Developed Python scripts for ingestion, data validation, and transformation of raw sales and transactional data.
Built ETL pipelines using AWS Glue, Lambda, and S3, automating daily data movement from POS, e-commerce, and CRM systems into centralized storage.
Implemented SQL-based transformations to standardize schema across heterogeneous sources, ensuring consistency and accuracy.
Designed and maintained data warehouse layers in Snowflake, supporting downstream reporting and analytics.
Automated workflow orchestration with Apache Airflow, reducing manual intervention in data refresh processes.
Applied data governance and security controls, including PII redaction, role-based access, and audit logging.
Built lightweight dashboards and KPI reports in Tableau and Power BI, providing managers with insights on store-level performance and product sales.
Impact:
Reduced data integration effort by 60% through automated ETL pipelines.
Improved data reliability and consistency, strengthening trust in reporting outputs.
Enabled faster, near real-time reporting, supporting business teams in inventory and revenue tracking.
Tech Stack: Python, SQL, Snowflake, AWS (Glue, Lambda, S3), Apache Airflow, Tableau, Power BI
ACADEMIC CREDENTIALS
University of North Texas, Denton, USA Aug 2023-May 2025
Jawaharlal Nehru Technological University, India July 2018- June 2022
ACADEMIC PROJECTS
Machine Learning-Based Classification and Performance Analysis on Appalachia and Non-Appalachia Datasets: The project developed predictive models using Appalachia and non-Appalachia datasets, analyzing the impact of demographic, clinical, and socioeconomic factors on health outcomes. Implemented Logistic Regression and XGBoost, achieving superior performance with XGBoost (accuracy: 67.18% and 67.03%). Conducted SHAP analysis to assess feature importance and model interpretability. Evaluated model performance with AUROC curves, confusion matrices, and statistical tests. Results provided insights for policy interventions and highlighted the importance of model tuning and feature selection.
Log Archival & Data Validation System: Implemented a robust log archival and validation system to automate the ingestion, cleaning, and long-term storage of server logs. Developed Python and Shell scripts to process over 5GB of log data weekly, applying row-level validation using SQL. Integrated the pipeline into Jenkins for scheduled execution and CI/CD deployment. Validated and archived logs in AWS S3 with structured foldering and metadata for audit compliance. This solution improved system visibility and reduced manual log handling by over 70%.