JAHNAVI AWALA
Senior Data Engineer
Email: ***********@*****.***
Phone: +1-872-***-****
LinkedIn: https://www.linkedin.com/in/a-jahnavi2563/
PROFESSIONAL SUMMARY:
Over 6+ years of experience as a Data Engineer, specializing in designing, developing, and optimizing databases, ETL pipelines, reporting solutions, and data models for mission-critical business operations.
Extensive expertise in architecting end-to-end data solutions on AWS, leveraging services like AWS Glue, EMR, S3, Kinesis, and Athena for scalable, reliable, and efficient data processing.
Proficient in data warehousing and ETL tools, including Informatica, Talend, and Snowflake, with strong skills in data profiling, migration, extraction, transformation, and loading.
Experienced in designing and implementing large-scale data pipelines using Python, PySpark, R, Scala, SQL, and shell scripting, optimizing performance for high-volume datasets.
Expertise in Hadoop ecosystem tools such as HDFS, Spark, MapReduce, Hive, Pig, Sqoop, Avro, Parquet, Presto, and YARN for building data-intensive applications.
Skilled in cloud-based analytics using Azure Data Lake Services (ADLS), Databricks, and Databricks Delta Lake for advanced storage and processing solutions.
Implemented RESTful APIs for seamless integration between frontend applications and backend services.
Experienced in orchestrating complex ETL workflows using Apache Airflow and automating job scheduling with Autosys.
Experience in Developed high-performance Spark applications for data ingestion, cleansing, transformation, validation, and aggregation, using Spark SQL, Spark Streaming, and RDDs.
Experience in Configuring Snowflake replication across AWS regions to ensure high availability and disaster recovery.
Optimized Python programs using multithreading and multiprocessing to handle computationally intensive tasks.
Designed and implemented real-time data streaming solutions using Apache Kafka, Kinesis, and event-driven architectures.
Integrated various data storage systems, including RDS, DynamoDB, Teradata, Oracle, and Cassandra, for relational and NoSQL data processing.
Implemented CI/CD pipelines using Jenkins, GitLab CI, and AWS CodePipeline to automate build, test, and deployment processes.
Experienced in Unix/Linux shell scripting and Bash for automating system tasks, file management, and workflow automation.
Leveraged Parquet and optimized storage formats in PySpark pipelines to improve processing efficiency and storage utilization.
Developed and automated robust ETL/ELT pipelines for data ingestion from multiple sources into data lakes and warehouses.
Seamlessly integrated PySpark with cloud platforms like Databricks for scalable and efficient processing.
Hands-on experience with AWS EMR and Glue for large-scale data workflow automation and transformation.
Adept at troubleshooting, monitoring, and performance tuning of large-scale data pipelines to ensure accuracy, efficiency, and reliability.
CERTIFICATION:
AWS Certified Solutions Architect
AWS Certified Cloud Practitioner
TECHNICAL SKILLS:
Cloud Platforms: AWS (Glue, S3, Lambda, EMR, Step Functions, Redshift, DynamoDB, RDS, Athena, IAM, SNS, CodePipeline, CloudFormation, SQS, Fargate), Azure.
Data Warehousing: Snowflake, AWS Redshift, Teradata, Oracle, PostgreSQL, MySQL
Big Data Technologies: Apache Hadoop, HDFS, Spark, PySpark, MapReduce, Hive, Sqoop, Pig, HBase, Zookeeper, Presto, Cassandra
ETL & Data Integration: Informatica, Talend, AWS Glue, Apache NiFi, Snowpipe, Alation
Programming & Scripting: Python, PySpark, SQL, Scala, R, Shell Scripting (Bash), Java
Data Engineering: ETL Development, Data Pipelines, Data Modeling, Data Profiling, Data Migration, Schema Design,Data Governance,
Streaming & Orchestration: Apache Kafka, Apache Airflow, AWS Kinesis, Spark Streaming
Containerization & CI/CD: Docker, Kubernetes, Terraform, Jenkins, GitLab CI, AWS CodePipeline
Monitoring & Logging: AWS CloudWatch, Datadog, ELK Stack (Elasticsearch, Logstash, Kibana)
Security & Compliance: Snowflake RBAC, Data Encryption, Access Control Policies
Workflow Automation: Apache Airflow, AWS Step Functions, Autosys
Business Intelligence & Reporting: Power BI, Tableau, Excel
Real-time & Batch Processing: Spark SQL, Spark Streaming, PySpark, Kafka Streams, AWS Lambda
Version Control & DevOps: Git, Bitbucket, Maven, Jenkins, SonarQube
PROFESSIONAL EXPERIENCE:
Truist Financial Corporation – Charlotte, NC August 2023 - Present
Role: Senior Data Engineer
Responsibilities:
Designed, developed, and maintained scalable data pipelines to ingest structured and semi-structured data into Snowflake from sources such as AWS S3, Kafka, and APIs.
Skilled in setting up and managing Kafka clusters, creating topics, and optimizing producer–consumer workflows for reliable, scalable integrations.
Leveraged Parquet columnar storage to reduce data size and accelerate query performance for large-scale analytics on AWS S3 and Snowflake.
Proficient in Snowflake SQL for writing and optimizing complex queries using clustering, materialized views, and caching to improve performance and minimize costs.
Automated ETL pipelines with Snowpipe and SQL for real-time data ingestion and transformation.
Integrated Snowflake with Apache Airflow and AWS services to orchestrate automated workflows and enable real-time data processing.
Conducted advanced statistical analysis using SQL, Python, R, and Excel to identify patterns, reduce inconsistencies, and improve efficiency.
Built and deployed end-to-end ETL pipelines using Python, supporting multiple source systems and transformation rules.
Utilized Parquet’s schema evolution to manage dynamic data models seamlessly, streamlining ETL processes.
Enabled secure and governed data sharing with Snowflake features integrated with AWS Glue.
Optimized data storage and compute costs using Snowflake’s pay-per-use model, with performance improvements via clustering and caching.
Monitored and optimized ETL workflows using AWS CloudWatch, Datadog, and Informatica Monitor.
Managed Talend code artifacts in Git/SVN, ensuring streamlined deployment and migration across environments.
Designed and scaled distributed pipelines with PySpark for batch and real-time analytics.
Leveraged Talend connectors to integrate with cloud services, relational databases, and big data platforms such as AWS, Hadoop, and Spark.
Built serverless pipelines with AWS Step Functions and Lambda for orchestration and transformation.
Developed custom operators and tasks in Airflow with Python to manage dependencies, workflows, and execution monitoring.
Created PySpark Streaming apps for real-time ingestion from Kafka with low-latency and fault-tolerant pipelines.
Defined CI/CD pipelines using AWS CodePipeline and CloudFormation, ensuring repeatable and consistent deployments.
Imported/exported relational data between MySQL, Oracle, and Hadoop ecosystems with Sqoop.
Migrated legacy batch processes to PySpark, achieving significant scalability and performance improvements.
Applied Docker networking for secure, isolated environments and inter-service communication.
Designed Informatica mappings with Lookup, Filter, Aggregator, Router, and Joiner transformations for complex workflows.
Built and optimized MongoDB schemas for scalable, high-performance data storage and retrieval.
Integrated Alation with BI and analytics tools to enhance metadata management and data discovery.
Implemented Kafka ingestion pipelines, activity tracking, and Git-based commit logs for distributed systems.
Deployed and managed Kubernetes clusters, overseeing container orchestration, scaling, and maintenance.
Provisioned and managed DynamoDB, S3, and Redshift clusters using Terraform for big data pipelines.
Designed ETL solutions across diverse sources using SQL, Python, Talend, Shell scripting, and scheduling tools.
Developed automated unit tests in JUnit/pytest to ensure reliability of BI reporting solutions managing terabytes of data.
Built optimized custom Docker images to improve scalability and performance.
Configured and utilized Hadoop ecosystem components for distributed data processing.
Led efforts to store and manage terabytes of customer data on AWS for BI and reporting solutions.
Geico – Chevy Chase, MD June 2021 - July 2023
AWS Data Engineer/ETL Developer
Responsibilities:
Developed and optimized big data applications using Python and PySpark, enhancing data processing performance and scalability across various data sets.
Built and optimized ETL workflows using tools like Informatica, AWS Glue, Apache NiFi, and Talend, ensuring scalable and reliable data processing.
Utilized Snowflake for data warehousing and analytics, optimizing SQL queries to improve performance and data retrieval times. Collaborated with analysts to refine data models and ensure alignment with business reporting needs.
Integrated SQS queues with AWS Lambda functions and other AWS services to automate message-driven workflows and event processing.
Leveraged AWS CloudFormation to define reusable infrastructure templates for EC2 instances, RDS databases, and S3 storage, simplifying deployments and enhancing infrastructure consistency.
Integrated CloudFormation with AWS CloudWatch to automate the creation of monitoring and alerting resources, ensuring proactive identification and resolution of issues in deployed applications.
Employed Apache Kafka for real-time data streaming, facilitating the integration of diverse data sources. Developed producers and consumers to enable near-instantaneous data processing and enhance overall system responsiveness.
Designed and implemented MapReduce jobs to process large-scale data sets, performing operations like sorting, filtering, and aggregations.
Integrated Apache Airflow with AWS (S3, Redshift, EMR), Snowflake, BigQuery, and other cloud-based platforms for seamless workflow execution.
Used Terraform's remote backends with encryption to securely store sensitive configurations and avoid state corruption in multi-user environments.
Implemented data processing pipelines in Scala, utilizing Apache Spark and Akka Streams for real-time and batch processing.
Utilized PySpark's Graph Frames for processing and analyzing large-scale graph datasets for relationship modeling.
Developed testing frameworks to identify and resolve data inconsistencies proactively.
Configured Kafka Streams and Kafka Connect to enable streaming ETL pipelines and integrate with systems like HDFS, Elasticsearch, and Snowflake.
Analyzed and visualized large datasets using Python libraries, presenting insights and trends to stakeholders for informed decision-making. Created dashboards and reports to communicate findings effectively.
Documented technical specifications and best practices for big data development processes, contributing to knowledge sharing and team efficiency.
Designed and maintained NoSQL databases to handle large volumes of unstructured data, focusing on efficient data storage, indexing, and retrieval.
Developed pipelines using Terraform Cloud APIs to automate approval workflows for infrastructure changes.
Enhanced data accessibility and query performance through strategic database design.
Configured CI/CD pipelines to integrate with containerization platforms like Docker and Kubernetes for consistent application delivery.
Managed Git repositories for version control, enabling effective collaboration and tracking of code changes across development teams. Implemented branching strategies to optimize development workflows.
Amerigroup, Virginia Beach, VA August 2019 - May 2021
Data Engineer
Responsibilities:
Gathered data and business requirements from end users and management, designing and implementing data solutions to migrate existing source data from the Data Warehouse to Atlas Data Lake (Big Data).
Developed and managed robust data pipeline solutions for a large-scale data analytics project, aimed at integrating and processing data from multiple sources to enable actionable business insights.
Managed data integration projects with tools like Apache NiFi and Informatica to consolidate information from various sources into a unified database.
Designed and implemented ETL processes to extract data from various sources, transform it into a suitable format, and load it into data warehouses.
Implemented Apache Airflow to orchestrate and automate complex data workflows, ensuring smooth and reliable execution of ETL processes and data integration tasks.
Designed and deployed cloud-based solutions utilizing a wide array of azure services, focusing on high availability, fault tolerance, and disaster recovery.
Developed PySpark pipelines on Azure Databricks, leveraging its auto-scaling and collaborative features for high-performance data processing.
Utilized a comprehensive suite of big data processing tools, including Apache Spark, Apache Hadoop, Apache Flink, and Apache Kafka, to handle extensive data processing and analytics tasks.
Managed and optimized the Hadoop ecosystem, including HDFS for scalable storage and YARN for resource management for healthcare-based data.
Configured and tuned MapReduce jobs for efficient large-scale data processing and utilized Oozie for workflow orchestration and Zookeeper for coordination in distributed systems.
Employed Apache Spark for advanced data processing, both in batch and real-time contexts.
Used Spark’s in-memory computing capabilities significantly accelerated data processing tasks.
Utilized azure Synapse Analytics for data warehousing solutions, enabling real-time analytics on large datasets and ensuring efficient querying with optimized performance.
Developed and executed data migration strategies using SQL Server Integration Services (SSIS), Talend, and Pentaho Data Integration to transfer data between systems.
Extracting data from various sources such as SQL databases, CRM systems, web APIs, and flat files (e.g., CSV, Excel), implementing automated data validation rules to check for inconsistencies and errors.
Converting Raw data into a structured format suitable for analysis, including normalization and aggregation.
Assessed relationships between variables using correlation coefficients and regression analysis.
Used azure Monitor and azure Log Analytics to monitor and troubleshoot application performance, ensuring effective alerting and diagnostics.
Created ad-hoc reports and visualizations to support business decision-making.
Developed custom analytical queries and used tools like Tableau and Power BI to build dashboards that displayed critical business metrics and trends.
EDUCATION DETAILS:
Master of Science in Data Analytics Management, Indiana Wesleyan University - May 2025
Bachelor of Commerce (General), Vivekananda Degree College, Kukatpally, India- May2019