Data Engineer
Ashwitha Vallapureddy
+1-469-***-**** ********************@*****.*** LinkedIn
PROFESSIONAL SUMMARY:
•Overall 5+ years of professional experience as a Data Engineer in design, development, deploying and supporting large- scale distributed systems.
•Specialized in Big Data ecosystem- Data acquisition, Ingestion, Modeling, Analysis, Integration, and Data processing.
•Development, Implementation, Deployment, and Maintenance using Bigdata technologies in designing and implementing complete end-to-end Hadoop-based data analytical solutions using HDFS, MapReduce, Spark, Scala, Yarn, Kafka, PIG, HIVE, Sqoop, Flume, Oozie, Impala, and HBase.
•Profound Experience with AWS services like Amazon EC2, S3, EMR, Amazon RDS, VP C, Amazon Elastic Load Balancing, IAM, Auto Scaling, Cloud Front, CloudWatch, SNS, SES, SQS, and Lambda to trigger resources.
•Experience in building data pipelines using Azure Data Factory, Azure Databricks, and loading data to Azure Data Lake, Azure SQL Database, and Azure SQL Data Warehouse to control and grant database access.
•Profound experience with Azure services like HDInsight, Stream Analytics, Active Directory, Blob Storage, Cosmos DB, Storage Explorer.
•Worked on Google Cloud Platform (GCP) services like Compute Engine, cloud load balancing, cloud storage, cloud SQL, Stackdriver monitoring and cloud deployment manager.
•Strong Hadoop and platform support experience with all the entire suite of tools and services in major Hadoop Distributions –
•Cloudera, Amazon EMR, Azure HDInsight, and Hortonworks.
•Proficient in handling and ingesting terabytes of Streaming data (Kafka, Spark streaming, Strom), Batch Data, Automation
•and Scheduling (Oozie, Airflow).
•Expertise in writing Hadoop Jobs using MapReduce, Apache Crunch, Hive, Pig, and Splunk.
•Profound knowledge in developing production-ready Spark applications using Spark Components like Spark SQL, MLlib, GraphX, DataFrames, Datasets, Spark-ML and Spark Streaming.
•Expertise in developing multiple confluent Kafka Producers and Consumers to meet business requirements. Store the stream data to HDFS and process it using Spark.
•Strong working experience with SQL and NoSQL databases (Cosmos DB, MongoDB, HBase, Cassandra), data modeling, tuning, disaster recovery, backup and creating data pipelines.
•Experienced in scripting with Python (PySpark), Java, Scala and Spark-SQL for development, aggregation from various file formats such as XML, JSON, CSV, Avro, Parquet, ORC.
•Great experience in data analysis using HiveQL, Hive-ACID tables, Pig Latin queries, custom MapReduce programs and achieved improved performance.
•Strong experience working with Amazon cloud web services like EMR, Redshift, DynamoDB, Lambda, Athena, S3, RDS, and Cloudwatch for efficient processing of Big Data.
•Strong skills in visualization tools Power BI, Tableau and Confidential Excel - formulas, Pivot Tables, Charts and DAX Commands.
•Expertise in various phases of project life cycles (Design, Analysis, Implementation, and testing).
•Experience in Database Design and development with Business Intelligence using SQL Server, Integration Services (SSIS), DTS Packages, SQL Server Analysis Services (SSAS), DAX, OLAP Cubes, Star Schema and Snowflake Schema.
•In depth Knowledge in data warehouse concepts, STAR Schema, SNOWFLAKE Schema designs and dimensional modeling.
•Expertise in Data Warehouse Testing Methodologies. Extensive experience in developing UAT Test plans, Test Cases, creating and editing test datasets, and generating/executing SQL Test Scripts and Test results.
•Experience in writing Hive UDF, Generic UDF’s to incorporate complex business logic into Hive queries.
•Good working experience using Sqoop to import data from RDBMS to HDFS and vice versa.
•Strong experience in collecting and storing stream data like log data in HDFS using Apache Flume.
•Good Experience in implementing and orchestrating data pipelines using Oozie and Airflow.
•Expertise in automating Jenkins to build code based on Ruby, YAML, Python, Shell, PowerShell, JSON, PHP and Perl triggered from GitHub to run web applications on AWS Elastic Beanstalk including EC2 build server for continuous delivery with less failover.
•Experience in setting up the infrastructure using AWS services including ELB, EC2, Elastic container services (ECS), Auto-Scaling, S3, IAM, VPC, Red Shift, DynamoDB, Cloud Trail, Cloud Watch, Elastic Cache, Lambda, SNS, Glacier, Cloud
Formation, SQS, EFS, and Storage Gateway.
•Experience in using Microsoft BI studio products like SSIS, SSRS, and Big Data tools like Apache Hadoop for implementation of ETL methodologies like data extraction, transformation, and loading.
•Proficient with Python including NumPy, SciPy, Pandas, Scikit-learn, Matplotlib, and TensorFlow.
• Extensive experience with Scrum, Agile and Waterfall software development methodologies.
Tools and Technologies:
Hadoop/Big Data Technologies
HDFS, Apache NIFI, Map Reduce, Sqoop, Flume, Pig, Hive, Oozie, Impala, Zookeeper, Ambari, Spark, and Kafka
No SQL Database
HBase, Cassandra, MongoDB
Monitoring and Reporting
Tableau, Custom Shell Scripts
Hadoop Distribution
Apache Hadoop, Horton Works, Cloudera, MapR, Amazon EMR (EMR, EC2, EBS, RDS, S3, Glue, Elasticsearch, Lambda, Kinesis, SQS, DynamoDB, Redshift, ECS), Azure (HDInsight, Databricks, Data Lake, Blob Storage, Data Factory, SQL DB, SQL DWH, Cosmos DB, Azure
DevOps, Active Directory)
Build and Deployment Tools
Maven, Sbt, Git, SVN, Jenkins
Programming and Scripting
Scala, SQL, UNIX Shell Scripting, Python, Pig Latin, HiveQL
Databases
Oracle, MY SQL, MS SQL Server, Teradata
Analytics Tools
Tableau, Microsoft SSIS, SSAS and SSRS
Web Dev. Technologies
HTML, XML, JSON, CSS, JQUERY, JavaScript
IDE Dev. Tools
PyCharm, Vi / Vim, Sublime Text, Jupyter Notebook.
Operating Systems
Linux, Unix, Windows 8, Windows 7, Windows Server 2008/2003, Mac OS
AWS Services
EC2, EMR, S3, Redshift, EMR, Lambda, Glue, Data Pipeline, Athena
Microsoft Azure
Databricks, Data Lake, Blob Storage, Azure Data Factory, SQL Database, SQL Data Warehouse, Cosmos DB, Azure Active Directory)
Network protocols
TCP/IP, UDP, HTTP, DNS, DHCP
Methodologies
Agile/Scrum, Waterfall.
Others
Machine learning, NLP, StreamSets, Terraform, Docker, Chef, Ansible, GCPJira, Springboot.
TECHNICAL SKILLS:
WORK EXPERIENCE:
Client: Amex, Plano, Tx. Jun 2024 - Present
Role: Azure Data Engineer
Responsibilities:
Engineered and managed 40+ enterprise-grade ETL pipelines using Azure Data Factory (ADF) and Apache Airflow to integrate transaction, settlement, and reference data from Azure Blob Storage, SQL Server, and external APIs. These pipelines supported liquidity risk monitoring, credit exposure tracking, and regulatory reporting.
Implemented and secured Azure Integration Runtime (IR) for hybrid on-premises/cloud ingestion of core banking and treasury data, improving data availability for Basel III liquidity dashboards and credit risk analytics by 40%.
Developed a metadata-driven ingestion framework with reusable templates, dynamic parameters, and linked services in ADF. This streamlined onboarding of new financial instruments, portfolios, and counterparties, cutting integration timelines by 70%.
Designed real-time streaming pipelines with Apache Kafka and Azure Event Hub to ingest trade, payments, and FX transaction flows into Delta Lake. This improved data freshness by 80%, enabling near real-time monitoring of intraday liquidity and market risk.
Consolidated Delta Lake with Microsoft Fabric layers to replicate curated datasets into OneLake, achieving a 60% improvement in dashboard refresh times and enabling timely financial reporting across capital, credit, and liquidity domains.
Architected a dual analytics platform using Apache Iceberg, Delta Lake, and DirectLake. This allowed separation of regulatory reporting workloads, risk simulations, and executive BI, while ensuring governance, scalability, and auditability.
Integrated Azure ML models into PySpark workflows to detect fraudulent transactions, unusual credit exposures, and payment anomalies. This reduced false positives by 25%, enhancing the bank’s surveillance and compliance posture.
Refactored legacy ETL workloads in SQL Server and Oracle core banking systems by leveraging materialized views, optimized CTEs, and partitioning. This reduced processing times by 70%, accelerating regulatory and financial close cycles.
Designed and optimized SQL queries and transformations in Snowflake for regulatory and financial reporting workloads.
Automated FX exposure and Value-at-Risk (VaR) calculations with modular PySpark transformations and dbt assertions, reducing manual validation by 70% and ensuring SLA adherence for daily risk reporting.
Modeled Data Vault and Star Schema warehouses in Azure Synapse to support both executive-level P&L and balance sheet aggregation as well as granular drilldowns into loans, deposits, and derivatives portfolios.
Enforced data governance and compliance by implementing Azure Purview with tag-based classifications, sensitivity labels, and row-level security. This safeguarded 10+ PII- and PCI-sensitive datasets and ensured adherence to SOX, GDPR, and OCC standards.
Applied fine-grained RLS and Purview-driven governance in Microsoft Fabric DirectLake datasets to secure regulatory submissions, liquidity ratios, and credit risk dashboards, ensuring auditable and defensible reporting.
Delivered observability and diagnostics dashboards in Splunk and Azure Monitor, reducing MTTR by 90% for incidents impacting critical banking pipelines such as settlements, reconciliations, and capital reporting.
Implemented Fabric Real-Time Analytics pipelines to process high-frequency FX trades and payment events from Event Hub directly into DirectLake-backed datasets, enabling immediate availability for Treasury and Risk BI consumption.
Defined and enforced data contracts and SLAs across five financial reporting domains (credit, market, liquidity, capital, compliance), ensuring alignment with data mesh principles and regulatory timelines.
Implemented CI/CD pipelines in Azure DevOps/Jenkins with YAML templates and REST APIs for data pipeline deployments, ensuring reproducibility, version-controlled rollouts, and faster releases across banking and regulatory systems.
Configured Microsoft Fabric Data Activator to monitor capital flows, settlement breaks, and liquidity thresholds, automatically triggering alerts in Power BI dashboards and ServiceNow queues to streamline financial risk escalation.
Provided mentorship and leadership by training analysts and junior engineers on banking-focused use cases for Microsoft Fabric, PySpark design patterns, and governance frameworks, ensuring best practices for regulatory reporting and risk data management.
Environment: Azure Data Factory (ADF), Apache Airflow, Azure Synapse Analytics, Delta Lake, Apache Iceberg, Microsoft Fabric (DirectLake, Dataflows Gen2, Data Activator), Azure Blob Storage, Azure Event Hub, Apache Kafka, PySpark, dbt, Azure Machine Learning, SQL Server, Oracle, Splunk, Azure Monitor, Azure Purview, Power BI, ServiceNow, Azure DevOps (YAML Pipelines, REST APIs), Excel (advanced functions, VBA).
Client: PepsiCo, Plano, Tx. Sep 2022 – May 2024
Role: Data Engineer
Responsibilities:
Hands-on experience working with Hadoop Cloudera distributed platform, Microsoft Azure.
Experience in working with Azure Cloud platform (HDInsight, Databricks, Data Lake, Blob, Data Factory, Synapse, SQL DB, SQL DWH).
Designed and deployed data pipelines using Data Lake, Databricks, and Apache Airflow.
Developed and automated shell scripts to schedule, monitor, and optimize ETL pipelines, reducing manual intervention by 40%.
Migrated complex legacy scheduling workflows from Oozie to Airflow; adaptable to similar migrations such as Tidal Airflow.
Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL and U-SQL Azure Data Lake Analytics.
Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in In Azure Databricks.
Responsible for implementing containerized based applications on Azure Kubernetes by using Azure Kubernetes service(AKS), Kubernetes Cluster, which are responsible for cluster management, Virtual Network to deploy agent nodes, Ingress APIGateway, MySQL Databases and Cosmo DB for stateless storage of external data, and setup reverse proxy Nginx in the cluster.
Managed Azure Infrastructure Azure Web Roles, Worker Roles, SQL Azure, Azure Storage, Azure AD Licenses. Virtual Machine Backup and Recover from a Recovery Services Vault using Azure PowerShell and Portal.
Serverless framework for deploying Multiple Lambda Functions, IAM Roles API Gateway and containerizing those Functions in such a way that So that can be Platform Independent.
Deployed and optimized two tier Java, Python web applications to Azure DevOps CI/CD to focus on development.
Design, install, administer, and optimize hybrid cloud components to ensure business continuity (i.e., Azure AD, ADFS, SSO & VPN Gateways.).
Architect, develop, plan and migrate servers, relational databases (SQL) and websites to Confidential azure cloud.
Deployed Azure IaaS virtual machines (VMs) and Cloud services (PaaS role instances) into secure VNets and subnets.
Script, debug and automate PowerShell scripts to reduce manual administration tasks and cloud deployments.
Build reports using PowerBI and display them using Angular application with various features like sharing, printing, personalized reports.
Worked on Using Azure Data Factory (v1 & v2), Azure storage accounts, Data Lake store, Data Lake Analytics, Logic Apps, Azure Automation Account, Azure Services, Azure Databricks, SQL, Oracle, PostgreSQL, Cassandra, CouchDB and Mongo DB.
Used Azure Key vault to store the secrets and configured the ADF pipeline to get the connection string secrets from the Key vault at the run time.
Worked on writing the u-sql scripts for delta load from ADLA (Azure Data Lake Analytics) Catalog. Created end to end workflow using Azure logic apps. Used logic app to send notification email based on the HTTP Post Method.
Build a program with Python and Apache beam and execute it in cloud Dataflow to run Data validation between raw source file and BigQuery tables.
Processed the schema oriented and non-schema-oriented data using Scala and Spark.
Provided architecture and design as product is migrated to Scala, Play framework and Sencha UI.
Implemented Docker Swarm to deploy load balance, scale and manage docker containers with multiple names spaced versions and integrated Cluster Management with Docker Engine using Docker Swarm.
Implemented applications with Scala along with Akka and Play framework.
Develop, refine and analytical solutions, data intensive systems and workflows.
Created Partitions, Buckets based on State to further process using Bucket based Hive joins.
Created Hive Generic UDF's to process business logic that varies based on policy.
Experienced with different kind of compression techniques like LZO, GZip, and Snappy.
Gained experience in managing and reviewing Hadoop log files.
Used Oozie workflow engine to manage interdependent Hadoop jobs and to automate several types of Hadoop jobs such as Java map-reduce Hive, Pig, and Sqoop.
Environment: Azure HDInsight, Databricks, Data Lake (Gen2), Data Factory (v2), Azure DevOps, MS SQL, Cosmos DB, MongoDB, Teradata, Ambari, Flume, HDFS, MapReduce, YARN, Spark, Hive, Sqoop, Python, Scala, GIT, JIRA, Jenkins, Apache Beam, Apache Airflow, SVN, Kubernetes, Docker, Ansible, Terraform, UDF, Snappy, LZO, ADF Pipeline, Power Bi, NOSQL, Oracle, PostgreSQL, Cassandra.
Client: Novartis, East Hanover, NJ. Jan 2021- July 2022
Role: AWS Data Engineer
Responsibilities:
Designed, built, and maintained data integration programs in a Hadoop and RDBMS environment, working with both traditional and non-traditional source systems, as well as RDBMS and NoSQL data stores for data access and analysis, Extensive experience with Python, including the creation of a custom ingest framework.
Creating and sustaining an optimal Data pipeline architecture and Imported data from AWS S3 into Spark RDD, performed transformations and actions on RDD’s.
Migrated an existing on-premises application to AWS. Used AWS services like EC2 and S3 for datasets processing and storage. Experience in maintaining the Hadoop cluster on AWS EMR.
Design and develop ETL processes on AWS Glue to migrate campaign data from external sources like S3, ORC/Parquet/Text Files into AWS Redshift.
Designed and developed functionality to get JSON document from MongoDB document store and send it to the client using RESTful web service. Implemented a Data interface to get information of customers using REST API and pre-process data using MapReduce and store it into HDFS.
Installed and designed with Apache bigdata Hadoop components like HDFS, MapReduce, YARN Hive, HBase, Sqoop, Pig, and Nifi.
Involved in loading data into S3 buckets using AWS Glue and PySpark. Involved in filtering data stored in S3 buckets using Elastic Search and loaded data into Hive External tables.
Created the infrastructure needed for optimal data extraction, transformation, and loading from a wide range of data sources.
Used Spark Streaming to stream data from external sources using Kafka service and migrated an existing on-premises application to AWS. Used AWS services like EC2 for processing datasets and S3 storing small datasets. Experienced in maintaining Hadoop cluster on AWS EMR.
Responsible for loading data from the internal server and the Snowflake data warehouse into S3 buckets.
Developed AWS Athena extensively to ingest structured data from S3 into various systems such as RedShift or to generate reports.
Created Pods and controlled them using Kubernetes, using Jenkins pipelines to push all micro service builds to the Docker registry and then deploy to Kubernetes.
Constructing/maintaining Kubernetes container clusters on GCP, operated by Linux, Bash, GIT, and Docker (Google Cloud Platform). To develop, evaluate, and deploy, the CI/CD framework used Kubernetes and Docker as the runtime environment.
Used the Spark API to analyze Hive data in conjunction with the EMR Cluster Hadoop Yarn.
AWS Cloud Formation templates were designed to create VPCs, subnets, and NAT to ensure the successful deployment of Web applications and database templates.
Proficient in NiFi data handling, NiFi Registry and Version Controlling.
Implemented Continuous Integration using Jenkins, which tracks the source code changes.
Experience in installing, configuring, supporting, and managing Hadoop clusters using HDP and other distributions.
Expertise with AWS databases such as RDS (Aurora), Redshift, DynamoDB, and Elastic Cache (Memcached & Redis).
Implemented end-to-end data pipeline using FTP Adaptor, Spark, Hive, and Impala.
Experience in dealing with both Agile and waterfall methods in a fast-paced manner.
Created external tables with partitions using AWS Athena and Redshift.
Experience working with RDBMS including Oracle/ DB2, SQL Server, PostgreSQL 9.x, MS Access, and Teradata for faster access to data on HDFS.
Environment: AWS, Apache NiFi, NiFi Rest API, NiFi Registry, MS-SQL Server, JavaScript, PowerShell, Jenkins, JIRA, LINUX/UNIX. Hadoop YARN, Spark, Spark Streaming, Spark SQL, Scala, Python, Hive, Sqoop, Cassandra, S3, EMR, EC2, Kibana, RDS, Redshift, Dynamo DB, Docker, Kubernetes, Ansible, Terraform, PostgreSQL.
Client: Country Financial Jan 2020 - Dec 2020
Role : Big Data Engineer
Responsibilities:
Imported Data using Sqoop to load Data from MySQL to HDFS on regular basis.
Performing aggregations on large amounts of data using Apache Spark, Scala, and landing data in Hive warehouse for further analysis.
Worked with Data Lakes and big data ecosystems (Hadoop, Spark, Hortonworks, Cloudera)
Load and transform large sets of structured, semi structured, and unstructured data.
Written Hive queries for data analysis to meet the Business requirements.
Built HBASE tables by leveraging on HBASE Integration with HIVE on the Analytics Zone.
Worked with Zookeeper, Oozie, and Data Pipeline Operational Services for coordinating the cluster and scheduling workflows.
Developed data pipeline using Flume, Sqoop to ingest customer behavioral data and purchase histories into HDFS for analysis.
Hands on experience in using Kafka, Spark streaming, to process the streaming data in specific use cases.
Worked on analyzing Hadoop cluster using different big data analytic tools including Pig, Hive, and MapReduce.
Developed a data pipeline using Kafka, Spark, and Hive to ingest, transform and analyzing data.
Wrote Hive queries for data analysis to meet the specified business requirements by creating Hive tables and working on them using Hive QL to simulate MapReduce functionalities.
Migrated the existing data to Hadoop from RDBMS (Oracle) using Sqoop for processing the data.
Hands on experience on AWS platform with EC2, S3 & EMR.
Implemented CICD pipelines to build and deploy the projects in Hadoop environment.
Implemented Installation and configuration of multi-node cluster on Cloud using AWS on EC2.
Using JIRA to manage the issues/project workflow.
Worked on Spark using Python (PySpark) and Spark SQL for faster testing and processing of data.
Used Spark Streaming to divide streaming data into batches as an input to Spark engine for batch processing.
Used Ambari web UI to load, manage and review terabytes of log files and used Zookeeper to coordinate, synchronize and serialize the servers within the clusters. Worked on Oozie workflow engine for job scheduling.
Used Git as version control tools to maintain the code repository.
Working experience on RDD’s & Dataframes (SparkSQL) using Pyspark for analyzing and processing the data.
Designed and implemented ETL pipelines between from various Relational Data Bases to the Data Warehouse using Apache Airflow.
Environment: Sqoop, MYSQL, HDFS, Apache Spark Scala, Hive Hadoop, Cloudera, HBASE, Kafka, Pig Mapreduce, Zookeeper, Oozie, Data Pipelines, RDBMS, AWS, EC2, Python, PySpark, Ambari, JIRA.