PUJITHA MALLADI
Phone: +1-572-***-**** Email: ***********@*****.*** Data Engineer
Professional Summary:
● Demonstrable skill and Proficient IT professional with 7+ years of Industry experience in data model, data security, data architecture, tackling challenging architectural and scalability problems.
● Experienced in Cloud computing platform in designing data pipeline on premise, AWS with hands-on experience in Amazon EC2, S3, RDS, Amazon Elastic Load Balancing, Auto Scaling, EMR in AWS family and Azure Data Lake, Azure SQL database and Azure SQL Data warehouse in Azure.
● Experience in ETL processing, migration and data processing using AWS services such as EMR, EC2, Athena, Glue, Lambda, S3, Relational Database Service (RDS) and other services of AWS family.
● Good Hands-on Knowledge on Amazon Web Services (AWS) Cloud Platform which includes services like EC2, S3, DynamoDB, autoscaling, Security Groups, EC2 Container Service (ECS), Red shift.
● Knowledge on Continuous storage in AWS using Elastic Block Storage, S3, Glacier, Cloud Formation, Cloud Trail, OpsWorks, Kinesis, IAM, SQS, SNS, SES.
● Expertise working on EC2 instance for computing the huge data and processing it across a wide range of applications.
● Skilled in transferring data from the data resources to Amazon Redshift, S3 using AWS data pipelines. Worked on spark data bricks cluster for estimating the cluster’s size, monitoring, and troubleshooting on AWS cloud.
● Used Terraform in AWS virtual private cloud to automate the tasks by configuring the settings interfacing with the control layer.
● Data Composition using Spark Streaming from S3 bucket in real-time. Accomplished necessary Transformation and Aggregation on the fly to shape the common learner data model and endure the data in HDFS.
● Expertise in implementing and maintaining data pipelines using Apache Airflow.
● Experience with Azure Cloud Services, SQL Azure, Azure Analysis, Azure Monitoring, Azure data factory. Worked closely with Azure platform. In hands experience on Azure Data Lake, Blob storage, Synapse, Data Storage Explorer, SQL, SQLDB and Data Warehouse.
● Proficient in building data pipelines and data loading using Azure Data Factory, Azure Databricks and Azure Data warehouse to control the accessibility to the database.
● Well versed with developing data processing jobs to analyze the data using Map Reduce, Spark and Hive.
● Excellent understanding of Spark Architecture, including Spark Core, Spark SQL, Spark Context, Spark- SQL Data Frame APIs, Driver Node, Pair RDDs, Worker Node, Stages, Executors and Tasks.
● Strong understanding and experience in Hadoop led development of enterprise level solutions utilizing Hadoop components such as Apache Spark, Sqoop, Hive, HBase, Oozie, Kafka and Zookeeper.
● Developed Spark applications using Spark SQL, PySpark and Delta Lake in Databricks to extract, transform and aggregate multiple file formats for analyzing and transforming the data.
● Expertise in building Spark and PySpark for interactive analysis, batch processing and real-time processing.
● Extensively used clouder platforms for the Hive data and used spark data frame operations for validating the data.
● Worked on importing continuous data using Sqoop in Last modified and Last updated mode.
● Hands-on Programming knowledge on Python (Pandas, NumPy), PL/SQL, Scala, PySpark.
● Good knowledge on converting Hive/SQL Queries into Spark Transformations with Datasets and Data frames.
● In-depth knowledge of Hadoop architecture and its component like YARN, HDFS, Name Node, Data Node, Job Tracker, Application Master, Task Tracker and Map Reduce programming paradigm.
● Very strong data development skills with ETL, Oracle SQL, Linux/UNIX, data warehousing and data modeling.
● Worked to design complex data models both in real-time and offline analytic processing and provided support for data profiling and data quality functions.
● Experienced with SCM tools like GIT, Jenkins, Ansible and Test-Driven Code Development.
● Hands on experience in building statistical models using tools like Python, PySpark, SQL with extensive experience in computing/programming skills, proficiency in Python, R, and Linux shell script.
● Experienced in data management and data analysis in a relational database, Hadoop and Spark.
● Expertise in relational database administration including configuration, implementation, data modeling, maintenance, redundancy/HA, security, troubleshooting/performance tuning, upgrades, database, data and server migrations, SQL.
● Experience working with Agile Software Development methodologies. Technical Skills:
Azure Big Data Technologies Amazon AWS (EMR, EC2, RDS, EBS, S3, Athena, Elasticsearch, Lambda) Azure Data Lake, Azure Data Bricks, Blob storage, Synapse, Data Storage Explorer, Azure Active Directory, HDInsight, Cosmos DB
Hadoop Big Data Technologies Airflow, Hadoop, Spark, Sqoop, Hive, HBase, Oozie, Flume, Kafka and Zookeeper, Cloudera Manager
ETL Tools AWS Glue, Azure Data Factory
Databases MySQL, Teradata, Oracle, MS-SQL SERVER, PostgreSQL, DB2 Version Control GIT
Database Modelling Dimension Modelling, ER Modelling, Star Schema Modelling, Snowflake Modelling.
Monitoring and Reporting Tableau, PowerBI, Datadog Programming Languages Python, PySpark, Scala, PowerShell, HiveQL IDE Tools Eclipse, Jupiter, Anaconda, PyCharm
Others ADO, Terraform, Docker, Kubernetes, Jenkins, Jira. Professional Experience:
TriHealth, OH. Jan/2023 – Present.
Role: Data Engineer
● Designed and Developed Enterprise data lake which allows various data types of data from multiple data resources.
● Worked closely with Data and Analytics team to architect and build Data Lake using various AWS services like EMR, S3, Athena, Glue, Apache Airflow and HIVE.
● Designed and implemented a real-time data pipeline to work with semi structured data by incorporating larger volume of raw records from multiple data sources using Kinesis Data Steam and kinesis firehouse.
● Created AWS Lambda function to read from the Producer and write records to an Amazon DynamoDB table as they arrive.
● Worked with EMR to transform and move big data into AWS data stores and databases in S3 and DynamoDB.
● Created and launched AWS EC2 instances to execute jobs on EMR to store the results in S3.
● Designed and setup Enterprise Data Lake to provide support in multiple areas such as Analytics, processing, storing and reporting of big and rapidly changing data.
● Worked with Hadoop distribute file system (HDFS), S3 Storage, big data formats like parquet, JSON.
● Configured AWS Redshift clusters, AWS Redshift spectrums for querying, AWS Redshift Data share for transferring the data among clusters.
● Automated the ETL processes using Apache Airflow and S3 for storing the data in batches.
● Developed Spark Applications by using Python and Implemented Apache Spark data processing Project to handle data from various RDBMS and Streaming sources.
● Developed Python code to satisfy requirement and perform data processing and analytics using inbuilt libraries.
● Used Apache Spark with Python to develop and execute Big Data Analytics applications.
● Created Lambda functions with Boto3 for unused AMIs in all application regions to reduce the cost for EC2 resources.
● Worked with Spark to improve performance and optimization of the existing algorithms.
● Deployed Applications on AWS EC2 instances and configured the storage on S3 buckets.
● Configured S3 buckets with policies to automatically archive the infrequently accessed data to storage classes.
● Involved in Analyzing raw files from S3 data lake using AWS Athena, Glue without loading the data in the database.
● Implemented IAM roles for various resources like EC2, S3, RDS to communicate with each other. Analyzed complex data and identified anomalies, trends, and risks to provide insights to improve internal controls.
● Configured computing, security and networking systems within the cloud environment and implemented cloud policies and maintained service availability. Worked Environment: AWS, Hadoop, Spark, AWS Kinesis, Parquet, JSON, SNS, Redshift, Apache Airflow, MYSQL, EC2, S3, AWS Athena, Glue.
Genworth Financial Inc, VA. Dec 2022 – Nov 2023
Role: Data Engineer
● Worked in development and delivery of data solutions and capabilities for Commercial Analytics and Data platforms with focus on costing deliverables and Apparel CBD Deliverables.
● Designed and Deployed Pipeline with EMR to transform and move big data into AWS data stores and databases in S3.
● Worked to clean, prepare and optimize data for ingestion and consumption.
● Worked closely with Data and Analytics team to architect and build Data Lake using various AWS services like EMR, S3, Apache Airflow, Spark, and HIVE.
● Developed Spark Operator by converting HIVE Query Language (HQL) to Spark Data frame and Airflow modification of the script to enhance the performance.
● Worked to troubleshoot data issues and performed RCA to resolve product and operational issues.
● Worked with EMR to transform and move big data into AWS data stores and databases in S3.
● Optimized the ETL process with transforming Spark Operator to process and store high volume data sets in most efficient manner.
● Worked collaboratively to review design, code, test and implement the task in support of maintaining data engineering workflow.
● Created a common function to pipelines for EMR independent file check to reduce the cost.
● Supported team to implement automated workflows and routines using workflow scheduling tools.
● Worked on running the Airflow DAGs, maintain and fix the DAGs and bugs as per needed.
● Worked on data ingestion to Snowflake.
Environment: AWS Cloud (EMR, S3, EC2), Python, SQL, PySpark, Hive, Airflow, Snowflake, Tableau. LBrands International, OH. - Oct/2021 – Nov/2022 Role: Azure Data Engineer
● Worked on Azure Data Factory to integrate data of both on-prem (MySQL Cassandra) and cloud (Azure SQL DB, Blob Storage) and applied transformations to load data to Azure Synapse.
● Migrated on-premises Hadoop cluster to Azure storage using Azure Data Factory.
● Built and scheduled pipelines using triggers in Azure Data Factory.
● Built PySpark pipelines to validate the table data from Hive and Oracle.
● Implemented Azure Log Analytics for monitoring the resources and their tasks for better throughput.
● Developed Apache Spark jobs for Data pre-processing and cleansing activities.
● Automated deployment from ACR using ADO YAML pipelines.
● Designed and developed common architecture to store data in Enterprise and building Data Lake in Azure cloud.
● Developed applications on Spark for Data extraction, transformation and aggregation from multiple systems and stored data with the help of Azure Databricks notebooks on Azure Data Lake Storage.
● Advanced skills in data pipelines using airflow to interact with services like Azure Databricks, Azure Data Lake, Azure Data Factory and Azure Synapse Analytics.
● Advanced skills to create ADF Pipelines using linked Services/Datasets/Pipeline to Extract, Transform and load data from multiple sources like SQL server, Blob Storage, Azure SQL and Azure Synapse Analytics.
● Deployed python libraries on Data bricks by setting up new Jobs to install the required libraries with dependencies.
● Managed huge volumes of data for exploring and transforming by creating Python Data bricks notebooks.
● Developed a Python code for transferring and extracting data from on-premises to Azure data lake.
● Managed resources and scheduling across the cluster using Azure Kubernetes Service.
● Collaborated with clients and stakeholders to execute the design flow of data migration to Azure including disaster recovery and testing performance.
● Used Terraform in managing resources scheduling, disposable environments, and multitier applications.
● Automated Data migration and exploration using Azure Analytics and HDInsight to deliver the insights.
● Developed various Python scripts to find vulnerabilities with SQL Queries by performing SQL injection, permission checks and performance analysis and developed scripts to migrate data from database to PostgreSQL.
● Designed SQL, SSIS, and Python based batch and real-time ETL pipelines to extract data from transactional and operational databases and load the data into target databases/data warehouses.
● Responsible for collecting, scrubbing, and extracting data, generated compliance reports using SSRS, analyzed and identified market trends to improve product sales. Worked Environment: Azure cloud, Azure data factory, Apache Spark, YAML, Azure Databricks, Azure Analytics, Synapse, SQL, SSIS, AWS VPC, Kubernetes, Python Data bricks, HDInsight. Mutual Bank, MN. Feb/2020 – Aug/2021
Role: Data Engineer
● Organized, configured, and scheduled different resources across the cluster using Azure Kubernetes Service and monitored Spark cluster using Log Analytics and Ambari Web UI.
● Worked on Transitional log storage from Cassandra to Azure SQL Data warehouse and improved the performance.
● Engaged in-depth with the development of data ingestion pipelines on Azure HDInsight Spark cluster using Azure Data Factory and Spark SQL.
● Developed Spark Streaming programs to real time data from Kafka and handled data for transformation.
● Worked intensively to write power shell scripts to schedule Hive and Spark jobs from Control-M jobs.
● Ingested data in mini batches and performed RDD transformation on mini batches of data by using Spark Streaming to perform streaming analytics in Data bricks.
● Developed business aggregations using Spark and DynamoDB for storing aggregated data in JSON format.
● Integrated Kerberos authentication with Hadoop infrastructure for Authentication and authorization management.
● Worked with HBase by using Hive-HBase integration and computed various metrics for reporting on the dashboards.
● Used Spark Scala APIs, HIVE data aggregation and formatting data (JSON) to develop data pipelines.
● Strong understanding of Partitioning, bucketing in Hive and designed tables in Hive to optimize performance.
● Performed Data Cleaning, features scaling, features engineering using pandas and NumPy packages in Python.
● Implemented User Defined Functions in PySpark for data loading and transformations.
● Implemented STAR schema style for data warehouses and fact tables for referencing any dimension tables.
● Extensively used SQL Queries to perform data validations, extractions, transformations, and data Loading.
● Converted Hive/SQL queries into Spark transformations using Spark RDDs and Scala.
● Implemented HBase Row key for inserting data into HBase tables with lookup and staging tables concepts.
● Implemented Spark jobs in Scala to access the data from Kafka and transformed the data to fit into HBase database.
● Developed skills on cluster installation, decommissioning and commissioning of data node, name node recovery, side configuration and capacity planning. Worked Environment: Azure Kubernetes, Ambari web UI, Kafka, Spark Streaming, Hive, DynamoDB, Scala, SQL, Map Reduce, HDFS, HBase.
Analogics tech, India - July/2015 – Dec/2017
Role: Software Engineering Analyst
● Engaged with different trams in gathering requirements to design ETL migration process from Existing RDBMS to Hadoop Cluster using Sqoop.
● Worked to develop HIVE Queries for Data Transformation and Data analysis.
● Developed, implemented and tested python-based web applications interacting with MySQL.
● Developed skills on how to load data from Hive table to MySQL and DB2 using python scripts.
● Involved in web services and Hibernate in a fast-paced development environment.
● Performed ETL operations using informatica, PL SQL, UNIX shell scripts and worked with SSIS.
● Developed strong skills in ETL design and development and understood the tradeoffs of various design options on multiple platforms using multiple protocols.
● Developed ETL pipelines in and out of data warehouse using combination of Python and Snowflakes.
● Worked with SQL Server Reporting Services (SSRS) and created various types of reports like Parameterized, Cascading, Conditional, Matrix, Table, Chart and Sub Reports.
● Defined static and dynamic repository variables to modify metadata content to adjust to a changing data environment.
● Researched Lean Six Sigma principle and implemented an online portal which replaced the Excel sheets and optimized the storage of data by reducing.
● Gained in-hands experience on agile software development process including analysis design and worked on Scrum/Jira and Confluence.
● Gained experience of DataStax Spark connector to store the data into Cassandra database also to get the data from Cassandra database.
Worked Environment: DataStax Spark, Cassandra, Python, Snowflakes, Agile, HIVE, PL/SQL, UNIX.