Priyanka R
Data Engineer
***********@*****.***
Professional Summary:
Around 8+ years of Professional experience in Big Data Development primarily using Spark and Hadoop Ecosystems.
Hands on experience with AWS- Elastic Map Reduce (EMR), Storage S3, EC2 instances, Glue, Redshift
Azure-ADB and ADF and Data Warehousing.
Relevant Experience in working with various SDLC methodologies like Agile Scrum for developing and delivering applications.
Experience in design, development, and Implementation of Big data applications using Hadoop ecosystem frameworks and tools like Sqoop, Spark, Storm HBase, Kafka.
Good at converting SQL queries into Spark transformations using Spark RDD and PySpark concepts.
Experience in development of ETL processes and frameworks for large-scale, complex datasets.
Experience with application development on Linux, python, RDBMS, NoSQL and ETL solutions.
Good experience in designing and operating large Data Warehouses.
Proficient in design and building data processing pipelines using tools and frameworks.
Good understanding on data transformation & translation requirements and which tools to leverage to get the job done.
Experience with data pipelines and modern ways of automating data pipelines using cloud-based techniques.
Familiar with new advances in the data engineering space such as EMR and NoSQL technologies like
MongoDB.
Hands on experience in working with Continuous Integration and Deployment (CI/CD)
Experience in using various version control systems like GIT.
Possess good interpersonal, analytical presentation Skills, ability to work in Self-managed and Team environments.
BigData : PySpark, SparkSql, Hive SQL, HDFS, Kafka
Web Development : JavaScript, Node.js, HTML, CSS, React, Postman, Tomcat.
Operating systems : Linux (Ubuntu), Windows (XP/7/8/10)
Languages : Python, R, Java, JS, C, Shell Script
Databases : MySQL, Mongo DB, Oracle, PostgreSQL, SQL Server, Teradata
Others : NLP, Spring Boot, Jupyter Notebook, Git, Apache, Jenkins, CI/CD, Power BI.
2022-01 – Current Data Engineer
IPC-Miami, FL
Used Azure Data Factory extensively for ingesting data from disparate source systems.
Used Azure Data Factory as an orchestration tool for integrating data from upstream to downstream systems.
Worked on ADLSgen2, ADB notebooks by using python and Sql extensively.
Been involved in Azure devops, CI/CD integration and Azure Analysis services.
Automated pipelines using different triggers (Scheduled) in ADF.
Designed and developed stored procedures in ADF.
Created Pipelines in ADF using Linked Services/Datasets/Pipeline/ to Extract, Transform, and load data from different sources like Azure SQL, Blob storage, Azure SQL Data warehouse, write-back tool, and backward.
Knowledge on Logic Apps.
Developed Spark applications using Pyspark and Spark-SQL for data extraction using Data Frame, transformation, and aggregation from multiple file formats for analyzing & transforming the data to uncover insights into the customer usage patterns.
Responsible for estimating the cluster size, monitoring, and troubleshooting of the Spark data bricks cluster.
Analyzed the data flow from different sources to target to provide the corresponding design Architecture in Azure environment.
Created High-level technical design documents and Application design documents as per the requirements and delivered clear, well-communicated, and complete design documents.
Created Build definition and Release definition for Continuous Integration and Continuous Deployment.
Created Application Interface Document for the downstream to create a new interface to transfer and receive the files through Azure Data Share.
Creating pipelines and doing complex data transformations and manipulations using ADF and PySpark with Databricks
2020-03 -2022-01 Data Engineer
UGI-Denver, PA
Implemented Spark using Python and SparkSql for faster testing and processing of data.
Developed Spark programs to parse the raw data, populate staging tables and store the refined data in partitioned tables in the EDW.
Imported the data from different sources like SQL Server into Spark RDD and developed a data pipeline using Kafka and Spark to store data into HDFS. Used Spark Streaming to receive real time data from the Kafka and store the stream data to HDFS using Python and NoSQL databases such as Mongo DB
Configured multiple AWS services like EMR and EC2 to maintain compliance with organization standards.
Used SparkSql to handle JSON data, create RDDs, load it into Hive tables and handled the Structured data using Spark SQL and using Snowflake schema.
Worked extensively on finetuning spark applications to improve performance and troubleshooting failures in spark applications.
Implemented various Hive queries for analytics. Created External tables, optimized Hive queries and improved the cluster performance by 30%. Used Broadcast Join in SPARK for making smaller datasets to large datasets without shuffling of data across nodes.
Developed Spark application for loading CSV file data and applying business validation on dataframe to find invalid and valid data frames. Wrote a valid data frame into the actual Hive partition table and invalid data frame into error table, partitioned by load date and load type.
Extensively used Teradata for all maintaining DBA functions and Used AWS Redshift, S3, Spectrum and Athena services to query large amount data stored on S3 to create a Virtual Data Lake.
Worked extensively with Sqoop for importing and exporting the data from HDFS to Relational Database systems and vice-versa loading data into HDFS.
Worked Data Integration and Analytics based on Spark, Kafka and web Methods technologies.
Implemented the Big Data solution using Hadoop and hive to pull/load the data into the HDFS system.
2019-07 - 2020-03 Data Engineer
Ecolab Inc-Naperville, IL
Responsible for running the spark jobs along with optimizing and data validation. Proficiency in converting SQL queries into Spark transformations using Spark RDD and Python.
Configured EMR to process the millions of customers data using spark applications in less time.
Worked on MongoDB (NoSQL framework) to store the unstructured data before processing with HiveQL.
Responsible for running the spark jobs along with optimizing, data validation and automation.
Developed Spark applications utilizing PySpark and Spark-SQL for information extraction, change, and accumulation from numerous document designs for analyzing and changing the information to reveal experiences into the client utilization designs.
Developed Spark jobs on Databricks to perform tasks like data cleansing, data validation, standardization, and then applied transformations as per the use cases.
Designed and Developed ETL Processes in AWS Glue to migrate Campaign data from external sources like S3, ORC/Parquet/Text Files into AWS Redshift.
Customized Hive UDF to develop the structured format of data from unstructured customers data and loaded into HBase environment from database using Sqoop.
Developed Spark jobs on Databricks to perform tasks like data cleansing, data validation, standardization, and then applied transformations as per the use cases.
Loaded data into S3 buckets and involved in filtering data stored in S3 buckets using Elasticsearch & loaded data into Hive external tables. Utilized Spark's in memory capabilities to handle large datasets on S3 Data Lake.
Used Spark-streaming for consuming event-based data from Kafka and joined this data set with existing Hive table data to generate performance indicators for an application.
Used AWS Lambda, running scripts/code snippets in response to events occurring in CloudWatch.
Integrated applications using Apache tomcat servers on EC2 instances and automated data pipelines into AWS using Jenkins and Git
2018-09 - 2019-07 Data Engineer
CLIN Force Inc-Durham, NC
Gathered data and performed analytics using AWS stack (EMR, EC2, S3, RDS, Lambda, Redshift).
Extensive working experience in writing MapReduce jobs and UDF's to gather, analyze, transform, and deliver the data as per business requirements.
Extensive experience in development of SQL, Oracle PL/SQL Scripts, Stored Procedures and Triggers for business logic implementation.
Researched, evaluated, architect, deployed new tools, frameworks, patterns to build sustainable Big Data platforms.
Designed and developed architecture for data services ecosystem spanning Relational, NoSQL, Big Data technologies.
Responsible for the data architecture design delivery, data model development, review, approval and Data warehouse implementation. Designed and developed the conceptual then logical and physical data models to meet the needs of reporting.
Generated ad-hoc SQL queries using joins, database connections and
transformation rules to fetch data from MySQL Workbench and SQL Server database systems.
Managed the meta-data for the Subject Area models for the Data Warehouse environment.
Generated DDL and created the tables and views in the corresponding architectural layers.
Worked on MongoDB (NoSQL framework) to store the unstructured data before processing with HiveQL.
Installed, configured & managed RDMS, SQL Server, MYSQL, DB2, PostgreSQL, MongoDB, Cassandra and also experienced in creating and visualizing tableau dashboards to create BI reports.
2015-10 - 2017-02 Data Engineer
Rock metric-HYD, India
Extensively worked with Azure cloud platform (HDInsight, Data Lake, Databricks, Blob Storage, Data Factory, AAS, Azure Data lake, Synapse, SQL, SQL DB, DWH and Data Storage Explorer).
Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL and U-SQL Azure Data Lake Analytics.
Ingested data to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in In Azure Databricks.
Created Pipelines in Azure Data Factory (ADF) using Linked Services, Datasets, Pipeline to extract, transform and load data from different sources like Azure SQL, Blob storage, Azure SQL DW, write-back tool and backwards.
Created Application Interface Document for the downstream to create new interface to transfer and receive the files through Azure Data Share.
Designed and configured Azure Cloud relational servers and databases, analyzing current and future business requirements.
Worked on Power Shell scripts to automate the Azure cloud system creation of Resource groups, Web Applications, Azure Storage Blobs & Tables, firewall rules.
Worked on migration of data from On-prem SQL server to Cloud databases (Azure Synapse Analytics (DW) & Azure SQL DB).
Configured Input & Output bindings of Azure Function with Azure Cosmos DB collection to read and write data from the container whenever the function executes.
Designed and deployed data pipelines using Data Lake, Databricks, and Apache Airflow.
Developed Elastic pool databases and scheduled Elastic jobs to execute T-SQL procedures.
Developed Spark applications using PySpark and Spark-SQL for data extraction, transformation, and aggregation from multiple file formats for analyzing & transforming the data to uncover insights into the customer usage patterns.
Ingested data in mini-batches and performs RDD transformations on those mini-batches of data by using Spark Streaming to perform streaming analytics in Databricks.
Created and provisioned different Databricks clusters needed for batch and continuous streaming data processing and installed the required libraries for the clusters.
Created several Databricks Spark jobs with PySpark to perform several tables to table operations.
Created data pipeline for different events in Azure Blob storage into Hive external tables. Used various Hive optimization techniques like partitioning, bucketing and Mapjoin.
Involved in developing data ingestion pipelines on Azure HDInsight Spark cluster using Azure Data Factory and Spark SQL. Also Worked with Cosmos DB (SQL API and Mongo API).
Designed custom-built input adapters using Spark, Hive, and Sqoop to ingest and analyze data (Snowflake, MS SQL, MongoDB) into HDFS.
Developed automatic job flows and ran through Oozie daily and when needed which runs MapReduce jobs internally.
Extracted Tables and exported data from Teradata through Sqoop and placed in Cassandra.
2013-08 - 2015-10 Data Analyst
Voltuswave Technologies Pvt. Ltd, HYD, India
Worked on wrangling data and creating data pipelines using fast, efficient Python code.
Implemented the Big Data solution using Hadoop, hive and Informatica to pull/load the data into the HDFS system.
Pulled the data from data lake and massaging the data with various RDD transformations.
Developed Python scripts, UDF's using both Data frames/SQL and RDD/MapReduce in Spark for Data Aggregation, queries and writing data back into RDBMS through Sqoop.
Worked on configuring and managing disaster recovery and backup on Cassandra Data.
Wrote CloudFormation Templates in JSON for Network & Content Delivery of the AWS Cloud Environment.
Created user-oriented complex queries using the Structured Query Language
(SQL)
Normalized and denormalized relational databases to optimize performance Handled Business logics by backend Python programming to achieve optimal results.
Worked on DBMS table design, loading and data modeling in SQL.
Identified problematic SQL queries and optimized statistics,skew, indexes, joins to save system resources.
Implement data models, database designs, data access, table maintenance and code changes together with our development team.
Developed and run serverless Spark based applications using AWS Lambda service and PYSpark to compute metrics for various business requirements.
Bachelors in Engineering-TPT, India 2013