Priyanka Thikkisetty
Email Id - ***************@*****.***
Contact No : +1-281-***-****
LinkedIn - www.linkedin.com/in/t-priyanka-b02783315
Summary
Experience in the complete Software Development Life Cycle (SDLC) following both Waterfall and Agile methodologies, enabling me to adapt to diverse project requirements and deliver high-quality solutions efficiently.
Skilled in constructing robust Data pipelines and Data parts using Hadoop stack, with hands-on experience in Apache Spark for RDD and Data Frame operations.
Expertise in designing and optimizing Data solutions on Microsoft Azure, utilizing services like, Azure Data Lake Storage SQL Database, and Azure Databricks to build scalable data pipelines and enable actionable insights.
Experienced in Database Design, Development, and support utilizing MS SQL Server, Azure Data Factory (ADF), Databricks, MSBI Production/Development and Snowflake.
Extensive knowledge and experience in various Software Development Life Cycle (SDLC) methodologies, including Agile-Scrum, Waterfall, and Scrum-Waterfall hybrid methodologies.
Possess a strong understanding of Data Warehouse Architecture and Data Warehousing fundamentals, encompassing Star Schema, Snowflake Schema, and OLTP & OLAP systems.
Demonstrated hands-on experience with Azure cloud services, including Data Factory, Blob Storage, Databricks, Monitor, and Azure Active Directory (AAD).
Expertise in Extraction, Transformation, and Loading data from SQL Server, CSV, Flat Files, and JSON using Azure Data Factory and Big Data tools (Apache Spark, Azure Databricks, Hive).
Proficient in designing Linked Services, Pipelines, and Datasets to facilitate the loading of data from source to target within ADF.
Proficient in Copy Activity, Lookup Activity, Get Metadata Activity, For Each Activity and other components utilized in Pipelines.
Implemented Integrated Runtimes, Azure Key Vault, and Triggers within Azure Data Factory.
Experienced in load files from AWS S3, Amazon Redshift, Google Cloud Storage, MySQL, PostgreSQL, Snowflake, Blob storage, Hive, Azure Data Lake Gen2 (ADLS Gen2) and Delta Lake storages as sources to destination Databases.
Skilled in implementing CI/CD pipelines using GitHub, Azure DevOps to automate the development and deployment process.
Hands on experience on undefined Data Analytics with Databricks, Databricks workspace user interface, Managing Databricks Notebooks, Delta Lake with python, Delta lake with Spark SQL.
Good understanding of Spark architecture with Databricks, Structured, Semi-Structured and Un-Structured streaming. Setting up Microsoft Azure with Databricks, Databricks workspace for Business Analytics, Manage cluster in Databricks.
Developed robust data processing pipelines using Java to ingest, transform, and load large volumes of data into distributed systems such as Hadoop and Spark. Developed and implemented data pipelines using Kafka and Spark Streaming for real-time analytics on HDFS.
Experience in working with product teams to create various store level metrics and supporting data pipelines written in GCP’s big data stack.
Advanced expertise in programming and scripting with Python, Scala, and Java, developing custom data processing and analytics applications tailored to specific business needs.
Deep understanding of database management and operations with MySQL, PostgreSQL, SQL Server, and Oracle 12c, ensuring data accuracy and availability.
Demonstrated ability in data visualization and reporting with tools like Tableau, Power BI, and Informatica, enabling stakeholders to grasp complex data insights easily.
Proficient in cloud infrastructure management with VM Instances, EC2, RDS, and DynamoDB, optimizing computing resources for cost-effective data operations.
Expertise in Big Data Technologies and Hadoop Ecosystems such as Pyspark, Spark-Scala, HDFS, GPFS, Hive, Sqoop, PIG, Spark-QL, Kafka, Hue, Yarn, Trifacta and EPIC data sources.
Experience in Agile environments with adherence to release management and best practices, including proficiency in version control tools like GIT and deployment tools like Urban Code Deployment (UCD).
Integrated Snowflake with Apache Airflow for automated data workflows.
Leveraged Snowflake with dbt for version-controlled data transformation pipelines.
Tuned Snowflake warehouse configurations to reduce compute costs.
Handle Snowflake performance tuning and Automate data loading using tools like Snow pipe, Streams & Tasks.
Optimize data storage (clustering, partitioning) and Manage staging and production schemas.
Build and maintain dbt models and Create reusable macros and Jinja templates and manage sources, models, and tests in dbt.
Write unit tests and Use dbt for data validation and quality checks.
Skills
Operating Systems: Unix, Linux, Windows
Programming Languages: Java, Python 3, Scala 2.12.8, Spark, SQL, PL/SQL, UNIX, Pig, HiveQL, Shell Scripting
Hadoop Eco System: Hadoop, MapReduce, Spark, HDFS, Sqoop, YARN, Oozie, Hive, Impala, Apache Flume, Apache Storm, Apache Airflow, HBase, Palantir
Relational Data Bases: Oracle 12c/11g/10g, MySQL, MS Sql, DB2, PostgreSQL, Snowflake, RDBS
NoSQL Data Bases: MongoDB, Cassandra, HBase, DynamoDB
Workflow mgmt. tools: Oozie, Apache Airflow
Visualization & ETL tools: Tableau, Power BI, D3.js, Informatica, Talend, SSIS
Cloud Technologies: Azure, AWS, GCP
Cloud Services: Azure - EC2, S3, EMR, RDS, Glue, Presto, Lambda, RedShift, Azure - Data Lakes, BLOB, Azure Data Factory, Synapse analytics, Databricks
IDE's: Eclipse, Jupyter notebook, Spyder, PyCharm, IntelliJ
Version Control Systems: Git, SVN, CSV
Education
JNTUK, India – May 2016
Bachelors, Electronics and Communication Engineering
Lamar University, USA – Dec 2023
Masters, Computer Science
Work Experience
GM Financial, Fort Worth, TX - Contract
Sr. Data Engineer Feb 2024 – Present
Involved in full Software Development Life Cycle (SDLC) - Business Requirements Analysis, preparation of Technical Design documents, Data Analysis, Logical and Physical database design, Coding, Testing, Implementing, and deploying to business users.
Designed Linked Services, Pipelines, and Datasets to facilitate the seamless loading of data from source to target in ADF (Azure Data Factory).
Development and maintenance of the Azure Data Factory pipelines, as well as creating and managing Databricks notebooks, ensuring seamless data integration and processing.
Successfully ingesting data from diverse sources, including SQL Server, APIs, and file systems, into respective target layers, leveraging the power of Data Factory and Databricks.
Proficiently developing linked services and datasets to facilitate smooth data ingestion and processing operations.
Analyses Business/Functional requirements and provide a Design solution for software components to meet Business/Functional requirements.
Using Spark Session in Notebook to load data from various sources such as files (CSV, Parquet, JSON), databases, or distributed storage systems (Hive, HDFS).
Monitor the progress of batch processing and Real Time processing job and log relevant information for debugging and auditing purposes.
Skilled in data ingestion, transformation, and modeling using Databricks Delta Lake, ensuring data reliability, consistency, and ACID transactions.
Experience in developing Spark application using spark-SQL in data bricks for data extraction, transformation and aggregation from multiple file formats for analyzing and transforming the data to uncover insights into the customer usage patterns.
Extract Transform and load data from source systems to Azure Data Storage Services using a combination of Azure Data Factory, T-SQL, Spark and U-SQL Azure data Lake analytics. Data ingestion to one or more Azure services (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing data in Azure Databricks.
Experience on undefined Data Analytics with Databricks, Databricks workspace user interface, Managing Databricks Notebooks, Delta Lake with Python, Delta Lake with Spark SQL.
Initiated the migration of on-premises data warehouses to AWS Redshift, enabling enhanced data analysis and reporting Utilized Amazon S3 as an object storage service for storing and retrieving data, ensuring durability, availability, and scalability.
Integrated DynamoDB as a fully managed NoSQL database, enabling low-latency access and high scalability for key-value and document data storage and retrieval.
Utilized Snowflake's auto-scaling capabilities to dynamically adjust compute resources based on workload demands, decreasing query latency.
Developed and optimized resilient ETL pipelines using Informatica, resulting in a reduction in data integration time and increase in data quality.
Developed and designed system to collect data from multiple portal using Kafka and then process it using spark.
Implemented Parquet columnar storage format for efficient data storage and retrieval, resulting in a reduction in storage costs and improvement in query performance.
Used cloud shell SDK in GCP to configure the services Data Proc, Storage, Big Query.
Implemented Spark SQL for distributed data processing, querying, and analyzing large datasets, achieving an increase in processing speed and reducing data transformation time.
Incorporated Amazon Athena for querying data directly from Amazon S3 using SQL, enabling ad-hoc data analysis without a separate data warehouse.
Developed ELT processes from the files from abinitio, google sheets in GCP with compute being dataprep, dataproc (pyspark) and Bigquery.
Employed Amazon QuickSight for creating interactive dashboards and visualizations, enabling business users to gain insights from their data.
Design, develop, deploy and maintain large scale Tableau dashboards for Product Insight, Devices and Networking, and Cox Premise Equipment
Utilized Python libraries and frameworks such as pandas, NumPy, PySpark, and Apache Beam for efficient data manipulation, analysis, and parallel processing.
Developed custom Python scripts and modules for data ingestion, data quality checks, and data transformation, ensuring data integrity and reliability.
Develop and deploy the outcome using spark and Scala code in Hadoop cluster running on GCP.
Integrated Apache Kafka to build real-time data pipelines, facilitating streaming data ingestion and reliable message delivery between various components of the data platform.
Developed scripts using PySpark to push the data from GCP to the third-party vendors using their API framework.
resulting in cost savings compared to traditional server-based architectures.
Provisioned Amazon EC2 instances for running virtual servers and hosting applications, ensuring scalability and high availability.
Involved in the reviewing of components and test cases also involved in GAP analysis between STTM and requirements.
Deployed Hadoop ecosystem components like HDFS (Hadoop Distributed File System) for batch processing and storage of large datasets, improving data processing efficiency while ensuring fault-tolerance and high availability.
Cyient, India
Data Engineer Nov 2019 - Jul 2022
Collaborated with stakeholders to gather and define project requirements following Agile methodology, ensuring alignment with business objectives.
Designed and developed data models for OLAP and OLTP systems to support various analytical and transactional workloads, enhancing data organization and retrieval.
Used DataStream to capture and stream data changes from Oracle databases to BigQuery for near real-time analytics, ensuring data consistency and timeliness.
Developed ETL pipelines using DataFlow (Apache Beam) to process and transform streaming data into clean, structured formats (Avro, Parquet, ORC), improving data quality and usability.
Employed Cloud Composer (Airflow) to orchestrate complex workflows, ensuring seamless execution and monitoring of ETL processes, enhancing automation and error handling.
Deployed Cloud Functions for serverless event-driven computing, enabling the execution of lightweight functions in response to specific triggers.
Conducted data analysis using SQL in BigQuery, deriving actionable insights from large datasets efficiently.
Developed dashboards and visualizations using Looker and Power BI, providing clear and interactive views of business metrics.
Employed Python and Pyspark for data mining and cleaning, leveraging their powerful libraries for data manipulation and analysis.
Utilized Kafka for real-time data streaming and event management, ensuring efficient data pipeline handling.
Leveraged DBT (Data Build Tool) for transforming data within the warehouse, promoting modular and maintainable data transformation workflows.
Designed and implemented data models in Azure Synapse Analytics to support OLAP queries.
Collaborated with stakeholders to gather and analyze business requirements and conducted detailed data analysis to identify data sources, data types, and data flow requirements.
Designed visual dashboard in Tableau using data extracted from MES Oi. 153 KPIs visually transcribed.
Designed the overall architecture for the Data Lake using Azure DataLake Storage for scalable and secure data storage, and utilized Azure Blob Storage for storing raw and semi-structured data.
Created Database on InfluxDB also worked on Interface, created for Kafka also checked the measurements on Databases
Developed data ingestion pipelines using Azure Data Factory (ADF) to orchestrate and automate data movement from various sources such as on-premises databases (SQL Server, Oracle, Teradata) and third-party APIs.
Implemented ETL processes using ADF, Azure Databricks, and Python to transform and load data into Azure Synapse Analytics for further processing and analysis.
Designed and deployed a cloud data warehouse solution in Google BigQuery, integrating data from multiple GAP product sources and enabling self-service analytics for the product team.
Designed and implemented data models in Azure Synapse Analytics to support OLAP queries and reporting, and managed data storage in Cosmos DB for high-throughput and low-latency access to transactional data.
Implemented unit tests for ETL processes using Python and PySpark to ensure data integrity and reliability, and conducted data validation and reconciliation to verify data accuracy.
Configured IAM policies and Azure Resource Manager for secure access control and resource management, and utilized Azure Data Catalog for data discovery and metadata management.
Pegasystems, India
Business Data Analyst Jul 2016 - Oct 2019
Developed and maintained data processing applications using Cloudera Hadoop distribution for efficient handling of large-scale datasets.
Designed and optimized Hive tables and queries for structured data storage and retrieval within Hadoop environments.
Implemented Impala for real-time querying and analysis of data stored in Hadoop clusters.
Collaborated with business stakeholders to gather requirements and translate them into technical solutions.
Developed data integration jobs using IBM DataStage to orchestrate data flows between various systems and platforms.
Utilized Shell scripting for automation of data processing tasks and system management.
Leveraged SQL and PL/SQL for querying and manipulating data in relational databases.
Developed ELT processes from the files from abinitio, google sheets in GCP with compute being dataprep, dataproc (pyspark) and Bigquery.
Scheduled and managed data processing tasks using Autosys job scheduler to ensure timely execution.
Conducted performance tuning and optimization of Hadoop clusters and data processing jobs for improved efficiency.
Implemented data security measures and access controls to protect sensitive information and ensure compliance with regulatory requirements.
Created dashboards in Tableau to determine profit generated, customer sentiment such as call volumes, case volumes for every quarter to be presented
Documented data engineering processes, workflows, and best practices for knowledge sharing and team collaboration.
Conducted data analysis and profiling to identify data quality issues and recommend improvements.
Worked closely with data scientists and analysts to understand business requirements and provide technical expertise in data engineering solutions.
Participated in code reviews and implemented coding standards to ensure code quality and maintainability.
Collaborated with infrastructure teams to provision and optimize resources for data processing and storage.
Certifications:
Certified Google Cloud Platform Associate Data Practitioner duration {Feb 2025 – March 2027}.
Certified Azure Data Associate (DP-203 ) duration {March 2025 – March 2026}.