Data engineer

Location:

Irving, TX

Salary:

Posted:

May 23, 2024

Contact this candidate

Resume:

DIVYA SREE VADLAKONDA

Sr. Data Engineer

Phone: +1-469-***-****

Email: *************@*****.***

LinkedIn: https://www.linkedin.com/in/divyasree-vadlakonda-dataengineer/

PROFESSIONAL SUMMARY

10+ years of expertise in different Cloud Data Engineering, Big Data Engineering, Data Warehouse, Data Mart, Data Visualization and Data Quality.

Proficient in building data ingestion procedures AWS Data Pipeline, extracting data from diverse source systems, and loading it into data lakes and warehouses.

Demonstrated experience developing scalable and reliable data solutions using Amazon Web Services.

Capable of using programming languages like Scala and Python, as well as Big Data technologies like Hadoop and Hive.

Experienced with normalization (1NF, 2NF, 3NF, and BCNF) and de-normalization techniques for optimal performance in OLTP and OLAP systems.

Possess a solid understanding of RDBMS ideas, logical and physical data modelling up till 3NF, and multidimensional data modelling schema (Star schema, Snow-Flake Modelling, Facts, and dimensions).

Worked and extracted data from several database sources, including Oracle, DB2 and SQL Server.

Practical knowledge with machine learning, big data, data visualization, R and Python development, Linux, SQL, and GIT/GitHub.

Possesses a thorough understanding of Hadoop ecosystem components, as well as hands-on experience processing and analysing large-scale datasets utilizing Spark, Hive, and YARN.

Machine learning techniques were implemented on massive datasets to discover hidden patterns and insights.

Proficient in using Python for various Data Engineering tasks such as data manipulation, transformation and analysis utilizing packages like NumPy, Pandas and Scikit-learn to enhance data processing capabilities.

Developed Spark applications with a high degree of optimization to carry out different tasks like data transformation, validation, cleaning, and summarization as needed.

Competent in managing non-relational data storage and retrieval, with experience working with NoSQL databases like Cassandra and DynamoDB.

Extensive background in data analysis, design, development, implementation, and testing using SQL and ETL procedures in a range of relational and non-relational databases.

Capable of using Google Cloud Platform (GCP) and BigQuery to create scalable data solutions that provide actionable insights and business value.

Built visually stimulating dashboards with Tableau and Power BI to help with forecasting and data visualization of big, aggregated information.

Adept at creating end-to-end data solutions by utilizing Azure data services including Azure Synapse Analytics, Azure SQL Database, Azure Data Lake Storage, Azure Data Factory, and Azure Databricks.

Proficient in using Informatica PowerCenter as an ETL tool to improve data extraction, transformation, and loading operations.

Proficient in designing complicated SQL queries for procedures, triggers, and packages, with a solid foundation in relational database design.

EDUCATION

Bachelor’s in Computer Science Engineering at Kakatiya Institute of Technology and Science Warangal (An Autonomous Institute under Kakatiya University), Telangana, India.

Master’s in Applied Computer Science at Northwest Missouri State university, Maryville, Missouri, United states.

CERTIFICATION

Microsoft Certified: Azure Data Engineer Associate, Certification ID - 891424C50A622B6D

TECHNICAL SKILLS

Big Data Technologies

Hadoop, Apache Spark, Apache Kafka, Pyspark, Hive, MapReduce, Scala

Cloud Platforms

Azure, AWS, GCP

Database Systems

Relational Databases (MySQL, PostgreSQL, SQL Server), NoSQL Databases (Cassandra, DynamoDB, MongoDB)

Data Modelling

Relational Data Modelling, Dimensional Modelling (Star Schema, Snowflake Schema)

Machine Learning

Python Libraries (NumPy, Pandas, Scikit-learn, TensorFlow, PyTorch), Predictive Modelling.

Scripting

Shell scripting, Python Scripting

Operating Systems

Microsoft Windows, Unix, Linux

Data Integration Tools

Apache Airflow, Apache Nifi

Reporting Tools

MS Excel, Tableau, Tableau server, Tableau Reader, Power BI

Version Control

Git, GitHub

Other tools and technologies

Microsoft Visual Studio, Java, Jupyter Notebook, Agile Methodologies, Anaconda, Jira, Docker, CI/CD.

WORK EXPERIENCE

Client: CVS Health, Richardson, TX

Role: Sr. Data Engineer

April 2023 to Present

Responsibilities:

Used big data technologies, such as Hadoop ecosystem components (Spark and Hive) along with Kafka and Scala to efficiently handle and analyse enormous healthcare datasets, integrating RESTful APIs for seamless data access and processing.

Generated detailed studies on potential third-party data handling solutions, verifying compliance with internal needs and stakeholder requirements.

Collaborated on ETL (Extract, Transform, Load) tasks, maintaining data integrity, and verifying pipeline stability.

Performed large-scale data conversions for integration into HD insight, employing Apache Spark for high-performance data processing.

Designed and implemented effective database solutions (Azure blob storage) to store and retrieve data, utilizing Apache Spark for data ingestion and manipulation.

Designed advanced analytics ranging from descriptive to predictive models to machine learning techniques.

Monitored incoming data analytics requests and distributed results to support IoT hub and streaming analytics.

Integrated Python scripts with Azure services like Azure Data Factory and Azure Databricks to automate data pipeline creation, management, and execution, using REST APIs to ensure smooth connectivity with external systems.

Prepared documentation and analytic reports, delivering summarized results, analysis, and conclusions to stakeholders.

Generated comprehensive studies on third-party data handling solutions, ensuring alignment with internal needs and stakeholder requirements, while maintaining Delta Live tables for up-to-date analytics.

To ensure dependable and effective data pipeline execution, Apache Airflow was used for workflow orchestration, scheduling, and monitoring.

Employed Python to write custom functions and logic for handling abnormal data in streaming analytics services using Azure service bus topics and Azure functions.

Guided the development, optimization, and management of data pipelines using Azure Data Factory and Databricks, ensuring seamless data integration and processing within the Azure ecosystem.

Employed data cleansing methods, significantly Enhanced data quality.

Using Snowflake on Azure, we designed and deployed efficient database solutions that made use of the secure storage and retrieval of healthcare data.

Write Azure service bus topic and Azure functions when abnormal data was found in streaming analytics service.

Used Kafka for real-time data streaming and event collecting in healthcare, integrating with Scala for scalable data processing and analytics.

Utilized Python testing frameworks like Pytest, Doctest and PyUnit for writing and executing automated tests to validate data processing pipelines and ETL workflows.

Constructed Azure Document DB to save the latest status of the targeted customer information.

Deployed data factory for creating data pipeline to orchestrate the data into SQL database.

Employed data cleansing methods within Snowflake to enhance data quality and ensure accuracy of healthcare data.

Used Python libraries like Matplotlib, NumPy and Pandas for visualizing the data.

Data integration and storage technologies with Jupyter Notebook and MySQL.

Implemented Azure Data Factory for data pipeline creation and orchestration, establishing automated CI/CD pipelines to ensure efficient deployment and maintenance of streamlined data workflows in Azure Synapse Analytics.

Client: Populus Financial Group, Dallas TX

Role: Sr. Data Engineer

June 2021 to March 2023

Responsibilities:

Design and implement database solutions in Azure SQL Data Warehouse, Azure SQL, utilizing REST APIs for data retrieval and processing.

Architect & implement medium to large scale BI solutions on Azure using Azure Data Platform services (Azure Data Lake, Data Factory, Data Lake Analytics, Stream Analytics, Azure Synapse Analytics, Azure SQL DW, HDInsight/Databricks, NoSQL DB).

Design & implement migration strategies for traditional systems on Azure (Lift and shift/Azure Migrate, other third-party tools.

Engage with business users to gather requirements, design visualizations, and provide training to use self-service BI tools, integrating Apache Spark for advanced analytics and machine learning.

Used various sources to pull data into Power BI such as SQL Server, Excel, Oracle, SQL Azure etc.

Propose architectures considering cost/spend in Azure and develop recommendations to right-size data infrastructure.

Utilized Python to interact with OpenStack APIs, enabling seamless integration with cloud resources and enhancing automation capabilities for infrastructure management, implementing REST APIs for resource provisioning and management.

Technically guide projects through to completion within target timeframes. Collaborate with application architects and DevOps and Identify and implement best practices, tools and standards.

Design Setup maintain Administrator the Azure SQL Database, Azure Analysis Service, Azure SQL Data warehouse, Azure Data Factory, Azure SQL Data warehouse.

Using Snowflake's scalability and performance features, we have built complex distributed systems that handle massive amounts of data, gather metrics, construct data pipelines, and do analytics in a streamlined manner.

Utilized Python within CI/CD pipelines to automate the deployment of database solutions and BI applications on Azure, ensuring continuous integration, testing, and efficient delivery of changes to Azure Data Platform services.

Build Complex distributed systems involving huge amount data handling, collecting metrics building data pipeline, and Analytics.

Utilised GCP features like BigQuery and Cloud Storage to improve data storage and processing for huge financial datasets, integrating Kafka, Scala, and Apache Spark for data ingestion, processing, and analytics.

Created and implemented data access controls and authorization methods to enforce least privilege principles and prevent unauthorized access to financial data stored in GCP.

Implement data validation and cleansing methods to ensure the integrity and accuracy of the data stored in GCP.

To ensure data integrity and dependability, extensive data validation is carried out between source files and BigQuery tables using Python programming and Cloud Dataflow.

Used GCP Cloud Shell SDK to simplify the configuration of services such as Data Proc, Storage, and BigQuery, assuring project readiness.

Designed and executed complicated data processing pipelines with Apache Flink, which supports elaborate data transformations and analytics.

Created Python Airflow scripts to automate workflow and facilitate the seamless execution of project tasks and procedures.

Processing data from a variety of sources using IBM Streams, a range of processing operators, and sophisticated analytics methods.

Client: Deutsche Bank, Kansas City MO

Role: Data Engineer

August 2019 to May 2021

Responsibilities:

Developed PySpark and Scala based Spark applications for performing data cleaning, event enrichment, data aggregation, de-normalization and data preparation needed for machine learning and reporting teams to consume.

Transformation of data using snowflake from AWS to s3

Involved in design and development of Snowflake Database components like stages, snowpipe, steam and task, ensuring seamless integration with Delta Lake and Delta Live Tables.

Experience working for EMR cluster in AWS cloud and working with S3, Redshift, Snowflake.

Wrote Spark-Streaming applications to consume the data from Kafka topics and write the processed streams to HBase and MongoDB.

Wrote Kafka producers to stream the data from external rest APIs to Kafka topics.

Developed migration strategy for conventional systems to Snowflake on Azure, incorporating best practices for lift and shift and third-party solutions for streamlined migration processes.

Experienced in handling large datasets using Spark in Memory capabilities, using broadcasts variables in Spark, effective & efficient Joins, transformations, and other capabilities.

Implemented automated testing methodologies within Azure CI/CD pipelines to validate data pipeline functionality and ensure data integrity and accuracy.

Utilized Python for scripting and automation tasks within Kubernetes environments, including the creation of private clouds supporting DEV, TEST, and PROD environments. Written HBase bulk load jobs to load processed data to HBase tables by converting to Hfiles.

Wrote Glue jobs to migrate data from HDFS to S3 data lake.

Generated Python Django forms for online users.

Developed Spark applications using Spark-SQL in Azure Databricks, leveraging the capabilities of Databricks for efficient data extraction and processing within the Azure cloud environment.

Implemented Partitioning, Dynamic Partitions, Buckets in Hive.

Built and managed cloud-based, resilient data pipelines with AWS services like Glue, Aurora Postgres, EKS, Redshift, PySpark, Lambda, and Snowflake.

Experience in Developing Spark applications using Spark - SQL in Databricks for data extraction.

Client: Adaptec Solutions Private Limited, India

Role: ETL Developer

Sep 2016 to July 2018

Responsibilities:

Developed and designed the ETL methodology for supporting data transformations, standardization and processing using PySpark, Python.

Worked on development of data ingestion pipelines using ETL tool, Talend & bash scripting with big data technologies including but not limited to Hive, Impala, Spark, Kafka, and Talend.

Importing data from MS SQL server and Teradata into HDFS using SQOOP.

Working knowledge of cluster security components like Kerberos, Sentry, SSL/TLS etc.

Utilized Hive partitioning, bucketing and performed various kinds of joins on Hive tables.

Tuned and optimized big data technologies to improve overall system efficiency and reduce processing times.

Contributed to the development and maintenance of documentation for ETL procedures, data models, and system configurations, resulting in clear and comprehensive records for future reference.

Implemented data quality checks and validation methods in ETL workflows to ensure data accuracy and dependability throughout its lifecycle.

Lead the design and implementation of data pipelines in Python, utilizing industry best practices and guaranteeing scalability, reliability, and efficiency when processing huge volumes of data.

Worked on real-time data streaming with Apache Kafka, allowing data to be ingested and processed in near real time for quick insights and decisions.

Contributed to the evaluation and implementation of new technologies and techniques to improve the capabilities of the data processing infrastructure.

Client: Acuity Software Technologies, India

Role: ETL Developer

July 2013 to Aug 2016

Responsibilities:

Developed and design of the ETL methodology for supporting data transformations, standardization and processing using PySpark, Python.

Data pre-processing and analysis using Python and Unix scripting.

Load and transform large sets of strutted, semi structured, and unstructured data and manage data coming from direr sources using major components of Hadoop eco system - Hive, HDFS, Spark.

PySpark and Python were used for feature engineering activities, which involved extracting useful features from raw data to improve the performance of machine learning models.

Worked on the analysis, specification, design, implementation, and testing phases of the Software Development Life Cycle (SDLC).

Improved query efficiency and decreased data scan times by implementing data partitioning and bucketing techniques in Hive.

Data pipelines were designed and optimized to handle complicated data structures, making it easier to extract and transform data from several sources into a cohesive and useable format.

Applied expressive and effective data manipulation using PySpark's DataFrame API, allowing for succinct yet potent transformations on big datasets.

Contact this candidate