Data Warehousing Big

Location:

Pflugerville, TX

Posted:

September 26, 2024

Contact this candidate

Resume:

Sanjana VS

Mobile number: +1-737-***-****

Email Address: ************@*****.***

LinkedIn: https://www.linkedin.com/in/sanjana-vs-785247125/

PROFESSIONAL SUMMARY

With over 7 years of comprehensive experience in data engineering, I bring extensive expertise in ETL methodology, cloud platforms, big data tools, and data warehousing. My technical acumen spans across multiple tools and technologies including Apache Spark, Hive, Microsoft Azure, AWS, Snowflake, and more. I worked as a Lead Big Data and Cloud Engineer at Deloitte and E&Y, where I spearhead complex data processing projects, drive automation, and deliver scalable solutions that optimize performance and reduce costs.

I am well-versed in the entire software development lifecycle (SDLC). My work is characterized by a deep understanding of data warehousing concepts, proficiency in SQL and scripting languages, and ability for solving complex technical challenges. My leadership skills are demonstrated through my experience in leading cross-functional teams, managing stakeholder expectations, and ensuring the timely delivery of high-quality products and a collaborative team player known for exceptional interpersonal and analytical prowess.

Extensive Industry Experience: Solid experience in data engineering, with a deep specialization in ETL methodology, SSIS, MS SQL, Microsoft Azure, Azure Data Factory, UNIX, Spark, Hive, Spark-SQL, Scala, Oozie, AWS Athena, Amazon S3, Hadoop, Python, Azure Databricks, Snowflake and Data warehousing. I have established myself as a trusted Data Expert proficient in data cleaning, transformations, profiling, modeling, visualization, engineering, and warehousing.

Automated Workflow Development: Successfully developed and deployed 20+ time-driven automated workflows using Oozie, leading to a 60% reduction in manual reporting efforts and a 30% increase in overall operational efficiency.

Data Analysis and Querying: Expertise in Athena, Scala, Spark-SQL, and Hive, with a proven track record of writing complex queries for data migration, manipulation, and reporting, particularly in the context of large-scale projects.

Plan and Develop roadmaps and deliverables to advance the migration of existing solutions on premises data (Oracle/ SQL Server/ DB2/ MongoDB) to Azure Data Lake Storage (ADLS) using Azure Data Factory (ADF V1/V2)

Data Warehousing Proficiency: Highly skilled in developing and maintaining Data Warehousing models, both relational and dimensional, ensuring data accuracy, consistency, and scalability across various projects.

Proficient in using Cloudera Manager, an end-to-end tool to manage Hadoop operations in Cloudera Cluster and worked with various streaming ingest services with Batch and Real-time processing using Spark streaming, Kafka, Flume and Sqoop.

Extensive working experience with big data ecosystem - Hadoop (HDFS, MapReduce, Yarn), Spark, Kafka, Hive, Impala, HBase, Sqoop, Pig, Oozie, Zookeeper, Ambari.

Comprehensive ETL Workflow Management: Extensive experience in the complete lifecycle of ETL workflows, from Analysis, Design, Development, Testing, and Deployment to Maintenance and Production Support, using both Agile and traditional methodologies.

Successfully utilized Apache Spark for distributed data processing, ensuring high efficiency and scalability in handling large datasets. Leveraged Azure Data Factory, Scala, Python, SQL, and AWS Athena to deliver high-performance solutions. Successfully managed all phases of development, from initial design to final production deployment. Ensured robust data processing, seamless ETL workflows, and efficient data analysis across platforms.

Real time experience in using Azure services: Portal, Azure Cosmos DB, Azure Synapse, Azure Analytics, Azure Data Lake Storage, Azure Data Factory, Azure Stream Analytics, Azure Databricks, Azure Log Analytics and Azure Blob storage and responsible for loading and transforming huge sets of structured, semi-structured, and unstructured data by satisfying all the Vs of Big data technology.

Strong Hadoop and platform support experience: With all the entire suite of tools and services in major Hadoop Distributions – Cloudera, Hortonworks, Amazon EMR, Azure HDInsight.

Experience developing Pig Latin and HiveQL scripts for Data Analysis and ETL purposes and extended the default functionality by writing User Defined Functions, (UDAFs) for data specific processing.

Acquired profound knowledge in developing production ready Spark applications utilizing Spark Core, Spark Streaming, Spark SQL, Data Frames, Datasets

Strong working experience with SQL and NoSQL databases, data modeling and data pipelines. Involved in end-to-end development and automation of ETL pipelines using SQL and Python.

Proficient in NoSQL databases including HBase, Cassandra, MongoDB, and its integration with Hadoop cluster. Experience in importing/exporting the data from HDFS to Relational Database Systems using Sqoop and from Relational Database Systems to HDFS.

Extensive experience on architecture and components of Spark, & efficient in working with Hadoop Core, Spark SQL, Spark streaming & expertise in building PySpark and Spark-Scala applications for interactive analysis, batch processing and stream processing & Experience in configuring Spark Streaming to receive real time data from the Kafka and store the stream data to HDFS & expertise in using Spark-SQL with various data sources like JSON, Hive etc.

Cluster Management: Experience in Monitoring workload, job performance and capacity planning using Cloudera Manager and creating and maintaining technical documentation for launching Cloudera Hadoop Clusters and for executing Hive queries and Pig Scripts and Experience with CDH distribution and Cloudera Manager to monitor Hadoop clusters and converting the Oracle tables to Teradata Table Components in Ab Initio Graphs.

Leveraged SQL for querying and managing relational databases, ensuring accurate data retrieval and manipulation. Integrated AWS Athena to query data stored in Amazon S3 directly, allowing for fast, serverless data analysis.

Python Utilities: Experience in creating Python scripts and utilities to automate repetitive tasks, enhance the functionality of the framework, and support data validation and error handling.

Cost Reduction: Analyzed the existing infrastructure and processes to identify cost-saving opportunities. Implemented changes that led to a 30% reduction in operational costs, primarily through better resource utilization and scaling down unnecessary processes.

Team Management and Resource Allocation: Experience in leading a team of developers and testers, overseeing their work, providing guidance, and ensuring that project milestones were met on time. Managed the allocation of resources, ensuring that the team had the necessary tools and support to complete their tasks effectively.

Stakeholder Communication: Interacting with stakeholders to understand their requirements, provide project updates, and gather feedback. This ensured alignment with client expectations and successful project outcomes.

Interview and Onboarding: Involved in conducting technical interviews and onboarding process for new team members, ensuring that they were familiar with the project, tools, and processes. Provided training and support to help them integrate into the team quickly.

TECHNICAL SKILLS

Programming Languages

Python, Scala, SQL

Data Warehousing

Snowflake, SQL Datawarehouse

ETL Tools

Apache NiFi, Apache Airflow, Informatica, SSIS

Data Processing Frameworks

Apache Spark, Apache Flink, Apache Storm

Database Systems

PostgreSQL, MySQL, Oracle, Microsoft SQL Server

Big Data Technologies

Hadoop, Hive, HBase, Kafka, Oozie,

Cloud Platforms

AWS (Amazon Web Services), Azure

Data Visualization

Power BI

Version Control

Git, GitHub, Bitbucket

Data Modeling

ERwin Data Modeler, dbt (data build tool)

Monitoring & Logging

Prometheus, Grafana, ELK Stack (Elasticsearch, Logstash, Kibana)

Scripting Languages

UNIX Shell script, Python, R

Data-Modeling Methodologies

Object Relational Modeling, ER modeling, Dimensional Modeling

Cloud technologies

DataLake Analytics, Databricks, DataFactory, AWS Glue, Blob Storage

Development Methodologies

Agile, Scrum, Iterative Development, Waterfall Model, UML, Design Patterns

Bigdata distribution

Cloudera, Hortonworks, Amazon EMR

WORK EXPERIENCE

Deloitte Aug 2021 – July 2024

Lead Big Data Engineer

Client: Liberty Mutual insurance company

Description: This project involves generating monthly and quarterly financial reports for clients related to both commercial and personal lines of insurance across various U.S. states. The process is automated, utilizing an Oozie workflow, Spark with Scala for data processing, Amazon Web Services for cluster management, and S3 for storage.

Responsibilities:

Oozie Workflows: Designed and orchestrated multiple Oozie workflows to automate complex data processing tasks. These workflows were responsible for executing various jobs for data ingestion, transformation, and reporting.

Scala Code Development: Wrote efficient and scalable Scala code to handle large volumes of data. The code was integrated within the Spark environment to perform complex transformations and computations. Developed and optimized Spark SQL queries to extract, process, and analyse data from large datasets. These queries were essential for generating accurate and timely reports.

Shell Scripting: Utilized shell scripts for task automation, including job execution, monitoring, and error handling within the UNIX/Linux environment.

Data Processing Time Improvement: Implemented various strategies to optimize the data processing pipeline, reducing the overall processing time by 60%. This included optimizing Spark jobs, efficient resource management, and reducing data shuffling.

Failure Handling: Proactively identified potential points of failure within the workflow and implemented measures to handle them gracefully. This included writing fail-safe mechanisms, alerting systems, and automated retries.

Issue Analysis and Resolution: Analyzed and debugged issues as they arose, including data discrepancies, job failures, and performance bottlenecks. Developed and implemented fixes to ensure smooth operations.

Hadoop Environment: Worked with Hadoop Ecosystem components like HBase, Sqoop, Zookeeper, Oozie, Hive and Pig with Cloudera Hadoop distribution and have worked on installation and configuration of EC2 instances on AWS (Amazon Web Services) for the establishment of clusters on the cloud.

Hive, MapReduce Migration to AWS Cloud: Performed the migration of Hive and MapReduce jobs from on-premises MapR to AWS cloud using EMR and Qubole and parsing data from S3 through the python API calls through the Amazon API gateway generating Batch Source for processing and configure Azure active directory and managed users, groups.

Data Processing and Migration: Moved data from the traditional databases like MS SQL Server, MySQL, and Oracle into the Hadoop by using HDFS and worked on developing workflow in Oozie to automate the tasks of loading data into HDFS and preprocessing with Pig and used AWS glue for data transformation and cleansing.

Implementation and Big Data Architecture Setup: Worked on CI/CD solution, using Git, Jenkins, Docker to setup and configure big data architecture on AWS cloud platform and exposed to all aspects of software development life cycle (SDLC) like Analysis, Planning, Developing, Testing, implementing and post-production analysis of the projects. Worked through Waterfall, Scrum/Agile Methodologies.

Quality Assurance: Ensured that the developed solutions were thoroughly tested and met the required quality standards before deployment. Played a key role in the deployment of the project to the production environment, ensuring a smooth transition with minimal downtime.

YAML Configuration: Designed YAML configuration files that were used to define various parameters and settings for the workflows, ensuring that they met the specific needs of the client. Wrote and optimized SQL queries to handle complex data retrieval and manipulation tasks, ensuring that they were efficient and aligned with client requirements.

Process Automation: Automated key workflow tasks, achieving a 60% reduction in manual effort. This encompassed critical areas such as data processing, streamlining report generation, and enhancing job monitoring processes.

Environments: Oozie, Apache Spark, Scala, Python, Shell Scripting, AWS (Amazon S3, Amazon EMR), Spark SQL, YAML, SQL, UNIX/Linux, Docker, Kafka, AWS Glue, Pig, HDFS, MapReduce, Hbase, Sqoop, Zookeeper.

E&Y Feb 2021 - Aug 2021

Data Engineer

Client: Nationwide Insurance Company

Description: The project was carried out for one of the largest insurance providers in the U.S., focusing on designing and developing a data warehouse to facilitate downstream data processes, generate reports using Power BI, and create subsequent data marts. In parallel, the project involved migrating data to Azure, utilizing Azure Data Factory (ADF) for data integration and transformation.

Responsibilities:

Developed jobs and scripts to efficiently load data from source systems into the data warehouse tables. This involved ensuring data integrity and consistency during the loading process.

Worked with COBOL copybooks to parse and load legacy data into the data warehouse, facilitating the migration of older data formats.

Extracted the data and updated it into HDFS using Sqoop Import from various sources like Oracle, Teradata, SQL server etc., also created Hive staging tables and external tables and joined the tables as required and implemented Dynamic Partitioning, Static Partitioning and Bucketing.

Responsible for writing Hive Queries for analyzing data in Hive warehouse using HQL and Res for creating complex dynamic partition tables using Hive for best performance and faster querying.

Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL, and Azure Data Lake Analytics.

Developed the Unix/BASH SHELL scripts for the purpose of pre- and post-validations of the master and slave nodes, before and after the configuration of the name node and data nodes, respectively and developed job workflows in Oozie for automating the tasks of loading the data into HDFS.

Develop a data set process for data modeling and recommend the ways to improve data quality, efficiency, and reliability. Worked on various compression and file formats like Avro, Parquet, and Text formats and migrating Map reduce jobs to Spark jobs for achieving a better performance

Estimated the development effort required for various tasks, contributing to project planning and resource allocation. Performed performance tuning on data mappings to optimize the data loading process, reducing execution time and improving efficiency. Analyzed logical data models to ensure that the data warehouse design met the requirements for reporting and downstream processes.

Actively participated in defect fixing during the System Integration Testing (SIT) and User Acceptance Testing (UAT) phases, ensuring that any issues were resolved promptly to meet project deadlines.

Managed Azure Infrastructure, Azure Web Roles, Worker Roles, VM Role, Azure SQL, Azure Storage, Azure AD Licenses, Virtual Machine Backup and Recover from a Recovery Services Vault using Azure PowerShell and Azure Portal and migrate data from traditional database system to Azure databases.

Built data pipelines in Azure Data Factory (ADF) to migrate data from Azure Blob Storage to Azure SQL Database. These pipelines were crucial for the successful migration and transformation of data within the Azure environment.

Applied various data transformations in ADF, including building complex expressions and working with different Azure components to meet data migration requirements.

Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in In Azure Databricks and responsible for estimating the cluster size, monitoring, and troubleshooting of Spark Databricks cluster and Experience on Azure Databricks cloud to organizing the data into notebooks and making it easy to visualize data using dashboards.

Involved in designing and developing a framework for file processing and job execution related to different financial data. This framework was crucial for standardizing and automating data processing tasks.

Environments: Azure Data Factory (ADF), Azure Blob Storage, Azure SQL Database, Power BI, SQL, Data Modelling, Azure Data Lake, Azure Storage, Azure SQL, Azure DW, Azure Databricks, Azure Data Lake Analytics, Azure PowerShell.

E&Y Client: American Family Insurance June 2019 - Jan 2021

Data Engineer

Description: This project was executed for a US insurance company, focusing on developing an automated Oozie workflow to generate monthly and yearly financial reports based on client data stored in HDFS. The generated reports were encrypted and securely placed in an FTP folder, with automated email notifications sent to stakeholders indicating the success or failure of the report generation process.

Responsibilities:

Wrote complex Hive and Impala queries for data transformation, ensuring the accurate creation of tables required for financial reporting.

Worked on optimizing these queries to enhance performance, ensuring quick and efficient data retrieval from HDFS.

Developed shell scripts for various tasks including file encryption, creation of control files, and automated email triggers. These scripts were integral to the secure handling and distribution of the generated reports.

Automated the encryption process to ensure that all reports were securely stored and transmitted to the FTP folder.

Worked on SQL optimizations and automation of manual processes to build a data warehouse from multiple data sources.

Work on MySQL and Pentaho configurations setup, migration of ETL jobs from on-prem to cloud-based servers.

Perform Data Quality checks to compares EDW data against source systems to identify variances. Analyze data from source systems such as Retail, Salesforce and develop dashboards to create story points.

Created and managed Oozie workflow properties files to schedule and execute jobs. This ensured that the financial reports were generated and distributed according to the defined timelines.

Experience in Extraction, Transformation and Loading (ETL) of data from multiple sources like XML files and Databases.

Experience in writing scripts in Unix for the automation of the sanitization process

Validating against the Target tables loaded into the cluster and reporting in case of any discrepancies.

Experience in writing scripts in Unix for the automation of the sanitization process.

Experience using Sqoop from HDFS to RDBMS & vice versa in importing and exporting data.

Continuously monitored the workflows to identify and implement performance improvements, ensuring the reliability and efficiency of the automated processes. Verified the accuracy of the generated reports by cross-checking the data against client-provided samples. This was crucial in maintaining the integrity and trustworthiness of financial reports.

Involved in the project from its inception, contributing to the initial design and development phases, and seeing it through to production deployment. Provided knowledge transfer sessions to team members and stakeholders, ensuring a smooth transition of responsibilities and understanding of the system.

Prepared Weekly Status Reports (WSR) to update stakeholders on the status of work, ensuring transparency and clear communication throughout the project lifecycle.

Preparing use cases for big data work in technical workshops.

Organized daily SCRUM meeting with team, prioritize product backlog items and responsible for timely delivery and deployment of product releases.

Environments: Oozie, Apache Hive, Apache Impala, HDFS, Shell Scripting, FTP, SQL, Hadoop, ETL

E&Y Client: Lloyd's of London May 2018 - June 2019

Data Engineer

Description: This project involved migrating insurance claims data from legacy systems to Guidewire ClaimCenter 9 for a major UK insurance provider specializing in Marine and Auto claims. The migration was aimed at modernizing the client’s claims management system and improving operational efficiency.

Responsibilities:

Data Transformation and Migration: Designed and developed SSIS (SQL Server Integration Services) packages to handle the complex data transformation and migration processes required to move data from legacy systems to the Guidewire ClaimCenter 9 platform.

Customization and Optimization: Customized SSIS packages to ensure they met specific project requirements and optimized them for performance to handle large volumes of claims data.

Data Mapping Documentation: Conducted a thorough analysis of source systems to understand data structures and relationships. Created detailed data mapping documentation to guide the migration process, ensuring that all necessary data fields were correctly mapped to the new system.

Validation and Testing: Ensured that the data mapping was accurate through rigorous validation and testing, reducing the risk of data integrity issues during migration.

SQL Server Stored Procedures: Developed a reconciliation framework using SQL Server stored procedures to verify the accuracy and completeness of data migrated from the legacy systems to the new platform.

Data Accuracy Verification: The framework was critical in ensuring that the data in the target system (Guidewire ClaimCenter 9) matched the data in the source systems, thereby maintaining data integrity post-migration.

Change Request Implementation: Provided ongoing production support after the migration, including implementing change requests and tracking their impact on the system. Monitored the system for any post-implementation issues, addressing and resolving defects promptly to ensure the system’s stability.

Procedure Creation: Created and tested stored procedures to manage data operations within SQL Server, ensuring they were robust, efficient, and met project requirements. Loaded data into the new system, identified defects during testing, and worked on fixing them to ensure a smooth transition to production.

In-Depth Knowledge: Leveraged in-depth knowledge of the London market insurance, including lines of business, roles, systems, messaging, and processes, to ensure that the migration aligned with industry standards and best practices. Ensured that the migration process adhered to the specific needs of Marine and Auto claims, considering the unique aspects of these lines of business.

Environments: SSIS (SQL Server Integration Services), SQL Server, Guidewire ClaimCenter 9, Stored Procedures

E&Y Client: Allstate Insurance Oct 2017 - May 2018

Data Engineer

Description: This project was undertaken for a US insurance company, focusing on migrating millions of claims from legacy systems, flat files, and Excel to Guidewire ClaimCenter. The migration involved creating tables, views, and stored procedures, transforming data according to ClaimCenter standards, and loading the data from various sources into staging and final Guidewire tables after thorough validation checks.

Responsibilities:

Requirement Gathering: Analyzed business requirements and functional specifications to understand the scope of the migration and the specific needs of the client.

Documentation: Collaborated with the business team to prepare detailed mapping documents, outlining how data from various sources would be transformed and loaded into the Guidewire system.

Table, View, and Procedure Creation: Designed and developed database tables, views, stored procedures based on the mapping documents, ensuring they aligned with ClaimCenter standards and facilitated smooth migration of data.

Data Transformation: Transformed data from legacy systems, flat files, and Excel into the required format for loading into Guidewire, applying necessary business logic and validations.

SIT and UAT Support: Actively supported the testing team during System Integration Testing (SIT) and User Acceptance Testing (UAT), addressing any defects that were identified and ensuring that the migrated data met the client’s quality standards. Participated in manual, regression, and smoke testing as needed to verify the accuracy and reliability of the migrated data.

Issue Resolution: Communicated any challenges or risks to stakeholders promptly and worked on resolving them to maintain project timelines.

Monitoring and Error Fixing: Monitored the execution of stored procedures to ensure they ran smoothly and efficiently during the migration process. Identified and fixed errors that occurred during the data loading process, ensuring that the final data in the Guidewire tables was accurate and complete.

Test Case Development: Developed unit test cases to validate the functionality of the created tables, views, and stored procedures, ensuring they performed as expected before integration into the larger system.

Environments: SQL Server, Guidewire ClaimCenter, Stored Procedures, Excel, Flat Files, Data Mapping Documentation.

Contact this candidate