Data Engineering Governance

Location:

Charlotte, NC

Posted:

January 09, 2024

Contact this candidate

Resume:

SAI GANESH

**************@*****.***

+1-614-***-****

Professional Summary:

Over 10 years of IT industry knowledge with hands on working experience in Data Engineering.

Good knowledge in Data Quality & Data Governance practices & processes.

Well versed with Agile with Scrum, Waterfall Model and Test-driven Development (TDD) methodologies.

Proficient in SQLite, MySQL and SQL databases with KafTerraform/Ansible Apache.

Practical understanding of the Data Modeling (Dimensional & Relational) concepts like Star - Schema modeling, Snowflake Schema Modeling, Fact and Dimension tables.

Experience in handling python and spark context when writing Py Spark programs for ETL.

Strong knowledge in data visualization using Power BI and Tableau.

Hands in experience on NoSQL database like Snowflake, HBase, Cassandra and MongoDB.

Experience in building and architecting multiple Data pipelines, end to end ETL and ELT process for Data ingestion and transformation in AWS and coordinate task among the team.

Methodologies, Regression based models, Factor analysis.

Expertise in transforming business requirements into analytical models, designing algorithms,

Building models, developing data mining and reporting solutions that scales across a massive volume of Structured and unstructured data.

Databricks provides a unified platform that combines data engineering, data science, and machine learning functionalities, all within a collaborative environment.

Real-Time Data processing To ingest and analyze real-time or almost real-time data in HDFS (Hadoop Distributed File System), I utilized Java in conjunction with technologies like Kafka and Spark streaming.

Involved in the entire data science project life cycle and actively involved in all the phases including data extraction, data cleaning, statistical modeling, and data visualization with

Large data sets of structured and unstructured data.

We employed Snowflake's diverse data loading methods, such as bulk loading and Snowpipe, for efficient ingestion of data into the platform

Skilled in Advanced Regression Modeling, Multivariate Analysis, Model Building,

Actively involved in Gathering, Detailing, and Documenting and Managing requirements from project inception through production migration.

Enforced code review practices within DBT projects through hands-on participation, maintaining code quality, identifying optimization opportunities, and sharing practical insights.

Hands-on experience in establishing robust data governance practices within a DataOps framework.

My practical experience in a GxP/SOX regulated quality function within biopharmaceutical companies and CROs demonstrates a thorough understanding of the regulatory landscape

I have hands-on experience with Apache Druid, which is a columnar storage database designed for real-time analytics. I've worked on projects involving the setup of Druid clusters, query optimization, and real-time data ingestion and processing.

Directly managed databases on the DBX-UC platform, taking a hands-on approach to performance tuning, indexing, and query optimizationConducted hands-on database health checks, actively identifying and resolving performance bottlenecks and issues.

Google Cloud with Big Query Java made it easier to create scripts in Big Query, and I connected them to reporting tools to get useful information out of the processed data.

Ability to prioritize and multitask while working on time sensitive deliverables, provide leadership to accomplish the work and meet deadlines.

Documentation of Data The framework can also be used as a form of data description, providing information about the desired structure and quality of datasets.

Excellent communication, analytical and problem-solving abilities with clear understanding of system development life cycle and proven project management skills.

Experience in AWS, Cloud functions, Big Table and Big Query.

Experience with Apache Spark ecosystem using Spark-Core, SQL, Data Frames and RDD's.

Experienced in data manipulation-using python.

Hands on experience working Amazon Web Services (AWS) using Elastic Map Reduce (EMR), Redshift, and EC2 for data processing.

I also investigated its available data sources and integration possibilities, evaluating simplicity of use, scalability, and performance factors, as well as any known restrictions and issues faced by our team.

I was able to discover data quality trends, spot abnormalities, and resolve any data concerns by analyzing the results of these Great Expectations tests. Great Expectations' integration with Databricks proven to be a great addition to our data quality management toolkit, assuring data accuracy, consistency, and reliability in our big data analytics and data engineering initiatives."

I started by installing Great Expectations on a server accessible from within my Databricks environment and created a dedicated project for the proof of concept. I then configured data sources within the Databricks environment, defining which data needed validation.

My practical machine learning experience in the healthcare industry demonstrates my dedication to better clinical decision-making, enhancing patient care, and advancing medical research. Recognizing the significant influence ML approaches may have on healthcare outcomes, I have continuously worked to apply them properly and ethically.

Databricks allows you to process and analyze large datasets using distributed computing capabilities, and it integrates with other Azure services for seamless data integration and analysis.

Proficient in installing, configuring and using Apache Hadoop ecosystems such as MapReduce, Hive, Pig, Flume, Yarn, HBase, Sqoop, Spark, Storm, Kafka, Oozie, and Zookeeper.

Strong experience on designing big data pipelines such as Data Ingestion, Data Processing (Transformations, enrichment and aggregations) and Reporting.

Experience in integrating Kafka with Spark streaming for high-speed data processing.

Experience in implementing PowerShell data solutions, provisioning storage account, Azure Data Factory, Azure Databricks, Azure Blob Storage, Azure Synapse and Azure Cosmos DB.

Technical Skills

Hadoop/Spark Ecosystem: Hadoop, MapReduce, Hive/impala, YARN, Kafka, Flume, Sqoop, Oozier, Zookeeper, Spark, Nifi, DB2,

Databases: Oracle, MySQL, SQL Server, PostgreSQL, HBase, Snowflake, Cassandra, MongoDB

Cloud computing : Amazon Web Services(AWS), Amazon Redshift, MS Azure, Azure blob storage, Azure Data Factory, Azure Synapse & Google cloud Platform(Big Query, Big Table, Dataproc ), MuleSoft.

BI Tools: Business Objects XI, Tableau 9.1, Power BI

Query Languages: SQL, PL/SQL, T-SQL

Scripting Languages: Unix, Python

Operating Systems: Linux, Windows, Ubuntu, Unix

SDLC Methodology: Agile, Scrum, Waterfall, UML

Professional Experience

Client: Molina healthcare, Bothell, WA March 2021 to Present

Data Engineer

Responsibilities:

Involved in creating data ingestion pipelines for collecting health care and providers data from various external sources like FTP Servers and S3 buckets.

Transform and analyze the data using Pyspark, HIVE, based on ETL mappings

Involved in migrating existing Teradata Datawarehouse to AWS S3 based data lakes.

Involved in migrating existing traditional ETL jobs to Spark and Hive Jobs on new cloud Data Lake.

Wrote complex spark applications for performing various de-normalization of the datasets and creating a unified data analytics layer for downstream teams.

Developed Spark scripts by using Scala Shell commands as per the requirement.

Used Spark and Spark-SQL to read the parquet data and create the tables in hive using the Scala API.

Involved in writing Spark applications using Scala to perform various data cleansing, validation, transformation, and summarization activities according to the requirement.

Worked on GCP service – Compute Engine for virtual machine hosting.

Actively used DBT's version control features to track changes, collaborating with team members for effective model modifications.

Created new database objects like Tables, Procedures, Functions, Triggers, and Views using T- SQL.

Creating and modifying the PL/SQL Procedure, Function, and Triggers according to the business requirement.

For our analytical environment, Snowflake's capability for parallel processing was essential. Several people could execute complex SQL queries simultaneously, and the platform efficiently managed the concurrency without compromising performance.

We utilized Snowflake as a core data warehousing solution in our data management tasks. We were able to separate computation and storage resources because to its architecture, which gave us the freedom to extend our data warehouse in response to shifting demands.

Developed logical and efficient data models using DBT, ensuring a structured representation of the underlying data.

We made use of Snowflake's data sharing features to make it easier for various teams in our company to collaborate with one another.

We were able to safely exchange datasets and insights thanks to this capability, which eliminated the need for significant data migration.

Our use of Snowflake benefited greatly from its scalability. Easily, we could adjust the compute resources (virtual warehouses) to match the amount and complexity of our analytical queries. Even at peak times, optimal performance was guaranteed by this elasticity.

My work has included managing data within data lake environments, including ADLS Gen. 2 and AWS S3.

I have hands-on experience in creating logical data models for visualization tools such as Tableau, Power BI, and Sigma.

Actively engaged in hands-on optimization of data processes, identifying bottlenecks, and implementing improvements iteratively. Ensured hands-on involvement in feedback loops, allowing for continuous improvement in DataOps processes.

Employed dimensional modeling techniques to optimize analytics performance. Ensured scalability and maintainability of DBT transformation pipelines through hands-on experience.

I developed test scripts and conducted automated testing to ensure the reliability and performance of the project.

Applied query optimization techniques within DBT, drawing on hands-on experience to enhance the performance of analytics queries.

Python helped me modify and preprocess the data. I used Pandas and NumPy, two Python tools, to organize, clean, and analyze the project's data

Python allowed me to integrate the project with external data sources and APIs. This facilitated real-time data updates and expanded the project's data resources.

This entails having a strong internal control system to prevent financial misstatements and fraud. I helped with the documentation and testing of internal controls, as well as liaising with stakeholders.

Primarily responsible for fine-tuning long running spark applications, writing custom spark udfs, troubleshooting failures etc.,

As part of my responsibilities, I was personally responsible for ensuring that all processes and systems linked to drug research, manufacturing, and clinical trials adhered to GxP standards

I Evaluating Automation The framework allows you to test your data pipelines and ETL procedures automatically. This helps detect data issues early in the pipeline and prevents faulty data from spreading.

Checks for Data Quality Great Expectations can be used to do different data quality checks, such as tests for missing values, data types, range limits, and more.

I've overseen change management processes and document control systems to guarantee that any changes to systems, processes, or procedures are thoroughly examined and do not jeopardize product quality or data integrity.

Part of my responsibilities have included creating and delivering training programs to ensure that all staff are well-versed in GxP and SOX rules, quality standards, and compliance needs.

Involved in building a real time pipeline using Kafka and Spark streaming for delivering event messages to downstream application team from an external rest-based application.

Involved in creating Hive scripts for performing adhoc data analysis required by the business teams.

Worked extensively on migrating on prem workloads to AWS Cloud.

Worked on utilizing AWS cloud services like S3, EMR, Redshift, Athena and Glue Metastore.

Managed and optimized Amazon Redshift, utilizing it as the primary data warehousing solution for high-performance analytics queries.

Wrote custom support modules for upgrade implementation using Pl/SQL, Unix Shell Scripts.

Worked on GCP service-Big Query to implement the logic framework.

Executed seamless integration between DBT and Redshift, applying hands-on experience to leverage the strengths of both tools for efficient data processing and analytics.

Involved in continuous Integration of application using Jenkins.

Worked on agile environment, Jira, GitHub version control and team city for continuous build.

Created Lambda functions to run the AWS Glue job based on the AWS S3 events.

Environment: Hadoop, AWS, Big Query, Big Table, Spark, qoop, ETL, HDFS, Snowflake DW, Oracle SQL, MapReduce, Kafka, Agile

Client: First Republic Bank, New York, NY January 2019 to February 2021

Azure Data Engineer

Responsibilities:

Implemented Azure Data Factory pipeline to extract data from oracle on premises to Azure Data Lake Gen 2 incrementally.

Implemented data vault model by analyzing source tables which will be processed to Synapse

Created Data Mapping sheet for better understating of data which will be utilized while implementing code for data modeling

Writing complex SQL scripts, views, and Stored procedures for transforming data based on the business requirements.

Hands-on Experience with different types of activities in Azure data Factory like LOOK UP, Filter, Get Metadata, Execute pipeline, switch, and notebooks.

Hands-on Experience on Azure Data Bricks platform for implementing code in Spark SQL and Py-spark

Worked on Synapse for implementing different types of complex SQL Views after transforming data from parquet files to Data bricks data.

Design and Develop ETL integration patterns using Python on Spark.

Analyzed the SQL scripts and designed the solution to implement using PySpark.

Developed highly optimized Spark applications to perform various data cleansing, validation, transformation, and summarization activities according to the requirement

Created datasets and linked services for both cloud and on premises resources like oracle, ADLS Gen 2, SQL server

Optimization of spark SQL code by reducing join conditions and sub queries which is used for good performance of execution.

Created Kimball model for data in synapse by implementing code according to table data transformation

Migrating implemented SSIS packages to cloud platform by using Azure data factory and Azure Data bricks.

Created different types of Dimensional and Fact Tables by following business rules and verifying legacy data warehousing.

Used different types of window function and cte’s in SQL script development.

Using GITHUB configuration for deploying ADF code and Data Bricks notebooks from one environment to another with the help of CI CD pipelines which are available.

Worked with business process managers and be a subject matter expert for transforming vast amounts of data and creating business intelligence reports in Power BI.

After Data Processing created calculated columns and measures to display the results in the form of visuals for end user

Configured data connections (Import and Direct query) for all types of sources which are required for report development.

Created incremental refresh for required fact tables in Power BI Desktop by passing end date as parameter values

Published reports in Power Bi Service.

Done schedule refresh for the reports.

Transform data by using merge queries/ append queries in Edit Queries section on Power Bi Desktop.

Creating bookmarks for report to view in current scenario.

Gateway connections implementation in Power BI Service for allowing data refresh in on premises

Implemented DAX expressions for MTD and YTD based on slicer selection.

Using bookmark and selection pane for hiding and displaying visuals based on selection.

Configuration of RLS (Row Level Security) for specified users with the help of specific user selection in Power BI service.

Worked on different DAX functions like EOMONTH, DATEDIFF, CALCULATE, COUNT, SUM, FILTER,

Implemented Data sets using data modeling by connecting all the tables to satisfy all required conditions.

Worked on all types of filters like Page level, Visual level, report level and drill through filters.

Used all the functionalities of visual in Power BI like drill up, drill down, conditional formatting, tool tips

Displaying images in all visuals using web URL from source tables.

Implementation of sorting techniques in visual display of reports.

Configuration of Power BI service account after publishing reports

Published developed reports to Power BI service and implemented gateways to allow data refresh based on schedule timings.

I am skilled at using Databricks for data analysis, which allows me to extract insights and develop reports or visualizations from structured and unstructured data.

My background includes model creation, training, and integration into data pipelines.

I know how to optimize Databricks clusters for performance and cost-efficiency, which is essential for large-scale data processing.

Worked with testing team for data quality

Followed sprint wise tasks by following JIIRA as tracking tool.

Environment: Spark, Kafka, Apache Airflow, Azure SQL DB, Azure DW, Azure Data Lake, Azure Data Factory, Python, XML, Azure Databricks, T-SQL, Agile

Client: Homesite insurance Boston, MA November 2016 to December 2018

Data Engineer

Responsibilities:

Created functions and assigned roles in AWS Lambda to run python scripts, and AWSLambda using java to perform event driven processing.

Used Kafka functionalities like distribution, partition, replicated commit log service for messaging systems by maintaining feeds.

Involved in Requirement gathering, Business Analysis and translated business requirements into Technical design in Hadoop and Big Data

Involved in SQOOP implementation which helps in loading data from various RDBMS sources to Hadoop systems and vice versa.

Provided support in underwriting processes for employer stop-loss insurance policies, contributing to risk assessment and pricing strategies.

Worked on claims processing within the insurance domain, ensuring accurate and timely handling of claims related to employer stop-loss coverage

Involved in Analyzing system failures, identifying root causes, and recommended course of actions, Documented the systems processes and procedures for future references.

Involved in Configuring Hadoop cluster and load balancing across the nodes.

Involved in Hadoop installation, Commissioning, Decommissioning, Balancing, Troubleshooting, Monitoring and, debugging Configuration of multiple nodes using Hortonworks platform.

Configured Spark streaming to get ongoing information from the Kafka and store the stream information to HDFS.

Load D-Stream data into Spark RDD and do in memory data Computation to generate Output response.

Involved in performance tuning of Spark jobs using Cache and using complete advantage of cluster environment.

Wrote script for Location Analytic project deployment on a Linux cluster/farm & AWS Cloud deployment using Python.

Worked extensively on Informatica Partitioning when dealing with huge volumes of data.

Used Teradata External Loaders like Multi Load, T Pump and Fast Load in Informatica to load data into Teradata database.

Implemented Spark Scripts using Scala, Spark SQL to access hive tables into Spark for faster processing of data.

Created several types of data visualizations using Python and Tableau. Extracted Mega Data from AWS using SQL Queries to create reports.

Experience in using Avro, Parquet, RCFile and JSON file formats, developed UDFs in Hive and Pig.

Involved in loading data from rest endpoints to Kafka Producers and transferring the data to Kafka Brokers.

Developed Preprocessing job using Spark Data frames to flatten JSON documents to flat file.

Continuously tuned Hive UDF's for faster queries by employing partitioning and bucketing.

Environment: Hadoop, Spark, Hive, Sqoop, AWS, HBase, Kafka, Python, HDFS, Elastic Search & Agile Methodology.

Client: Ceequence Technologies Hyderabad, India May 2015 to August 2016

Data Engineer

Responsibilities:

Designed and Developed ETL Processes with py spark in AWS Glue to migrate data from S3 to generate Reports.

Involved in writing and Scheduling the Databricks jobs Using Airflow.

Developed the Py spark code for AWS Glue jobs and for EMR.

Installed and configured Hadoop Map Reduce, HDFS, developed multiple Map Reduce jobs in java and Scala for data cleaning and preprocessing

Developed Java Map Reduce programs for the analysis of sample log file stored in cluster

Implemented Spark using Python and Spark SQL for faster testing and processing of data.

Developed Spark scripts using Python on Azure HDInsight for Data Aggregation, Validation and verified its performance over MR jobs.

Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, Python and Scala

Used IAM to create new accounts, roles and groups and polices and developed critical modules like generating amazon resource numbers and integration points with S3, Dynamo DB, RDS, Lambda and SQS Queue

Reviewing the explain plan for the SQLs in snowflake

Wrote various data normalization jobs for new data ingested to s3

Created Airflow Dags to schedule the jobs on daily, weekly, monthly schedules.

Designed and Developed ETL Processes with py spark in AWS Glue to migrate data from external sources and S3 Files into AWS Redshift.

Involved in writing and Scheduling the Glue jobs, Building data catalog and mapping from S3 to Redshift.

Created AWS Lambda functions and assigned IAM roles to schedule python scripts using CloudWatch Triggers to support the infrastructure needs that needed extraction of xml tags.

Involved in connecting Redshift to Tableau for creating dynamic dashboard for analytics team.

Authored Spark Jobs for data filtering and data transforming through Pyspark data frames.

Environment: AWS EMR, EC2, S3, Oozie, Kafka, Spark, Spark SQL microgre SQL, Shell Script, SQOOP, Scala.

Client: Cybage Software Private Limited Hyd India March 2014 to April 2015

Data Engineer

Responsibilities:

Written Spark applications using Scala to interact with the PostgreSQL database using Spark SQL Context and accessed Hive tables using Hive Context.

Extensively used ETL Components to Extract Data from Different Sources.

Design and development of SSIS packages to fetch data from different data sources.

Creation of reports using prompts/ combined queries, report level filters.

Used Sage maker as dev endpoint for the glue development.

Authored Spark Jobs for data filtering and data transforming through Pyspark data frames in both aws glue and data bricks.

Imported data using Sqoop to load data from MySQL to HDFS on regular basis.

Working on Test cases of SSIS Packages and Data cleansing.

Developed C#, VB.Net scripts as part of ETL coding.

Development of UNIT test cases and prepared technical specifications for SSIS packages.

Creation of source Queries at Analysis and Designing phase.

Preparation of mapping documents and technical design documents based on client requirement specifications.

Creating merged dimensions at report level by using multiple data providers.

Testing of reports for data validation at different levels and preparation of Test results document.

Universe and Reports Migration from Dev to UAT to Prod

Worked in performance tuning and deployment phase.

Scheduled and managed daily/weekly/monthly sales and operational reports based on the business requirements.

Developed different type of reports by using SSRS and SQL Server.

Environment: SQL Server, MS BI (SSIS, SSAS &SSRS), .Net, Intelex, Talend, C#, ETL, Sqoop, Pyspark

Contact this candidate