Quality Control Azure Data

Location:

Louisville, KY

Posted:

February 21, 2025

Contact this candidate

Resume:

Sirish Yenugu

PROFESSIONAL SUMMARY

Over *+ years of experience in Data Engineering, Data mining with large Data Sets of Structured and Unstructured Data, Data Acquisition, Data Validation, Predictive modeling, Data Visualization, Web Crawling, Web Scraping Statistical Modeling, Data Mining and Natural Language Processing (NLP)

Used python to extract financial data from various sources and integrated them into one single data set

Hands-on-experience with pandas, SciPy and Numpy packages in Python for data analytics

Large data sets manipulation, data cleaning, quality control, and data management (Python, R, SQL)

Adept in statistical programming languages like Python and R including BigData technologies like Hadoop and Hive.

Migration of on-premises SQL Server data to Azure Data Lake Store (ADLS) using Azure Data Factory.

Implemented a 'server less' architecture using API Gateway, Lambda, and Dynamo DB and deployed AWS Lambda code from Amazon S3 buckets. Created a Lambda Deployment function, and configured it to receive events from your S3 bucket.

Designed the data models to be used in data intensive AWS Lambda applications which are aimed to do complex analysis creating analytical reports for end-to-end traceability, lineage, definition of Key Business elements from Aurora.

Extensive experience in developing and optimizing complex SQL queries, stored procedures, and ETL workflows to support data analysis, reporting, and decision-making across diverse domains, including finance, healthcare, and telecommunications.

Proficient in leveraging PySpark for large-scale data processing, transformation, and analysis, utilizing Databricks to build scalable data pipelines that enhance data quality and reduce processing time by up to 50%.

Skilled in configuring and managing Databricks environments, optimizing cluster performance, and developing automated data workflows that improve data availability and integrity while reducing operational costs.

Adept at integrating data from multiple sources into unified data lakes and warehouses, using Databricks and PySpark to support real-time and batch processing needs for business intelligence and advanced analytics.

Expertise in tuning SQL queries and PySpark jobs to achieve optimal performance, reducing data processing times and improving overall system efficiency for large-scale data environments.

Proficient in managing entire data science project life cycle and actively involved in all the phases of project life cycle including data acquisition, data cleaning, Engineering, features scaling, features engineering, statistical modeling (Decision Trees, Regression Models, Neural Networks, Support Vector Machine (SVM), Clustering), dimensionality reduction using Principal Component Analysis and Factor Analysis, testing and validation using ROC plot, K - fold cross validation and Data Visualization

Adept and deep understanding of Statistical Modeling, Multivariate Analysis, model testing, problem analysis, model comparison and validation.

Involved in designing cubes in SSAS environment using Snowflake and Star Schemas.

Proficient in designing the techniques like Snowflake schemas, start schema, fact and dimension tables, logical and physical modelling

Experience in using various packages in Python and R like ggplot2, caret, dplyr, Rweka, gmodels, RCurl, tm, C50, twitteR, NLP, Reshape2, rjson, plyr, pandas, numpy, seaborn, scipy, matplotlib, scikit-learn, Beautiful Soup, Rpy2

Launching Amazon EC2 Cloud Instances using Amazon Images (Linux/ Ubuntu) and Configuring launched instances with respect to specific applications.

Experienced in Data Modeling, Multi-Dimensional Modeling and involved in creation of Star Schema and Snowflake dimensional schema

Working with AWS services such as EC2, VPC, RDS, CloudWatch, CloudFront, Route53 etc.

Focus on continuous integration and deployment, promoting Enterprise Solutions to target environments.

Configuring and Networking of Virtual Private Cloud (VPC).

Written Cloud formation templates and deployed AWS resources using it.

Creating S3 buckets and also managing policies for S3 buckets and Utilized S3 bucket and Glacier for storage and backup on AWS

Hands on experience in implementing LDA, NaiveBayes and skilled in Random Forests, Decision Trees,

Large data sets Manipulation, Data Blending, Data Cleaning, and Data Management using tools like SQL, Python, and R

Worked with various datasets, analyzed data using python/R, and implemented various supervised/unsupervised predictive models.

Experience in writing Sub Queries, Stored Procedures, Triggers, Cursors, and Functions on MySQL databases.

Experience with Statistical analysis and testing (Hypothesis, A/B Testing). Stat packages mainly in R/python

Integrated python with tableau using TabPy to use python scripts in calculated fields of Tableau.

Deep experience with design and development of complex Tableau visualization solutions

Created a Python/Django based web application using Python scripting for data processing, MySQL for the database, and HTML/CSS/JQuery and Tableau for data visualization of the served pages.

Strong experience/knowledge in developing REST APIs.

Experienced in developing Web Services with Python programming language.

Experience in Shell Scripting, SQL Server, Linux, and Git

Experienced in developing Web Services with Python programming language. Scheduled and managed daily/weekly/monthly sales and operational reports based on the business requirements.

Excellent knowledge in Tableau visual analytics, complex KPI scorecards

TECHNICAL SKILLS:

Analytical Tools

Tableau 2019.x/2018.x/10.x/9.x, (Desktop, Public, Server, Reader, Online, Prep), Google Analytics (Adobe Site Catalyst), PowerBI

Programming

Python, R, SQL, HTML, SAS

RDBMS

MYSQL, MS SQL Server, NoSQL (MongoDB and InfluxDB), Teradata, Oracle, DB2, Access

SQL Tools

Teradata SQL Assistant, PL/SQL Developer, Toad, SQL Server Management Studio

Statistical Software

R, Python, SAS

Statistical Methods

Time Series Analysis, Factor Analysis, Regression Models, Anova Testing (I,II way), Confidence Intervals, Sample Size Calculation, Stepwise Regression Model, Hypothesis Test, Principal Component Analysis

Machine Learning Models

Linear Regression, Logistic Regression, Regularization, Support Vector Machines, Neural Networks, Decision Trees, Ensemble Methods like Random Forests, Gradient Boost, AdaBoost, Deep Learning, LightGBM, XGBoost

Scripting Languages

R, Python, HTML, JavaScript, SQL, PL/SQL and XML

Operating Systems

UNIX, Linux, Windows, Mac

PROFESSIONAL EXPERIENCE

Responsibilities:

AT&T, Dallas, TX Jan 2021 – Present

Sr Data Engineer

Worked in fraud detection team to identify the number of frauds in customer accounts

Worked in MSSQL database queries and wrote Stored Procedures for normalization and renormalization

Utilized SQL for developing and optimizing complex queries and writing Stored Procedures for data normalization and renormalization to enhance data integrity and retrieval efficiency in fraud detection processes.

Maintained and optimized data pipelines, data flows, and complex data transformations using PySpark with Databricks to support fraud detection efforts.

Developed complex SQL queries to identify patterns and anomalies in customer data, aiding in the detection and prevention of fraud.

Built dynamic SQL-based reporting tools to provide real-time analytics and insights into fraud trends, allowing for proactive measures and strategy adjustments.

Created and optimized PySpark jobs within Databricks for large-scale data transformation tasks, improving data processing speed by 30%.

Integrated Databricks with Azure Data Lake to manage and scale data storage solutions effectively, ensuring quick access to large datasets.

Developed monitoring and alerting solutions using Databricks to track data pipeline performance and ensure data quality and consistency.

Automated data workflows and pipeline development using Apache Airflow, integrating PySpark jobs to streamline data processing and analysis in Databricks.

Specially worked on random forest experiments to find out the number of accounts associated with fraud.

Maintain and provide support for optimal pipelines, data flows and complex data transformations and manipulations using PySpark with Databricks.

Automated workflows and developed pipelines using Airflow.

Generated CSV files from Python Dictionaries and exported them directly to Excel for visualizing data

Transformed the dataset from non-stationary to stationary dataset for forecasting

Used Ensemble model to integrate various algorithms using Python

Walgreen Boots Alliance, Chicago, IL Aug 2020 – Present

Data Scientist/Engineer

Worked with millions of healthcare records using Python to find patterns in the data and take actionable insights accordingly

Worked in MSSQL database queries and wrote Stored Procedures for normalization and renormalization

Implemented Azure Data Factory (ADF) extensively for ingesting data from different source systems like relational and unstructured data to meet business functional requirements.

Design and developed Batch processing and real-time processing solutions using ADF, Databricks clusters and stream Analytics.

Created and maintained SQL queries and views to support data analysis and reporting for supply chain optimization, reducing product delays by 15%.

Implemented PySpark scripts in Databricks to automate data cleansing and transformation processes, ensuring high-quality data for downstream analytics.

Designed data models in Databricks to support healthcare data analysis, utilizing PySpark for real-time processing of streaming data from multiple sources.

Developed custom SQL-based ETL solutions to integrate various data sources into a unified data warehouse, optimizing data flow and reducing latency.

Leveraged Databricks Delta Lake features for data versioning and rollback, ensuring data integrity and compliance with healthcare regulations.

Conducted performance tuning of PySpark jobs in Databricks to optimize cluster usage, reducing costs by 20% while maintaining processing speed.

Developed SQL queries and wrote Stored Procedures to normalize and renormalize data, enhancing database performance and reducing supply chain delays.

Implemented data ingestion and batch processing solutions in Azure Databricks, leveraging PySpark for real-time and batch data processing from multiple sources.

Designed and managed complex data transformations and manipulations using PySpark in Databricks to support data analytics and visualization for healthcare datasets.

Involved in migrating objects from Teradata to snowflake

Created numerous pipelines in Azure using Azure Data Factory v2 to get the data from disparate source systems by using different Azure Activities like Move &Transform, Copy, filter, for each, Databricks, etc.

Use of Docker and Kubernetes to manage microservices for the development of continuous integration and continuous delivery.

Developed data warehouse model in snowflake for over 100 data sets using where space

Maintain and provide support for optimal pipelines, data flows and complex data transformations and manipulations using ADF and PySpark with Databricks.

Automated workflows and developed pipelines using Airflow.

Used PyCharm IDE and installed various libraries to connect with SQL and other tools through python

Heavily involved in testing snowflake to understand best possible way to use the cloud resource

Generated CSV files from Python Dictionaries and exported them directly to Excel for visualizing data

Transformed the dataset from non-stationary to stationary dataset for forecasting

Used Ensemble model to integrate various algorithms using Python

Used tableau to visualize data from a given healthcare dataset. Used R to do statistical modeling and did data transformation before using the data in tableau and visualizing it. Wrote R scripts and connected with tableau using an external ODBC

Developed Snowflake SQL queries to create required views to be fed to the Tableau dashboards

Analyzed large data sets using R and used regression models to predict future data using R forecasting and visualized them in tableau

Microsoft, Seattle, WA April 2020 – Aug 2020

Sr Data Engineer

Responsibilities:

Analyzed data using data mining and machine-learning packages (e.g., Apache Spark, various Python and R packages, collaborative filtering techniques, propensity modeling) to identify good predictive features and build classification models to predict student course success of CSUSM.

Built a knowledge-based recommender system for students to focus on subjects, in which students are weak, in Python.

Developed an Adaptive question-answering platform, using the BERT model in NLP, where the question was divided into few levels and appeared on the screen based on previously answered questions.

Designed a plagiarism detection program using Python and NLP, to inspect assignment quality, and evaluated via Jaccard similarity coefficient.

Architect and implement ETL and data movement solutions using Azure Data Factory.

Implemented Copy activity, Custom Azure Data Factory Pipeline Activities for On-cloud ETL processing.

Analysed and Prepare data, identify the patterns on dataset by applying historical models. Collaborating with Senior Data Scientists for understanding of data

Perform data manipulation, data preparation, normalization, and predictive modelling. Improve efficiency and accuracy by evaluating model in Python and R

Scraped Data from various study participants using the AWARE FRAMEWORK and FITBIT APIs using Jupyter Notebook (Python) and MySQL

Performed exploratory data analysis like calculation of descriptive statistics, detection of outliers, assumptions testing, factor analysis, etc., in R

Performed Data Profiling to assess data quality using SQL through complex internal database

Designed data profiles for processing, including running SQL, Procedural/SQL queries and using Python and R for Data Acquisition and Data Integrity which consists of Datasets Comparing and Dataset schema checks.

Performed configuration, deployment and support of cloud services including Amazon Web Services (AWS).

Created stored procedures and wrote queries in mySQL for efficient reporting results

Implemented various classifiers such as LightGBM, XGBoost, SVC, Logistic Regression, Random Forest, and Decision Trees.

Created a pipeline for tuning classifier model hyper parameters which consisted of minority class oversampling on training data, features scaling, and GridSearch Cross Validation.

Accurately predicted sleep targets to get a high F1 score of .83 (LightGBM) in a 30-minute time window

Validated low model overfitting by comparing Training and Test set AUC scores.

Adjusted predictions by using the threshold value that maximizes the difference between Sensitivity and False Positive Rate to better classify sleep classes and create sleep profiles for each participant.

Accurately measured Sleep Onset, Sleep Offset, and Sleep Duration among participants within a 30-min error margin.

Manage DevOps and Infrastructure Teams supporting tools and infrastructure for developers.

Leveraged AWS cloud services such as EC2, auto-scaling and VPC to build secure, highly scalable and flexible systems that handled expected and unexpected load bursts.

Launching Amazon EC2 Cloud Instances using Amazon Images (Linux/ Ubuntu) and Configuring launched instances with respect to specific applications.

Working with AWS services such as EC2, VPC, RDS, CloudWatch, CloudFront, Route53 etc

Configuring and Networking of Virtual Private Cloud (VPC).

Visualized the predictions using Tableau to present sleep profiles of a study participant which showed their sleep onset time, sleep offset time, total sleep duration, and sleep disturbance.

Created dashboards using Tableau using various parameters, calculated Fields, and other functionalities that showed errors in prediction for future tuning for all the participants.

Presented findings through Tableau Stories and dashboards analysis showing how the study participants were sleep deprived, how the model overestimated total sleep duration, future work, and discussions.

Licenses:

Environment: SQL, Python (Jupyter), Apache Spark, AWS EC2, R studio, Tableau, Git, Tableau, T-SQL, ETL, XML, MS office,

Apple Inc, Cupertino, CA August 2018 to April 2020

Data Engineer

Responsibilities:

Data gathering, data cleansing and data wrangling performed using R.

Responsible for design and development of advanced R/Python programs to prepare transform and harmonize data sets in preparation for modelling.

Closely worked with Kafka Admin team to set up Kafka cluster setup on the QA and Production environments.

Had knowledge on Kibana and Elastic search to identify the Kafka message failure scenarios.

Implemented to reprocess the failure messages in Kafka using offset id.

Performed Data Profiling to learn about user behaviour and merged data from multiple data sources.

Exploratory data analysis (univariate, bi-variate, multi-variate analysis) with R, Python.

Equipped with experience in utilizing statistical techniques which include Correlation, Hypotheses modelling, Multivariate analysis, Dimension reduction (Principal Component Analysis, Greedy Stepwise Regression).

Data mining and modelling techniques using Linear and Logistic Regression, Multinomial Logistic Regression, SVM, Naïve Bayes, LMT, Decision Trees, Bagging, KNN, QDA, LDA, Neural Network, Random forest, decision trees, and k-mean clustering with (R, Python)

Responsible for data aggregation, data pre-processing, missing value imputation and descriptive and inferential analysis with R, Python.

Worked with several R and Python packages like ggplot2, dplyr, plyr, e1071, rpart, pandas, Scikit-Learn etc.

Multiple data product dashboard created with Tableau desktop.

Interacted with Business Analysts and Data Modellers and defined Mapping documents and Design process for various Sources and Targets.

Connect to SQL server, created the dimensions, hierarchies, and levels on Tableau desktop.

Worked on building queries to retrieve data into Tableau from SQL Server 2012 and Oracle SQL Developer.

Used Tableau to provide the standard user interfaces for visualizing and filtering date/time data, and Hive will construct the Map/Reduce jobs necessary to parse the string data as date/time to satisfy the queries Tableau generates.

Used Hive and Hadoop configuration variables for a given connection from Tableau to tune performance characteristics.

Environment: Windows, Oracle, Anaconda, Python, R studio, Git, Tableau

Deutsche Bank, Washington DC May 2017 – August 2018

Data Engineer

Responsibilities:

Data gathering, data cleaning and data wrangling performed using Python.

Random Forests, K-means, & KNN for data analysis.

Utilized statistical techniques to understand the data, perform descriptive statistics (mean, median, mode, density distributions, box plots etc.), inferential statistics (t-test, ANOVA, Chi square etc.) and hypothesis testing with Python.

Experience in designing and development of Tableau visualization solutions and Created Business requirement documents and plans for creating dashboards.

Built customized interactive investment performance dashboards for various users indicating return on investment.

Used Spark API to generate PairRDD using Java programming.

Created dashboards for analysing the various stock trends and Histograms showing how returns are distributed into different bins to see variability.

Metrics to track the value of an investment over time and derive the ROI.

Worked extensively with Advance Analysis Actions, Calculations, Parameters, Background images, Maps.

Used level of Detailed Functions, Trend Lines, Statistics and Log Axes. Groups, hierarchies, sets to create detail level summary report and Dashboard using KPI's.

Administered documentation of Infrastructure, Architecture, Network, conduct audit of events, troubleshoot issue and educate users

Created User filters to enable the row level security in tableau within client’s premises.

Defined best practices for Tableau report development and effectively used data blending and multiple data sources joins in Tableau 10.1.

Create customize & share interactive web dashboards in minutes with simple drag & drop method and access dashboards from any browser or tablet.

Analysed metrics for service requests and incidents to identify problem trends and adjusted training/support of technical staff accordingly, meeting 100% of service level agreement (SLA) compliance.

Implemented report generation framework.

Created stack bars, bar graphs, Heat map, Gantt charts, Bullet Chart, Spark Line with alert mechanism. Various KPI graphs were drawn. Kept eye on Qlikview best practices.

Gathered user requirements, analysed and designed software solution based on the requirements.

Upgrading Tableau platforms from 9.1 to Tableau 10 in clustered environment and performing content

Experienced in Maintaining User, groups and Sites.

Extensively worked with JavaScript, AJAX, jQuery for various front-end validations.

Extensive experienced in developing Stored Procedures, Functions, Views and Triggers, Complex SQL queries using SQL Server, TSQL and Oracle PL/SQL.

Connect to Oracle directly, created the dimensions, hierarchies, and levels on Tableau desktop.

Build from the ground up reliable infrastructure services in AWS to deliver highly scalable services.

Worked on building queries to retrieve data into Tableau from SQL Server 2008 and Oracle SQL Developer, and developed T-SQL statements for loading data into Target Database.

Environment: SQL server 2016, R Studio, Hadoop, Python, SAS, Tableau.

T Mobile, Seattle, WA Dec 2016 - May 2017

Role: Data Engineer

Responsibilities:

●Analyzed and Prepared data identified the patterns on the dataset by applying historical models while collaborating with senior data scientists for understanding of data.

●Performed data manipulation, data preparation, normalization, and predictive modelling. Improved efficiency and accuracy by evaluating models in Python and R

●Used Python and R for programming for improvement of model to optimize services.

●Developed a pricing model for various product and services bundled offerings to predict and optimize the gross margin.

●Built price elasticity model for various product and services bundled offerings.

●Under supervision of Sr. Data Scientist performed Data Transformation through Rescaling and Normalizing Variables

●Developed predictive causal model using annual failure rate and standard cost basis for the new bundled service offering.

●Worked with the sales and Marketing team for Partners and collaborated with a cross-functional team to frame and answer important data questions, prototyping and experimenting ML/DL algorithms and integrating into production systems for different business needs.

●Worked on Multiple datasets containing two billion values which are structured and unstructured data about web applications usage and online customer surveys.

●Hands on experience on Amazon Redshift platform.

●Performed Data cleaning process applied Backward - Forward filling methods on dataset for handling missing values.

Environment: MS SQL Server, R/R studio, SQL Enterprise Manager, Python, Red shift, MS Excel, Power BI, Tableau, T-SQL, ETL, MS Access, XML, MS office, Outlook, AS E-Miner, Apache Airflow

Telstra, Melbourne, Australia

Data Analyst Aug 2015 - Dec 2016

Responsibilities:

Participated in all phases of Data mining, Data cleaning, Data collection, developing models, Validation, Visualization, and Performed Gap analysis.

Built capacities for analytics, proposed an implementable solution based on insights derived from data and Involved in extracting, analyzing and representing data for retail client’s loyalty rewards program

Recreating existing application logic and functionality in the Azure Data Lake, Data Factory, Databricks, SQL Database and SQL Datawarehouse environment.

Experience managing Azure Data Lakes (ADLs) and Data Lake Analytics and an understanding of how to integrate with other Azure Services.

Migration of on-premise data (Oracle/ Teradata) to Azure Data Lake Store(ADLS) using Azure Data Factory(ADF V1/V2).

Designed and created optimal pipeline architecture on the Azure platform.

Created pipelines in Azure using ADF to get the data from different source systems and transform the data by using many activities.

Designing and developing visualization solutions and Created Business requirement documents and plans for creating Tableau dashboards.

Developed triggers, stored procedures, functions, and packages using cursors and ref cursor concepts associated with the project using PL SQL

Involved in writing QA test cases for the Tableau dashboards.

Involve in ETL applications, data feeding, cleansing, archiving, and re-modeling.

Designed and Optimized Data Connections, Data Extracts, Schedules for Background Tasks, and Incremental Refresh for the weekly and monthly dashboard reports on Tableau Server.

Analyzed the data using statistical features in Tableau to develop Trend Analysis reports.

EDUCATION

James Cook University, Brisbane, Australia

Master’s in Information Technology

Contact this candidate