Azure Data Warehouse

Location:

United States

Posted:

September 27, 2023

Contact this candidate

Resume:

DATA ENGINEER

Name: Sathvik Ganta

Email: adzzvg@r.postjobfree.com

Phone: 469-***-****

PROFESSIONAL SUMMARY:

IT professional with around 8+ years of experience, specialized in Big Data ecosystem, Data Acquisition, Ingestion, Modelling, Storage Analysis, Integration, Data Processing, and Database Management.

Experience in designing interactive dashboards, reports, performing ad-hoc analysis and visualizations using Tableau, Power BI, Arcadia, and Matplotlib.

Experience in application development, implementation, deployment, and maintenance using Hadoop and Spark-based technologies like Cloudera, Hortonworks, Amazon EMR, and Azure HDInsight.

Experienced in building data pipelines using Azure Data Factory, Azure Databricks, and loading data to Azure Data Lake, Azure SQL Database, Azure SQL Data Warehouse, and controlling database access.

Extensive experience with Azure services like HDInsight, Stream Analytics, Active Directory, Blob Storage, Cosmos DB, and Storage Explorer. Experience on Migrating SQL database to Azure data Lake, Azure data lake Analytics, Azure SQL Database, Data Bricks and Azure SQL Data warehouse and controlling and granting database access and Migrating On premise databases to Azure Data Lake store using Azure Data factory.

Hands on experience in Architecting Legacy Data Migration projects such as Teradata to AWS Redshift migration and from on-premises to AWS Cloud.

Use of NLP, Open NLP & Stanford NLP for Natural Language Processing, and sentiment analysis.

Well versed with big data on AWS cloud services i.e., EC2, S3, Glue, Anthena, DynamoDB and RedShift

Performed the migration of Hive and MapReduce Jobs from on - premises MapR to AWS cloud using EMR.

Set up a Google Cloud Platform (GCP) Firewall rules to allow or deny traffic to and from the Virtual Machine instances based on specified configuration and used GCP cloud CDN (content delivery network) to deliver content from GCP cache locations drastically improving user experience and latency.

Experience with Apache Spark ecosystem using Spark-Core, SQL, Data Frames, RDD's Spark MLlib.

Strong experience in Business and Data Analysis, Data Profiling, Data Migration, Data Integration, Data governance and Metadata Management, Master Data Management and Configuration Management.

Experience in implementing various Big Data Engineering, Cloud Data engineering, Data Warehouse, Data Mart, Data Visualization, Reporting, Data Quality, and Data virtualization Solution

Experience in analyzing data using Python, R, SQL, Microsoft Excel, Hive, PySpark, Spark SQL for Data Mining, Data Cleansing, Data Mining and Machine Learning.

Hands - on experience in Azure Cloud Services (PaaS & IaaS), Azure Synapse Analytics, SQL Azure, Data Factory, Azure Analysis services, Application Insights, Azure Monitoring, Key Vault, and Azure Data Lake.

Extensive Knowledge on developing Spark Streaming jobs by developing RDD’s (Resilient Distributed Datasets) using Scala, PySpark and Spark-Shell.

Good knowledge in understanding the security requirements and implementation using Azure Active Directory, Sentry, Ranger, and Kerberos for authentication and authorizing resources.

Good knowledge of Hadoop cluster architecture and its key concepts - Distributed file systems, Parallel processing, High availability, fault tolerance, and Scalability.

Complete knowledge of Hadoop architecture and Daemons of Hadoop clusters, which include Name node, Data node, Resource manager, Node Manager, and Job history server.

Expertise in developing Spark applications for interactive analysis, batch processing and stream processing, using programming languages like PySpark, Scala.

Advanced knowledge in Hadoop based Data Warehouse (HIVE) and database connectivity (SQOOP).

Ample experience using Sqoop to ingest data from RDBMS - Oracle, MS SQL Server, Teradata, PostgreSQL, and MySQL.

Experience in working with various streaming ingest services with Batch and Real-time processing using Spark Streaming, Kafka, Confluent, Storm, Flume, and Sqoop.

Proficient in using Spark API for streaming real-time data, staging, cleaning, applying transformations, and preparing data for machine learning needs.

Experience in developing end-to-end ETL pipelines using Snowflake, Alteryx, and Apache NiFi for both relational and non-relational databases (SQL and NoSQL).

Strong working experience on NoSQL databases and their integration with the Hadoop cluster - HBase, Cassandra, MongoDB, DynamoDB, and Cosmos DB.

Experience with AWS cloud services to develop cloud-based pipelines and Spark applications using EMR, LAMBDA and Redshift.

Experience in all phases of Data Warehouse development like requirements gathering, design, development, implementation, testing, and documentation.

Extensive knowledge of Dimensional Data Modeling with Star Schema and Snowflake for FACT and Dimensions Tables using Analysis Services.

Experienced in Automating, Configuring, and deploying instances on AWS, Azure and GCP cloud environments and Data centers.

Good experience in the development of Bash scripting, T-SQL, and PL/SQL Scripts.

Sound knowledge in developing highly scalable and resilient Restful APIs, ETL solutions, and third-party platform integrations as part of Enterprise Site platform.

Experience in implementing pipelines using ELK (Elasticsearch, Logstash, Kibana) and developing stream processes using Apache Kafka.

TECHNICAL SKILLS

Big Data Ecosystem

HDFS, Yarn, MapReduce, Spark, Kafka, Kafka Connect, Hive, Airflow, StreamSets, Sqoop, HBase, Flume, Pig, Ambari, Oozie, Zookeeper, Nifi, Sentry

Hadoop Distributions

Apache Hadoop 2.x/1.x, Cloudera CDP, Hortonworks HDP

Cloud Environment

Amazon Web Services (AWS), Microsoft Azure, GCP

Databases

MySQL, Oracle, Teradata, MS SQL SERVER, PostgreSQL, DB2

NoSQL Database

DynamoDB, HBase

AWS

EC2, EMR, S3, Redshift, EMR, Lambda, Kinesis Glue, Data Pipeline

Microsoft Azure

Databricks, Data Lake, Blob Storage, Azure Data Factory, SQL Database, SQL Data Warehouse, Cosmos DB, Azure Active Directory

Operating systems

Linux, Unix, Windows 10, Windows 8, Windows 7, Windows Server 2008/2003, Mac OS

Software’s/Tools

Microsoft Excel, Statgraphics, Eclipse, Shell Scripting, ArcGIS, Linux, Jupyter Notebook, PyCharm, Vi / Vim, Sublime Text, Visual Studio, Postman

Reporting Tools/ETL Tools

Informatica, Talend, SSIS, SSRS, SSAS, ER Studio, Tableau, Power BI, Arcadia, Data stage, Pentaho

Programming Languages

Python (Pandas, SciPy, NumPy, Scikit-Learn, Stats Models, Matplotlib, Plotly, Seaborn, Keras, TensorFlow, PyTorch), PySpark, T-SQL/SQL, PL/SQL, HiveQL, Scala, UNIX Shell Scripting

Version Control

Git, SVN, Bitbucket

Development Tools

Eclipse, NetBeans, IntelliJ, Hue, Microsoft Office

Professional Experience

Client: Volkswagen, Herndon, Virginia Jan 2022 to Present

Senior Data Engineer

Responsibilities:

Responsible for the execution of big data analytics, predictive analytics, and machine learning initiatives.

Analysis, design and build Modern data solutions using Azure Cloud services to support visualization of data.

Understand current Production state of application and determine the impact of new implementation on existing business processes.

Extract Transform and Load data from the different Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, Spark SQL and Python and Azure Data Lake Analytics.

Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in In Azure Databricks.

Created Pipelines in ADF using Linked Services/Datasets/Pipeline/ to Extract, Transform, and load data from different sources like Azure SQL, Blob storage, Azure SQL Data warehouse, write-back tool and backwards.

Responsible for Azure Data Factory job monitoring and troubleshooting the failures and providing the resolution for the ADF jobs failures.

Developed Spark applications using PySpark and Spark-SQL for data extraction, transformation, and aggregation from multiple file formats for analyzing & transforming the data to uncover insights into the customer usage patterns.

Worked on Parquet file format and other kind of different file types.

Responsible for estimating the cluster size, monitoring, and troubleshooting of the Spark databricks cluster.

Worked on a migration project to migrate data from different sources (Teradata,Hadoop,DB2) to Google Cloud Platform(GCP) using UDP framework and transforming the data using Spark Scala scripts.

Worked on creating data ingestion processes to maintain Global Data lake on the GCP cloud and Big Query.

Experienced in performance tuning of Spark Applications for setting right Batch Interval time, correct level of Parallelism and memory tuning.

Worked on Production bugs especially involved in Azure Databricks Notebooks bugs and provided the new PySpark and Spark SQL logics to eliminate the bugs.

Experience in Different kind of Data platforms (CUBE, DATAMARTS).

Developed Notebooks and ETL Pipeline in Azure Data Factory (ADF) that process the data according to the job trigger.

Hands-on experience on developing SQL Scripts for automation purpose.

Created Build and Release for multiple projects (modules) in production environment using Visual Studio Team Services (VSTS).

Worked on Google Cloud Platform (GCP) services like compute engine, cloud load balancing, cloud storage, cloud SQL, stack driver monitoring and cloud deployment manager.

Strong understanding of Data Modeling in data warehouse environment such as star schema and snowflake schema.

Experience in migrating the Legacy application into GCP platform and managing the GCP services such as Compute engine, cloud storage, Big Query, VPC, Stack Driver, Load Balancing and IAM.

Worked on Spark Architecture including Spark Core, Spark SQL, Data Frames, Spark Streaming, Driver Node, Worker Node, Stages, Executors and Tasks.

Involved in Database Design and development with Business Intelligence using SQL Server 2014/2016, Integration Services (SSIS), DTS Packages, SQL Server Analysis Services (SSAS), DAX, OLAP Cubes, Star Schema and Snowflake Schema.

Extensively utilized SSIS packages to create complete ETL process and load data into database which was to be used by Reporting Services.

Identified the dimension, fact tables and designed the data warehouse using star schema. Developed Multi-Dimensional Objects (Cubes, Dimensions) using SQL Server Analysis Services (SSAS).

Extensively used Azure DevOps for code check-in and checkouts for version control.

Environment: Hadoop, Azure cloud Services (Azure Data Factory, Azure Data Bricks, Azure Data Lake), MS visual studio, Github, Pyspark, Scala, GCP, SQL Server, SQL, MS Power BI.

Client: AT&T, Dallas Texas Feb 2021 to Dec 2021

Senior Data Engineer

Responsibilities:

Studied in-house requirements for the Data warehouse to be developed, conducted one-on-one sessions with business users to gather data warehouse requirements.

Analyzed database requirements in detail with the project stakeholders by conducting Joint Requirements.

Development sessions developed a Conceptual model using Erwin based on requirements analysis.

Developed normalized Logical and Physical database models to design OLTP system for insurance applications.

Created dimensional model for the reporting system by identifying required dimensions and facts using Erwin r7.1.

Used forward engineering to create a Physical Data Model with DDL that best suits the requirements from the Logical Data Model

Giving on-call support to monitor and fixing the failures occurring in production and aware of various kinds of issues that comes up in both Ab initio and UNIX, Autosys Jobs.

Designed and customized data models for Data warehouse supporting data from multiple sources on real time and Created mapping documents to outline data flow from sources to targets.

Involved in building the ETL architecture and Source to Target mapping to load data into Data warehouse.

Involved in Dimensional modeling (Star Schema) of the Data warehouse and used Erwin to design the business process, dimensions and measured facts.

Extracted the data from the flat files and other RDBMS databases into staging area and populated onto Data warehouse.

Analyze, design and build Modern data solutions using Azure PaaS service to support visualization of data.

Understand current Production state of application and determine the impact of new implementation on existing business processes.

Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL, and U-SQL Azure Data Lake Analytics. Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in In Azure Databricks.

Installed and configured Apache airflow for workflow management and created workflows in python.

Responsible for estimating the cluster size, monitoring, and troubleshooting of the Spark databricks cluster.

Experienced in performance tuning of Spark Applications for setting right Batch Interval time, correct level of Parallelism and memory tuning.

To meet specific business requirements wrote UDF’s in Scala and PySpark.

Developed JSON Scripts for deploying the Pipeline in Azure Data Factory (ADF) that process the data using the Sql Activity.

Hands-on experience on developing SQL Scripts for automation purpose.

Created Build and Release for multiple projects (modules) in production environment using Visual Studio Team Services (VSTS).

Well versed in SQL Server and T- SQL (DDL and DML) in constructing Tables, Normalization/ De normalization Techniques on database Tables.

Involved in Creating and Updating Clustered and Non-Clustered Indexes to keep up the SQL Server Performance.

Worked in creating and managing fragmentation of Indexes to achieve better query performance. Experience in Performance Tuning and Query Optimization.

Environment: Python, PySpark Data Virtualization, Data Warehouse, AWS, Hive, HBase, Impala, Airflow, Azure, ADF, Azure Data Lake (ADL), SQL Server, Sqoop, GCP, MapReduce, NOSQL, UNIX, HDFS, Oozie, SSIS.

Client: Auto Nation, Phoenix, AZ Jan 2020 to Jan 2021

Data Engineer

Role & Contribution:

Works closely with project manager to develop work plan for Data Warehouse projects and keep the manager aware of any issues.

Supporting the development group by providing API's and performing middle-tier application Development.

Providing analytical network support to improve quality and standard work results

Providing accurate estimates for project development and implementation. Work with management to meet those estimates.

Develop framework, metrics and reporting to ensure progress can be measured, evaluated and continually improved.

Perform in-depth analysis of research information for the purpose of identifying opportunities, developing proposals and recommendations for use by management.

Support the development of performance dashboards that encompass key metrics to be reviewed with senior leadership and sales management

Well versed in designing, building, and implementing cloud systems.

Hands on of experience in GCP, Big Query, GCS bucket, G - cloud function, cloud dataflow, Pub/sub cloud shell, GSUTIL, BQ command line utilities, Data Proc, Stack driver.

Deployed application to GCP using Spinnaker (rpm based) and worked on designing star schema in Big Query.

Deployed application to GCP using Spinnaker (rpm based) launched multi-node Kubernetes cluster in Google Kubernetes Engine (GKE) and migrated the dockerized application from AWS to GCP.

Environment: Cassandra, HDFS, MongoDB, Zookeeper, Oozie, Pig, Google Cloud Platform (scala), Kubernetes, GitHub, Jenkins, Big Query, Docker, JIRA, Unix/Linux CentOS 7, Nexus V3, Bash Shell Script, Python, Node.js, Apache Tomcat, MongoDB, SQL.

Client: GTE Financial, Tampa, FL Feb 2019 to Dec 2019

Data Engineer

Role & Contribution:

Utilized Apache Spark with Python to develop and execute Big Data Analytics and Machine learning applications, executed Machine Learning use cases under Spark ML and Mllib.

Identified areas of improvement in existing business by unearthing insights by analyzing vast amount of data using machine learning techniques.

Interpret problems and provides solutions to business problems using data analysis, data mining, optimization tools, and machine learning techniques and statistics.

Designed and developed NLP models for sentiment analysis.

Led discussions with users to gather business processes requirements and data requirements to develop a variety of Conceptual, Logical and Physical Data Models.

Expert in Business Intelligence and Data Visualization tools: Tableau, MicroStrategy.

Worked on machine learning on large size data using Spark and MapReduce.

Let the implementation of new statistical algorithms and operators on Hadoop and SQL platforms and utilized optimizations techniques, linear regressions, K-means clustering, Native Bayes and other approaches.

Developed Spark/Scala, Python for regular expression (regex) project in the Hadoop/Hive environment with Linux/Windows for big data resources.

Data sources are extracted, transformed, and loaded to generate CSV data files with Python programming and SQL queries.

Stored and retrieved data from data-warehouses using Amazon Redshift.

Worked on TeradataSQL queries, Teradata Indexes, Utilities such as Mload, Tpump, Fast load and Fast Export.

Used Data Warehousing Concepts like Ralph Kimball Methodology, Bill Inmon Methodology, OLAP, OLTP, Star Schema, Snowflake Schema, Fact Table and Dimension Table.

Refined time-series data and validated mathematical models using analytical tools like R and SPSS to reduce forecasting errors.

Worked on data pre-processing and cleaning the data to perform feature engineering and performed data imputation techniques for the missing values in the dataset using Python.

Created Data Quality Scripts using SQL and Hive to validate successful das ta load and quality of the data.

Created various types of data visualizations using Python and Tableau.

Consult on broad areas including data science, spatial econometrics, machine learning, information technology and systems and economic policy with R.

Performed Data mapping between source systems to Target systems, logical data modeling, created class diagrams and ER diagrams and used SQL queries to filter data.

Enabled speedy reviews and first mover advantages by using Oozie to automate data loading into the Hadoop Distributed File System and PIG to pre-process the data.

Used various techniques using R data structures to get the data in right format to be analyzed which is later used by other internal applications to calculate the thresholds.

Maintaining conceptual, logical and physical data models along with corresponding metadata.

Worked on data migration from an RDBMS to a NoSQL database and gives the whole picture for data deployed in various data systems.

Environment: Python, Hadoop, Map Reduce, Spark, Spark MLlib, Tableau, SQL, Excel, VBA, SAS, MATLAB, AWS, SPSS, Cassandra, Oracle, MongoDB, SQL Server 2012, DB2, T-SQL, PL/SQL, XML, Tableau.

Client: Couth InfoTech Pvt. Ltd, Hyderabad, India May 2013 to Dec 2017

Data Engineer

Role & Contribution:

Written shell scripts to extract data from Unix servers into Hadoop HDFS for long-term storage.

Implemented Micro Services architecture using spring boot framework.

Created Messaging queues using RabbitMQ to read data from HDFS to process the data. Written Spark Application to implement slowly changing dimensions (SCD Type I).

Created Oozie workflow in process to automate the spark application. Written pig script to load processed data from HDFS into MongoDB.

Used MongoDB to store processed products and commodities data, which can be further down streamed into web application (Green Box/ Zoltar). Deployed Spark application and java web services in pivotal cloud foundry.

Agile methodology including test-driven and pair-programming concept. Strong communication and analytical skills and a demonstrated ability to handle multiple tasks as well as work independently or in a team.

Gathering business requirements from the Business Partners and Subject Matter Experts. Installed and Configured Hadoop cluster using Amazon Web Services (AWS) for POC purposes.

Involved in implementing nine node CDH4 Hadoop cluster on red hat LINUX. Imported data from RDBMS to HDFS and Hive using Sqoop on regular basis.

Created Hive tables and worked on them using Hive QL, which will automatically invoke and run MapReduce, jobs in the backend. Responsible for developing PIG Latin scripts.

Developed custom Map Reduce programs for data analysis and data cleaning using pig Latin scripts.

Managing and scheduling batch Jobs on a Hadoop Cluster using Oozie.

Experience in managing and reviewing Hadoop Log files.

Create Integration tests to check the validity of the data being crawled. Worked with other teams to determine and understand business requirement.

Create Hive based reports to support the application metrics (These will be used by UI teams for reports).

Implementing new drill to detail dimensions into the data pipeline upon on the business requirements.

Hands on experience with Apache SOLR for indexing HBase tables and querying the indexes. Worked with multiple teams in resolving production issues.

Deployment automation via Chef. Handled Prod Deployments and provided production support for fixing the defects.

Environment: Hadoop, Map Reduce, HDFS, Hive, Pig, Java, Hadoop distribution of Cloudera, Pig, AWS, Linux, XML, Eclipse, Oracle 10g, PL/SQL.

Education:

Bachelors in Computer Science Engineering from Gitam University

Masters in Information Systems and Operations Managements from University of FL.

Contact this candidate