Resume

Big Data Azure

Location:

Overland Park, KS

Posted:

March 26, 2024

Contact this candidate

Resume:

Pranav Saggam

ad4ldl@r.postjobfree.com

913-***-****

Senior Big Data Engineer

PROFESSIONAL SUMMARY

Around 8 years of experience in developing, deploying, managing big data applications, ETL Architecture, and cloud computing platforms like Amazon Web services (AWS), Microsoft Azure and Google Cloud Platform. Delivering results at a global level with extensive knowledge in Retail, Healthcare, and Finance Domains.

Extensively worked on AWS Cloud services like EC2, VPC, IAM, RDS, ELB, EMR, EKS, Auto-scaling, S3,Cloud Front, Glacier, Elastic Beanstalk, Lambda, Elastic Cache, Route53, OpsWorks, Cloud Watch, Cloud Formation, RedShift, DynamoDB, SNS, SQS, SES, Kinesis, Firehose, Cognito IAM.

Hands on experience in Azure Analytics services-Azure Data Lake store(ADLS), Azure Data Lake Analytics(ADLA), Azure SQL DW, Azure SQL DB, Azure Data Factory(ADF), Azure Data Bricks(ADB), Azure Cloud, Azure Devops, Azure Synapse Analytics, Azure Cosmos NO SQL DB, etc. and built Azure environments by deploying Azure IaaS Virtual Machines (VMs) and cloud services (PaaS).

Hands-on experience with snowflake utilities, SnowSQL, SnowPipe.

Hands on experience on Google cloud platform (GCP) in all the bigdata products BigQuery, Cloud Data Proc, Cloud Dataflow, Google cloud storage, composer (Airflow as a service), Dataprep, Datafusion, Data catalog, cloud Pub/Sub, cloud Functions and cloud provisioning tools such as Terraform and CloudFormation.

Proven experience deploying software development solutions for a wide range of high-end clients, including Big Data Processing, Ingestion, Analytics, and Cloud Migration from On-Premises to AWS Cloud.

Expertise in Azure infrastructure management (Azure Web Roles, Worker Roles, SQL Azure, Azure Storage).

Strong Experience in working with ETL Informatica which includes components Informatica PowerCenter Designer, Workflow manager, Workflow monitor, Informatica server and Repository Manager.

Implemented the spark Application by using Spark API, Data Frame,Data Set, RDD and other concepts like cache, persist, repartition, coloease and few others and also performing the few optimization techniques for spark application

Implemented Data warehouse solutions consisting of ETLS, On-premise to Cloud Migration and good expertise building and deploying batch and streaming data pipelines on cloud environments.

Worked on Airflow 1.8(Python2) and Airflow 1.9(Python3) for orchestration and familiar with building custom Airflow operators and orchestration of workflows with dependencies involving multi-clouds.

Good understanding of Spark Architecture with Data bricks, Structured Streaming. Setting Up AWS and Microsoft Azure with Data bricks, Data bricks Workspace for Business Analytics, Manage Clusters In Data bricks, Managing the Machine Learning Lifecycle.

Experience in Hadoop Development/Administration, Proficient in programming knowledge of Hadoop and Ecosystem components Hive, HDFS, Pig, Sqoop, HBase, Python, spark.

Experience in developing custom UDFs for Pig and Hive.

Demonstrated understanding of the Fact/Dimension data warehouse design model, including star

and snowflake design methods.

Experience in working with Waterfall and Agile development methodologies.

Experience identifying data anomalies performing statistical analysis and data mining techniques.

Experienced in building Snowpipe and In-depth knowledge of Data Sharing in Snowflake and Snowflake Database, Schema and Table structures.

Designed and developed logical and physical data models that utilize concepts such as Star Schema, Snowflake Schema and Slowly Changing Dimensions.

Expertise in using Airflow and Oozie to create, debug, schedule, and monitor ETL jobs.

Experience of Partitions, bucketing concepts in Hive, and designed both Managed and External tables in Hive to optimize performance.

Experience with different file formats like Avro, parquet, ORC, JSON, XML, and compressions like snappy & bzip.

Experienced in configuring and administering the Hadoop Cluster using major Hadoop Distributions like Apache Hadoop and Cloudera.

Hands-on experience in handling database issues and connections with SQL and No SQL databases

such as MongoDB, HBase, SQL server. Created Java apps to handle data in MongoDB and HBase.

Work Experience

DISCOVER FINANCIAL Chicago, IL

AWS Data Engineer Feb 2021 to Present

Implemented solutions utilizing Advanced AWS Components: EMR, EC2, Redshift, S3, Athena, Glue, lambda and Kinesis etc. Integrated with Big Data/Hadoop Distribution Frameworks: Hadoop YARN, MapReduce, Spark, Hive, etc.

Used AWS Athena extensively to ingest structured data from S3 into multiple systems, including RedShift to generate reports.

Created on-demand tables on S3 files using Lambda Functions and AWS Glue using Python and PySpark.

Performed end-to-end Architecture and implementation assessment of various AWS services like Amazon EMR, Redshift, S3, Athena, Glue, and Kinesis.

Created AWS RDS (Relational database services) to work as Hive metastore and could combine EMR cluster's metadata into a single RDS, which avoids the data loss even by terminating the EMR.

Involved in code migration of quality monitoring tool from AWS EC2 to AWS Lambda and built logical datasets to administer quality monitoring on snowflake warehouses.

Install and configure Apache Airflow for S3 bucket and Snowflake data warehouse and created dags to run the Airflow.

Loaded the data into Spark RDD and performed in-memory data computation to generate the output response.

Created ETL jobs on AWS glue to load vendor data from different sources, transformations involving data cleaning, data imputation and data mapping and storing the results into S3 buckets. The stored data was later queried using AWS Athena.

Designed and developed ETL process using Informatica 10.4 tool to load data from wide range of sources such as Oracle, flat files, salesforce, Aws cloud.

Extracting and uploading data into AWS S3 buckets using Informatica aws plugin.

Automated resulting scripts and workflow using Apache Airflow and shell scripting to ensure daily execution in production.

Queried both Managed and External tables created by Hive using Impala.

Monitored and controlled Local disk storage and Log files using Amazon CloudWatch.

Played a key role in dynamic partitioning and Bucketing of the data stored in Hive Metadata.

Involved with extraction of large volumes of data and analysis of complex business logics; to derive business -oriented insights and recommending/proposing new solutions to the business in Excel Report.

Experienced in performance tuning of Spark Applications for setting right Batch Interval time, correct level of Parallelism and memory tuning.

Encoded and decoded json objects using PySpark to create and modify the dataframes in Apache Spark.

Created Build and Release for multiple projects (modules) in production environment using Visual Studio Team Services (VSTS).

Develop ETL jobs to automate the real time data retrieval from Salesforce.com, suggest best methods for data replication from Salesforce.com.

Used AWS data pipeline for Data Extraction, Transformation and Loading from homogeneous or heterogeneous data sources and built various graphs for business decision-making using Python matplot library.

Perform Data Cleaning, features scaling, features engineering using pandas and NumPy packages in python.

Developed PySpark and SparkSQL code to process the data in Apache Spark on Amazon EMR to perform the necessary transformations based on the STMs developed

Designed, developed, and managed Power BI, Tableau, QlikView, Qlik Sense Apps including Dashboard, Reports, Storytelling.

Created a new Power BI reports dashboard with 13 pages according to the design spec in two weeks beating the tight timeline. Deployed an automation to production for update the company holiday schedule based on company’s holiday policy which need to be updated yearly.

Used Informatica Power Center for extraction, transformation, and loading (ETL) of data in the data warehouse. Loading data into Snowflake tables from internal stage using SnowSQL.

Prepared data warehouse using Star/Snowflake schema concepts in Snowflake using SnowSQL.

Prepared Tableau reports and dashboards with calculated fields, parameters, sets, groups or bins and publish on the server.

Ebay California

Sr. AWS Data Engineer Dec 2018 to Jan 2021

Involved in the project life cycle including the design, development, and implementation of verifying data received in the data lake.

Design and Develop ETL Processes in AWS Glue to migrate accident data from external sources like

S3, and Text Files into AWS Redshift.

Analysed impact changes on existing ETL/ELT processes to ensure timely completion and availability of data in the data warehouse for reporting use. For data processing, several databases such as Snowflake, Netezza, UDB, and MySQL are queried.

Using a combination of Python and Snowflake SnowSQL to create ETL pipelines in and out of a data

warehouse. Using Snowflake to write SQL queries.

Involved in ingestion, transformation, manipulation and computation of data using StreamSets, Kafka, Memsql, Spark.

Developed, tested, and tuned performance of complex mappings, transforms, aggregations, joins, enrichment, and validations for target data underpinnings.

Performed a POC to check the time taking for Change Data Capture (CDC) of oracle data across Striim, StreamSets and DBVisit.

Used ranking functions (rank, dense_rank, percent_rank, row_number) and aggregation functions (sum, min, max) in spark.

Using crontab in Linux, automate Netezza database management and monitoring.

Built real-time streaming pipeline utilizing Kafka, Spark Streaming, and Redshift.

Used Elastic search for analysing big volumes of data quickly and in near real-time.

Developed logical and physical data flow models for Informatica ETL applications.

Added support for AWS S3 and RDS to host static /media files and the database into the amazon cloud.

Implemented Installation and configuration of multi-node cluster on Cloud using Amazon Web Services (AWS) on EC2.

Handled AWS Management Tools as Cloud watch and Cloud Trail. Stored the log files in AWS S3.

Used versioning in S3 buckets where the highly sensitive information is stored.

Integrated AWS DynamoDB using AWS lambda to store the values of items and backup teh DynamoDB streams.

Automated Regular AWS tasks like snapshots creation using Python scripts.

Designed data warehouses on platforms such as AWS Redshift, Azure SQL Data Warehouse, and other high-performance platforms.

Install and configure Apache Airflow for AWS S3 bucket and create dags to run the Airflow.

Prepared scripts to automate the ingestion process using Pyspark and Scala as needed through various sources such as API, AWS S3, Teradata and Redshift.

Created multiple scripts to automate ETL/ ELT process using Pyspark from multiple sources

Developed Pyspark scripts utilizing SQL and RDD in spark for data analysis and storing back into S3.

Developed Pyspark code to load from staging to hub implementing the business logic.

Developed code in Spark SQL for implementing Business logic with python as programming language.

Designed, Developed and Delivered teh jobs and transformations over the data to enrich the data and progressively elevate for consuming in the Pub layer of the data lake.

Worked on Sequence files, Map side joins, bucketing, partitioning for hive performance enhancement and storage improvement.

Wrote, compiled, and executed programs as necessary using Apache Spark in Scala to perform ETL jobs with ingested data.

Used Spark Streaming to divide streaming data into batches as an input to Spark engine for batch processing.

Maintained Kubernetes patches and upgrades. Managed multiple Kubernetes clusters in a production environment.

Wrote Spark applications for data validation, cleansing, transformation, and custom aggregation and used Spark engine, Spark SQL for data analysis and provided it to the data scientists for further analysis.

Developed various UDFs in Map-Reduce and Python for Pig and Hive.

Data Integrity checks have been handled using hive queries, Hadoop, and Spark. Worked on performing transformations & actions on RDDs and Spark Streaming data wif Scala.

Implemented the Machine learning algorithms using Spark with Python.

Profile structured, unstructured, and semi-structured data across various sources to identify patterns in data and Implement data quality metrics using necessary query's or python scripts based on source.

Designs and implementing Scala programs using Spark Data frames and RDDs for transformations and actions on input data.

Improved the Hive queries performance by implementing partitioning and clustering and Optimized file formats (ORC).

Worked on the creation of customer Docker container images, tagging, and pushing of data images.

Implemented and analyzed SQL query performance issues in databases.

Responsible for design development of Spark SQL Scripts based on Functional Requirements and Specifications.

Experience working on Vagrant boxes to setup a local Kafka and StreamSets pipeline.

Worked on designing and developing the Real - Time Time Tax Computation Engine using Oracle, StreamSets, Kafka, Spark Structured Streaming and MemSql.

Implemented spark programs/applications in Scala using Spark APIs for Data Extraction, Transformation, and Aggregation.

Used Kafka as a message cluster to pull/push messages into spark for data ingestion, processing, and storing the resultant data in AWS S3 Buckets.

Worked on Kafka Publisher integrated into spark job to capture errors from Spark Application and push into Postgres table.

Intensively used Python, JSON scripts coding to deploy the StreamSets pipelines into the server.

Build pipeline solutions to integrate data from multiple heterogeneous systems using StreamSets data collectors.

Responsible for maintaining quality reference data in Oracle by performing operations such as cleaning, transformation, and ensuring Integrity in a relational environment.

Written shell scripts to extract data from Unix servers into Hadoop HDFS for long-term storage.

Environment: Hadoop, Spark, Hive, Teradata, Tableau, Linux, Python, Kafka, AWS S3 Buckets, AWS Glue, Stream sets, Postgres, AWS EC2, Oracle PL/SQL, Development toolkit (JIRA, Bitbucket/Git, Service now, etc.,)

Kaiser permanente Atlanta, GA

Azure Data Engineer Aug 2016 - Dec 2018

Developed upgrade and downgrade scripts in SQL that filter corrupted records with missing values along with identifying unique records based on different criteria.

Implemented Azure Storage - Storage accounts, blob storage, and Azure SQL Server. Explored on the Azure storage accounts like Blob storage.

Experience in building, deploying, troubleshooting data extraction for a huge number of records using Azure Data Factory (ADF).

Design and implement database solutions in Azure SQL Data Warehouse, Azure SQL.

Migrate data from traditional database systems to Azure databases.

Design and implement migration strategies for traditional systems on Azure (Lift and shift/Azure Migrate, other third-party tools.

Experience in DWH/BI project implementation using Azure Data Factory.

Interacts with Business Analysts, Users, and SMEs on elaborating requirements.

Design and implement end-to-end data solutions (storage, integration, processing, visualization) in Azure.

Design and implement database solutions in Azure SQL Data Warehouse, Azure SQL.

Implementation of data movements from on-premises to cloud in Azure.

Develop batch processing solutions by using Data Factory and Azure Data bricks.

Implement Azure Data bricks clusters, notebooks, jobs and auto scaling.

Design for data auditing, data masking and data encryption for data at rest and in transit.

Propose architectures considering cost/spend in Azure and develop recommendations to right-size data infrastructure.

Setup and maintain the Azure SQL Database, Azure Analysis Service, Azure SQL Data warehouse, Azure Data Factory, Azure SQL Data warehouse.

Develop conceptual solutions & create proofs-of-concept to demonstrate viability of solutions.

Implement Copy activity, Custom Azure Data Factory Pipeline Activities.

Primarily involved in Data Migration using SQL, SQL Azure, Azure storage, and Azure Data Factory, SSIS, PowerShell.

Create C# applications to load data from Azure storage blob to Azure SQL, to load from web API to Azure SQL and scheduled web jobs for daily loads.

Recreating existing application logic and functionality in the Azure Data Lake, Data Factory, SQL Database and SQL Data Warehouse environment. experience in DWH/BI project implementation using Azure DF and databricks. Architect, design and validate Azure infrastructure-as-a-Service (IaaS) environment

Develop dashboards and visualizations to help business users analyse data as well as providing data insight to upper management with a focus on Microsoft products like SQL Server Reporting Services (SSRS) and Power BI.

Responsible for creating Requirements Documentation for various projects.

Developed Python scripts to do file validations in Databricks and automated the process using ADF.

Developed an automated process in Azure cloud which can ingest data daily from web service and load in to Azure SQL DB. Developed Streaming pipelines using Azure Event Hubs and Stream Analytics to analyse data for dealer efficiency and open table counts for data coming in from IOT enabled poker and other pit tables.

Analyzed data where it lives by Mounting Azure Data Lake and Blob to Databricks.

Used Logic App to take decisional actions based on the workflow.

Developed custom alerts using Azure Data Factory, SQLDB and Logic App.

Developed Databricks ETL pipelines using notebooks, Spark Dataframes, SPARK SQL and python scripting.

Developed complex SQL queries using stored procedures, common table expressions (CTEs), and temporary table to support Power BI reports.

Implemented complex business logic through T-SQL stored procedures, Functions, Views and advanced query concepts. Worked with enterprise Data Modelling team on creation of Logical models.

Development level experience in Microsoft Azure providing data movement and scheduling functionality to cloud-based technologies such as Azure Blob Storage and Azure SQL Database.

Independently manage development of ETL processes - development to delivery.

Environment: Azure SQL, Azure Storage Explorer, Azure Storage, Azure Blob Storage, Azure Backup, Azure Files, Azure Data Lake Storage, SQL Server Management Studio 2016, Visual Studio 2015, VSTS, Azure Blob, Power BI, PowerShell, C# .Net, SSIS, DataGrid, ETL Extract Transformation and Load, Business Intelligence (BI).

Brio Technologies Private Limited Hyderabad, India

Data Analyst Jan - 2015 to July - 2016

Involved in designing physical and logical data model using ERwin Data modelling tool.

Designed the relational data model for operational data store and staging areas,

Designed Dimension & Fact tables for data marts.

Extensively used ERwin data modeler to design Logical/Physical Data Models, relational database design.

Created Stored Procedures, Database Triggers, Functions and Packages to manipulate the database and to apply the business logic according to the user's specifications.

Created Triggers, Views, Synonyms and Roles to maintain integrity plan and database security.

Creation of database links to connect to the other server and access the required info.

Integrity constraints, database triggers and indexes were planned and created to maintain data integrity and to facilitate better performance.

Used Advanced Querying for exchanging messages and communicating between different modules.

System analysis and design for enhancements Testing Forms, Reports and User Interaction.

Contact this candidate