Prasad N
Sr. Data Engineer
Email: ***************@*****.***
Phone: 689-***-****
PROFESSIONAL SUMMARY:
Big Data professional with 10+ years of combined experience in the fields of Data Applications, Big Data implementations like Spark, Hive, Kafka and HDFS including programming languages such as Python, Scala and Java.
Experience in Big Data ecosystems using Hadoop, Pig, Hive, HDFS, MapReduce, Sqoop, Storm, Spark, Airflow, Snowflake, Teradata, Flume, Kafka, Yarn, Oozie, and Zookeeper.
High Exposure on Big Data technologies and Hadoop ecosystem, In - depth understanding of Map Reduce and Hadoop Infrastructure.
Expertise in writing end to end Data processing Jobs to analyse data using MapReduce, Spark and Hive.
Experience on Migrating SQL database to Azure data Lake, Azure data lake Analytics, Azure SQL Database, Data Bricks and Azure SQL Data warehouse and controlling and granting database access and Migrating On premise databases to Azure Data Lake store using Azure Data factory.
Hands on experience in Architecting Legacy Data Migration projects such as Teradata to AWS Redshift migration and from on-premises to AWS Cloud.
Well versed with Big data on AWS cloud services i.e. EC2, S3, Glue, Anthena, DynamoDB and RedShift
Performed the migration of Hive and MapReduce Jobs from on - premise MapR to AWS cloud using EMR and Qubole
Experience with Apache Spark ecosystem using Spark-Core, SQL, Data Frames, RDD's and knowledge on Spark MLLib.
Strong experience in Business and Data Analysis, Data Profiling, Data Migration, Data Integration, Data governance and Metadata Management, Master Data Management and Configuration Management.
Experience in implementing various Big Data Engineering, Cloud Data engineering, Data Warehouse, Data Mart, Data Visualization, Reporting, Data Quality, and Data virtualization Solution
Experience in analysing data using Python, R, SQL, Microsoft Excel, Hive, PySpark, Spark SQL for Data Mining, Data Cleansing, Data Mining and Machine Learning.
Hands - on experience in Azure Cloud Services (PaaS & IaaS), Azure Synapse Analytics, SQL Azure, Data Factory, Azure Analysis services, Application Insights, Azure Monitoring, Key Vault, Azure Data Lake.
Extensive Knowledge on developing Spark Streaming jobs by developing RDD’s (Resilient Distributed Datasets) using Scala, PySpark and Spark-Shell.
Experienced in data manipulation using python for loading and extraction as well as with python libraries such as NumPy and Pandas for data analysis and numerical computations.
Expertise in Creating, Debugging, Scheduling and Monitoring jobs using Airflow and Oozie.
Good experience in developing web applications implementing Model View Control (MVC) architecture using Django, Flask, Pyramid and Python web application frameworks.
Experienced in using Pig scripts to do transformations, event joins, filters and pre-aggregations before storing the data into HDFS.
Developed AWS Cloud Formation templates and set up Auto scaling for EC2 instances.
Have extensive experience in creating pipeline jobs, schedule triggers using Azure data factory.
Expertise in writing Map-Reduce Jobs in Python for processing large sets of structured, semi-structured and unstructured data sets and stores them in HDFS.
Good understanding of data modelling (Dimensional & Relational) concepts like Star-Schema Modelling, Snowflake Schema Modelling, Fact and Dimension tables.
Hands on experience in SQL and NOSQL database such as Snowflake, HBase, Cassandra and MongoDB.
Experience working on Azure Services like Data Lake, Data Lake Analytics, SQL Database, Synapse, Data Bricks, Data factory, Logic Apps and SQL Data warehouse and GCP services Like Big Query, Dataproc, Pub sub etc.
Hands on experience in setting up workflow using Apache Airflow and Oozie workflow engine for managing and scheduling Hadoop jobs.
Extensive working experience with Big data ecosystem - Hadoop (HDFS, MapReduce, Yarn), Spark, Kafka, Hive, Impala, HBase, Sqoop, Pig, Airflow, Oozie, Zookeeper, Ambari, Flume, Nifi.
Experience in Google Cloud components, Google container builders and GCP client libraries and cloud SDK’s
Strong experience in working with UNIX/LINUX environments, writing shell scripts.
Worked with various formats of files like delimited text files, clickstream log files, Apache log files, Avro files, JSON files, XML Files.
Experienced in working in SDLC, Agile and Waterfall Methodologies.
Strong skills in analytical, presentation, communication, problem solving with the ability to work independently as well as in a team and had the ability to follow the best practices and principles defined for the team.
TECHNICAL SKILLS:
Hadoop/Spark Ecosystem
Hadoop, MapReduce, Pig, Hive/impala, YARN, Flume, Sqoop, Oozie, Zookeeper, Spark, Airflow, MongoDB, Cassandra, HBase, and Storm.
Azure Cloud Platform
ADLS, Azure SQL DB, BLOB Storage, Azure Analytic Services, Azure Data Lake (Gen1/Gen2), Azure Cosmos DB, Azure Stream Analytics, Azure DevOps, ARM Templates, Logic Apps, App Services, Data bricks, Mapping data flow.
Hadoop Distribution
Cloudera distribution and Horton works
Programming Languages
Scala, Java, C, C++
Script Languages
Python, Shell Script (bash, [sh)
Databases
Azure SQL Warehouse, Azure SQL DB, Oracle, MySQL, SQL Server, PostgreSQL, HBase, Snowflake, MongoDB, Teradata
Operating Systems
Linux, Windows, Ubuntu, Unix
Web/Application server
Apache Tomcat, NetBeans
IDE
IntelliJ, Eclipse and NetBeans
Version controls and Tools
GIT, Maven, SBT, CBT
PROFESSIONAL EXPERIENCE:
AT&T, Plano, TX Mar 2022- Till date
Sr. Data Engineer
Responsibilities:
Responsible for the execution of big data analytics, predictive analytics, and machine learning initiatives.
Used Azure Data Factory extensively for ingesting data from disparate source systems.
Used Azure Data Factory as an orchestration tool for integrating data from upstream to downstream systems.
Creating pipelines, data flows and complex data transformations and manipulations using ADF and Pyspark with Databricks.
Ingested data in mini-batches and performs RDD transformations on those mini-batches of data by using spark streaming to perform streaming analytics in Data bricks.
Worked on migration of data from on-prem SQL server to cloud databases (Azure Synapse Analytics (DW) & Azure SQL DB)
Spearheaded the development and maintenance of robust data pipelines using Apache Airflow on the Astronomer platform.
Designed and implemented scalable data processing solutions on cloud platforms such as AWS/GCP/Azure.
Implemented data quality checks and monitoring to ensure the reliability and accuracy of data flowing through pipelines.
Developed and maintained ETL processes to extract, transform, and load data from various sources into a centralized data warehouse.
Worked on creating correlated and non-correlated sub-queries to resolve complex business queries involving multiple tables from different databases.
Extensively used the Azure service like Azure Data Factory and Logic App for ETL, to push in/out the data from DB to Blob storage, HDInsight – HDFS, Hive Tables.
Wrote, compiled, and executed programs as necessary using Apache Spark in Scala to perform ETL jobs with ingested data.
Implemented End to End solution for hosting the web application on Azure cloud with integration to ADLS.
Used Spark Streaming to divide streaming data into batches as an input to Spark engine for batch processing.
Wrote Spark applications for data validation, cleansing, transformation, and custom aggregation and used Spark engine, Spark SQL for data analysis and provided to the data scientists for further analysis.
Developed rest API's using python with flask and Django framework and done the integration of various data sources including Java, JDBC, RDBMS, Shell Scripting, Spreadsheets, and Text files.
Analyse, design and build Modern scalable distributed data solutions using with Hadoop, Azure cloud services.
Developed JSON Scripts for deploying the Pipeline in Azure Data Factory (ADF) that process the data using the SQL Activity.
Implemented schema extraction for Parquet and Avro file formats in Hive.
Implemented Spark RDD transformations to Map business analysis and apply actions on top of transformations.
Worked on data migration to Hadoop and hive query optimization.
Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL, and U-SQL Azure Data Lake Analytics. Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in In Azure Databricks.
Automated resulting scripts and workflow using Apache Airflow and shell scripting to ensure daily execution in production.
Primarily involved in Data Migration using SQL, SQL Azure, Azure Storage, and Azure Data Factory, SSIS, PowerShell.
Developed Spark/Scala, Python for regular expression (regex) project in the Hadoop/Hive environment with Linux/Windows for big data resources.
Developed data pipeline using Spark, Hive, Pig, python, Impala, and HBase to ingest customer
Profile structured, unstructured, and semi-structured data across various sources to identify patterns in data and Implement data quality metrics using necessary queries or python scripts based on source.
Worked with PowerShell and UNIX scripts for file transfer, emailing and other file related tasks.
Created a new data model that embeds NoSQL sub models within a relational data model by applying Hybrid data modelling concepts.
Involved in using Sqoop for importing and exporting data between RDBMS and HDFS.
Used cloud shell SDK in GCP to configure the services Data Proc, Storage, Big Query
Used MongoDB to stored data in JSON format and developed and tested many features of dashboard using Python, Bootstrap, CSS, and JavaScript
Deploy the code to EMR via CI/CD using Jenkins
Extensively used Code cloud for code check-in and checkouts for version control.
Environment: Apache Spark, Python, Azure, Airflow, NOSQL, AWS, Impala, Avro, Parquet, HDFS, UNIX, Scala, RDBMS, Arcadia, Hive, CI/CD, Sqoop, PowerShell.
CVS, Hartford, CT Jan 2019- Feb 2022
Sr. Data Engineer
Responsibilities:
Analyse and cleanse raw data using Hive QL
Analyse, design, and build Modern data solutions using Azure Paas service to support visualization of data. Understand current production state of application of data. Understand current Production state of application and determine the impact of new implementation on existing business processes.
Extract transforms and Load data from Source Systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL, and U-SQL Azure Data Analytics.
Data ingestion to one or more Azure Services – (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in Azure data bricks.
Created pipelines in ADF using Linked Services/ Datasets/Pipeline/ to extract, Transform and load data from different sources like Azure SQL, Blob storage, Azure SQL Datawarehouse, write-back tool and backwards.
Developed Spark applications using Pyspark and Spark – SQL for data extraction, transformation, and aggregation the data to uncover insights into the customer usage patterns.
Responsible for estimating the Cluster size, monitoring, and troubleshooting of the spark data bricks cluster.
Experienced in performance tuning of spark Applications for setting right Batch Interval time, correct level of Parallelism and memory tuning.
To meet specific business requirements wrote UDF’s in Scala and Pyspark.
Developed JSON scripts for deploying the pipeline in Azure Data Factory (ADF) that process the Data using the SQL Activity.
Implemented OLAP multi-dimensional cube functionality using Azure SQL Data Warehouse.
Worked with Requests, Pysftp, NumPy, SciPy, Matplotlib, Beautiful Soup and Pandas python libraries during development lifecycle
Created Azure SQL database, performed monitoring and restoring of Azure SQL database. Performed migration of Microsoft SQL server to Azure SQL database.
Environment: Python, Data Virtualization, Data Warehouse, Airflow, Azure, SQL Server, Sqoop, ADF, ADLS GEN1/GEN2, NOSQL, UNIX, HDFS, Oozie, SSIS.
Humana, Louisville, KY July 2015 - Dec 2018
Data Engineer
Responsibilities:
Analyse, design and build Modern data solutions using Azure PaaS service to support visualization of data. Understand current Production state of application and determine the impact of new implementation on existing business processes.
Designed and setup Enterprise Data Lake to provide support for various uses cases including analytics, processing, storing, and Reporting of voluminous, rapidly changing data.
Responsible for maintaining quality reference data in source by performing operations such as cleaning, transformation and ensuring Integrity in a relational environment by working closely with the stakeholders and solution architect.
Designed and developed security Framework to provide fine grained access to objects in AWS S3 using AWS Lambda, DynamoDB.
Set up and worked on Kerberos authentication principals to establish secure network communication on cluster and testing of HDFS, Hive, Pig and MapReduce to access cluster for new users.
Performed end-to-end Architecture and implementation assessment of various AWS service like Amazon EMR, Redshift, S3.
Implemented the machine learning algorithms using python to predict the quantity a user might want to order for a specific item so we can automatically suggest using kinesis firehose and S3 data lake.
Used AWS EMR to transform and move large amounts of data into and out of other AWS data stores and databases, such as Amazon Simple storage service (Amazon S3) and Amazon DynamoDB.
Used Spark SQL for Scala and amp, python interface that automatically converts RDD case classes to schema RDD.
Import the data from different sources like HDFS/HBase into Spark RDD and perform computations using Pyspark to generate the output response.
Creating Lambda functions with Boto3 to deregister unused AMIs in all application regions to reduce the cost for EC2 resources.
Importing and exporting database using SQL server Integrations service (SSIS) and Data transformation service (DTS Packages).
Coded Teradata BTEQ scripts to load, transform data, fix defects like SCD 2 date chaining, cleaning up duplicates.
Developed reusable framework to be leveraged for future migrations that automates ETL from RDBMS systems to the Data Lake utilizing Spark Data sources and Hive data objects.
Conducted Data blending, Data preparation using Alteryx and SQL for Tableau consumption and publishing data sources to Tableau server.
Implemented AWS step Functions to automate and orchestrate the Amazon Sage Maker related tasks such as publishing data to S3, training ML model and deploying it for prediction.
Environment: AWS EMR, S3, RDS, Redshift, Lambda, Dynamo DB, Apache Spark, HBase, Hive, SQOOP, Map Reduce, Snowflake, Apache Pig, Python, Tableau, SSRS.
IP Soft, Bangalore, India May 2013 – Aug 2014
Data Analyst
Responsibilities:
Capture functional requirements from business clients along with IT requirement analysts by posing suitable questions, analysing the requirements by collaborating with team and system architects by following the standard templates.
Extensive Tableau Experience in Enterprise Environment and Tableau Administration.
Successfully upgraded Tableau platforms in clustered environment and performed content upgrades.
Experienced in popular Python framework (like Django, Flask or Pyramid)
Knowledge of object-relational mapping (ORM)
Performed quality review and validation of SAS programs generated by other SAS programmers.
Followed good programming practices and adequately documented programs.
Produced quality customized reports by using PROC REPORT, TABULATE, Summary and also descriptive statistics using PROC Means, Frequency and Univariate.
Providing production support to client benefit and product service users by Writing Custom SQL and SAS programs.
Experience with T-SQL, DDL, and DML Scripts and established Relationships between Tables using Primary Keys and Foreign Keys.
Extensive Knowledge in Creating Joins and sub-queries for complex queries involving multiple tables.
Hands-on experience in Using DDL and DML for writing Triggers, Stored Procedures, and Data manipulation.
Worked with star schema, snowflakes schema dimensions, SSRS to support large Reporting needs.
Experience in using Profiler and Windows Performance Monitor to resolve Dead Locks or long running queries and slow running server.