Naveen Raju
Data Engineer
*************@*****.***
Professional Summary:
Dynamic and motivated IT professional with Around 8 years of experience as a Big Data Engineer with expertise in designing data intensive applications using Hadoop Ecosystem, Big Data Analytical, Cloud Data engineering, Data Warehouse/ Data Mart, Data Visualization, Reporting, and Data Quality solutions.
In - depth knowledge of Hadoop architecture and its components like YARN, HDFS, Name Node, Data Node, Job Tracker, Application Master, Resource Manager, Task Tracker and Map Reduce programming paradigm.
Extensive experience in Hadoop led development of enterprise level solutions utilizing Hadoop components such as Apache Spark, MapReduce, HDFS, Sqoop, PIG, Hive, HBase, Oozie, Flume, NiFi, Kafka, Zookeeper, and YARN.
Profound experience in performing Data Ingestion, Data Processing (Transformations, enrichment, and aggregations). Strong Knowledge on Architecture of Distributed Systems and Parallel Processing, In-depth understanding of MapReduce programming paradigm and Spark execution framework.
Experienced with the Spark improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data frame API, Spark Streaming, MLlib, Pair RDD's and worked explicitly on PySpark and Scala.
Handled ingestion of data from different data sources into HDFS using Sqoop, Flume and perform transformations using Hive, Map Reduce and then loading data into HDFS. Managed Sqoop jobs with incremental load to populate HIVE external tables.
Good working experience in using Apache Hadoop eco system components like MapReduce, HDFS, Hive, Sqoop, Pig, Oozie, Flume, HBase and Zookeeper.
Extensive experience working on spark in performing ETL using Spark Core, Spark-SQL and Real-time data processing using Spark Streaming.
Extensively worked with Kafka as middleware for real-time data pipelines.
Writing UDFs and integrating with Hive and Pig using Java.
Expertise in Creating, Debugging, Scheduling and Monitoring jobs using Airflow and Oozie. Experienced with using most common Operators in Airflow - Python Operator, Bash Operator, Google Cloud Storage Download Operator, Google Cloud Storage Object Sensor, GoogleCloudStorageToS3Operator.
Hands-on experience in handling database issues and connections with SQL and NoSQL databases such as MongoDB, HBase, Cassandra, SQL server, and PostgreSQL. Created Java apps to handle data in MongoDB and HBase. Used Phoenix to create SQL layer on HBase.
Experience in designing and creating RDBMS Tables, Views, User Created Data Types, Indexes, Stored Procedures, Cursors, Triggers and Transactions.
Expert in designing ETL data flows using creating mappings/workflows to extract data from SQL Server and Data Migration and Transformation from Oracle/Access/Excel Sheets using SQL Server SSIS.
Expert in designing Parallel jobs using various stages like Join, Merge, Lookup, remove duplicates, Filter, Dataset, Lookup file set, Complex flat file, Modify, Aggregator, XML.
Hands-on experience with Amazon EC2, Amazon S3,AWS Glue,Route 53, Amazon RDS, VPC, IAM, Amazon Elastic Load Balancing, Auto Scaling, CloudWatch, SNS, SES, SQS, Lambda, EMR and other services of the AWS family.
Experienced in fact dimensional modeling (Star schema, Snowflake schema), transactional modeling and SCD (Slowly changing dimension).
Experienced with JSON based RESTful web services, and XML/QML based SOAP web services and also worked on various applications using python integrated IDEs like Sublime Text and PyCharm.
Building and productionizing predictive models on large datasets by utilizing advanced statistical modeling, machine learning, or other data mining techniques.
Developed intricate algorithms based on deep-dive statistical analysis and predictive data modeling that were used to deepen relationships, strengthen longevity and personalize interactions with customers.
Technial Skills:
Hadoop/Big Data
HDFS, MapReduce, Yarn, HBase, Pig, Hive, Sqoop, Flume,Oozie, Zookeeper, Splunk, Hortonworks, Cloudera
Programming languages
SQL, Python,Pspark,R, Scala, Spark, Linux shell scripts
Databases
RDBMS (MySQL, DB2, MS-SQL Server,Terradata,PostgreSQL),NoSQL (MongoDB, HBase, Cassandra), Snowflake virtual warehouse.
OLAP & ETL Tools
Tableau,Tableau Server,PowerBiSpyder, Spark, SSIS, Informatica Power Center, Pentaho, Talend
Data Modelling Tools
Microsoft Visio, ER Studio, Erwin
Python and R libraries
R-tidyr, tidyverse, dplyr reshape, lubridate, Python – beautiful Soup, numpy, scipy, matplotlib, python-twitter,pandas,numpy,scikit-learn,keras,boto3.
Machine Learning
Regression, Clustering, MLlib, Linear Regression, Logistic Regression, Decision Tree, SVM, Naive Bayes, KNN, K-Means, Random Forest, and Gradient Boost &Adaboost, Neural Networks and Time Series Analysis.
Data analysis Tools
Machine Learning, Deep Learning, Data Warehouse, Data Mining, Data Analysis, Big data, Visualizing, Data Munging, Data Modelling
Cloud Computing Tools
Snowflake, SnowSQL, Amazon Web Services (AWS), Microsoft Azure, Google Cloud Platform (GCP)
Amazon Web Services
EMR, EC2, S3, RDS, Cloud Search, Redshift,Glue,Data Pipeline, Lambda.
Reporting Tools
JIRA, MS Excel, Tableau, Power BI, QlikView, Qlik Sense, D3, SSRS, SSIS
IDE’s
Pycharm,Anaconda,Jupyter Notebook
Development Methodologies
Agile, Scrum, Waterfall
Professional experience:
Vangaurd, Malvern, PA Jul 2021 – Present
Role: BigData/Hadoop Engineer
Responsibilities:
•Prepared ETL design document which consists of the database structure, change data capture, Error handling, restart and refresh strategies
•Worked on developing ETL pipelines on S3 Parquet files on data lake using AWS Glue
•Programmed ETL functions between Oracle and Amazon Redshift
•Performed Data Analytics on Data Lake using Pyspark on databricks platform
•Responsible for assessing and improving the quality of customer data
•Worked on AWS cloud services like EC2,S3,EMR,RDS,Athena and Glue
•Analyzed data quality issues through exploratory data analysis(EDA) using SQL,Python and Pandas
•Worked on creating automation scripts leveraging various python libraries to perform accuracy checks from various sources to target databases.
•Worked on building Python scripts to generate heatmaps to perform issue and root cause analysis for data quality report failures.
•Performed data analysis and predictive data modeling.
•Interacting with stakeholders delivering regulatory reports and to recommend best remediation strategies to ensure pristine quality of high priority usage data elements by building various analytical dashboards using excel and python plotting libraries.
•Designed and implemented RestApi to access Snowflake DB platform.
•Worked with and maintained data warehouses in snowflake and star schemas
•Involved in the code migration of quality monitoring tool from AWS EC2 to AWS Lambda and built logical datasets.
Worked with different feeds data like JSON, CSV, XML,DAT and implemented Data Lake concept
Environment: Python, Spark SQL,PySpark, Pandas, Numpy,Excel, PowerBI, AWS EC2, AWS S3,AWS Lambda, Athena, Glue,Linux Shell Scripting,SnowflakeDB, Git,DynamoDB, Redshift.
City Bank, NYC, NY Apr 2020 – Jun 2021
Role: BigData/Hadoop Engineer
Responsibilities:
Extracted and analyzed >800k+ data using SQL queries from Azure Snowflake, Azure SQL DB.
Designed the Exploratory Data Analysis in Python using Seaborn and Matplotlib to evaluate data qualities.
Developed Spark applications using PySpark and Spark SQL in Databricks for data extraction, transformation, and aggregation from multiple file formats.
Worked on migration of data from On-prem SQL server to Cloud databases (Azure Synapse Analytics (DW) & Azure SQL DB)
Constructed connections with Azure Blob Storage, built end-to-end data ingestion to process raw files in Databricks and Azure Data Factory.
Wrote Pig Scripts in Hadoop to generate Map Reduce jobs and extracted data from Hadoop Distributed File System (HDFS).
Scheduled data pipelines using Apache Oozie, involved in resolving the issues and troubleshooting related to the performance of Hadoop clusters.
Implemented data cleaning using Pandas, Numpy in Jupyter Notebook by dropping or replacing missing values, advancing data preparation.
Designed custom-built input adapters using Spark, Hive, and Sqoop to ingest and analyze data (Snowflake, MS SQL) into HDFS.
Expert in implementing advanced procedures like text analytics and processing using the in-memory computing capabilities like Apache Spark written in Scala.
Expertized in implementing Spark using Python and Spark SQL for faster testing and processing of data responsible to manage data from different sources.
Implemented data ingestion and handling clusters in real time processing using Kafka.
Extracted Tables and exported data from Teradata through Sqoop and placed in Cassandra
Designed data warehouses schemas and concepts (star schema and snowflake schema), fact table, cubes, dimensions, measures using SSAS.
Designed NLP grammar checker, search engine query auto-corrector using N-gram model; improved data quality and solidify compliance writeup.
Environment: Azure Databricks, Data Lake, MySQL, Azure SQL, Snowflake, Teradata, Git, Blob Storage, Data Factory, python, Pandas, Numpy, Sqoop, Hadoop, Hive, HQL, Spark, PySpark, Airflow, Hive, Sqoop, HBase
HighMark – Pittsburgh PA Feb 2019 – Mar 2020
Role: Bigdata/Hadoop Engineer
Responsibilities:
Worked on converting SQL processes into Hadoop framework using Pyspark, Python, Hive, Sqoop, Shell.
Analyzed existing campaign SQL scripts and implemented automated solutions using Shell scripts and Pyspark jobs.
Experience in deploying data from various sources like oracle, Teradata into HDFS and building reports using Power BI.
Worked on improving the performance and optimization of the existing Pyspark and Hive code.
Optimized the Hive tables using optimization techniques like partitions and bucketing to provide better performance with Hive QL queries.
Developed data pipelines using pig, Sqoop, Hive, Python, PySpark and store data into HDFS.
Developed TV package recommendation engine using pyspark and HQL.
Developed data transformation jobs in spark using Python and pyspark, Spark-SQL.
Worked On implementing Spark streaming pipelines for ingesting PPV data using stream analytics using Java.
Developed predictive model to detect STB failures using PySpark, Spark MLlib
Performed video customer segmentation using K Means Clustering.
Worked in operations to monitor and debug job failures and data discrepancy issues.
Developed data cleaning and data transformation jobs in spark using Python and pyspark, Spark-SQL.
Experience in deploying data from various sources into HDFS and building reports usingTableau and Power BI.
Worked in operations to monitor and debug job failures and data discrepancy issues.
Worked on operations side for monitoring and fixing production failures.
Environment: Amazon EC2, EMR, Amazon S3, Amazon RDS, Redshift, HDFS, Spark, Hive, Sqoop, HBase, Git, Scala, Oozie, MySQL, Tableau, PyCharm, Shell Scripts, Spark, Spark Streaming, Windows, Linux, Shell Elastic Search, Kafka, Teradata, SQL Server.
Fedility, Boston, Massachusetts Mar 2018 – Jan 2019
Role: HadoopSpark Developer
Responsibilities:
Responsible for installation and configuration of Hive, Pig, HBase, and Sqoop on the Hadoop cluster and
created HIVE tables to store the processed results in a tabular format. Configured Spark Streaming to
receive real-time data from Apache Kafka and store the stream data to HDFS using Scala.
Developed the Sqoop scripts to make the interaction between Hive and Vertical Database.
Processed data into HDFS by developing solutions and analyzed the data using Map Reduce Pig, and
Hive to produce summary results from Hadoop to downstream systems.
Build servers using AWS: Importing volumes, launching EC2, creating security groups, auto-scaling, load
balancers, Route 53, SES, and SNS in the defined virtual private connection.
Designing and building multi-terabyte, full end-to-end Data Warehouse infrastructure from the ground
up on Confidential Redshift for large-scale data handling Millions of records every day.
Written Map Reduce code to process and parse the data from various sources and store parsed data
into HBase and Hive using HBase-Hive Integration.
Migrated an existing on-premises application to AWS. Used AWS services like EC2 and S3 for data sets
processing and storage. Built and configured a virtual data center in the Amazon Web Services cloud to
support Enterprise Data Warehouse hosting including Virtual Private Cloud, Security Groups, and Elastic
Load Balancer.
Streamed AWS log group into Lambda function to create service now incident.
Involved in loading and transforming large sets of Structured, Semi-Structured, and Unstructured data
and analyzed them by running Hive queries and Pig scripts.
Created Managed tables and External tables in Hive and loaded data from HDFS.
Developed Spark code by using Scala and Spark-SQL for faster processing and testing and performed
complex HIVE-QL queries on Hive tables.
Scheduled several times based on Oozie workflow by developing Python scripts.
Developed Pig Latin scripts using operators such as LOAD, STORE, DUMP, FILTER, DISTINCT, FOREACH,
GENERATE, GROUP, COGROUP, ORDER, LIMIT, UNION, SPLIT to extract data from data files to load into
HDFS.
Exporting the data using Sqoop to RDBMS servers and processed that data for ETL operations.
Worked on S3 buckets on AWS to store Cloud Formation Templates and worked on AWS to create EC2
instances.
Designing ETL Data Pipeline flow to ingest the data from RDBMS source to Hadoop using shell script,
Sqoop, package, and MySQL.
End-to-end architecture and implementation of client-server systems using Scala, Akka, Java, and
related, Linux.
Optimized the Hive tables using optimization techniques like partitions and bucketing to provide better.
Used Oozie workflow engine to manage interdependent Hadoop jobs and to automate several types of
Hadoop jobs such as Java map-reduce Hive, Pig, and Sqoop.
Implementing Hadoop with the AWS EC2 system using a few instances in gathering and analyzing data
log files.
Involved in Spark and Spark Streaming creating RDDs, applying operations -Transformation and Actions.
Created partitioned tables and loaded data using both static partition and dynamic partition methods.
Developed custom Spark programs in Scala to analyze and transform unstructured data.
Managed importing of data from various data sources, performed transformations using Hive, and
MapReduce, and loaded data into HDFS.
Extracted the data from Oracle into HDFS using Sqoop.
Using Kafka on publish-subscribe messaging as a distributed commit log, have experience in its fast,
scalable, and durability.
Test-Driven Development (TDD) process and experience with Agile and SCRUM programming
methodology.
Implemented POC to migrate Map Reduce jobs into Spark RDD transformations using Scala.
Scheduled map reduces jobs in the production environment using Oozie scheduler.
Involved in Cluster maintenance, Cluster Monitoring, Troubleshooting, Managing, and reviewing data
backups and log files.
Designed and implemented map-reduce jobs to support distributed processing using java, Hive, and
Apache Pig.
Analyzing Hadoop clusters and different Big Data analytic tools including Pig, Hive, HBase, and Sqoop.
Improved the Performance by tuning Hive and MapReduce.
Research, evaluate, and utilize modern technologies/tools/frameworks around the Hadoop ecosystem.
Environment: HDFS, Map Reduce, HIVE, Sqoop, Pig, Flume, Vertica, Oozie Scheduler, Scala, Java, Shell Script,Teradata, Oracle, HBase, MongoDB, Cassandra, Cloudera, AWS, JavaScript, JSP, Kafka, Spark, Spark Streaming, Scala and ETL, Python, RDBMS.
Unisys Global Services Pvt Ltd, India Jan 2016 – Jul 2017
Role: Hadoop Engineer
Responsibilities:
Developed highly optimized Spark applications to perform data cleansing, validation, transformation and summarization activities
Data pipeline consist Spark, Hive and Sqoop and custom build Input Adapters to ingest, transform and analyze operational data.
Created Spark jobs and Hive Jobs to summarize and transform data.
Used Spark for interactive queries, processing of streaming data and integration with popular NoSQL database for huge volume of data.
Converted Hive/SQL queries into Spark transformations using Spark DataFrames and Scala.
Used different tools for data integration with different databases and Hadoop.
Used Spark for interactive queries, processing of streaming data and integration with popular NoSQL database for huge volume of data.
Built real time data pipelines by developing Kafka producers and spark streaming applications for consuming.
Ingested syslog messages parse them and streams the data to Kafka.
Handled importing data from different data sources into HDFS using Sqoop and performing transformations using Hive, Map Reduce and then loading data into HDFS.
Exported the analyzed data to the relational databases using Sqoop, to further visualize and generate reports for the BI team.
Collecting and aggregating large amounts of log data using Flume and staging data in HDFS for further analysis
Analyzed the data by performing Hive queries (Hive QL) to study customer behavior.
Helped Devops Engineers for deploying code and debug issues.
Used Hive to analyze the partitioned and bucketed data and compute various metrics for reporting.
Developed Hive scripts in Hive QL to de-normalize and aggregate the data.
Scheduled and executed workflows in Oozie to run various jobs.
Implemented business logic in Hive and written UDF’s to process the data for analysis.
Addressing the issues occurring due to the huge volume of data and transitions.
Designed, documented operational problems by following standards and procedures using JIRA.
Environment: Spark, Scala, Hive, Apache NiFi, Kafka, HDFS, Oracle, HBase, MapReduce, Oozie, Sqoop
Smart Software Tech. Dev. Pvt. Ltd., India Apr 2013 – Dec 2015
Role: Hadoop Engineer
Responsibilities:
Interacting with the Business Requirements and the design team and preparing the Low-Level Design and high-level design documents.
Provide in-depth technical and business knowledge to ensure efficient design, programming, implementation and on-going support for the application.
Involved in identifying possible ways to improve the efficiency of the system.
Logical implementation and interaction with HBase.
Efficiently put and fetched data to/from HBase by writing MapReduce job.
Developed Map Reduce jobs to automate transfer of data from/to HBase.
Assisted with the addition of Hadoop processing to the IT infrastructure.
Implemented Map/Reduce job and execute the Map/Reduce job to process the log data from the ad-servers.
Used Hive to analyze the partitioned and bucketed data and compute various metrics for reporting.
Worked on MongoDB, and Cassandra.
Prepared multi-cluster test harness to exercise the system for better performance.
Environment: Hadoop, HDFS, MapReduce, HBase, Hive, Cassandra, Hadoop distribution of Hortonworks, Cloudera, SQL* PLUS and Oracle 10g.
Education:
Master of Computer Science from New York Institute Of Tech, New York, NY
Bachelors in Computer Science from JNTU H, India