Post Job Free

Resume

Sign in

Big data engineer

Location:
Jersey City, NJ
Salary:
65
Posted:
March 08, 2024

Contact this candidate

Resume:

Naveen Raju

Data Engineer

469-***-****

ad37gf@r.postjobfree.com

Professional Summary:

Dynamic and motivated IT professional with having 9+ years of experience as a Big Data Engineer with expertise in designing data intensive applications using Hadoop Ecosystem, Big Data Analytical, Cloud Data engineering, Data Warehouse/ Data Mart, Data Visualization, Reporting, and Data Quality solutions.

In - depth knowledge of Hadoop architecture and its components like YARN, HDFS, Name Node, Data Node, Job Tracker, Application Master, Resource Manager, Task Tracker and Map Reduce programming paradigm.

Extensive experience in Hadoop led development of enterprise level solutions utilizing Hadoop components such as Apache Spark, MapReduce, HDFS, Sqoop, PIG, Hive, HBase, Oozie, Flume, NiFi, Kafka, Zookeeper, and YARN.

Profound experience in performing Data Ingestion, Data Processing (Transformations, enrichment, and aggregations). Strong Knowledge on Architecture of Distributed Systems and Parallel Processing, In-depth understanding of MapReduce programming paradigm and Spark execution framework.

Experienced with the Spark improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data frame API, Spark Streaming, MLlib, Pair RDD's and worked explicitly on PySpark and Scala.

Handled ingestion of data from different data sources into HDFS using Sqoop, Flume and perform transformations using Hive, Map Reduce and then loading data into HDFS. Managed Sqoop jobs with incremental load to populate HIVE external tables.

Good working experience in using Apache Hadoop eco system components like MapReduce, HDFS, Hive, Sqoop, Pig, Oozie, Flume, HBase and Zookeeper.

Extensive experience working on spark in performing ETL using Spark Core, Spark-SQL and Real-time data processing using Spark Streaming.

Extensively worked with Kafka as middleware for real-time data pipelines.

Writing UDFs and integrating with Hive and Pig using Java.

Expertise in Creating, Debugging, Scheduling and Monitoring jobs using Airflow and Oozie. Experienced with using most common Operators in Airflow - Python Operator, Bash Operator, Google Cloud Storage Download Operator, Google Cloud Storage Object Sensor, GoogleCloudStorageToS3Operator.

Hands-on experience in handling database issues and connections with SQL and NoSQL databases such as MongoDB, HBase, Cassandra, SQL server, and PostgreSQL. Created Java apps to handle data in MongoDB and HBase. Used Phoenix to create SQL layer on HBase.

Experience in designing and creating RDBMS Tables, Views, User Created Data Types, Indexes, Stored Procedures, Cursors, Triggers and Transactions.

Expert in designing ETL data flows using creating mappings/workflows to extract data from SQL Server and Data Migration and Transformation from Oracle/Access/Excel Sheets using SQL Server SSIS.

Expert in designing Parallel jobs using various stages like Join, Merge, Lookup, remove duplicates, Filter, Dataset, Lookup file set, Complex flat file, Modify, Aggregator, XML.

Hands-on experience with Amazon EC2, Amazon S3,AWS Glue,Route 53, Amazon RDS, VPC, IAM, Amazon Elastic Load Balancing, Auto Scaling, CloudWatch, SNS, SES, SQS, Lambda, EMR and other services of the AWS family.

Experienced in fact dimensional modeling (Star schema, Snowflake schema), transactional modeling and SCD (Slowly changing dimension).

Experienced with JSON based RESTful web services, and XML/QML based SOAP web services and also worked on various applications using python integrated IDEs like Sublime Text and PyCharm.

Building and productionizing predictive models on large datasets by utilizing advanced statistical modeling, machine learning, or other data mining techniques.

Developed intricate algorithms based on deep-dive statistical analysis and predictive data modeling that were used to deepen relationships, strengthen longevity and personalize interactions with customers.

Technial Skills:

Hadoop/Big Data

HDFS, MapReduce, Yarn, HBase, Pig, Hive, Sqoop, Flume, Oozie, Zookeeper, Splunk, Hortonworks, Cloudera

Programming languages

SQL, Python,Pspark,R, Scala, Spark, Linux shell scripts

Databases

RDBMS (MySQL, DB2, MS-SQL Server, Terradata, PostgreSQL), NoSQL (MongoDB, HBase, Cassandra), Snowflake virtual warehouse.

OLAP & ETL Tools

Tableau, Tableau Server, PowerBiSpyder, Spark, SSIS, Informatica Power Center, Pentaho, Talend

Data Modelling Tools

Microsoft Visio, ER Studio, Erwin

Python and R libraries

R-tidyr, tidy verse, dplyr reshape, lubridate, Python – beautiful Soup, NumPy, SciPy, matplotlib, python-twitter,pandas,numpy,scikit-learn,keras,boto3.

Machine Learning

Regression, Clustering, MLlib, Linear Regression, Logistic Regression, Decision Tree, SVM, Naive Bayes, KNN, K-Means, Random Forest, and Gradient Boost &Adaboost, Neural Networks and Time Series Analysis.

Data analysis Tools

Machine Learning, Deep Learning, Data Warehouse, Data Mining, Data Analysis, Big data, Visualizing, Data Munging, Data Modelling

Cloud Computing Tools

Snowflake, SnowSQL, Amazon Web Services (AWS), Microsoft Azure, Google Cloud Platform (GCP)

Amazon Web Services

EMR, EC2, S3, RDS, Cloud Search, Redshift,Glue,Data Pipeline, Lambda.

Reporting Tools

JIRA, MS Excel, Tableau, Power BI, QlikView, Qlik Sense, D3, SSRS, SSIS

IDE’s

Pycharm,Anaconda,Jupyter Notebook, Intellij

Development Methodologies

Agile, Scrum, Waterfall

Professional experience:

CIGNA, Morristown, NJ Sep 2022 - Present

Role: Big Data/Hadoop Engineer

Responsibilities:

Designing and implementing product features in collaboration with business and IT stakeholders.

Utilized Apache Spark with Python to develop and execute Big Data Analytics and Machine learning applications, executed machine Learning use cases under Spark ML and Mllib.

Involved in complete Bigdata flow of the application starting from data ingestion from upstream to HDFS, processing and analyzing the data in HDFS.

Created Pipelines in ADF using Linked Services/Datasets/Pipeline/ to Extract, Transform and load data from different sources like Azure SQL, Blob storage, Azure SQL Data warehouse, write-back tool and backwards.

Developed Json Scripts for deploying the Pipeline in Azure Data Factory (ADF) that process the data using the Cosmos Activity.

Expertise in creating HDInsight cluster and Storage Account with End-to-End environment for running the jobs.

Followed Agile & Scrum principles in developing the project.

Worked with Azure Data factory Data transformations such as Stored Procedure, Databricks.

Developed Spark API to import data into HDFS from DB2 and created Hive tables .

Handled importing of data from various data sources, performed transformations using Hive, MapReduce, loaded data into HDFS and Extracted the data from SQL into HDFS using Sqoop.

Have knowledge on partition of Kafka messages and setting up the replication factors in Kafka Cluster.

Identified areas of improvement in existing business by unearthing insights by analyzing vast amount of data using machine learning techniques.

Interpret problems and provides solutions to business problems using data analysis, data mining, optimization tools, and machine learning techniques and statistics.

Designed and developed NLP models for sentiment analysis.

Led discussions with users to gather business processes requirements and data requirements to develop a variety of Conceptual, Logical and Physical Data Models. Expert in Business Intelligence and Data Visualization tools: Tableau, Microstrategy.

Worked on machine learning on large size data using Spark and MapReduce.

Let the implementation of new statistical algorithms and operators on Hadoop and SQL platforms and utilized optimizations techniques, linear regressions, K-means clustering, Native Bayes and other approaches.

Developed Spark/Scala, Python for regular expression (regex) project in the Hadoop/Hive environment with Linux/Windows for big data resources.

Data sources are extracted, transformed and loaded to generate CSV data files with Python programming and SQL queries.

Worked on data pre-processing and cleaning the data to perform feature engineering and performed data imputation techniques for the missing values in the dataset using Python.

Created Data Quality Scripts using SQL and Hive to validate successful das ta load and quality of the data.

Vangaurd, Malvern, PA Jul 2021 - Aug 2022

Role: Big Data/Hadoop Engineer

Responsibilities:

Prepared ETL design document which consists of the database structure, change data capture, Error handling, restart and refresh strategies

Worked on developing ETL pipelines on S3 Parquet files on data lake using AWS Glue

Programmed ETL functions between Oracle and Amazon Redshift

Performed Data Analytics on Data Lake using Pyspark on data bricks platform

Responsible for assessing and improving the quality of customer data

Worked on AWS cloud services like EC2, S3, EMR, RDS, Athena and Glue

Analyzed data quality issues through exploratory data analysis (EDA) using SQL, Python and Pandas

Worked on creating automation scripts leveraging various python libraries to perform accuracy checks from various sources to target databases.

Worked on building Python scripts to generate heatmaps to perform issue and root cause analysis for data quality report failures.

Performed data analysis and predictive data modeling.

Interacting with stakeholders delivering regulatory reports and to recommend best remediation strategies to ensure pristine quality of high priority usage data elements by building various analytical dashboards using excel and python plotting libraries.

Designed and implemented Rest Api to access Snowflake DB platform.

Worked with and maintained data warehouses in snowflake and star schemas

Involved in the code migration of quality monitoring tool from AWS EC2 to AWS Lambda and built logical datasets.

Worked with different feeds data like JSON, CSV, XML, DAT and implemented Data Lake concept

Environment: Python, Spark SQL, PySpark, Pandas, NumPy, Excel, PowerBI, AWS EC2, AWS S3,AWS Lambda, Athena, Glue, Linux Shell Scripting,SnowflakeDB, Git,DynamoDB, Redshift.

Citi Bank, New York, NY Apr 2020 – Jun 2021

Role: Big Data/Hadoop Engineer

Responsibilities:

Extracted and analyzed >800k+ data using SQL queries from Azure Snowflake, Azure SQL DB.

Designed the Exploratory Data Analysis in Python using Seaborn and Matplotlib to evaluate data qualities.

Developed Spark applications using PySpark and Spark SQL in Databricks for data extraction, transformation, and aggregation from multiple file formats.

Worked on migration of data from On-prem SQL server to Cloud databases (Azure Synapse Analytics (DW) & Azure SQL DB)

Constructed connections with Azure Blob Storage, built end-to-end data ingestion to process raw files in Databricks and Azure Data Factory.

Wrote Pig Scripts in Hadoop to generate Map Reduce jobs and extracted data from Hadoop Distributed File System (HDFS).

Scheduled data pipelines using Apache Oozie, involved in resolving the issues and troubleshooting related to the performance of Hadoop clusters.

Implemented data cleaning using Pandas, Numpy in Jupyter Notebook by dropping or replacing missing values, advancing data preparation.

Designed custom-built input adapters using Spark, Hive, and Sqoop to ingest and analyze data (Snowflake, MS SQL) into HDFS.

Expert in implementing advanced procedures like text analytics and processing using the in-memory computing capabilities like Apache Spark written in Scala.

Expertized in implementing Spark using Python and Spark SQL for faster testing and processing of data responsible to manage data from different sources.

Implemented data ingestion and handling clusters in real time processing using Kafka.

Extracted Tables and exported data from Teradata through Sqoop and placed in Cassandra

Designed data warehouses schemas and concepts (star schema and snowflake schema), fact table, cubes, dimensions, measures using SSAS.

Designed NLP grammar checker, search engine query auto-corrector using N-gram model; improved data quality and solidify compliance writeup.

Environment: Azure Databricks, Data Lake, MySQL, Azure SQL, Snowflake, Teradata, Git, Blob Storage, Data Factory, python, Pandas, Numpy, Sqoop, Hadoop, Hive, HQL, Spark, PySpark, Airflow, Hive, Sqoop, HBase

Fidelity, Boston, Massachusetts Nov 2018 – Mar 2020

Role: Hadoop Developer

Responsibilities:

Responsible for installation and configuration of Hive, Pig, HBase, and Sqoop on the Hadoop cluster

Created HIVE tables to store the processed results in a tabular format. Configured Spark Streaming to

Received real-time data from Apache Kafka and store the stream data to HDFS using Scala.

Developed the Sqoop scripts to make the interaction between Hive and Vertical Database.

Processed data into HDFS by developing solutions and analyzed the data using Map Reduce Pig, and hive to produce summary results from Hadoop to downstream systems.

Build servers using AWS: Importing volumes, launching EC2, creating security groups, auto-scaling, load balancers, Route 53, SES, and SNS in the defined virtual private connection.

Designing and building multi-terabyte, full end-to-end Data Warehouse infrastructure from the ground up on Confidential Redshift for large-scale data handling Millions of records every day.

Written Map Reduce code to process and parse the data from various sources and store parsed data into HBase and Hive using HBase-Hive Integration.

Migrated an existing on-premises application to AWS. Used AWS services like EC2 and S3 for data sets processing and storage.

Built and configured a virtual data center in the Amazon Web Services cloud to support Enterprise Data Warehouse hosting including Virtual Private Cloud, Security Groups, and Elastic Load Balancer.

Streamed AWS log group into Lambda function to create service now incident.

Involved in loading and transforming large sets of Structured, Semi-Structured, and Unstructured data and analyzed them by running Hive queries and Pig scripts.

Created Managed tables and External tables in Hive and loaded data from HDFS.

Developed Spark code by using Scala and Spark-SQL for faster processing and testing and performed complex HIVE-QL queries on Hive tables.

Scheduled several times based on Oozie workflow by developing Python scripts.

Exporting the data using Sqoop to RDBMS servers and processed that data for ETL operations.

Worked on S3 buckets on AWS to store Cloud Formation Templates and worked on AWS to create EC2 instances.

Designed ETL Data Pipeline flow to ingest the data from RDBMS source to Hadoop using shell script, Sqoop, package, and MySQL.

End-to-end architecture and implementation of client-server systems using Scala, Java, and related, Linux.

Optimized the Hive tables using optimization techniques like partitions and bucketing to provide better.

Used Oozie workflow engine to manage interdependent Hadoop jobs and to automate several types of

Hadoop jobs such as Java map-reduce Hive, Pig, and Sqoop.

Implementing Hadoop with the AWS EC2 system using a few instances in gathering and analyzing data

log files.

Involved in Spark and Spark Streaming creating RDDs, applying operations -Transformation and Actions.

Created partitioned tables and loaded data using both static partition and dynamic partition methods.

Developed custom Spark programs in Scala to analyze and transform unstructured data.

Managed importing of data from various data sources, performed transformations using Hive, and MapReduce, and loaded data into HDFS.

Extracted the data from Oracle into HDFS using Sqoop.

Using Kafka on publish-subscribe messaging as a distributed commit log, have experience in its fast,scalable, and durability.

Test-Driven Development (TDD) process and experience with Agile and SCRUM programming methodology.

Implemented POC to migrate Map Reduce jobs into Spark RDD transformations using Scala.

Scheduled map reduces jobs in the production environment using Oozie scheduler.

Involved in Cluster maintenance, Cluster Monitoring, Troubleshooting, Managing, and reviewing data backups and log files.

Designed and implemented map-reduce jobs to support distributed processing using java, Hive, and Apache Pig.

Analyzing Hadoop clusters and different Big Data analytic tools including Pig, Hive, HBase, and Sqoop.

Improved the Performance by tuning Hive and MapReduce.

Research, evaluate, and utilize modern technologies/tools/frameworks around the Hadoop ecosystem.

Environment: HDFS, Map Reduce, HIVE, Sqoop, Pig, Flume, Vertica, Oozie Scheduler, Scala, Java, Shell Script,Teradata, Oracle, HBase, MongoDB, Cassandra, Cloudera, AWS, JavaScript, JSP, Kafka, Spark, Spark Streaming, Scala and ETL, Python, RDBMS.

Unisys Global Services Pvt Ltd, India Jan 2016 – Jul 2017

Role: Hadoop Engineer

Responsibilities:

Developed highly optimized Spark applications to perform data cleansing, validation, transformation and summarization activities

Data pipeline consist Spark, Hive and Sqoop and custom build Input Adapters to ingest, transform and analyze operational data.

Created Spark jobs and Hive Jobs to summarize and transform data.

Used Spark for interactive queries, processing of streaming data and integration with popular NoSQL database for huge volume of data.

Converted Hive/SQL queries into Spark transformations using Spark DataFrames and Scala.

Used different tools for data integration with different databases and Hadoop.

Used Spark for interactive queries, processing of streaming data and integration with popular NoSQL database for huge volume of data.

Built real time data pipelines by developing Kafka producers and spark streaming applications for consuming.

Ingested syslog messages parse them and streams the data to Kafka.

Handled importing data from different data sources into HDFS using Sqoop and performing transformations using Hive, Map Reduce and then loading data into HDFS.

Exported the analyzed data to the relational databases using Sqoop, to further visualize and generate reports for the BI team.

Collecting and aggregating large amounts of log data using Flume and staging data in HDFS for further analysis

Analyzed the data by performing Hive queries (Hive QL) to study customer behavior.

Helped Devops Engineers for deploying code and debug issues.

Used Hive to analyze the partitioned and bucketed data and compute various metrics for reporting.

Developed Hive scripts in Hive QL to de-normalize and aggregate the data.

Scheduled and executed workflows in Oozie to run various jobs.

Implemented business logic in Hive and written UDF’s to process the data for analysis.

Addressing the issues occurring due to the huge volume of data and transitions.

Designed, documented operational problems by following standards and procedures using JIRA.

Environment: Spark, Scala, Hive, Apache NiFi, Kafka, HDFS, Oracle, HBase, MapReduce, Oozie, Sqoop

Smart Software Tech. Dev. Pvt. Ltd., India Apr 2014 – Dec 2015

Role: Hadoop Engineer

Responsibilities:

Interacting with the Business Requirements and the design team and preparing the Low-Level Design and high-level design documents.

Provide in-depth technical and business knowledge to ensure efficient design, programming, implementation and on-going support for the application.

Involved in identifying possible ways to improve the efficiency of the system.

Logical implementation and interaction with HBase.

Efficiently put and fetched data to/from HBase by writing MapReduce job.

Developed Map Reduce jobs to automate transfer of data from/to HBase.

Assisted with the addition of Hadoop processing to the IT infrastructure.

Implemented Map/Reduce job and execute the Map/Reduce job to process the log data from the ad-servers.

Used Hive to analyze the partitioned and bucketed data and compute various metrics for reporting.

Worked on MongoDB, and Cassandra.

Prepared multi-cluster test harness to exercise the system for better performance.

Environment: Hadoop, HDFS, MapReduce, HBase, Hive, Cassandra, Hadoop distribution of Hortonworks, Cloudera, SQL* PLUS and Oracle 10g.

Education:

Master of Computer Science from New York Institute of Tech, New York, NY

Bachelors in Computer Science from JNTUH, India



Contact this candidate