Post Job Free

Resume

Sign in

Hadoop Developer, spark,hive,Kafka,java,python,HBase, Sqoop

Location:
Groton, CT
Posted:
February 24, 2018

Contact this candidate

Resume:

Srinivasarao Padala

Groton, Connecticut - USA

T: +1-248-***-****

E: ac4li1@r.postjobfree.com

Sr. Big Data Engineer

Summary ● 5 years of experience as a professional Hadoop developer in Batch and real-time data processing using various Hadoop components- Spark, Solr, Kafka, Hbase, Hive, Nifi, Sqoop, Storm and Java

● Having experience in building Hortonworks Hadoop Cluster – HDP2.5

● Having working experience in PySpark with AWS Cloud components like S3, Redshift Db

● Deputed to TCS-Singapore for a period of 6 months to work closely for DBS to build a Business layer for various transactional sources using Hadoop Components

● Experience in working with Pyspark, Spark, Spark Streaming & Spark Sql and also in extending Spark integration with various components - Solr, Kafka, HDFS, Hbase, Hive & Amazon Kinesis

● Extensive Experience in building SolrCloud cluster, Solr indexing, Banana dashboards and in extending prepare custom Solr schema & configurations

● Utilized Storm, Kafka & Amazon kinesis for processing large volume of datasets

● Experience in importing and exporting data using Sqoop from HDFS/Hive/HBase to RDBMS and vice-versa

● Experience in working with MR,PIG scripts & HIVE query language, Hcatalog and also in extending Hive and Pig functionality by writing custom UDFs

● Experience in analyzing data using Hive QL,Pig,SPARK and custom MR programs in Java

● Having hands on experience of Nifi(HDF) in building data routing & transformation dataflows and integrated with various components( Hdfs, Hive, Hbase, Solr, Mssql & Kafka) as source/target

● Experience in working with multiple data formats - Avro, Parquet, Json, Xml, Csv

● Utilized Oozie workflow to schedule Sqoop, Java, Hive,Hive2, Pig, MR & Shell script in HDP Kerberos Cluster

● Having experienced knowledge in HBase and Phoenix

● Hands on experience on Azure Cloud and Amazon AWS cloud services : EC2,S3,Data pipeline and EMR,S3

● In depth understanding knowledge of Hadoop Architecture and various components such as HDFS,YARN,Zookeeper and MapReduce concepts

● Having work experience in various Hadoop distributions(Cloudera, Hortonworks) & cloud platforms (AWS Cloud, Microsoft Azure)

Education Distinction in Bachelor of Computer Science and Engineering(CSE) From Bapatla Engineering College, Acharya Nagarjuna University, India Technical

Skills

Hadoop Distributions : C loudera, Hortonworks

Cloud Platforms : A mazon Cloud and Microsoft Azure

Data Movement and integration : N ifi (HDF), Sqoop, Kafka, Amazon Kinesis

Search Engine : S olr

Processing/Computing Frameworks : P yspark, Spark,Spark Streaming, MapReduce, Storm

Query Languages: H iveQL, Spark Sql, Sql, Impala

Security : K erberos, Ranger

File formats : A vro, Parquet, XML, JSON, CSV, XLSX

Workflow schedulers : O ozie, Unix Cron

Other Big Data Components : Y ARN, Zookeeper, Ambari, Hue, Tez, P ig

Cluster Installation : Hortonworks HDP 2.5 Using Ambari 2.4

Databases: H Base, Oracle, MsSql, Redshift

Languages: Java, Python, D3

Development / Build Tools : E clipse, Maven,SVN

Java Frameworks : H ibernate, Jboss Drools Engine

Operating Systems : L inux, Windows

Experience

Pfizer

- CT, USA

Sr.Hadoop & Cloud Developer, Tata Consultancy Services(TCS) Client - Pfizer Inc

Groton, Connecticut-USA June 2017 -- Present

SDC Data lake & Analytics - T he main purpose of this project is to build Scientific Data Cloud(Data lake) to store various Scientific laboratory and manufacturing datasets using Hadoop and Cloud technologies and build Analytics Dashboards on top of SDC . SDC builds on MapR and AWS Cloud platforms .

Responsibilities :

a. Involved in Discussions with business users to gather the required knowledge b. Analysing the requirements to develop the framework c. Developed Java Spark streaming scripts to load raw files and corresponding processed metadata files into AWS S3 and Elasticsearch cluster. d. Developed Python Scripts to get the recent S3 keys from Elasticsearch e. Developed Python Scripts to fetch/get S3 files using Boto3 module . f. Implemented Pyspark logic to transform and process various formats of data like XLSX, XLS, JSON, TXT

g. Built scripts to load pyspark processed files into Redshift Db h. Developed scripts to monitor and capture state of each file which is being through Pyspark logic

i. Implemented Shell script to automate the whole process Tools/Components : AWS S3, Java 1.8,Maven, Python 2.7, Spark 1.6.1, Kafka, ElasticSearch 5.3, MapR Cluster, Amazon Redshift Db, Shell script Python Modules : B oto3, pandas, Elasticsearch,certifi, pyspark,Psycopg2, Json,io Delphi

- Mi, USA

Sr.Hadoop Developer, Tata Consultancy Services(TCS) Client - Delphi Automotive

Hyderabad, India March 2017 -- May 2017

Troy, Michigan-USA August 2016 -- Feb 2017

Hyderabad, India March 2016 -- July 2016

Plant Visibility Using Big Data - T he Project is a package of 3 use cases in which we have used various Hadoop components to build Hadoop-Data lake and perform Analytics on various areas.

1. Production Status Display : This use case is focusing on displaying real-time production count of part manufactured in various plant location, calculate yield information of GOOD vs BAD parts.

Responsibilities :

a. Analysing the requirements to develop the framework b. Import plants event data from various plants using Apache Nifi c. Implement transformation logic on plant events using Apache Nifi d. Build HBase data lake and created secondary indexes using Phoenix e. Manage and tuning Hbase to improve Performance

Tools/Components : A pache Nifi 1.0,JSON expression Language, HBase1.1.2, Phoenix 4.7, Microsoft Azure HDP Cluster-2.5

2. Solr indexing on HDFS and Hbase data : T his use case is to build various real time banana dashboards on solr hdfs indexed data using solr + spark streaming integration and solr HBase indexed data using hbase-indexer Responsibilities :

a. Build Solrcloud cluster with external Zookeeper quorum b. Index on real time HDFS plants events using Solr and Spark Streaming c. Index on HBase Cycle time events using lucidworks HBase indexer d. Build Banana dashboards

e. Configurable changes in Banana to make them available to end users Tools/Components : L ucidworks-Solr 5.5.2, Apache Spark 1.6, HBase1.1.2, Banana 1.6 Dashboard, Java1.8, Shell Script, Microsoft Azure HDP Cluster-2.5 3. Scrap Analytics : T he main purpose of use case is to build SAP-BO reports on Hive data, which is imported frequently using Sqoop and oozie workflow scheduler from MsSql servers

Responsibilities :

a. Importing data using Sqoop from MsSql Server into HDFS b. Build Hive scripts to perform queries and transformations c. Build oozie coordinator workflows of Sqoop and Hive to schedule daily and incremental jobs

d. Helps to team to build SAP-BO reports on Hive using ODBC Driver Tools/Components : S qoop 1.4, Hive 1.2, Oozie 4.2, Shell Script DBS

-Singapore

Hadoop Developer, Tata Consultancy Services(TCS)

Client - Development Bank Of Singapore (DBS)

Singapore August 2015 -- 28 February 2015

Group Finance Architecture Program (Pre-General Ledger implementation) :

The main purpose of Project is to implement a Pre-General Ledger (Pre-GL) layer using Hadoop Infrastructure (Cloudera distribution). PRE-GL is an intermediate mapping and data enrichment layer that will resides between the transaction processing systems (e.g Current Account,Savings systems) and Finance systems (e.g PSGL, Operational Data store (ODS)).

Key benefits of implementing Pre-GL layer are:

● Align the accounting entries from source systems to populate Standard set of PSGL chartfields (Ledger, Business Unit, Account, Product, PC Code, Chartfield3, Original CCY, Base CCY.

● Summarised Accounting entries to be sent to PSGL in order to maintain the performance/EOD processing of PSGL.

● Detailed (non-summarised) accounting entries to be sent to ODS and any other downstream that require such information.

● Reduce Manual journal entries (MJE) posted across operations.

● Decommission legacy mainframe systems

● Faster Book closing

Responsibilities :

j. Involved in Discussion with business users to gather the required knowledge k. Analysing the requirements to develop the framework. l. Importing data using Sqoop.

m. D ata Ingestion into Hive.

n. Processing Hive data using Spark and Spark Sql

o. Integration of JBoss Drools with Spark transformations p. Sending files to PSGL, ODS

Tools/Components : Cloudera 5.4.3 Cluster, Apache Spark 1.3, JBoss Drools, Java, Maven, Sqoop, Hdfs

AMAT

-India

Hadoop Developer, Tata Consultancy Services(TCS)

Client - Applied Materials (AMAT)

Hyderabad, India November 2014 - July 2015

In this engagement, I have been involved in two use cases : Build of Materials(BOM) :

This Use case has been developed to understands the relation among various raw materials (up to 18 levels) which are used to build a final product . Responsibilities :

a. Analysing the requirements to develop the framework. b. Data Ingestion into HDFS and then integrated into Hive. c. Develop the script integrate HBase with Hive data d. Build the scripts to index data using HBase lily indexer + Solr e. Developed Solr Java code to bring up the relation among materials f. Build the logic functionality to fetch the hierarchical data and give a provision to search with component number using Java, JSP integration g. Visualize the results using D3 Javascript dashboards Tools/Components : Cloudera CDH5.2,Solr,, Java, Sqoop,Hive, HBase,D3.js,JSP Change Data Capture Mechanism (CDC) :

The main purpose of CDC is to build a system to automatic identifying and capturing the data changes in Oracle / DB2 data sources and store them on Big Data data warehouse to maintain the same replica of Data Responsibilities :

a. Analysing the requirements to develop the framework. b. D eveloped Sqoop scripts and Data Services to pull delta data and store them into HDFS.

c. Developed hive scripts to merge delta data with existing hive data. d. Worked on oozie scripts to schedule above process for every 30 mints e. Developed Reconciliation Java framework is used to record level comparison Tools/Components : Cloudera CDH5.2,, Java,Sqoop,Hive BT

-India

Hadoop Developer, Tata Consultancy Services(TCS)

Client - British Telecom (BT)

Hyderabad, India November 2013 - August 2014

TV Analytics :

The main purpose of this Project is to speed up the data processing using Big data technologies compared to existing oracle systems and build the ad hoc reports on Processed Hive data .

We have developed Tableau reports by processing various raw feeds such as UserProfile, physicalDevice, ProductDevice, Media, Release, PurchaseEvents, IntentToPlay, Products and Assets data.

Responsibilities :

a. Analysing the requirements to develop the framework. b. Developed Sqoop scripts to import data from Oracle to Hdfs c. Developed Map-Reduce programs for Cleansing & Validating data on imported Hdfs data

d. Implemented Custom key and partitioning techniques in MapReduce Programming

e. Developed Hive table Structures to inject cleansed data f. Configuration changes in Hive, MR programming as a part of performance tuning

g. Executed queries in impala to get query performance h. Built a workflow of Sqoop, MapReduce and Hive scripts and schedule them in Oozie via Hue

i. Helps to tableau team to build reports by connecting Impala Tools/Components : Cloudera 4, Java, Sqoop, MapReduce Hive, Impala, Oozie, Hue CV

-India

Hadoop Developer, Tata Consultancy Services(TCS)

Client - Cable Vision (CV)

Hyderabad, India January 2013 - October 2013

Channel tuning :

The main purpose of this Project is to capture the daily Users interest & behaviour on Tv Programs. In order to implement this, We have been used Kafka and storm to collect,transform and process the daily user’s channel tuning events and save them in Amazon Redshift via S3 to build tableau reports

Responsibilities :

h. Analysing the requirements to develop the framework. j. Worked on kafka and storm to capture, transform and process realtime transactions and saved them in S3 bucket

k. Data ingestion/copied into Redshift from S3

l. Build the Redshift queries and preprocess the data and store back into S3 bucket. m. Pig transformation applied on the data to merge the consecutive records with the difference between them is <3sec duration. And Stores back into S3. n. Build the Redshift queries to split the merged records back into various programs using Program schedule table.

o. Schedule the job using Cron to run it in every 1 hour Tools/Components : Amazon Cloud(AWS), Hortonworks HDP Cluster, EMR, S3, Redshift, Kafka, Storm, Java, Pig



Contact this candidate