Sign in

Data Engineer

Newark, NJ
January 09, 2020

Contact this candidate



*** *-* **, ********, NJ 1-201-***-****

LinkedIn : GitHub : EDUCATION

MS in Computer Science: New Jersey Institute Of Technology, NJ Aug’2018- Dec’2019, GPA - 3.65/4.0 Coursework: Applied Statistics, Big data, Data Analytics using R, Database Management System, Machine Learning BE in Information Technology. – Biju Patnaik University Of Technology, India Jul’2008-Jul’2012, CGPA-3.70/4.0 SKILLS

Programming: Python 3, Scala, PLSQL, PowerShell scripting, Data Technologies: SQL, Apache Spark, HDFS, Sqoop, Flume, Kafka, MangoDB Machine Learning: Supervised/Unsupervised Learning, Feature Engineering, Text Analysis Tools & IDEs: PySpark, MySQL, Talend Studio, AWS(Redshift, S3, EMR, EC2) BI tools: Tableau


• Cloudera Certification (CCA 175)

• AWS Certified Big Data - Specialty


Tata Consultancy Services – Data Engineer - Hyderabad, India Oct’2012 – July’2018 Technologies: Big Data, My SQL

Client : UNUM, Davita

• Involved in processing of audit data from a FTP location and pushing into HDFS using Flume and process the data using MapReduce and PIG Job and extensively working in writing Pig data transform scripts to transform data from several data sources into forming baseline data

• Created and worked Sqoop jobs with incremental load from MySQL to populate Hive External tables and use of Partitions, buckets using Hive.

• Worked on interactive dashboards for building story and presenting using Tableau

• Worked on SparkSQL where the task is to fetch the NOTNULL data from two different tables and loads into a lookup table. Here in look up table the daily data should be loaded in incremental manner and should check for duplicates.

• Creating S3 buckets also managing policies for S3 buckets and utilized S3 bucket and Glacier for storage and backup on AWS.

• Performed k-Means clustering in order to understand customer backgrounds and segment based on customer transaction behavior

• Involved in development and maintenance of Oracle database using PL/SQL for infra-net management system

• Followed best ETL ways and good practices to create Talend ETL jobs. Created Implicit, local and global Context variables in the job


Personalized cancer diagnosis:

Problem statement: Classify the given genetic variations/mutations based on evidence from text-based clinical literature

• Combined text and train data to form final reference of texts columns

• Applied all text vectorization algorithms like BagOfWords, TFIDF, Word2Vec with classification algorithms to predict the class label

• Finally applied LSTM which gave good accuracy. Evaluation of the model was done by confusion matrix, accuracy score. Tableau visualization of US agriculture Data:

• Created a dashboard story taking various factors like crop, age of farmers from 2017 USAA dataset.

• The link to the dataset - Tableau visualization of US agriculture Data Healthcare Dataset with Spark :

• Predict the probability of an observation belonging to a category, categories are being probability of having a stroke.

• Applied libraries like Spark.Sql, SqlContext and Spark ML (Decision Tree) Twitter trending topics - (Apache Spark,Spark Streaming,Python) :

• Developed a Spark Streaming Application that finds trending topics on twitter in real time

• Got tweets using Twitter HTTP Client App and send the tweet to the spark using a TCP connection.

• Applying transformation and aggregation logics to find frequency of hashtags and create a dashboard to present. Contosa Retail Data Warehouse & Business Intelligence Technologies: Talend, Tableau

• Configured ETL activities for a Retail data warehouse in Talend, also improved the performance of these jobs by 80%

• Implemented slow changing dimensions to track the changing product pricing and costing in a retail warehouse that made more data available for analysis

Contact this candidate