AKHILA K CHOUDHURY
LinkedIn : https://www.linkedin.com/in/akhil-choudhury/ GitHub : https://github.com/Akhilcet307 EDUCATION
MS in Computer Science: New Jersey Institute Of Technology, NJ Aug’2018- Dec’2019, GPA - 3.65/4.0 Coursework: Applied Statistics, Big data, Data Analytics using R, Database Management System, Machine Learning BE in Information Technology. – Biju Patnaik University Of Technology, India Jul’2008-Jul’2012, CGPA-3.70/4.0 SKILLS
Programming: Python 3, Scala, PLSQL, PowerShell scripting, Data Technologies: SQL, Apache Spark, HDFS, Sqoop, Flume, Kafka, MangoDB Machine Learning: Supervised/Unsupervised Learning, Feature Engineering, Text Analysis Tools & IDEs: PySpark, MySQL, Talend Studio, AWS(Redshift, S3, EMR, EC2) BI tools: Tableau
• Cloudera Certification (CCA 175)
• AWS Certified Big Data - Specialty
Tata Consultancy Services – Data Engineer - Hyderabad, India Oct’2012 – July’2018
• Explored with the Spark improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark -SQL, Data Frame, PairRDD's, Spark YARN.
• Configured Spark streaming to get ongoing information from the Kafka and store the stream information to HDFS.
• Handled importing of data from various data sources, performed transformations using Hive, MapReduce, loaded data into HDFS and Extracted the data from MySQL into HDFS using Sqoop.
• Used various spark Transformations and Actions for cleansing the input data.
• Worked on different file formats (ORCFILE, TEXTFILE) and different Compression Codecs (GZIP, SNAPPY, LZO).
• Involved in processing of audit data from a FTP location and pushing into HDFS using Flume and process the data using MapReduce and PIG Job and extensively working in writing Pig data transform scripts to transform data from several data sources into forming baseline data
• Created and worked Sqoop jobs with incremental load from MySQL to populate Hive External tables and use of Partitions, buckets using Hive.
• Creating S3 buckets also managing policies for S3 buckets and utilized S3 bucket for storage and backup on AWS.
• Worked on EC2
• Involved in development and maintenance of Oracle database using SQL for infra-net management system
• Worked on python Dataframes(Numpy, Pandas, Matplotlib/Seaborn) to work with data analyst team to describe datasets and create a report on it.
• Worked on interactive dashboards for building story and presenting using Tableau
• Worked on Hadoop MapReduce jobs to process the tasks and create a final output. COURSEWORK PROJECTS
Personalized cancer diagnosis:
Problem statement: Classify the given genetic variations/mutations based on evidence from text-based clinical literature
• Combined text and train data to form final reference of texts columns
• Applied all text vectorization algorithms like BagOfWords, TFIDF, Word2Vec with classification algorithms to predict the class label
• Finally applied LSTM which gave good accuracy. Evaluation of the model was done by confusion matrix, accuracy score. Tableau visualization of US agriculture Data:
• Created a dashboard story taking various factors like crop, age of farmers from 2017 USAA dataset.
• The link to the dataset - Tableau visualization of US agriculture Data Healthcare Dataset with Spark :
• Predict the probability of an observation belonging to a category, categories are being probability of having a stroke.
• Applied libraries like Spark.Sql, SqlContext and Spark ML (Decision Tree) Twitter trending topics - (Apache Spark,Spark Streaming,Python) :
• Developed a Spark Streaming Application that finds trending topics on twitter in real time
• Got tweets using Twitter HTTP Client App and send the tweet to the spark using a TCP connection.
• Applying transformation and aggregation logics to find frequency of hashtags and create a dashboard to present. IMDB dataset :
• Classification algorithm to categorize the movie dataset.