Resume

Data Aws

Location:

Hayward, CA

Posted:

February 03, 2021

Contact this candidate

Resume:

HEMANTH ANNAVARAPU

DATA ENGINEER

adjwst@r.postjobfree.com

SUMMARY:

• 5 years of experience in Bigdata analysis through multiple projects involving Hadoop MapReduce, Apache Spark, HDFS, Hive, Impala, Sqoop, Kafka, Oozie.

• 2 years of experience in implementing Bigdata solutions on cloud-based infrastructure (AWS).

• Experienced in working with several AWS components like EC2, EMR, IAM, RDS, DMS, GLUE, S3, Elasticsearch etc.

• Excellent knowledge on Hadoop system Architecture and Hadoop ecosystem components.

• Experienced in building data pipelines using Spark, Scala/Python, Hive and HDFS for collecting huge amounts of data (structured, unstructured and semi-structured) from various upstream sources and store it on HDFS for analysis and reporting.

• Developed Spark/Kafka Streaming jobs to process incoming streams of data and store them in hive tables.

• Experienced in writing complex SQL queries to generate reports from large amounts of data stored in Hive tables and to perform data transformations in data pipelines built using Spark Dataframes.

• Worked on cloud based Bigdata platforms using technologies like AWS EC2, AWS S3, AWS RDS, AWS DMS and AWS EMR.

• Experienced in working with various data structures in Spark like RDDs, Datasets and Dataframes and used Sparks’ built in Web UI to monitor and improve processing times of Spark jobs by using partitioning, broadcasting and check pointing techniques.

• Experienced in designing and scheduling workflows for ETL using Airflow and Oozie.

• Experienced in automating manual tasks using Bash and Python scripts.

• Analyzed large data sets using Hive queries for Structured data. Developed custom Hive UDFs for processing and transformation of data.

• Experienced in ingesting data from traditional database system into the Hadoop data lake using SQOOP for analysis.

• Experienced in relational database systems (RDBMS) such as MySQL, MSSQL, Oracle.

• Excellent knowledge in Java and SQL in application development and deployment.

• Good understanding of core computer science concepts like Data structures and algorithms.

• Strong knowledge in programming languages like C, C++.

• Experienced in Linux administration and operation.

• Exceptional ability to learn new technologies and to deliver outputs in short deadlines.

• Excellent technical communication, analytical and problem-solving skills and ability to get on well with people including cross-cultural backgrounds and trouble-shooting capabilities. EDUCATION:

Master’s in Computer Science · University of Louisiana at Lafayette. 2014 - 2015 Bachelor’s in Computer Science and Engineering · KL University. 2010 - 2014 TECHNICAL SKILLS:

Big Data Apache Hadoop, Apache Spark, Hive, Impala, Sqoop, Kafka, Oozie, Ranger, Airflow, Elasticsearch

Hadoop Distributions Cloudera, Hortonworks

Cloud Technologies Amazon Web Services (AWS)

Languages Core Java, C, C++, Scala, Python

Web Services & Technologies REST, PHP, JavaScript, JQuery, XML, HTML, CSS, AJAX and JSON Databases SQL Server, MySQL, Oracle, Postgres

IDE Eclipse, IntelliJ IDEA

Build Tools Maven, SBT

CICD technologies Git, Jenkins

PROJECT EXPERIENCE:

Cloudwick Technologies Inc., Newark CA Mar 2020 - Present Project: Freddie Mac - DOCSS

Role: Cloud Data Engineer

Description: The purpose of this team is to migrate the existing DOCSS (Document Storage Service) web application from on-premises Big Data platform of Freddie Mac onto a cloud-based architecture utilizing AWS Elasticsearch and S3.

• Develop enterprise java code for adding new functionalities/features to the application to write data to Elasticsearch and S3.

• Implement store, search, and retrieval functionalities of documents to/from Elasticsearch and S3.

• Implement authentication mechanism with Ping IDP to authenticate with AWS using SAML assertion and thereby access Elasticsearch and S3.

• Develop Spark applications for migrating data from SOLR to Elasticsearch and from HDFS/HBASE to S3.

• Tune migration applications on Spark to reduce bottlenecks and throttling to increase throughput and efficiency.

• Use complex SQL queries to generate reports from large amounts of data stored in hive tables to generate reports on data migrations.

• Act as a big data consultant in recommending the right tools/libraries to solve big data problems.

• Design, develop, Quality Assurance (QA) and maintain application code.

• Work with line of businesses and development management to provide effective technical designs aligning with industry best practices.

Environment: HDP (Hortonworks Data Platform) 2.6.4, Hadoop 2.7.3, Spark, HBase, AWS S3, AWS EC2, AWS Elasticsearch, Java, SOLR.

Cloudwick Technologies Inc., Newark CA Mar 2019 – Mar 2020 Project: Freddie Mac - BDAP

Role: Big Data Engineer

Description: The purpose of this team is to migrate and implement the existing on-premises Big Data platform of Freddie Mac onto a cloud-based architecture utilizing AWS.

• Developed migration scripts for migrating various components of the on-premises Hadoop platform including HDFS, Hive Metastore, HBase, Atlas and Ranger.

• Used AWS DMS to migrate on-premise Hive metastore from Oracle to Postgres on AWS RDS.

• Developed custom shell/python scripts for validating the migration of various components by doing a one-to-one comparison of each component on on-premises and cloud platforms.

• Developed Pyspark scripts for validation of hive metastore migration and comparison of row counts of hive tables between on-premises and cloud clusters.

• Used complex SQL queries to generate reports from large amounts of data stored in hive tables and to perform data transformations in data pipelines built using spark dataframes.

• Developed data pipelines for data experimentation using a variety of tools such as Kafka, Hadoop, Hive and Spark.

• Designed and maintained Hadoop Workflows/ETL for all the data products.

• Acted as a big data consultant in recommending the right tools/libraries to solve big data problems.

• Design, develop, Quality Assurance (QA) and maintain application code.

• Worked with line of businesses and development management to provide effective technical designs aligning with industry best practices.

• Developed highly scalable and extensible Big Data platform which enables collection, storage, modeling, and analysis of massive data sets from numerous channels. Environment: HDP(Hortonworks Data Platform) 2.6.4, Hadoop 2.7.3, Hive, Atlas, Ranger, Spark, HBase, Kafka, AWS S3, AWS EC2, AWS RDS, AWS DMS, Python, Shell scripting. Bank of America, Charlotte NC May 2017 – Jan 2019

Project: GIS Cloudera Platform

Role: Hadoop/Spark Developer

Description: This team is a part of GIS (Global Information Security) in bank of America. The purpose of this project/team is to provide a reliable and scalable bigdata platform to support various projects in GIS and other LOBs in Bank of America. The objectives mainly include fetching the data from various sources ranging from traditional RDBMS systems to Realtime data pipelines, provide support to various types of clients using different types of data consumption procedures and generating reports based on ad-hoc requests from business.

• Built data pipelines using Spark/Scala, Hive, HDFS to Ingest the data from various upstream sources to the production cluster for analysis and reporting.

• Developed distributed Spark/Scala programs for data processing and transformations.

• Developed standalone Java programs for fetching Email data and generating reports.

• Used complex SQL queries to generate reports from large amounts of data stored in hive tables and to perform data transformations in data pipelines built using spark dataframes.

• Developed Kafka producer and consumer programs for message handling, developed spark programs to parse the logs and structure them in tabular format to facilitate effective querying on the log data.

• Developed complex multi-layered workflows for ETL using Oozie and Airflow

• Developed custom Spark/Kafka streaming applications for ingesting data from various Realtime datasets into Hive tables.

• Debugged and maintained Spark/Java code in case of any issues.

• Extracted the data from various sources like APIs using java programs and shell scripting and ingested the data into Hive.

• Worked on generating data analysis reports and handled adhoc data requests from the Business using Hive and Impala.

Environment: Apache Spark, Apache Hadoop, HDFS, Hive, Impala, Sqoop, Kafka, Oozie, Maven, intelliJ, Linux, Cloudera CDH 5

Client: Bank of America, Charlotte NC Dec 2016 – Apr 2017 Project: Wholesale Loss Forecasting

Role: Hadoop/Spark Developer

Description: This project is intended for Bank of America’s Comprehensive Capital Analysis and Review (CCAR) reporting. CCAR is a regulatory framework introduced by the Federal Reserve of USA to assess, regulate, and supervise large banks and financial institutions.

• Converted MapReduce code to Spark code for better performance.

• Debugged and maintained Spark model code in case of any issues.

• Coded Data Quality checks in Spark to ensure the quality of data intended for analysis.

• Built data pipelines to ingest data from various upstream sources to the production cluster for analysis and reporting.

• Performed data quality validation using transient and staging tables in Hive. After all the actions are done, data is loaded into the final tables.

• Provided production support during the CCAR reporting process when business was generating reports for the CCAR submission process.

• Wrote shell scripts for deployment validations and data copy between various lanes in the Development cluster.

• Done deployments to various environments for development and testing purposes of the application.

Environment: Apache Spark, HDFS, Hive, Sqoop, Linux, Cloudera CDH 5, Tableau, Oracle and WebLogic Stiaos Technologies, Houston TX Feb 2016 – Dec 2016 Project: Clickstream Data Analysis

Role: Hadoop Developer

Description: Clickstream data is the path in which a user has accessed various resources or web pages on a website. Clickstream data can give great insights to take strategic decisions in the best interests of business. Clickstream data analysis is the process of harnessing the value of clickstream data and thereby improving the use experiences and sales.

• Worked on a Hadoop Cluster with current size of 56 Nodes and 896 Terabytes capacity.

• Developed custom Apache Spark programs for data validation to filter unwanted data and cleanse the data.

• Ingested traditional RDBMS data into the HDFS from the existing SQL Server using Sqoop.

• Used PIG to process small samples of transformed data for a product purchase prediction to create an efficient product recommendation system.

• Created Hive tables and queried large amounts of data to apply the transformations done on data samples in Pig on a large scale of data distributed across the cluster. Environment: MapReduce, Hive, Impala, Sqoop, Linux, Cloudera CDH 5, Scala, Pig, Spark, Zookeeper, HBase, Tableau and SQL Server.

Contact this candidate