Sr. Hadoop/Spark Developer

Location:

Columbus, OH, 43232

Posted:

July 12, 2022

Contact this candidate

Resume:

Name: Chaitanya Maloth

Phone: +1-703-***-**** Email: *******@*******.***

BIG DATA ENGINEER

Professional Summary:

7+ years of technical expertise in complete software development life cycle (SDLC), which includes 6 years of Data Engineering experience using Hadoop and Big Data Stack.

Hands on experience working with Spark and Hadoop ecosystems like MapReduce, Sqoop, Hive, b, Flume, Kafka, Zookeeper and NoSQL Databases like HBase.

Excellent knowledge and understanding of Distributed Computing and Parallel processing frameworks.

Strong experience with developing end-to-end Spark applications in Scala.

Worked extensively on troubleshooting issues related to memory management, resource management, with in spark applications.

Strong knowledge on fine-tuning spark applications and hive scripts.

Written complex MapReduce jobs to perform various data transformations on large scale datasets.

Experience in installation, configuration, and monitoring Hadoop clusters both in house and on the cloud (AWS).

Good experience working with AWS Cloud services like S3, EMR, Redshift, Athena, Glue Metastore etc.,

Extending Hive core functionality by writing custom UDF’s for Data Analysis.

Handling importing of data from various data source, performed transformation, and hands on developing and debugging MR2 jobs to process large data sets.

Experience in writing queries in HQL (Hive Query Language), to perform data analysis.

Created Hive External and Managed Tables.

Implemented Partitioning and Bucketing on Hive tables for Hive Query Optimization.

Experienced in writing Oozie workflows and coordinator jobs to schedule sequential Hadoop jobs.

Experience in using Apache Flume for collecting, aggregation, moving large amount of data from application server.

Good experience utilizing Sqoop extensively for ingesting data from relational databases.

Good knowledge on Kafka for streaming real time feeds from external rest applications to Kafka topics.

Worked on building real time data workflows using Kafka, Spark Streaming and HBase.

Good understanding of Relational Databases like MySQL, Postgres, Oracle, and Teradata.

Experienced in using GIT, SVN.

Ability to deal with build tools like Apache Maven, SBT

Technical Skills

Big Data Ecosystem

HDFS, MapReduce, Pig, Hive, Spark 2.x/1.x, YARN, Kafka 2.10, Flume, Sqoop, Impala, Oozie, Zookeeper, Spark, Ambari

Cloud Environment

AWS, Google Cloud

Hadoop Distributions

Cloudera CDH 6.1/5.12/5., Hortonworks, MAPR

ETL

Talend

Languages

Python, Shell Scripting, Scala.

NoSQL Databases

MongoDB, HBase, DynamoDB.

Development / Build Tools

Eclipse, Git, IntelliJ and log4J.

RDBMS

Oracle 10g,11i, MS SQL Server, DB2

Testing

MRUnit Testing, Quality Center (QC)

Virtualization

VMWare, AWS/EC2, Google Compute Engine.

Build Tools

Maven, Ant, SBT.

EDUCATION:

Bachelor’s in Computer Science Engineering, JNTUH,India. May 2013

PROFESSIONAL EXPERIENCE:

Client: CVS, Woonsocket, RI Jan 2020 to Till Date

Sr. Hadoop/Spark Developer

Responsibilities:

Developed custom input adaptors for ingesting click stream data from external sources like ftp server into S3 backed data lakes on daily basis.

Created various spark applications using Scala to perform series of enrichments of these click stream data combined with enterprise data of the users.

Implemented batch processing of jobs using Spark Scala API.

Developed Sqoop scripts to import/export data from Teradata to HDFS and into Hive tables.

Optimized Hive tables using optimization techniques like partitions and bucketing to provide better performance with Hive QL queries.

Worked with multiple file formats like Avro, Parquet, and Orc.

Converted existing MapReduce programs to Spark Applications for handling semi structured data like JSON files, Apache Log files, and other custom log data.

Wrote Kafka producers to stream the data from external rest api’s to Kafka topics.

Wrote Spark-Streaming applications to consume the data from Kafka topics and write the processed streams to HBase.

Experienced in handling large datasets using Spark in Memory capabilities, using broadcasts variables in Spark, effective & efficient Joins, transformations, and other capabilities.

Worked extensively with Sqoop for importing data from Teradata.

Implemented business logic in Hive and written UDF’s to process the data for analysis.

Utilized AWS services like S3, EMR, Redshift, Athena, Glue Metastore etc., for building and managing data pipelines within the cloud.

Automated EMR Cluster creation and termination using AWS Java SDK.

Loaded the processed data to redshift clusters using Spark Redshift Integration.

Created views with-in Athena for allowing downstream reporting and data analysis team to query and analyze the results.

Environment: AWS Services (S3, EMR, Redshift, Athena, Glue Metastore), Spark, Hive, Teradata, Scala, Python.

Client: TARGET, Minneapolis, MN Sep ’2018 – Dec 2019

Sr. Hadoop Developer

Responsibilities:

Developed Spark applications using Scala utilizing Data frames and Spark SQL API for faster processing of data.

Developed highly optimized Spark applications to perform various data cleansing, validation, transformation, and summarization activities according to the requirement

Data pipeline consists of Spark, Hive and Sqoop and custom build Input Adapters to ingest, transform and analyze operational data.

Developed Spark jobs and Hive Jobs to summarize and transform data.

Used Spark for interactive queries, processing of streaming data and integration with popular NoSQL database for huge volume of data.

Involved in converting Hive/SQL queries into Spark transformations using Spark DataFrames and Scala.

Analyzed the SQL scripts and designed the solution to implement using Scala.

Used Spark for interactive queries, processing of streaming data and integration with popular NoSQL database for huge volume of data.

Built real time data pipelines by developing Kafka producers and spark streaming applications for consuming.

Ingested syslog messages parsed them and streamed the data to Kafka.

Handled importing data from different data sources into HDFS using Sqoop and performing transformations using Hive, Map Reduce and then loading data into HDFS.

Exported the analyzed data to the relational databases using Sqoop, to further visualize and generate reports for the BI team.

Collecting and aggregating large amounts of log data using Flume and staging data in HDFS for further analysis

Analyzed the data by performing Hive queries (Hive QL) to study customer behavior.

Used Hive to analyze the partitioned and bucketed data and compute various metrics for reporting.

Developed Hive scripts in Hive QL to de-normalize and aggregate the data.

Scheduled and executed workflows in Oozie to run various jobs.

Environment: Hadoop, HDFS, HBase, Spark, Scala, Hive, MapReduce, Sqoop, ETL, Java

Client: AIG, Jersey City, NJ Jan. 2017-Aug 2018

Role: Data Engineer

Responsibilities:

Developing and maintaining a Data Lake containing regulatory data for federal reporting with big data technologies such as Hadoop Distributed File System (HDFS), Apache Impala, Apache Hive and Cloudera distribution.

Developing different ETL jobs to extract data from different data sources like Oracle, Microsoft SQL Server, transform the extracted data using Hive Query Language (HQL) and load it into Hadoop Distributed file system (HDFS).

Involved in importing the data from different sources into HDFS using Sqoop and applying transformations using Hive, spark and then loading data into Hive tables.

Fixing data related issues within the Data Lake.

Primarily involved in Data Migration process using AWS by integrating with GitHub repository and Jenkins.

Designed, developed, and did maintenance of data integration programs in a Hadoop and RDBMS environment with both traditional and non-traditional source systems as well as RDBMS and NoSQL data stores for data access and analysis.

Performed in-memory computing capacity of Spark to perform procedures such as text analysis and processing using Scala.

Primarily responsible for designing, implementing, Testing, and maintaining database solution for AWS.

Experience working with Spark Streaming and divided data into different branches for batch processing through the Spark engine.

Implementing new functionality in the Data Lake using big data technologies such as Hadoop Distributed File System (HDFS), Apache Impala and Apache Hive based on the requirements provided by the client.

Communicating regularly with the business teams along with the project manager to ensure that any gaps between the client’s requirements and project’s technical requirements are resolved.

Developing Python scripts using Hadoop Distributed File System APIs to generate Curl commands to migrate data and to prepare different environments within the project.

Coordinating the Production releases with the change management team using Remedy tool.

Communicating effectively with team members and conducting code reviews.

Environment: Hadoop, Data Lake, AWS, Python, Spark, Hive, Cassandra, ETL Informatica, Cloudera, Oracle 10g, Microsoft SQL Server, Control-M, Linux

Azine Technologies, India Sep 2014 - Nov 2016

Java Developer

Responsibilities:

Involved in various SDLC phases like requirements gathering, design, and analysis and code development.

Developed the applications using Java, Spring Boot, JDBC

Worked on various use cases in development using Struts and testing the functionalities.

Involved in preparing the High Level and Detail level design of the system using J2EE.

Created struts form beans, action classes, JSPs following Struts framework standards.

Involved in the development of model, library, struts, and form classes (MVC).

Used display tag libraries for decoration and used display table for reports and grid designs.

Designed and developed file upload and file download features using JDBC with Oracle Blob.

Worked on Core Java, using file operations to read system file (downloads) and to present on JSP.

Involved in the development of underwriting process, which involves communications without side systems using IBM MQ and JMS.

Used PL/SQL stored procedures for applications that needed to execute as part of a scheduling mechanisms.

Designed and developed Application based on Struts Framework using MVC design pattern.

Developed Struts Action classes using Struts controller component.

Developed SOAP based XML web services.

Used SAX XML API to parse the XML and populate the values for a bean.

Used Jasper to generate rich content reports. Developed XML applications using XSLT transformations.

Created XML document using STAX XML API to pass the XML structure to Web Services.

Used Apache Ant for the entire build process.

Used Rational Clear Case for version control and JUnit for unit testing.

Used quartz scheduler to process or trigger the applications daily.

Configured WebSphere Application server and deployed the web components.

Provided troubleshooting and error handling support in multiple projects.

Deployed applications and patches in all environments and provided production support.

Environment: Adobe Flex, Struts, spring, JMS, IBM MQ, XML, SOAP, JDBC, JavaScript, Oracle 9i, IBM WebSphere 6.0, ClearCase, Log4J, ANT, JUnit, IBM RAD, and Apache Tomcat.

Contact this candidate