hadoop/spark developer

Location:

Texas City, TX

Posted:

June 10, 2022

Contact this candidate

Resume:

Sai Teja Devarakonda

Hadoop / Spark Developer

Phone: 469-***-**** Email:***********@*****.***

Professional Summary:

Around 7 plus years of experience with Hadoop / Spark stackand have strong experience working with programming languages: Scala,Python.

Good experience in application development primarily using Hadoop, Pythonand worked on data analysis

In-depth understanding of Spark Architecture and performed several batch and real-time data stream operations using Spark (Core, SQL, Streaming).

Experience working with Streamsets data collector and processed 4 million records through the pipelines which I built as apart of P2P MoneyMovement.

Experience with NoSQL databases including HBase, Cassandra.

Experienced the integration of various data sources like RDBMS, Spreadsheets, Text files.

Set up standards and processes for Hadoop based application design and implementation.

Experience in designing, developing and implementing connectivity products that allow efficient exchange of data between our core database engine and the Hadoop ecosystem.

Experienced in handling large datasets using Spark in-memory capabilities, Partitions, Broadcast variables, Accumulators, Effective & Efficient Joins. Used Scala to develop Spark applications.

Tested and Optimized Spark applications.

Experience in different Hadoop distributions like Cloudera and Horton Works Distributions (HDP) and Elastic Map Reduce (EMR).

Hands on experience in Azure Development, worked on Azure web application, App services, Azure storage, Azure SQL Database, Virtual machines, Azure AD, Azure search, and notification hub.

Hands on knowledge in installing Hadoop cluster using different distributions of Apache Hadoop, Cloudera and Hortonworks.

Experience in creating Tableau Dashboards using Stack Bars, Bar Graphs, and geographical maps.

Good understanding of Mapper, Reducer and Driver class structures for Map-Reduce.

hands on work experience in writing applications on NoSQL databases like HBase, and Cassandra.

Extensive experience with ETL and Query big data tools like Pig Latin and Hive QL.

Hands on experience in big data ingestion tools like Flume and Sqoop

Experience in managing and reviewing Hadoop Log files.

Good experience in performing data analytics and insights using Impala, Hive and Working knowledge on Kubernetes

Experience in installation, configuration, supporting and monitoring Hadoop clusters using Apache.

In depth understanding/knowledge of Hadoop Architecture and various components such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node and Map Reduce concepts

Experience in using Pig, Hive, Scoop, Oracle VM.

Experience in importing and exporting data using Sqoop from HDFS to Relational Database Systems and vice-versa.

Technical Skills

Big Data Ecosystem

Hadoop, Spark, MapReduce, YARN, Hive, SparkSQL, Pig, Sqoop, HBase, Flume, Oozie, Zookeeper, Avro, Parquet, Maven, Snappy,Streamsets SDC

Hadoop Architecture

HDFS, Job Tracker, Task Tracker, Name Node, Data Node, Map Reduce

Hadoop Distributions

Cloudera, MapR, Hortonworks

NoSQL Databases

Cassandra, Mongo DB, HBase

Languages

Python, Scala, SQL

Databases

SQL Server, MySQL, PostgreSQL, Oracle

ETL/BI

Talend, Tableau

Operating Systems

UNIX, Linux, Windows Variants

Work Experience:

Client: AT&T, Remote Sep 2021 to Present

Role: Data Engineer

Responsibilities:

Worked on Hadoop ecosystem in building data pipelines that connect from Hadoop to SQL server and to the end application.

Built data pipelines from Hadoop to Hadoop and extensively used sqoop for incremental import to SQL server.

Experience in developing a batch processing framework to ingest data into HDFS, Hive, and HBase.

Responsible for developing data pipeline using Flume, Sqoop, and PIG to extract the data from weblogs and store in HDFS.

Involved in moving all log files generated from various sources to HDFS for further processing through Flume.

Created Hive queries that helped market analysts spot emerging trends by comparing fresh data with EDW reference tables and historical metrics.

Good experience in writing GCP Big query and perform the code implementation using DBT.

Worked on DBT Transformation, Terraform, Big query as part of GCP migration.

Extensively worked on Pyspark and transformations were done using spark sql .

Developed performance tuning spark applications for memory tuning, batch interval time and parallelism.

Worked on optimizing Hadoop algorithms using spark context, spark-sql and data frames.

Experienced in handling large datasets close to 800 billion using partitions, spark in memory capabilities.

Highly experience in debugging spark-sql code to identify problem according to the requirements and modifying code accordingly.

Implemented partitioning, dynamic partitioning buckets in hive.

Collaborated with Architects to design Spark model for the existing MapReduce model and migrated them to Spark models using Scala.

Experience in Agile softwares like Jira, Bitbucket and splunk repository for continuous integration of the project.

Experience working on Linux servers for data management and checking process integrity of data flow through pipelines.

Developed spark jobs for continuous integration of Error records (read, write and error count) which will pull logs from Kafka Topic to MySQL server Tables.

Performed scoop incremental imports from different sources into HDFS and performed some transformations using Hive and Map Reduce.

Developed Spark streaming model which gets transactional data as input from multiple sources and create multiple batch and later processed for already trained fraud detection model and error records.

Experienced working with Hadoop Big Data technologies (HDFS), Hadoop ecosystems (HBase, Hive, pig) and NoSQL database MongoDB

Proven skills in establishing strategic direction yet technically strong in designing, implementing, and deploying. Collected/translated business requirements into distributed architecture & robust scalable designs

Collaborated with the infrastructure, network, database, application and BI teams to ensure data quality and availability.

Environment: Hadoop YARN, AWS, EC2, Spark Core, Spark SQL, Scala, Python, Data Collector, GCP, Terraform,DBT, CLI, Kafka, Bitbucket, Jira, Scala SBT, Hive, Sqoop, Impala, Oracle, HBase, Oozie, Redshift, Jenkins, HDFS, Zookeeper, Yarn, SQL, Shell Script, Cloudera, SQL, Map Reduce, Avro, Json, Linux, GIT, Oozie.

Client: Early warning services, Phoenix, AZ October 2020 to sep2021

Role: Data Engineer

Description: Early warning systems holds Zelle and fraud detection tools for banks and other companies. Zelle is successful implementation tool for money movement and our project is to work on Zelle data generated by banks and applications.

Responsibilities:

Working on Streamsets tool for continuous integration of data related to money movement.

Using Streamsets tool to develop pipelines which will connect to kafka topic and stream data to oracle databases.

Experience working on AWS Kiness, Cucumber and Scala SBT for pipeline related automation.

Experience on automation for the pipelines I developed using cucumber BDD and scala as an automation script.

Working on ZELL transaction data and daily volume upto 4 million records entering to our pipelines which I build using Streamsets tool.

Worked on spark streaming to stream data from Kafka to Kafka and parallelly developed Streamsets Pipeline to do the same to check processing time and error record count.

Developed Kafka topics for test data streaming and spark jobs using Python and shell scripting. Spark jobs are also used for pulling internal processed data count for maintaining Metrices.

Configured Zookeeper to coordinate the servers in clusters to maintain the data consistency.

Pipelines build with help of python and json files developed to support staging.

Developed various pipelines for Mastercard and Visa card data Integration and continuous stream of data to databases and one of the organizations developed kafka topics using StreamsetsData collector.

Developed pipeline for continuous integration of streaming data from Kafka to Oracle database with error count logs using Streamsets tool.

Involved in creating Hive tables by using Impala and working on them using HiveQL and perform data analysis using Hive and Pig.

Responsible for designing and implementing data pipeline using Big Data tools including Hive, Airflow, Spark, Drill, Sqoop, Nifi, EC2, S3 and EMR..

Worked on Hive tables and extensive experience on creating Spark jobs for transforming data and HiveQL.

Developed Kafka producer code to publish Test data into kafka topics which will be consumed by Pipelines for continuous streaming of Data.

Created data pipeline: Kafka-> Spark -> HDFS along with the team.

Worked on CI/CD with Devops tools like Docker, Kubernetes, Jenkins.

Developed automated Job Control system to start/stopstreamsets pipeline using streamsets CLI.

Performed analytical operations, Log analysis and data reporting on Redshift.

Worked on configuring multiple MapReduce Pipelines, for the new Hadoop Cluster.

Tested and Optimized Spark applications.

Extensively Involved in developing Kafka producer and consumer code for Test data streaming and have experience in managing cloudera managerlogs, status and configurations through Cloudera UI.

Environment: Hadoop YARN, AWS, EC2, Spark Core, Spark SQL, Scala, Python,Streamsets, Streamsets Data Collector, Streamsets CLI, Kafka, Bitbucket, Jira, Scala SBT, Hive, AWS S3, Sqoop, Impala, Oracle, HBase, Oozie, Redshift, Jenkins, HDFS, Zookeeper, Yarn, AWS, Windows, SQL, Shell Script, Cloudera, SQL, Map Reduce, Avro, Json, Linux, GIT, Oozie.

Client: TJX Companies, Boston, MA Sep 2019 to Sep 2020

Role: Hadoop / Spark Developer

Description: The TJX Companies is a multinational off-price department store corporation delivering great value to the customers through the combination of brand, fashion, price and quality. The purpose of the project is to store TB of information generated by the online sales and various stores sales data to Big Data Lake. The data will be stored Apache Hadoop file system and processed using Spark.

Responsibilities:

Participated Responsible for Scalable distributed Infrastructure for Model Development and Code Promotion using Hadoop.

Developed Spark scripts by using Python and Shell scripting commands as per the requirement.

Used Spark API over Cloudera/Hadoop/YARN to perform analytics on data in Hive and MongoDB.

Developed Scala scripts, UDFs using both Data frames/SQL/Data sets and RDD/MapReduce in Spark for Data Aggregation, queries and writing data back into OLTP system through Sqoop.

Performance tuning of Spark Applications for setting right Batch Interval time, correct level of Parallelism and memory tuning.

Used Spark for series of dependent jobs and for iterative algorithms. Developed a data pipeline using Kafka and Spark Streaming to store data into HDFS.

Developed MapReduce jobs in Java to convert data files into Parquet file format.

Optimizing of existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frames and Pair RDD's.

Developed workflow in Oozie to automate the tasks of loading the data into HDFS.

Developed Spark streaming pipeline in Python to parse JSON data and to store in Hive tables

Developed Hive queries to process the data and generate the data cubes for visualizing.

Implemented schema extraction for Parquet and Avro file Formats in Hive/MongoDB.

Experienced working on cloud AWS using EMR. Performed operations on AWS using EC2 instances, S3 storage, performed RDS, Lambda, analytical Redshift operations.

Worked with Talend open studio for designing ETL Jobs for Processing of data.

Used Zookeeper to manage Hadoop clusters and Oozie to schedule job workflows.

Worked with continuous Integration of application using Jenkins.

Developed ETL Applications using Hive, Spark, and Impala & Sqoop for Automation using Oozie.

Used Reporting tools to connect with Hive for generating daily reports of data.

Writing HiveQL as per the requirements and Processing data in Spark engine and store in Hive tables.

Importing existing datasets from Oracle toHadoop system using SQOOP.

Writing the Spark Core Programs for processing and cleansing data thereafter load that data into Hive orHBase for further processing.

Used Partitioning and Bucketing techniques in Hive to improve the performance, involved in choosing different file format's like ORC, Parquet over text file format.

Responsible for importing data from Postgres to HDFS, HIVE, MongoDB, HBASE using SQOOP tool.

Experienced in migrating HiveQLinto Impala to minimize query response time.

Worked on Sequence files, RC files, Map side joins, bucketing, partitioning for Hive performance enhancement and storage improvement.

Environment: Hadoop YARN, AWS, EC2, Spark Core, Spark SQL, Scala, Python, Hive, AWS S3, RDS, Lambda, Redshift, MongoDB, Sqoop, Impala, Oracle, HBase, Oozie, Jenkins, HDFS, Zookeeper, Yarn, Autosys, AWS,Windows, SQL, OLTP, Shell Script, Cloudera, SQL, Map Reduce, Parquet, Avro, Linux, GIT, Oozie.

Client: MassMutual–Boston, MA Jan 2018 – Aug 2019

Role: Spark/ Hadoop Developer

Description: Mass mutual life insurance company, offers life, disability income, and long-term care insurance; annuities; mutual funds, individual retirement accounts and income protection services. The aim of this project is to analyze effectiveness of various plans offered by MassMutual.

Responsibilities:

Developed Spark applications using Scala.

Used Data frames/ Datasets to write SQL type queries using Spark SQL to work with datasets.

Performed real-time streaming jobs using Spark Streaming to analyze data on a regular window time interval to the incoming data from Kafka.

Created Hive tables and had extensive experience with HiveQL.

Executed Hive queries on Parquet tables stored in Hive to perform data analysis to meet the business requirements.

Extended Hive functionality by writing custom UDFs, UDAFs, UDTFs to process large data.

Performed Hive UPSERTS, partitioning, bucketing, windowing operations, efficient queries for faster data operations.

Involved in moving data from HDFS to AWS Simple Storage Service (S3) and extensively worked with S3 bucket in AWS.

Created and maintained Technical documentation for launching Hadoop Clusters and for executing Hive queries and Pig Scripts

Responsible for loading bulk amount of data in HBase using MapReduce by directly creating H-files and loading them.

Developed spark application for filtering Json source data in AWS S3 location and store it into HDFS with partitions and used spark to extract schema of Json files.

Imported and exported data between relational database systems and HDFS/Hive using Sqoop.

Wrote custom Kafkaconsumer code and modified existing producer code in Python to push data to Spark-streaming jobs. .

Scheduled jobs and automated workflows using Oozie.

Automated the movement of data using NIFI dataflow framework and performed streaming and batch processing via micro batches. Controlled and monitored data flow using web UI.

Worked with HBase database to perform operations with large sets of structured, semi-structured and unstructured data coming from different data sources. – need to add new line

Exported analytical results to MS SQL Server and used Tableau to generate reports and visualization dashboards.

Environment: AWS, S3, Cloudera, Spark, Spark SQL, HDFS, HiveQL, Hive, Zookeeper, Hadoop, Python, Scala, Kafka, Sqoop, MapReduce, Oozie, Tableau, MS SQL Server, HBase, Agile, Eclipse.

Client: Boston Scientific, Boston, MA June 2016 – Dec 2017

Role: Hadoop/Python Developer

Responsibilities:

Created Hive tables, loaded data, executed HQL queries and developed MapReduce programs to perform analytical operations on data and to generate reports.

Involved in converting Hive into Spark transformations using Python and Scala.

Collecting and aggregating large amounts of log data using Flume and staging data in HDFS for further analysis

Created Hive internal and external tables, used MySQL to store table schemas. Wrote custom UDFs in Python.

Moved data between MySQL and HDFS using Sqoop.

Used NoSQL database with HBase and MongoDB.

Developed Spark jobs and Hive Jobs to summarize and transform data.

Developed MapReduce jobs in Python for log analysis, analytics, and data cleaning.

Wrote complex MapReduce programs to perform operations by extracting, transforming, and aggregating to process terabytes of data.

Designed E-R diagrams to work with different tables.

Used Spark for interactive queries, processing of streaming data and integration with HBase database for huge volume of data.

Wrote many SQL, Procedures, Triggers and Views on top of Oracle.

Involved in the implementation of the Software development life cycle (SDLC) that includes Development, Testing, Implementation, and Maintenance Support.

Environment: Python, Hive, Sqoop, MapReduce, SQL, MySQL, NOSQL-MongoDB, Oracle, HQL, Scala, HDFS, SDLC.

Client: Code Master Technology, India Jan 2013 to Nov 2015

Role: Hadoop Developer

Responsibilities:

Utilized Agile and Scrum Methodology to help manage and organize a team of developers with regular code review sessions.

Collected logs injected into MongoDB using Storm and Kafka. Partitioned the collected logs by date/timestamps and host names. Developed programs to extract the required data from the logs* Involved in converting Hive/SQL queries into Spark transformations using SparkRDDs, and Scala.

Deployed the application in Hadoop cluster mode by using Spark submit scripts.

Queried and analyzed data from Cassandra for quick searching, sorting and grouping through CQL.

Worked on the large-scale HadoopYARN cluster for distributed data processing and analysis using Spark, Hive. Writing PIG scripts to process the data. Worked on Python plugin on MySQL and Imported data frequently from MySQL to HDFS using Sqoop.

Created Hive tables, loading the structured data resulted from jobs into the tables and writing hive queries to further analyze the logs to identify issues and behavioral patterns.

Created scripts for importing data into HDFS/Hive using Sqoop from MySQL/Oracle.

Extracted the data from Teradata into HDFS using Sqoop. Analyzed the data by performing Hive queries and running Pig scripts to know Customers behavior like Health Condition, travelers etc. Exported the patterns analyzed back into Teradata using Sqoop

Managed Oozieworkflow engine to run multiple Hive. Continuous monitoring and managing the Hadoop cluster through Cloudera Manager.

Handled importing data from different data sources into HDFS using Sqoop and performing transformations using Hive, Map Reduce and then loading data into HDFS.

Developed Hive queries to process the data and generate the data cubes for visualizing.

• Used Spark API over Cloudera Hadoop YARN to perform analytics on data in Hive.

• Extensive knowledge on PIG scripts using bags and tuples and Pig UDF's to pre-process the data for analysis.

• Implemented usage of Amazon EMR for processing Big Data across a Hadoop Cluster of virtual servers.

Environment: ClouderaManager, HDFS, Hive, Pig, MongoDB, Cassandra, Scala, Oracle, Oozie, Hadoop, Spark, CQL, SQOOP, Kafka, Storm, Flume, Tableau, SQL, MySQL, Shell Scripting.

Contact this candidate