Surya R
Sr. Hadoop / Spark Developer
******************@*****.***
PROFESSIONAL SUMMARY:
•Hadoop / Spark developer with over 8+ years of involvement in software development which includes 4 years of working proficiency in Big Data and Hadoop ecosystem components.
•In-depth experience and strong subjective knowledge of HDFS, Map Reduce, Hive, Pig, Sqoop, Flume, Yarn/MRv2, Spark, Kafka, Impala, HBase,and Oozie.
•Extensive involvement in working with various distributions of Hadoop like enterprise versions of Cloudera(CDH4/CDH5), Hortonworks and good knowledge on IBM Big Insights and Amazon's EMR (Elastic MapReduce).
•Currently working on Spark and Spark Streaming frameworks extensively using Scala as the main programming dialect.
•Used Spark Data Frames, Spark-SQL and RDD API of Spark for performing various data transformations and dataset building.
•Exposure to Data Lake Implementation utilizing Apache Spark and developed Data pipelines and applied business logic using Spark.
•Extensively dealt with Spark Streaming and Apache Kafka to fetch live stream data.
•Used Scala to convert Hive/SQL queries into RDD transformations in Apache Spark.
•Designing and implementing complete end-to-end Hadoop Infrastructure including Pig, Hive, Sqoop, Oozie, Flume and Zookeeper.
•Experience in developing data pipeline using Pig, Sqoop, and Flume to extract the data from weblogs and store in HDFS. Accomplished developing Pig Latin Scripts and using Hive Query Language for data analytics.
•Has solid fundamental knowledge of distributed computing and distributed storage concepts for highly scalable data engineering.
•Extending HIVE and PIG core functionality by using custom User Defined Function's (UDF), User Defined Table-Generating Functions (UDTF) and User Defined Aggregating Functions (UDAF) for Hive and Pig.
•Created multiple MapReduce Jobs using Java API, Pig and Hive for data extraction, transformation,and aggregation from multiple file formats including Parquet, Avro, XML, JSON, CSV and other compressed file formats Codecs like gZip, Snappy, LZO.
•Expertise in designing columnar families in Cassandra and writing queries in CQL to analyze data from Cassandra tables.
•Experience in using DataStax Spark-Cassandra connectors to get data from Cassandra tables and process them using Apache Spark.
•Strong expertise in troubleshooting and performance fine-tuning spark, MapReduce and hive applications.
•Experience in optimizing MapReduce algorithms using Mappers, Reducers, combiners, and partitioners to deliver the best outcomes on massive datasets.
•Experience in loading streaming data into HDFS and, in performing streaming analytics using stream processing platforms like Flume and Apache Kafka messaging system
•Worked with Click Stream Data extensively for creating various behavioral patterns of the visitors and allowing data science team to run various predictive models.
•Working knowledge of Amazon's Elastic Cloud Compute (EC2) infrastructure for computational tasks and Simple Storage Service (S3) as aStorage mechanism.
•Worked on No-SQL data-stores, primarily HBase using the Java API of HBase and Hive Integration.
•Extensively dealt with data migrations from diversified databases into HDFS and Hive using Sqoop.
•Implemented Dynamic Partitions and Buckets in HIVE for efficient data access.
•Expertisein installation, configuration, supporting and managing Hadoop Clusters using Apache, Cloudera (CDH3, CDH4) distributions, Hortonworks and on Amazon web services (AWS).
•Strong expertise in Unix shell script programming.
•Exposure to Mesos, Zookeeper cluster environment for application deployments and dock containers
•Knowledge ofEnterprise Data Warehouse (EDW) architecture and various data modeling concepts like star schema, snowflake schema and Teradata.
•Highly proficient in Scala programming Knowledge.
•Worked on data warehousing and ETL tools like Informatica, Tableau, and Pentaho.
•Extensive experience in developing applications that perform Data Processing tasks using Teradata, Oracle, SQL Server and MySQL database
•Experience in understanding the security requirements for Hadoop and integratewithKerberosauthenticationand authorization infrastructure.
•Experience in Software Design, Development and Implementation of Client/Server Web-based Applications using jQuery, JavaScript, Java Beans, Struts, PL/SQL, SQL, HTML, CSS, PHP, XML, and AJAX
•Experience in Database design, Data analysis, Programming SQL, Stored procedure's PL/ SQL, and Triggers in Oracle and SQL Server.
•Acquaintance with Agile and Waterfall methodologies. Responsible for handling several clients facing meetings with great communication skills.
TECHNICAL COMPETENCIES:
Hadoop Ecosystems
HDFS, Map Reduce, Pig, Hive, Sqoop, Flume, YARN, Oozie, Zookeeper, Impala, Ambari, Spark, Spark SQL, Spark Streaming, Apache Kafka, Storm.
Languages
C, C++, Java, Scala, Python, C#, SQL, PL/SQL, Pig Latin, HiveQL
Frameworks
J2EE, Spring, Hibernate, Angular JS
Web Technologies
HTML, CSS, Java script, jQuery, Ajax, XML, SOAP, REST API, ASP .Net
No-SQL
HBase, Cassandra, MongoDB, DynamoDB
Cluster Management and Monitoring
Cloudera Manager, Hortonworks Ambari, Apache Mesos
Cloud Platforms
Amazon Web Services (AWS), Microsoft Azure.
Relational Databases
Oracle 11g, MySQL, SQL-Server, Teradata
Build Tools
ANT, Maven, SBT, Jenkins
Application Server
Tomcat 6.0, WebSphere 7.0
Version Control
GitHub, BitBucket, SVN
Security
Kerberos, OAuth
Development Methodologies
Agile, Scrum, Waterfall.
PROFESSIONAL EXPERIENCE:
Sr. Hadoop / Spark Developer Mar 2017 – Present
Oncor Electric – Dallas, TX
Description: Oncor is a regulated electric transmission and distribution service provider that serves east, west and north-central Texas. Oncor is committed to delivering reliable electricity, providing award-winning service reducing our environmental impact and promoting healthy lifestyles. Oncor Electric is working in partnership with Siemens Power Technologies to build a smart grid in an exploration of theuse of micro-grids.
Roles & Responsibilities:
•Worked on Big Data infrastructure for batch processing and real-time processing using Apache Spark. Built scalable distributed Hadoop cluster running Hortonworks Data Platform (HDP 2.6).
•Responsible for design and development of Spark SQL Scripts based on Functional Specifications.
•Worked on the large-scale Hadoop YARN cluster for distributed data processing and analysis using Spark, Hive, and MongoDB.
•Involved in creating data-lake by extracting customer's data from various data sources to HDFS which include data from Excel, databases, and log data from servers.
•Developed Apache Spark applications by using Scala for data processing from various streaming sources.
•Processed the Web server logs by developing Multi-Hop Flume agents by using Avro Sink and loaded into MongoDB for further analysis, Extracted files from MongoDB through Flume.
•Extracted files from MongoDB through Sqoop and placed in HDFS and processed it using Hive.
•Writing MapReduce (Hadoop) programs to convert text files into AVRO and loading into Hive (Hadoop) table.
•Developed a NiFi Workflow to pick up the data from Data Lake as well as from SFTP server and send that to Kafka broker.
•Involved in loading and transforming large sets of structured data from router location to EDW using an ApacheNiFi data pipeline flow.
•Worked with Spark for improving performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frames, Pair RDD's.
•Developed preprocessing job using Spark Data frames to transform JSON documents to flat file.
•Loaded D-Stream data into Spark RDD and did in-memory data computation to generate output response.
•Experienced in writing real-time processing and core jobs using Spark Streaming with Kafka as a data pipeline system.
•Involved in loading data from rest endpoints to Kafka producers and transferring the data to Kafka brokers.
•Implemented Kafka event log producer to produce the logs into Kafka topic which are utilized by ELK (Elastic Search, Log Stash, Kibana) stack to analyze the logs produced by the Hadoop cluster.
•Used Apache Kafka functionalities like distribution, partition, replicated commit log service for messaging systems by maintaining feeds.
•Partitioning data streams using Kafka. Designed and configured Kafka cluster to accommodate heavy throughput.
•Spark Streaming collects data from Kafka in near-real-time and performs necessary transformations and aggregation to build the common learner data model and stores the data in NoSQL store (HBase).
•Exported the analyzed data to the relational databases using Sqoop for visualization and to generate reports for the BI team.
•Migrated an existing on-premises application to Amazon Web Services (AWS) and used its services like EC2 and S3 for small data sets processing and storage, experienced in maintaining the Hadoop cluster on AWS EMR.
•Imported data from AWS S3 into Spark RDD, performed transformations and actions on RDD's.
•Used Apache Oozie for scheduling and managing multiple Hive Jobs. Knowledge of HCatalog for Hadoop based storage management.
•Developed solutions to pre-process large sets of structured, semi-structured data, with different file formats (Text, Avro, Sequence, XML, JSON and Parquet).
•Generated various kinds of reports using Pentaho and Tableau based on Client specification.
•Worked with SCRUM team in delivering agreed user stories on time for every Sprint.
Environment: Hadoop, MapReduce, HDFS, Yarn, Hive, Sqoop, HBase, Apache Solr, Oozie, Spark, Scala, Python, AWS, Flume, Kafka, Tableau, Linux, Shell Scripting.
Sr. Hadoop Developer Feb 2016 – Mar 2017
Arris – Denver, CO
Description: Arris International is a telecommunications equipment manufacturing company that provides cable operators with data, video and telephony systems for homes and businesses. The firm popularly produces telephony modems, wireless cable modem-and-router units, and other telecommunication, data-transfer products.
Roles & Responsibilities:
•Implemented Big Data platforms using Cloudera CDH4 as data storage, retrieval and processing systems. Continuous monitoring and managing the Hadoop cluster using Cloudera Manager.
•Involved in complete Implementation lifecycle, specialized in writing custom MapReduce, Pig and Hive programs.
•Managing and scheduling Jobs to remove the duplicate log data files in HDFS using Oozie.
•Used Apache Oozie workflow engine for scheduling and managing interdependent Hadoop Jobs.
•Implemented custom serializer, interceptor, source and sink in Flume to ingest data from multiple sources.
•Experience in setting up Fan-out workflow in Flume to design V-shaped architecture to take data from many sources and ingest into thesingle sink.
•Created Cassandra tables to load large sets of structured, semi-structured and unstructured data coming from Linux File system, other NoSQL databases and a variety of portfolios.
•Implemented a Data service as a rest API project to retrieve server utilization data from this Cassandra Table.
•Involved in loading data into HBase using HBase Shell, HBase Client API, Pig and Sqoop.
•Generated Java APIs for retrieval and analysis on No-SQL databases such as HBase and Cassandra.
•Involved in creating data-models for data in Cassandra tables using Cassandra Query Language (CQL).
•Created a StreamSet pipeline to parse the file in XML format and convert to a format that is fed to Solr.
•Created near Real Time Solr indexing on HBase (using lucid works plugins) and Hdfs.
•Used File System Check (FSCK) to check the health of files in HDFS.
•Worked with different compression codecs like GZIP, SNAPPY and BZIP2 in MapReduce, Pig and
Hive for better performance.
•Analyzed substantial amounts of data sets to determine theoptimal way to aggregate and report on it.
•Involved in the pilot of Hadoop cluster hosted on Amazon Web Services (AWS).
•Used Sqoop to import the data from databases to Hadoop Distributed File System (HDFS) and performed automated data auditing to validate the accuracy of the loads.
•Proactively researched on Microsoft Azure. Presented Demo on Microsoft Azure, an overview of cloud computing with Azure.
•Worked on a POC to compare processing time of Impala with Apache Hive for batch applications to implement the former in theproject.
•Creating Impala views on top of Hive tables for faster access to analyze data through HUE/TOAD.
Environment: Apache Hadoop, Map Reduce, Solr, HDFS, Hive, Sqoop, Microsoft Azure, Oozie, SQL, Flume, Cassandra, HBase, Java, GitHub.
Hadoop Developer Dec 2014 to Feb 2016
Bancorp South – Tupelo, MS
Description: Bancorp South is a regional financial services company that provides Banking, Investments, Mortgage and commercial financial services in a variety of Online and Mobile Services.
Roles & Responsibilities:
•Worked on analyzing Hadoop cluster using different Big Data analytic tools including Hive, Pig and Map Reduce.
•Installed and configured the Hadoop cluster using the Cloudera's CDH distribution and monitored the cluster performance using the Cloudera Manager.
•Monitored workload, job performance and capacity planning using Cloudera Manager.
•Implemented schedulers on the Job tracker to share resources of the cluster for the MapReduce jobs given by cluster.
•Involved in Design, develop Hive Data model, loading with data and writing Java UDF for Hive.
•Handled importing and exporting data into HDFS by developing solutions, analyzed the data using Map Reduce, Hive and produce summary results from Hadoop to downstream systems.
•Used Sqoop to import and export the data from Hadoop Distributed File System (HDFS) to RDBMS.
•Created Hive tables and loaded data from HDFS to Hive tables as per the requirement.
•Established custom Map Reduces programs to analyze data and used HQL queries for data cleansing.
•Created components like Hive UDFs for missing functionality in Hive to analyze and process large volumes of data extracted from No-SQL database Cassandra.
•Worked on various performance optimizations like using distributed cache for small datasets, Partitioning, Bucketing in Hive and Map Side join.
•Collecting and aggregating substantial amounts of log data using Apache Flume and staging data in HDFS for further analysis.
•Developed workflow in Oozie to automate the tasks of loading the data into HDFS and pre-processing with Pig.
•Worked on optimization of Map Reduce algorithm using combiners and partitions to deliver the best results and worked on Application performance optimization for an HDFS cluster.
•Comprehensive Knowledge and experience in process improvement, normalization/de-normalization, data extraction, data cleansing, Scrum data manipulation.
Environment: Cloudera Distribution (CDH), HDFS, Pig, Hive, Map Reduce, Sqoop, Hbase, Impala, Java, SQL, Cassandra.
Hadoop Developer Oct 2013 to Dec 2014
Ochsner Health Systems – Jefferson, LA
Description: Ochsner Health Systems is Louisiana’s largest academic and multispecialty healthcare delivery system that provides high-quality clinical and hospital patient care by collaborating with clinicians and scientists to bring innovative medical discoveries to patients across Louisiana.
Roles & Responsibilities:
•Hands on experience in loading data from UNIX file system and Teradata to HDFS
•Installed and configured Flume, Hive, Pig, Sqoop and Oozie on the Hadoop cluster on the Cloudera’s CDH distribution.
•Developed PIG scripts for the processing of semi-structured data using sorting, joins and Grouping the data.
•Developed Java MapReduce programs on log data to transform into astructured way to find user location, age group, spending time.
•Collected and aggregated large amounts of weblog data from different sources such as webservers, mobile and network devices using Apache Flume and stored the data into HDFS for analysis
•Created HBase tables to store variable data formats coming from different portfolios Performed real-time analytics on HBase using Java API and Rest API.
•Extracted files from Couch DB, MongoDB through Sqoop and placed in HDFS for processed
•Integrated Oozie with the rest of the Hadoop stack supporting several types of Hadoop jobs out of the box (such as Map-Reduce, Pig, Hive, and Sqoop) as well as system-specific jobs (such as Java programs and shell scripts).
•Developed ETL using Hive, Oozie, shell scripts and Sqoop and Analyzed the weblog data using the HiveQL.
•Supported Data Analysts in running MapReduce Programs.
•Experienced with working on Avro Data files using Avro Serialization system.
Environment: Cloudera Distribution (CDH), HDFS, Map Reduce, Pig, Hive, Sqoop, Flume, HBase, Java, Maven, Avro, Oozie, ETL and Unix Shell Scripting.
Java Developer July 2011 to Sep 2013
Persistent Systems – India
Description: Persistent Systems is a technology services company which offers a secure and true mobile ad-hoc networking system with its wave relay product line.
Roles & Responsibilities:
•Involved in the analysis, design, and development, testing phases of Software Development Life Cycle (SDLC).
•Analysis, design and development of Application based on J2EE using Struts and Tiles, Spring 2.0 and Hibernate 3.0.
•Involved in interacting with the Business Analyst and Architect during the Sprint Planning Sessions.
•Used XML Web Services for transferring data between different applications.
•Used Apache CXF web service stack for developing web services and SOAP UI and XML-SPY for testing web services.
•Used JAXB for binding XML to Java. Used SAX and DOM parsers to parse XML data.
•Hibernate was used for Object-Relational mapping with Oracle database.
•Worked with Spring IOC for injecting the beans and reduced the coupling between the classes.
•Implemented Spring IOC (Inversion of Control)/DI (Dependency Injection) for wiring the object dependencies across the application.
•Implemented spring transaction management for implementing transactions for the application.
•Implemented design patterns for Service Locator.
•Performed unit testing using JUnit 3, EasyMock Testing Framework for performing Unit testing.
•Worked on PL/SQL stored procedures using PL/SQL Developer.
•Involved in Fixing the production Defects for the application.
•Used ANT as build-tool for building J2EE applications.
Environment: Java 1.6, Struts, PL/SQL, Spring Transaction Management, Hibernate 3.0, Springs2.0, JSP 2.0, Oracle 11g, Eclipse, JUnit 3, JDBC, ANT, Maven, XML Web Services.
Jr. Java Developer May 2009 to July 2011
Shore Info Tech – India
Description: Shore Info Tech is a leading provider of various IT solutions and data services. It is involved in developing Smart Data solutions – yielding clean, organized, actionable data to extract information and insight.
Roles & Responsibilities:
•Extensive Involvement in Requirement Analysis and system implementation.
•Actively involved in SDLC phases like Analysis, Design and Development.
•Responsible for Developing modules and assist in deployment as per the client's requirements.
•The application is implemented using JSP and servlets are used for implementing Business logic.
•Developed utility and helper classes and Server-side Functionalities using servlets.
•Created DAO Classes and Written Various SQL queries to perform DML Operations on the data as per the requirements.
•Created Custom Exceptions and implemented Exception handling using Try, Catch and Finally Blocks.
•Developed user interface using JSP, JavaScript and CSS Technologies.
•Implemented User Session tracking in JSP.
•Involved in Designing DB Schema for the application.
•Implemented Complex SQL Queries, Reusable Triggers, Functions, Stored procedures using PL/SQL.
•Worked in pair programming, Code reviewing and Debugging.
•Involved in Tool development, performing Unit testing and Bug Fixing.
•Involved in UAT and production deployments and support activities.
Environment: - Java, J2EE, Servlets, JSP, SQL, PL/SQL, HTML, JavaScript, CSS, Eclipse, Oracle, MYSQL, IBM WebSphere, JIRA.