7+ years of commendable experience in the IT industry with proven expertise in Big Data Analytics, and Development.
Having 4 years of experience in Big Data related technologies like Hadoop frameworks, Spark, Scala, HDFS, Map Reduce, Hive, Pig, Impala, Kafka, YARN, HBase, Oozie, Zookeeper, Flume, Sqoop.
Managing, Monitoring and Administration of Hadoop Multi Node Cluster with Ambari or Cloudera Manager.
Experience in working with Cloudera and HortonWorks Hadoop distributions.
Worked with both Scala and Java, Created frameworks for processing data pipelines through Spark.
Expertise in working with Hive data warehouse tool-creating tables, data distribution by implementing partitioning and bucketing, writing and optimizing the HiveQL queries.
Developed Sqoop Scripts for importing large dataset from RDBMS (MySQL, Oracle, SQL Server) to HDFS.
Hands on experience in setting up workflow using Control-M for managing and scheduling Hadoop jobs.
Implemented Spark using Scala and Spark-SQL for faster testing and processing of data.
Experience in converting Hive/SQL queries into Spark transformations using Spark RDDs, Python and Scala.
Improving the performance and optimization of existing jobs in Hadoop using Spark context, Spark-SQL and Spark YARN using Scala.
Worked on migrating MapReduce programs into Spark transformations using Apache Spark and Scala.
Experience with batch processing of data sources using Apache Spark, Hive, Impala.
Experience in writing Hive queries using Hive Query Language HiveQL(HQL). Hands-On experience on SQL.
Hands on experience in configuring and working with Flume to load the data from multiple sources directly into HDFS.
Worked with NoSQL databases such as HBase and good knowledge on Cassandra and MongoDB.
Experience in writing custom UDFs in Java and Scala to extend the Hive and Pig functionality.
Familiar in designing and developing custom processers and data flow pipelines between systems using flow-based programming in Apache NiFiusing NiFis web based UI.
Knowledge on Cloud computing infrastructure such as Microsoft Azure.
Hands on experience on Real Time data tools like Kafka and developing data flow pipeline.
Good knowledge in creating Custom Serdes in Hive.
Hadoop Environment: Hortonworks, Cloudera Hadoop Distributions.
Big Data Stack: HDFS, MapReduce, Hive, Pig, Sqoop, Flume, Oozie, Avro, Hadoop Streaming, Storm, Spark, Kafka, YARN, Crunch, Zookeeper, HBase, Impala, Cassandra, MongoDB, Spark MLLIB
Languages: Scala, SQL, PL/SQL, Java, Shell Script.
IDE’s: Eclipse, Intellij
Build Tools: Maven, SBT and Gradle.
Production Support Tools: Control-M.
Configuration Tools: TFS, SVN, GitHub, CVS.
Databases: Cassandra, HBase, Oracle, SQL Server and Teradata.
Cloud Solutions: AWS EC2, S3, Microsoft Azure.
Defects Triage:IBM SCCD, JIRA and BugZilla.
Reporting Tools: PowerBI, Tableau.
Networking and Protocols: TCP/IP, HTTP, FTP, SNMP, LDAP, DNS.
Operating Systems: RHEL, Ubuntu, CentOS, Windows.
HortonWorks HDP Certified Spark Developer
Hadoop Developer / Spark Developer
Client: Navistar, Inc. - IL - Mar 2018 to till date
Project: Navistar IoT devices provides telemetry data every 10 seconds to on-premises data science environment and Azure cloud. This data consists of geospatial information & 1500+ unique attributes including but not limited to, odometer reading, engine faults, brake acceleration. This project requires strong knowledge of Apache Spark, Spark SQL, Spark Streaming, Spark MLlib, Scala, Java, SQL Server, Impala, Kafka, HBase, Data Analytics, near Real Time & Batch jobs to ingest, compute, transform and visualize.
Expertise in designing, developing modules in Big Data environment using Apache Spark, Spark SQL, Spark Core, Spark Streaming, Structured Streaming, Scala, Impala, Kafka, HBase.
Developed data ingestion pipeline for Navistar’s IoT devices, define and implement data governance and data retention policies.
Load the data from RDBMS such as Oracle, SQL Server, Microsoft Azure to Hive using Sqoop.
Loading data from UNIX file systems to HDFS.
Used JSON and XML SerDe Properties to load JSON and XML data into Hive tables.
Design and implement geo spatial application using core software development practices using Scala, Java, SQL Server.
Implemented Spark using Scala and utilizing Data frames and Spark SQL API for faster processing of data.
Optimize the Hive tables using optimization techniques such as partitions and bucketing to provide better performance with HiveQL queries.
Working extensively with importing metadata into Hive and migrated existing tables and applications to work on Hive and Spark.
Build data model to pull data from external sources like weather, crash, potholes, highway geospatial data and combined it with real time streaming data from OCC telematics devices using Kafka, Flume, HBase, Spark and Hive.
Worked with NoSQL databases like HBase in creating HBase tables to load large sets of semi structured data coming from various sources.
Develop and convert HiveQL queries into Spark transformations using Spark RDD, Scala.
Use different data formats (Text, Avro, Parquet, JSON, XML) while loading the data into HDFS.
Leading & training junior Hadoop engineers in Lisle and off-shore for knowledge transition and creating guidelines for best practices.
Data Ingestion between different database tables in MySQL, SQL Server, Oracle, Teradata and HDFS using Sqoop.
Dealt with several source systems (RDBMS/ HDFS/Azure) and file formats (JSON / XML / Parquet) to ingest, transform and persist data in hive for further downstream consumption
Expertise in writing and parameterization of Shell scripts to trigger the Spark, Hive, Impala, Sqoop jobs and schedule in Control-M.
Develop Spark programs for batch and real time processing data from Kafka and transform them to DataFrames and load the DataFrames into Hive and HDFS.
Worked closely with data science team to identify predictive model performance.
Used Flume to collect the log data from different resources and load the data to Hive tables using different SerDe to store in JSON, XML and Sequence file formats.
Monitoring Production and Non production jobs using the Control-M scheduler.
Involvedas member of Enterprise DevOps team for code review and implementing right software development practices for data operations.
Environment: CDH, Cloudera Manager, HDFS, YARN, Spark, Scala, Hive, Impala, Sqoop, HBase, Zookeeper, Control-M, Kafka, Flume, Shell Scripting, Linux/Unix, MySQL, Oracle, SQL Server, GIT, TFS.
Hadoop Developer / Spark Developer
Innovative Consulting Solutions LLC - IL - Jan 2016 to Feb 2018
Project: Delivery of ERP structured data is needed in an accurate and timely manner for regulatory and tax audits globally. JCI needs to improve timeliness and accuracy of providing electronic data for future audits in order to avoid onerous penalties for active (production) and non-active ERP systems for non-compliance. JCI needs to create a formal structured data archiving process and enforce the record retention policy to properly secure, archive and dispose of ERP data. The solution accommodates retiring systems that have been acquired, changed, closed or sold.
Performing Data Migration, Data Profiling, Extraction, Transformation and Loading using Talend and designed data conversions from large variety of source systems including Oracle, DB2, SQL server, Teradata, Hive and non-relational sources like flat files, XML files
Migrated Existing MapReduce programs to Spark Models using Scala.
Converting Hive/SQL queries into Spark transformations using Spark RDDs, Python and Scala.
Developed Hive scripts on Avro and parquet file formats.
Importing and Exporting Data between different Database Tables like MySQL, Oracle, Teradata and HDFS using Sqoop.
Handled importing of data from various data sources, performed transformations using Hive, Spark to load data into HDFS.
Built Spark Applications using IntelliJ and Maven.
Worked with Parallel connectors for Parallel Processing to improve job performance while working with bulk data sources in Talend.
Analyzed the data by performing Hive SQL queries (HiveQL or HQL) and running Pig scripts (Pig Latin) to study customer behavior.
Developed Storm topology and integrated with Kafka topic to stream and process the data.
Created HBase tables to load large sets of structured, semi-structured and unstructured data coming from different sources like UNIX, NoSQL etc.
Good knowledge in Configuring, Maintaining, Monitoring and Deployment of Hadoop cluster and different Big Data analytic tools includingHive, Oozie, Sqoop, Flume, Spark with HortonWorks distribution.
Designed Spark code using Scala and Spark-SQL for faster data processing and validating data.
Use Flume to collect the log data from different resources and transfer the data type to hive tables using different SerDe to store in JSON, XML and Sequence file formats.
Developed Spark Programs for Batch and Real Time Processing to process incoming streams of data from Kafka sources and transform it into as DataFrames and load those DataFrames into Hive and HDFS.
Developed Spark code to process data from different sources and store into Hive/HBase (Data is pre-processed and stores in HDFS using NiFi before spark consumption).
Experience in managing and reviewing Hadoop log files.
Debugging and troubleshooting the issues in development and Test environments.
Environment: HDP,Ambari, HDFS, Hive,Pig, Sqoop, HBase, Apache NiFi, Zookeeper, Scala,Oozie,Talend, Storm,Kafka,Shell Scripting, Linux/Unix,IntelliJ, MySQL,GIT, Oracle, MySQL, DB2, SQL Server.
PurpleTalk-Hyderabad, India - Nov 2013 - Dec 2014
Responsibilities include building scalable distributed data solutions using Hadoop ecosystem.
Importing and exporting data between RDMS systems and HDFS / HIVE using Sqoop.
Experience in creating Hive tables, and loading and analyzing data using hive queries.
Implemented Partitioning, Bucketing in Hive for better organization of the data.
Loading data from UNIX file system to HDFS.
Developed workflow by scheduling Hive processes for Log file data, which is streamed into HDFS using Flume.
Implemented the workflows using Apache Oozie framework to automate tasks.
Continuous monitoring and managing the Hadoop cluster using Ambari.
Developed Shell scripts to automate and provide Control flow to Pig scripts.
Developed UDFs using Java.
Developed multiple MapReduce jobs in java for data cleaning and preprocessing.
Custom shell scripts for automating redundant tasks on the cluster.
Involved in managing and reviewing Hadoop log files.
Worked on GIT for version control, JIRA for project tracking.
Migration of ETL processes from Oracle to Hive to test the easy data manipulation.
Environment: Hortonworks HDP,Ambari,Oozie, Hive, Pig, Sqoop,Zookeeper, Java,Ganglia, Shell Scripting,Linux/Unix, MySQL, Jira, Oracle, SQL Server, GIT.
PurpleTalk- Hyderabad, India - Jun 2011- Sep 2013
Implemented user acceptance testing with a focus on documenting defects and executing test cases.
Involved in Development, implementation and Integrating with the existing modules.
Used Maven to build, run and create JARs and WAR files among other uses
Executes development processes for the project such as build automation, unit testing, software configuration management and packaging.
Implemented modules REST web services.
Deployed the application to TOMCAT server.
Performed unit testing using JUNIT framework and used MDB listener and services for testing CRD and trust queue messages.
Used the protocols like http, https, and ftp for connecting to the server.
Used GIT for controlling the version of the project.
Used Eclipse for development, Testing, and Code Review.
Messaging and interaction of Web Services is done using SOAP.
Used JDBC, SQL and PL/SQL programming for storing, retrieving, manipulating the data.
Developed JUnit Test cases for Unit Test cases and as well as System and User test scenarios.
Developed and wrote UNIX Shell scripts to automate various tasks.
Master’s in Computer Science from University of Central Missouri, May 2016.
Bachelor’s in Computer Science from JNTU Kakinada, INDIA, May 2011