Sribhanu
Data Scientist/ Hadoop Spark
********.****@*****.***
PROFESSIONAL SUMMARY:
•Above 8+ years of experience in Machine Learning, Data mining with large data sets of Structured and Unstructured data, Data Acquisition, Data Validation, Predictive modeling, Data Visualization.
•Extensive experience in Text Analytics, developing different Statistical Machine Learning, Data Mining solutions to various business problems and generating Data Visualizations using R, Python and Tableau.
•Expertise in transforming business requirements into analytical models, designing algorithms, building models, developing data mining and reporting solutions that scales across massive volume of structured and unstructured data.
•Sound knowledge of statistical learning theory with a post graduate background in mathematics.
•Experience in extracting source data from Sequential files, XML files, Excel files, transforming and loading it into the target Data warehouse.
•Expertise in Java/J2EE technologies such as Core Java, Struts, Hibernate, JDBC, JSP, JSTL, HTML, JavaScript, JSON
•Experience with database SQL and NoSQL (MongoDB) (Cassandra )
•Hands on experience with Hadoop Core Components (HDFS, MapReduce) and Hadoop Ecosystem (Sqoop, Flume, Hive, Pig, Impala, Oozie, HBase).
•Experience in ingesting real time/near real time data using Flume, Kafka, Storm
•Experience in importing and exporting the data using Sqoop from Relational Database to HDFS and reverse.
•Hands on Experience on Linux systems
•Experience in using Sequence files, AVRO file, Parquet file formats; Managing and reviewing Hadoop log files
•Proficient in Statistical Modeling and Machine Learning techniques (Linear, Logistics, Decision Trees, Random Forest, SVM, K-Nearest Neighbors, Bayesian, XG Boost) in Forecasting/ Predictive Analytics, Segmentation methodologies, Regression based models, Hypothesis testing, Factor analysis/ PCA, Ensembles.
•Experience in foundational Machine Learning Models and concepts are Regression, Random Forest, Boosting, GBM, NNs, HMMs, CRFs, MRFs, Deep Learning.
•Apex Code
•Adept in statistical programming languages like R and also Python including Big Data technologies like Hadoop, Hive.
•Skilled in using dplyr and pandas in R and python for performing exploratory data analysis.
•Experience working with data modeling tools like Erwin, Power Designer and ER Studio.
•Experience in designing star schema, Snowflake schema for Data Warehouse, ODS architecture.
•Experience and Technical proficiency in Designing, Data Modeling Online Applications, Solution Lead for Architecting Data Warehouse/Business Intelligence Applications.
•Good understanding of Teradata SQL Assistant, Teradata Administrator and data load/ export utilities like BTEQ, Fast Load, Multi Load, and Fast Export.
PROFESSIONAL EXPERIENCE:
Client: BBVA Compass, Birmingham, Alabama Feb 2017 – Present.
Role: Data Scientist(Hadoop/Spark)
Responsibilities:
•Sqooped data from different source systems and automating them with oozie workflows.
•Good knowledge in using apache NiFi to automate the data movement between different Hadoop systems.
•Generation of business reports from data lake using Hadoop SQL (Impala) as per the Business Needs.
•Automation of Business reports using Bash scripts in Unix on Datalake by sending them to business owners..
•Knowledge on Amazon EC2 Spot integration & and Amazon S3 integration.
•Performing Hadoop ETL using hive on data at different stages of pipeline.
•Worked in an agile technology with Scrum.
•Developed pyspark, scala code to cleanse and perform ETL on the data in data pipeline in different stages.
•Worked in different environments like DEV, QA, Datalake and Analytics Cluster as part of Hadoop Development.
•Snapped the cleansed data to the Analytics Cluster for reporting purpose to Business.
•Developed pig scripts, python to perform Streaming and created tables on the top of it using hive.
•Developed multiple POCs using Scala and deployed on the Yarn cluster, compared the performance of Spark, and SQL.
•Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, and Scala.
•Developed pyspark code and Spark-SQL/Streaming for faster testing and processing of data.
•Developed Oozie workflow engine to run multiple Hive UDF, Pig, sqoop and Spark jobs.
•Handled importing of data from various data sources, performed transformations using Hive, Spark and loaded data into HDFS.
•Generation of business reports from data lake using Hadoop SQL (Impala) as per the Business Needs.
•Automation of Business reports using Bash scripts in Unix on Datalake by sending them to business owners.
•Developed pyspark, scala code to cleanse and perform ETL on the data in data pipeline in different stages.
•Worked in different environments like DEV, QA, Datalake and Analytics Cluster as part of Hadoop Development.
•Snapped the cleansed data to the Analytics Cluster for reporting purpose to Business.
•Developed pig scripts, python to perform Streaming and created tables on the top of it using hive.
•Developed multiple POCs using Scala and deployed on the Yarn cluster, compared the performance of Spark, and SQL.
•Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, and Scala.
•Developed pyspark code and Spark-SQL/Streaming for faster testing and processing of data.
•Developed Oozie workflow engine to run multiple Hive UDF, Pig, sqoop and Spark jobs.
•Handled importing of data from various data sources, performed transformations using Hive, Spark and loaded data into HDFS.
Environment: Hadoop, AWS, Java, HDFS, MapReduce, Spark, pyspark, Pig, Hive, Impala, Sqoop, Flume, Kafka, HBase, Oozie, Java, SQL scripting, Linux shell scripting, Eclipse and Cloudera.
Client: Apple Inc Dec 2015 - Jan 2017.
Role: Data Scientist/ Hadoop
Responsibilities:
•Helped the team to increase cluster size from 55 nodes to 145+ nodes. The configuration for additional data nodes was managed using Puppet.
•Installed and configured Hadoop MapReduce, HDFS, developed multiple MapReduce jobs
•Integrate Apache Spark with Hadoop components
•Java for data cleaning and preprocessing.
•Extensive experience in writing HDFS and Pig Latin commands.
•Developed complex queries using HIVE and IMPALA.
•Developed data pipeline using Flume, Sqoop, Pig and Java map reduce to ingest claim data and financial histories into HDFS for analysis.
•Worked on importing data from HDFS to MYSQL database and vice-versa using SQOOP.
•Implemented Map Reduce jobs in HIVE by querying the available data.
•Configured Hive metastore with MySQL, which stores the metadata for Hive tables.
•Analyzed the data by performing Hive queries and running Pig scripts to study customer behavior.
•Written Hive and Pig scripts as per requirements.
•Developed Spark Application by using Scala
•Implemented Apache Spark data processing project to handle data from RDBMS and streaming sources.
•Designed batch processing jobs using Apache Spark to increase speeds by ten-fold compared to that of MR jobs.
•Developed Spark SQL to load tables into HDFS to run select queries on top.
•Used Spark Streaming to divide streaming data into batches as an input to Spark engine for batch processing.
•Highly skilled in integrating Kafka with Spark streaming for high speed data processing
•Used Spark Dataframes, Spark-SQL, Spark MLLib extensively
•Integrated Apache Storm with Kafka to perform web analytics. Uploaded click stream data from Kafka to Hdfs, Hbase and Hive by integrating with Storm
•Designed the ETL process and created the high level design document including the logical data flows, source data extraction process, the database staging and the extract creation, source archival, job scheduling and Error Handling.
•Worked on Talend ETL tool and used features like context variable and database components like input to oracle, output to oracle, tFile compare, tFile copy, to oracle close ETL components.
Environment: Hadoop, AWS, Java, HDFS, MapReduce, Spark, Pig, Hive, Impala, Sqoop, Flume, Kafka, HBase, Oozie, Java, SQL scripting, Linux shell scripting, Eclipse and Cloudera.
Client: FIS Global, Omaha, NB Feb 2014 -Nov 2015.
Role: Sr. Hadoop Developer
Responsibilities:
•Involved with ingesting data received from various relational database providers, on HDFS for analysis and other big data operations
•Wrote MapReduce jobs to perform operations like copying data on HDFS and defining job flows on EC2 server, load and transform large sets of structured, semi-structured and unstructured data.
•Creating Hive tables to import large data sets from various relational databases using Sqoop and export the analyzed data back for visualization and report generation by the BI team.
•Used Spark-Streaming APIs to perform necessary transformations and actions on the fly for building the common learner data model which gets the data from Kafka in near real time and Persists into Cassandra.
•Configured deployed and maintained multi-node Dev and Test Kafka Clusters.
•Developed Spark scripts by using Java, and Python shell commands as per the requirement.
•Used Spark API over Cloudera Hadoop YARN to perform analytics on data in Hive.
•Developed Scala scripts, UDFFs using both Data frames/SQL/Data sets and RDD/MapReduce in Spark 1.6 for Data Aggregation, queries and writing data back into OLTP system through Sqoop.
•Experienced in performance tuning of Spark Applications for setting right Batch Interval time, correct level of Parallelism and memory tuning.
•Optimizing of existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frames and Pair RDD's.
•Predictive analytics (which can monitor inventory levels and ensure product availability)
•Analysis of customers' purchasing behaviors
•Response to value-added services based on clients' profiles and purchasing habits
•Design and implement MapReduce jobs to support distributed processing using java, Hive and Apache Pig.
Environment: Apache Hadoop, Hive, PIG, HDFS, Java Map-Reduce, Core Java, Scala, Maven, GIT, Jenkins, UNIX, MYSQL, Eclipse, Oozie, Sqoop, Flume and Cloudera Distribution, Oracle, Teradata and MySql
Client: Capgemini, Richmond, VA Nov 2012 - Jan 2014.
Role: Hadoop Developer/Data Modeler
Responsibilities:
Participated in JAD session with business users and sponsors to understand and document the business requirements in alignment to the financial goals of the company.
Involved in analysis of Business requirement, Design and Development of High level and Low level designs, Unit and Integration testing
Performed data analysis and data profiling using complex SQL on various sources systems including Teradata, SQL Server.
Developed the logical data models and physical data models that confine existing condition/potential status data fundamentals and data flows using ER Studio
Created the conceptual model for the data warehouse using Erwin data modeling tool.
Reviewed and implemented the naming standards for the entities, attributes, alternate keys, and primary keys for the logical model.
Performed second and third normalizations for ER data model of OLTP system
Worked with data compliance teams, Data governance team to maintain data models, Metadata, Data Dictionaries; define source fields and its definitions.
Translate business and data requirements into Logical data models in support of Enterprise Data Models, ODS, OLAP, OLTP, Operational Data Structures and Analytical systems.
Designed, Build the Dimensions, cubes with star schema and Snow Flake Schema using SQL Server Analysis Services (SSAS).
Implemented end-to-end systems for Data Analytics, Data Automation and integrated with custom visualization tools using R, Mahout, Hadoop and MongoDB.
Gathering all the data that is required from multiple data sources and creating datasets that will be used in analysis.
Performed Exploratory Data Analysis and Data Visualizations using R, and Tableau.
Perform a proper EDA, Univariate and bi-variate analysis to understand the intrinsic effect/combined effects.
Worked with Data Scientist in order to create a Data marts for data science specific functions.
Created stored procedures using PL/SQL and tuned the databases and backend process.
Determined data rules and conducted Logical and Physical design reviews with business analysts, developers and DBAs.
Environment: Erwin, Teradata, SQL Server 2008, Oracle 9i, SQL*Loader, PL/SQL, ODS, OLAP, OLTP, SSAS, Informatica Power Center.
Client: Velvan Soft Solution Pvt Ltd, Hyd, IN, July 2011- Oct 2012.
Role: Data Analyst/Data Engineer
Responsibilities:
Deployed GUI pages by using JSP, JSTL, HTML, DHTML, XHTML, CSS, JavaScript, AJAX
Configured the project on WebSphere 6.1 application servers
Implemented the online application by using Core Java, Jdbc, JSP, Servlets and EJB 1.1, Web Services, SOAP, WSDL
Communicated with other Health Care info by using Web Services with the help of SOAP, WSDL JAX-RPC
Worked as a Data Modeler/Analyst to generate Data Models using Erwin and developed relational database system.
Analyzed the business requirements of the project by studying the Business Requirement Specification document.
Extensively worked on Data Modeling tools Erwin Data Modeler to design the data models.
Designed mapping to process the incremental changes that exists in the source table. Whenever source data elements were missing in source tables, these were modified/added in consistency with third normal form based OLTP source database.
Designed tables and implemented the naming conventions for Logical and Physical Data Models in Erwin 7.0.
Provide expertise and recommendations for physical database design, architecture, testing, performance tuning and implementation.
Exported data to a Mysql from HDFS using Sqoop and NFS mount approach.
Moved data from HDFS to Cassandra using Map Reduce and BulkOutputFormat class.
Developed Map Reduce programs for applying business rules on the data.
Developed and executed hive queries for denormalizing the data.
Works with ETL workflow, analysis of big data and loaded them into Hadoop cluster.
Created PL/SQL packages and Database Triggers and developed user procedures and prepared user manuals for the new programs.
Environment: Erwin r7.0, SQL Server 2000/2005, Linux, Java, Map Reduce, HDFS, DB2, MS-DTS, UML, UAT, SQL Loader, OOD, OLTP, PL/SQL, MS Visio, Cassandra
Client: Excell Media limited, India Aug 2010 – Jun 2011.
Role: Java Developer
Responsibilities:
•Involved in Requirement gathering, Analysis and Design using UML and OOAD
•Worked on Presentation layer used JSP, Servlets, Struts
•Extensively used Struts framework for MVC, used Struts framework in UI designing and validations
•Created and deployed dynamic web pages using HTML, JSP, CSS and JavaScript
•Worked on coding and deployments of EJB Stateless session beans
•Interacted with Developers to follow up on Defects and Issues
•Involved in the design and development of HTML presentation using XML, CSS, XSLT and XPath
•Deployed J2EE web applications in BEA Weblogic.
•Ported the Application onto MVC Model 2 Architecture in Struts Framework
•Testing of the applications Review and troubleshooting
•Migration of Existing flat file data to Normalize Oracle database
•Used XML, XSD, DTD and Parsing APIs SAX and DOM XML based documents for information exchange.
•Coded SQL, PL/SQL for backend processing and retrieval logic
•Testing and implementation of the system and Installation of system
•Involved in build and deploying the application using ANT builder
•Used Microsoft Visual Source Safe(VSS) and CVS as version control system
•Worked on bug fixing and Production Support
Environment: J2EE, XML, Informatica, HTML, JSP, PL/SQL