Data Scientist/Hadoop Spark

Location:

Birmingham, AL

Salary:

$65

Posted:

June 18, 2018

Contact this candidate

Resume:

Sribhanu

Data Scientist/ Hadoop Spark

********.****@*****.***

469-***-****

PROFESSIONAL SUMMARY:

•Above 8+ years of experience in Machine Learning, Data mining with large data sets of Structured and Unstructured data, Data Acquisition, Data Validation, Predictive modeling, Data Visualization.

•Extensive experience in Text Analytics, developing different Statistical Machine Learning, Data Mining solutions to various business problems and generating Data Visualizations using R, Python and Tableau.

•Expertise in transforming business requirements into analytical models, designing algorithms, building models, developing data mining and reporting solutions that scales across massive volume of structured and unstructured data.

•Sound knowledge of statistical learning theory with a post graduate background in mathematics.

•Experience in extracting source data from Sequential files, XML files, Excel files, transforming and loading it into the target Data warehouse.

•Expertise in Java/J2EE technologies such as Core Java, Struts, Hibernate, JDBC, JSP, JSTL, HTML, JavaScript, JSON

•Experience with database SQL and NoSQL (MongoDB) (Cassandra )

•Hands on experience with Hadoop Core Components (HDFS, MapReduce) and Hadoop Ecosystem (Sqoop, Flume, Hive, Pig, Impala, Oozie, HBase).

•Experience in ingesting real time/near real time data using Flume, Kafka, Storm

•Experience in importing and exporting the data using Sqoop from Relational Database to HDFS and reverse.

•Hands on Experience on Linux systems

•Experience in using Sequence files, AVRO file, Parquet file formats; Managing and reviewing Hadoop log files

•Proficient in Statistical Modeling and Machine Learning techniques (Linear, Logistics, Decision Trees, Random Forest, SVM, K-Nearest Neighbors, Bayesian, XG Boost) in Forecasting/ Predictive Analytics, Segmentation methodologies, Regression based models, Hypothesis testing, Factor analysis/ PCA, Ensembles.

•Experience in foundational Machine Learning Models and concepts are Regression, Random Forest, Boosting, GBM, NNs, HMMs, CRFs, MRFs, Deep Learning.

•Apex Code

•Adept in statistical programming languages like R and also Python including Big Data technologies like Hadoop, Hive.

•Skilled in using dplyr and pandas in R and python for performing exploratory data analysis.

•Experience working with data modeling tools like Erwin, Power Designer and ER Studio.

•Experience in designing star schema, Snowflake schema for Data Warehouse, ODS architecture.

•Experience and Technical proficiency in Designing, Data Modeling Online Applications, Solution Lead for Architecting Data Warehouse/Business Intelligence Applications.

•Good understanding of Teradata SQL Assistant, Teradata Administrator and data load/ export utilities like BTEQ, Fast Load, Multi Load, and Fast Export.

PROFESSIONAL EXPERIENCE:

Client: BBVA Compass, Birmingham, Alabama Feb 2017 – Present.

Role: Data Scientist(Hadoop/Spark)

Responsibilities:

•Sqooped data from different source systems and automating them with oozie workflows.

•Good knowledge in using apache NiFi to automate the data movement between different Hadoop systems.

•Generation of business reports from data lake using Hadoop SQL (Impala) as per the Business Needs.

•Automation of Business reports using Bash scripts in Unix on Datalake by sending them to business owners..

•Knowledge on Amazon EC2 Spot integration & and Amazon S3 integration.

•Performing Hadoop ETL using hive on data at different stages of pipeline.

•Worked in an agile technology with Scrum.

•Developed pyspark, scala code to cleanse and perform ETL on the data in data pipeline in different stages.

•Worked in different environments like DEV, QA, Datalake and Analytics Cluster as part of Hadoop Development.

•Snapped the cleansed data to the Analytics Cluster for reporting purpose to Business.

•Developed pig scripts, python to perform Streaming and created tables on the top of it using hive.

•Developed multiple POCs using Scala and deployed on the Yarn cluster, compared the performance of Spark, and SQL.

•Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, and Scala.

•Developed pyspark code and Spark-SQL/Streaming for faster testing and processing of data.

•Developed Oozie workflow engine to run multiple Hive UDF, Pig, sqoop and Spark jobs.

•Handled importing of data from various data sources, performed transformations using Hive, Spark and loaded data into HDFS.

•Generation of business reports from data lake using Hadoop SQL (Impala) as per the Business Needs.

•Automation of Business reports using Bash scripts in Unix on Datalake by sending them to business owners.