Resume

Big Data Visualization

Location:

Menlo Park, CA, 94025

Posted:

March 14, 2024

Contact this candidate

Resume:

Qualifications

*+ years of experience in Big Data Development.

Hands-on experience with AWS, AZURE, GCP.

Dealing with multiple terabytes of data stored in AWS using Elastic Map Reduce and Redshift PostgreSQL.

Understands customer use cases and can create a vision on how to design and implement a solution.

Understands and articulates the overall value of big data; works effectively and proactively with internal and external partners.

Provide actionable recommendations to meet Hadoop data analytical needs on a continuous basis using Hadoop distributed system and cloud systems.

Made use of Python libraries for analytic processing, such as SciPy, Pandas, and NumPy.

Displays of analytics and insights using data visualization tools, Tableau, and Hadoop tools to generate reports and dashboards to drive key business decisions.

Experience with data visualization tools, data analysis, and business recommendations (cost-benefit, invest-divest, forecasting, impact analysis).

Deliver effective presentations of findings and recommendations to multiple levels of stakeholders, creating visual displays of quantitative information.

Cleanse, aggregate, and organize Hadoop HDFS data lake.

Skilled in Kerberos authentication and Blockchain cryptography using Ethereum and Solidity.

Experience with Hyperledger exposure in Hadoop data projects.

Skill with Spark framework on both batch and real-time data processing.

Hands-on experience processing data using Spark Streaming API with Scala.

Experience with using Hadoop clusters, Hadoop HDFS, Hadoop tools and Spark, Kafka, Hive in social data and media analytics using Hadoop ecosystem.

Highly knowledgeable in data concepts and technologies including AWS pipelines, cloud repositories (Amazon AWS, MapR, Cloudera).

Hadoop ecosystem tools for ETL and analysis, pipelines, and cleaning data in prep for analysis.

Experience in migrating the data using Sqoop from HDFS to Relational Database System and vice-versa according to client's requirement.

Experience data processing like collecting, aggregating, moving from various sources using Apache Flume and Kafka.

Hands on experience using Cassandra, HIVE, No-SQL databases (like Hbase, MongoDB), SQL databases (like Oracle, SQL, PostgresSQL, My SQL server, as well as data lakes and cloud repositories to pull data for analytics.

Hands on programming using Kafka, Spark, Scala, Python, Java, R to refine Hadoop data analytics.

Multiple years applied to dedicated Linus systems administration experience.

Technical Skills

APACHE – Hadoop, Flume, Hive, Kafka, Oozie, Spark, Tez

PROGRAMMING/SCRIPTING – Scala, HiveQL, MapReduce, XML, FTP, Python, Java, UNIX, Shell scripting

OPERATING SYSTEMS – Unix/Linux, Windows

FILE FORMATS – Parquet, JSON, ORC

DISTRIBUTIONS – Cloudera, Hortonworks, MapR, EMR

DATA PROCESSING (COMPUTE) ENGINES – Apache Spark

DATA VISUALIZATION TOOLS – QlikView, Tableau, Kibana

DATABASES and DATA STRUCTURES – Microsoft SQL Server Database Administration, Apache Cassandra, Amazon Redshift, DynamoDB, Apache HBase, Elasticsearch

SOFTWARE – Microsoft Project, VMWare, Microsoft Word, Excel, Outlook, Power Point; Technical Documentation Skills

CLOUD SERVICES – Amazon Web Services (AWS), Microsoft Azure

Work Experience

Yahoo, Sunnyvale, CA

December 2019 – Current

Big Data Engineer

Implement solutions for ingesting data from various sources and processing the Data-at-Rest utilizing Big Data technologies such as Hadoop, Map Reduce Frameworks, HBase, Hive.

Work with Amazon Web Services (AWS) and involved in ETL, Data Integration and Migration.

Work on AWS Kinesis Consumer/Producer library for processing real-time data.

Configure Stream sets to store converted data to SQL SERVER using JDBC drivers.

Create Hive external tables and designed data models in Apache Hive.

Expertise in optimizing the storage in Hive using partitioning and bucketing mechanisms on managed and external tables.

Implement AWS EMR Spark using PySpark and utilized DataFrames and SparkSQL API for faster processing of data.

Use Spark SQL and DataFrames API to process structured and semi-structured information into Spark Clusters.

Work on CI/CD pipeline for code deployment by engaging different tools (Git, Jenkins, Code Pipeline).

Use Cloudera Manager for installation and management of a multi-node Hadoop cluster.

Work on Cassandra designs as well as information modeling.

Creation, configuration, and monitoring of Cassandra Shard sets.

Perform analysis of data governance and distribution, choosing a shard Key to distribute data evenly.

Work with Spark core, SparkSQL, Data Frames, and Pair RDDs.

Enforce YARN Resource pools to share resources of cluster for YARN jobs submitted by users.

Explore Spark for improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, DataFrames, Pair RDDs, Spark YARN.

Use Spark to export transformed streaming datasets into Redshift on AWS cloud.

Install and configure Kafka cluster.

Monitor Kafka cluster and architected a lightweight Kafka broker.

Work on integration of Kafka with Spark for real-time data processing.

Extract needed data from the server into Hadoop file system (HDFS) and bulk load cleaned data into HBase using Spark.

Access Hadoop file system (HDFS) using Spark and manage data in Hadoop data lakes.

Work with the Spark-SQL context to create data frames to filter input data for model execution.

Utilize Spark Data Frame and Data Set from Spark SQL API widely for information handling.

Thermo Fisher Scientific Inc., Waltham, MA

November 2017 – December 2019

Hadoop Cloud Engineer

Met with stakeholders, SMEs, and Data Scientists to gather, determine, and document requirements for the company’s Affymetrix microarray analysis product tools used by researchers studying plant and animal genomics and transcriptomics.

Created a Lambda architecture variety consisting of near real-time data processing with Spark Streaming, Spark SQL, and Spark clusters.

Deployed Hadoop clusters of HDFS, Spark Clusters, and Kafka clusters on virtual servers in Azure environment.

Used Azure HDInsights as interface to manage the online clusters.

Worked on an Azure Cloud environment implementing Azure HDInsights.

Ran the Database Migration Assistant to upgrade the existing SQL Server to Azure SQL Database.

Used Scala libraries to process XML data that was stored in HDFS and processed data was stored in HDFS.

Performed streaming data ingestion to the Spark distribution environment, using Kafka.

Built a prototype for real-time analysis using Spark streaming and Kafka.

Worked on escalated tasks related to interconnectivity issues and complex cloud-based identity management and user authentication, service interruptions with Virtual Machines (their host nodes) and associated virtual storage (Blobs, Tables, Queues).

Used Spark DataFrame API over Azure HDConnect platform to perform analytics on Hive data.

Extensively used transformations like Router, Aggregator, Normalizer, Filter, Join, Expression, Source Qualifier, Unconnected and connected lookup, Update strategy and store procedure, XML transformations along with error handling and performance tuning.

Used Sqoop to extract the data back to relational databases for business reporting.

Extensively worked on Datastage sever and parallel job controls and sequencers. Designed and developed parallel jobs by using different types of stages such as transformers, Aggregator, Merge, Join, Lookup, Sort, Remove duplicates, Funnel, Filter, Pivot, Shared container for developing jobs.

Implemented all SCD types using server and parallel jobs. Extensively implemented error handling concepts, testing, debugging skills and performance tuning of targets, source, transformation logics and version control to promote the jobs.

Involved in loading and transforming large sets of structured, semi-structured and unstructured data.

Involved in transforming data from legacy tables to HDFS and HBase tables using Sqoop.

Kellogg’s, Battle Creek, MI

September 2015 – November 2017

Big Data Engineer AWS

Used Spark SQL to perform transformations and actions on data residing in Hive.

Created EC2 instances and auto scaling.

Completed Cloud formation scripting, security and resources automation.

Configured CloudWatch Monitor for S3 and Glacier storage management, Access control and policy.

Wrote numerous Spark programs in Scala for information extraction, change, and conglomeration from numerous record designs.

Implemented Spark using Scala and utilized DataFrames and Spark SQL API for faster processing of data.

Involved in converting HiveQL/SQL queries into Spark transformations using Spark RDDs, Python and Scala.

Used Spark DataFrame API over Cloudera platform to perform analytics on Hive data.

Used Spark streaming to receive real time data using Kafka.

As part of Batch Modernization initiative in E2C, analyzed existing batch ingestion developed in Oracle Data Integrator and developed PySpark application as ETL tool. This reduced the batch ingestion time from 3.5 hours to 15 minutes.

Set up cloud compute engine in managed and unmanaged modes and SSH key management.

Bed Bath and Beyond, Northern, NJ

Jan 2014 – September 2015

Data Engineer

Planned and bult up a day-by-day cycle to steadily import crude information from DB2 into Hive tables utilizing Sqoop.

Engaged with troubleshooting Map Reduce work utilizing MR Unit.

Utilized Hive/HQL to query information in Hive Tables and stacked information into Hive.

Created data pipeline utilizing Flume, Sqoop, Spark and MapReduce to ingest information into HDFS for examination.

Utilized Oozie and Zookeeper for work process planning and monitoring.

Utilized Sqoop to move information from data sets (SQL, Oracle) to HDFS, Hive.

Incorporated Apache Storm with Kafka to perform web examination.

Transferred click stream information from Kafka to HDFS, Hbase and Hive by incorporating with Storm.

Planned Hive external tables utilizing shared meta-store with dynamic partitioning and buckets.

Dealt with moving programs into Spark changes utilizing Spark and Scala.

Worked widely with Sqoop for bringing in and trading information from HDFS to Relational Database frameworks/centralized server and the other way around.

Stacking information into HDFS.

Made simultaneous access for hive tables with shared/elite locks.

Actualized utilizing SCALA and SQL for quicker testing and preparing of information.

Constant streaming the information utilizing with KAFKA.

Utilized OOZIE for group preparing and booking work processes powerfully.

Dealt with making end-to-end information pipeline organization utilizing Oozie.

Populated HDFS and Cassandra with enormous measures of information utilizing Apache Kafka.

Used Sqoop to transfer all the tables and their data into HDFS into specific directories that were created for each one.

Added an extra table into HDFS (originally it was just an ordinary CSV file converted into a table with data).

After Sqoop, the tables were sent to Hive (a data warehouse software) in 2 databases that I created in it. One is called “raw” which is for external tables and “dsl” for internal tables. The last step in this project was to put the tables from internal tables into a distributed big data HBase.

Executed commands to create tables in HBase and call back through Hive.

The Weinberg Group, Washington, D.C.

August 2012 - December 2013

Hadoop Data Engineer

Collected and aggregated large amounts of log data using Apache Flume and staging data in HDFS for further analysis.

Developed Hadoop pipeline jobs to process the Hadoop Distributed File System (HDFS) data and used Avro and Parquet file formats with ORC compression tool.

Used Zookeeper for providing coordinating services to the Hadoop cluster.

Documented Technical Specs, Dataflow, Data Models and Class Models in the Hadoop system.

Used Zookeeper and Oozie for coordinating the cluster and scheduling workflows in Hadoop.

Worked on installing cluster, commissioning, and decommissioning of data node, NameNode recovery, capacity planning, and slots configuration in Hadoop.

Involved in production support, which involved monitoring server and error logs, and foreseeing and preventing potential issues, and escalating issue when necessary.

Implemented partitioning, bucketing in Hive for better organization of the Hadoop Distributed File System (HDFS) data.

Used Linux shell scripts to automate the build process, and regular jobs like ETL.

Imported data using Sqoop to load data from MySQL and Oracle to Hadoop Distributed File System (HDFS). on regular basis.

Created Hive external tables to store the Pig script output. Worked on them for data analysis to meet business requirements.

Successfully loaded files to HDFS from Teradata and loaded from HDFS to HIVE.

Used Sqoop to efficiently transfer data between databases and HDFS and used Flume to stream the log data from servers.

Involved in loading the created Files into HBase for faster access of all the products in all the stores without taking Performance hit.

Installed and configured Pig for ETL jobs and made sure we had Pig scripts with regular expression for data cleaning.

Involved in loading data from Linux file system to Hadoop Distributed File System (HDFS).

Responsible for building scalable distributed data solutions using Hadoop.

Moved data from Oracle to Hadoop Distributed File System (HDFS), and vice-versa (ETL) using Sqoop.

Prism Data Systems – San Antonio, TX

March 2009 – August 2012

Linux System Administrator

Performed system administration technical monitoring and maintenance, upgrades, and support of Linux-based systems.

Configured DNS and DHCP on clients’ networks.

Provided technical support via telephone/email to 1000s of users.

Created database tables with various constraints for clients accessing FTP.

Experienced as Red Hat Enterprise Linux Administrator.

Built, installed, and configured servers from scratch with OS of RedHat Linux.

Performed Red Hat Linux Kickstart installations on RedHat 4.x/5.x, performed Red Hat Linux Kernel Tuning, and memory upgrades.

Installed, configured, and performed troubleshooting activities specific to Solaris, Linux RHEL, HP-UX, and AIX operating systems.

Applied OS patches and upgrades on a regular basis and upgraded administrative tools and utilities, and configured or added new services.

Installed and configured Apache, Tomcat, and Web Logic and Web Sphere applications.

Provided remote system administration using tools such as SSH, Telnet and Rlogin.

Education

Bachelor of Science in Mathematics from Sonoma State University

Contact this candidate