Data Big

Location:

Columbia, OH

Posted:

July 20, 2023

Contact this candidate

Resume:

Name: Yojana

Email: *************@*****.***

Phone No: 832-***-****

PROFESSIONAL SUMMARY:

7+ years of IT experience in software development, big data management, data modeling, data integration, implementation, and big data frameworks.

My experience includes Bigdata Technologies, Hadoop ecosystem, Data Warehousing, SQL related technologies in Retail, Manufacturing, Financial and Communication sectors.

5+ years of experience in design, development and deployment of Big Data Analytics using Azure, AWS, Databricks, Airflow, Hadoop Ecosystem including HDFS, Hive, Pig, Hbase, Sqoop, Flume, MapReduce, Spark and Oozie.

Strong working experience with ingestion, storage, processing and analysis of big data.

Proficient knowledge and hands on experience with writing shell scripts in Linux.

Implemented several optimization mechanisms like Combiners, Distributed Cache, Data Compression, and Custom Partitioner to speed up the jobs.

Good Experience in writing Map Reduce programs using Java.

Hands on performance improvement techniques for data processing in Hive, Pig, Impala & map-reduce using methods dynamic partitioning, bucketing, file compression.

Worked with Sqoop in importing and exporting data from different databases like MySql, Oracle into HDFS and Hive.

Experience with data formats like JSON, PARQUET, AVRO, RC and ORC formats and compressions like snappy & bzip2.

Developed Oozie workflows by integrating all tasks relating to a project and schedule the jobs as per requirements.

Worked on analyzing data in NOSQL databases like Hbase

Experience in database design, entity relationships, database analysis.

Experience in handling large datasets using Partitions, Spark in memory capabilities, Broadcasts in Spark with Scala and python, Effective and efficient Joins, Transformations and other during ingestion process itself

Good experience in writing Sqoop queries for transferring bulk data between Apache Hadoop and structured data stores

Strong expertise in troubleshooting and performance fine-tuning Spark, MapReduce and Hive applications

Good experience on working with Amazon EMR framework for processing data on EMR and EC2 instances

Created AWS VPC network for the installed Instances and configured security groups and Elastic IP’s Accordingly

Developed AWS Cloud formation templates to create custom sized VPC, subnets, EC2 instances, ELB and security groups

Extensive experience in developing applications that perform Data Processing tasks using Teradata, Oracle, SQL Server and MySQL database

Acquaintance with Agile and Waterfall methodologies. Responsible for handling several clients facing meetings with great communication skills

TECHNICAL SKILLS:

Big Data Technologies

HDFS, MapReduce, Hive, Pig, Sqoop, Flume, Oozie, Zookeeper, Kafka, Cassandra, Apache Spark, Spark Streaming, HBase, Impala, cloudera, Hortonworks

Hadoop Distribution

Cloudera, HortonWorks, Azure, AWS

Cloud Technologies

Azure, ADF2, Databricks, AWS, Airflow, ADLS, S3

Spark Components

Apache Spark, Data Frames, Spark-SQL

Languages

Java, SQL, PL/SQL, Python, HiveQL, Scala, Hive-QL, Unix

Operating Systems

Windows,UNIX, LINUX, UBUNTU, CENTOS.

Portals/Application servers

WebLogic, WebSphere Application server, WebSphere Portal server, JBOSS

Build Automation tools

SBT, Ant, Maven

Version Control

GIT

IDE & Build Tools, Design

Eclipse, Visual Studio, Net Beans, Rational Application Developer, Junit

Databases

Oracle, SQL Server, MySQL, MS Access, NoSQL Database (HBase, Cassandra, MongoDB), Teradata.

PROFESSIONAL EXPERIENCE:

Client: Honeywell March 2022 – Till Date

Role: Data Engineer

Responsibilities:

Designed, and build data platforms and systems with modern cloud-based technologies such as Azure, Databricks, Snowflake

Designed and developed data pipelines using Azure Databricks Notebooks, Azure Data Factory (ADF2) and Azure Data Lake Stores (ADLS)

Extracted data from a variety of legacy data sources (SQL Server, Teradata, DB2, etc) - using Spark and Sqoop

Responsible for creating database objects like table, views using SNOW SQL to provide structure to store data and to maintain database efficiently.

Performed data quality issue analysis using SNOW SQL by building analytical warehouses on Snowflake.

Created linked services in ADF2 to connect cloud and on-prem data to the pipeline

Design and develop automated processes to perform scheduled tasks, ETL, and maintenance activities in ADF2.

Created a repository in GitHub (version control system) to store project and keep track of changes to files.

Involved in daily SCRUM meetings to discuss the development/progress and was active in making scrum meetings more productive.

Worked on analyzing Hadoop cluster and different big data analytical and processing tools including Hive, Sqoop, python and Spark with Scala & java, Spark Streaming

Wrote Spark-Streaming applications to consume the data from Kafka topics and wrote processed streams to HBase and steamed data using Spark with Kafka

Worked on the large-scale Hadoop YARN cluster for distributed data processing and analysis using Spark, Hive, and HBase

Involved in creating data-lake by extracting customer's data from various data sources to HDFS which include data from Excel, databases, and log data from servers

Developed Apache Spark applications by using Scala and python for data processing from various streaming sources

Used Scala to convert Hive/SQL queries into RDD transformations in Apache Spark

Implemented Spark solutions to generate reports, fetch and load data in Hive

Experienced in writing real-time processing and core jobs using Spark Streaming with Kafka as a data pipeline system

Written HiveQL to analyse the number of unique visitors and their visit information such as views, most visited pages, etc

Configured Spark Streaming to receive real time data from the Apache Kafka and store the stream data to HDFS using Scala

Experienced on working with Amazon EMR framework for processing data on EMR and EC2 instances

Further used pig to do transformations, event joins, elephant bird API and pre -aggregations performed before loading JSON files format onto HDFS

Environment:Spark Streaming, Spark-Core, Spark SQL, Scala, Hive, Kafka, JSON, HBase, python, Azure, ADLS Gen2, Datanbricks,AWS S3, Redshift, ADF, snowflake

Client: Huntington Bank Feb 2021 – Feb 2022

Role: Spark Developer

Responsibilities:

Responsible for plan and execute automation and platform management solutions to integrate data into Business Intelligence and Analytics platform.

Designed and developed scalable data warehousing solutions, building ETL pipelines in AWS environment

Maintained detailed documentation of my work and changes to support data quality and data governance

Ensured high operational efficiency and quality of your solutions to meet SLAs and support commitment to our customers (Data Science, Data Analytics teams)

Experience in loading data into snowflake both internal and external tables from ASW S3 buckets

Helped new team members by providing training to processes used in the project.

Designed and build star schema models in snowflake to load the data into facts and dimensions to be consumed by Data Scientists and Data Analysts.

Worked with distributed systems such as Spark, Hadoop (HDFS, Hive, Presto, PySpark) to query and process data.

Designed, Build and maintained data pipelines using Airflow.

Designed and implemented the DBT pipeline to load the data into facts and dimensions

Created automated tests and deployment process using DBT

Worked with scripting skills such as Bash and Python.

Experience in using Jenkins to build CI/CD pipeline for code deployment process

Familiar with Scrum and Agile methodologies

Responsible for architecting Hadoop clusters with CDH3 and involved in installation of CDH3 and upgradation to CDH4 from CDH3

Worked on creating Keyspace in Cassandra for saving the Spark Batch output

Worked on Spark application to compact the small files present into hive ecosystem to make it equivalent to block size of HDFS

Manage migration of on-perm servers to AWS by creating golden images for upload and deployment

Manage multiple AWS accounts with multiple VPC’s for both production and non-production where primary objectives are automation, build out, integration and cost control

Implemented the real time streaming ingestion using Kafka and Spark Streaming

Loaded data using Spark-streaming with Scala and Python

Involved in requirement and design phase to implement Streaming Lambda Architecture to use real time streaming using Spark and Kafka and Scala

Experience in loading the data into Spark RDD and performing in-memory data computation to generate the output responses

Migrated complex map reduce programs into In-memory Spark processing using

Transformations and actions

Developed full text search platform using NoSQL and Logstash Elastic Search engine, allowing for much faster, more scalable and more intuitive user searches

Developed the Sqoop scripts to make the interaction between Pig and MySQL Database

Worked on Performance Enhancement in Pig, Hive and HBase on multiple nodes

Worked with Distributed n-tier architecture and Client/Server architecture

Supported Map Reduce Programs those are running on the cluster and developed multiple Map Reduce jobs in Java for data cleaning and pre-processing

Developed MapReduce application using Hadoop, MapReduce programming and HBase

Evaluated usage of Oozie for Work Flow Orchestration and experienced in cluster coordination using Zookeeper

Developing ETL jobs with organization and project defined standards and processes

Experienced in enabling Kerberos authentication in ETL process

Implemented data access using Hibernate persistence framework

Design of GUI using Model View Controller Architecture (STRUTS Frame Work)

Integrated Spring DAO for data access using Hibernate and involved in the Development of Spring Framework Controllers

Environment: Hadoop 2.X, HDFS, MapReduce, Hive, Pig, Sqoop, Oozie, HBase, Java, J2EE, Eclipse, HQL. Spark, Python, AWS, S3, Snowflake, Databricks, Jenkins, Data Modelling

Client: Universal Studios Oct 2019 – Dec 2020

Role: Hadoop Developer

Responsibilities:

Optimizing the cluster overall performance by using cache or persist and unpersist the data wherever it is possible.

Active member for developing POC on streaming data using Apache Kafka, Flume and Spark Streaming.

Ingested all formats of structured and unstructured data including logs/transactions, relational databases using Sqoop & Flume into HDFS.

Designed and Implemented Sqoop incremental imports on tables without primary keys and dates from Teradata and append it directly into Hive Warehouse.

Created Hive Generic UDF's to process business logic that varies based on policy.

Created Partitioning, Bucketing, Map side Join, Parallel execution for optimizing the hive queries.

Querying Hive tables through Impala for faster execution.

Used various Hive queries, Pig scripts, Sqoop and Map-Reduce Programs using Oozie workflows and sub-workflows.

Used JOINS in pig scripts like Map Side Join, Skewed Join and Merged Join for performance Tuning.

Worked with Pig to flatten the multiple datasets.

Experience in customizing map reduce framework at different levels like input formats, data types, custom Serde and partitioners.

Responsible for handling different data formats like Avro, Parquet and Sequence formats.

Used Compression Techniques (snappy) with file formats to leverage the storage in HDFS.

Automated workflows using shell scripts pull data from various databases into Hadoop

Involved in daily SCRUM meetings to discuss the development/progress and was active in making scrum meetings more productive.

Participated in gathering requirements, analyze requirements and design technical documents for business requirements.

Analyzed SQL scripts and designed the solution to implement using PySpark.

Worked in Spark streaming to get ongoing information from the Kafka and store the stream information to HDFS.

Developed and Configured Kafka brokers to pipeline server logs data into Spark Streaming.

Completed data extraction, aggregation and analysis in HDFS by using PySpark and store the data needed to hive.

Worked on MySQL database to retrieve information from storage using Python.

Experienced in implementing and working on the python code using shell scripting.

Responsible for developing data pipeline Apache NIFI and Spark/Scala to extract the data from vendor and store in HDFS and REDSHIFT.

Developed Python code to gather the data from HBase (Cornerstone) and designs the solution to implement using PySpark.

Optimizing the poorly written Spark/Scala jobs by monitoring the YARN UI to use more parallelism.

Working on python's UNITTEST and PYTEST frameworks.

Used Python/ HTML / CSS to help the team implement dozens of new features in a massively scaled Google App Engine web application.

Loaded all data sets into Hive from Source CSV files using spark and Cassandra from Source CSV files using Spark/PySpark.

Having good knowledge on an Apache Cassandra for storing the data in a Cluster.

Developed and analyzed the SQL scripts and designed the solution to implement using Pyspark

Built an Ingestion Framework that would ingest the files from SFTP to HDFS using Apache NIFI and ingest financial data into HDFS.

Developed a fully automated continuous integration system using Git, Jenkins.

Performed job functions using Spark API's in Scala for real time analysis and for fast querying purposes.

Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, Python and Scala.

Loaded all datasets into Hive from Source CSV files using Spark and Cassandra from Source CSV files using Spark/PySpark.

Perform data analysis on NoSQL databases such as HBase and Cassandra.

Developed and implemented core API services using Python and Spark (PySpark).

Used Pyspark to process and analyze the data.

Written test cases using PyUnit test framework and Selenium Automation testing for better manipulation of test scripts.

Leveraged cloud and GPU computing technologies for automated machine learning and analytics pipelines, such as AWS, GCP

Conducted systems design, feasibility on recommend cost-effective cloud solutions such as Amazon Web Services (AWS), Microsoft Azure and Rackspace.

Developed and deployed data pipeline in cloud such as AWS and GCP

Using Hive join queries to join multiple tables of a source system and load them to Elastic search tables.

Done batch processing of data sources using Apache Spark, Elastic search.

Implemented a Python-based distributed random forest via Python streaming.

Worked on the MySQL migration project to make the system completely independent of the database being used.

Environment: Python, MapReduce, Spark, Hadoop, HBase, Scala, Kafka, Hive, Pig, Sqoop, RBAC, ACL, Kafka, Pandas, Docker, ReactJS, Google APIs, SOAP, REST, Azure APIs, Shell Scripting, Selenium, AWS, CDH, Hadoop, HDFS, Map Reduce, Hive, Pig, Sqoop, Unix, Oozie, Autosys.

Client: Data Group Geospatial Technologies Pvt Ltd, India June 2017 – Sep 2019

Role: Big Data Engineer

Responsibilities:

Concerned and well-informed on Hadoop Components such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node, YARN and Map Reduce programming.

Developed Map-Reduce programs to get rid of irregularities and aggregate the data.

Implemented Hive UDF's and did performance tuning for better results

Implemented optimized map joins to get data from different sources to perform cleaning operations before applying the algorithms.

Experience in using Sqoop to import and export the data from Oracle DB into HDFS and HIVE.

Implemented CRUD operations on HBase data using thrift API to get real time insights.

Developed workflow in Oozie to manage and schedule jobs on Hadoop cluster for generating reports on nightly, weekly and monthly basis.

Used various compression codecs to effectively compress the data in HDFS.

Responsible for troubleshooting issues in the execution of MapReduce jobs by inspecting and reviewing log files.

Implemented POC to migrate map reduce jobs into Spark RDD transformations.

Worked in Agile development environment in sprint cycles of two weeks by dividing and organizing tasks. Participated in daily scrum and other design related meetings.

Created final reports of analyzed data using Apache Hue and Hive Browser and generated graphs for studying by the data analytics team.

Understanding business needs, analyzing functional specifications and map those to develop and designing MapReduce programs and algorithms

Designed and implemented MapReduce-based large-scale parallel relation-learning system.

Customized Flume interceptors to encrypt and mask customer sensitive data as per requirement

Worked with NoSQL databases like HBase in creating HBase tables to load large sets of semi-structured data coming from various sources.

Built web portal using JavaScript, it makes a REST API call to the Elastic search and gets the row key.

Used Kibana, which is an open source-based browser analytics and search dashboard for Elastic Search.

Used Amazon web services (AWS) like EC2 and S3 for small data sets.

Performed importing data from various sources to the Cassandra cluster using Java APIs or Sqoop.

Developed iterative algorithms using Spark Streaming in Scala for near real-time dashboards.

Installed and configured Hadoop and Hadoop stack on a 40-node cluster.

Involved in customizing the Partitioner in MapReduce in order to root Key value pairs from Mapper to Reducers in XML format according to requirement.

Configured Flume for efficiently collecting, aggregating and moving large amounts of log data.

Involved in creating Hive tables, loading the data using it and in writing Hive queries to analyze the data.

Implemented AWS services to provide a variety of computing and networking services to meet the needs of applications

Involved in scheduling Oozie workflow engine to run multiple Hive and pig jobs

Designed and built the Reporting Application, which uses the Spark SQL to fetch and generate reports on HBase table data.

Extracted the needed data from the server into HDFS and Bulk Loaded the cleaned data into HBase.

Used different file formats like Text files, Sequence Files, Avro, Record Columnar CRC, ORC

Responsible for creating Hive UDF’s that helped spot market trends.

Developed custom aggregate functions using Spark SQL and performed interactive querying

Environment: HDFS, MapReduce, Cloudera, HBase, Hive, Pig, Elastic search, Kibana, Sqoop, Spark, Cassandra, Scala, Flume, Oozie, Zookeeper, AWS, Maven, Linux, UNIX Shell Scripting, Hadoop, CDH 5.5, Map Reduce, Hive, Pig, Sqoop, Flume, HBase, Java, Spark, Oozie, Linux, UNIX

Client: Vitwit Technologies Private Limited, India May 2015 – May 2017

Role: Big Data Analyst

Responsibilities:

Collaborated with the Internal/Client BA’s in understanding the requirement and architect a data flow system.

Developed complete end-to-end Big–data processing in Hadoop eco system.

Extracted the needed data from the server into HDFS and Bulk Loaded the cleaned data into HBase.

Lead role in NoSQL column family design, client access software, Cassandra tuning; during migration from Oracle based data stores.

Designed, implemented and deployed within a customer’s existing Hadoop / Cassandra cluster a series of custom parallel algorithms for various customer-defined metrics and unsupervised learning models.

Using the Spark framework Enhanced and optimized product Spark code to aggregate, group and run data mining tasks.

Wrote queries Using Datastax Cassandra CQL to create, alter, insert and delete elements.

Created Hive schemas using performance techniques like partitioning and bucketing.

Optimizing of existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frames and Pair RDD's.

Involved in handling large datasets using Partitions, Spark in-memory capabilities, Broadcasts in Spark, Effective & efficient Joins, Transformations like Reduce by Key, Aggregate by Key and other during ingestion process itself.

Used Map Reduce JUnit for unit testing.

Experienced in performance tuning of Spark Applications for setting right Batch Interval time, correct level of Parallelism and memory tuning.

Develop HIVE queries for the analysts.

Created an e-mail notification service upon completion of job for the team, which requested for the data.

Defined job work flows as per their dependencies in Oozie.

Played a key role in productionizing the application after testing by BI analysts.

Given POC of FLUME to handle the real-time log processing for attribution reports.

Maintain System integrity of all sub-components related to Hadoop.

Environment: Apache Hadoop, HDFS, Spark, Hive, Datastax Cassandra, Map Reduce, Pig, Java, Flume, Cloudera CDH4, Oozie, Oracle, MySQL, Amazon S3.

Education Details

University of Dayton Jan 2021 - Aug 2022

Contact this candidate