Big Data Engineer

Location:

United States

Salary:

$60

Posted:

August 09, 2022

Contact this candidate

Resume:

PRIYANKA

Email: *****@******************.***

Phone: +1-609-***-****

PROFESSIONAL SUMMARY:

7+ years of experience in working with Big Data Technologies on systems this comprises of massive amount of data running in highly distributed Hadoop environment.

Hands on experience in using Hadoop ecosystem components like Hadoop, Hive, Pig, Sqoop, HBase, Cassandra, Spark, Spark Streaming, Spark SQL, Oozie, Zookeeper, Kafka, Flume, MapReduce and Yarn.

Strong Knowledge on architecture and components of Spark, and efficient in working with Spark Core, SparkSQL, Spark streaming.

Implemented Spark Streaming jobs by developing RDD's (Resilient Distributed Datasets) and used spark and spark-shell accordingly.

Experience in configuring Spark Streaming to receive real time data from the Apache Kafka and store the stream data to HDFS using Scala.

Experience in importing and exporting data using stream processing platforms like Flume and Kafka.

Accomplished complex HiveQL queries for required data extraction from Hive tables and written Hive UDF's as required.

Pleasant experience of Partitions, bucketing concepts in Hive and designed both Managed and External tables in Hive to optimize performance.

Involved in converting Hive/SQL queries into Spark transformations using Spark Data frames and Scala.

Used Spark Data Frames API over Cloudera platform to perform analytics on Hive data.

Used Spark Data Frame Operations to perform required Validations in the data.

Experience in integrating Hive queries into Spark environment using Spark SQL.

Expert in PySpark, Python

Good understanding and knowledge of NoSQL databases like MongoDB, HBase and Cassandra.

Worked on HBase to load and retrieve data for real time processing using Rest API.

Excellent understanding and knowledge of job workflow scheduling and locking tools/services like Oozie and Zookeeper.

Expert in JSON, XML, AVRO, Parquet and other data formats and storage types.

Experienced in designing different time driven and data driven automated workflows using Oozie.

Knowledge of ETL methods for data extraction, transformation and loading in corporate-wide ETL Solutions and Data Warehouse tools for reporting and data analysis.

Worked on developing ETL Workflows on the data obtained using Python for processing it in HDFS and HBase using Oozie.

Experience in configuring the Zookeeper to coordinate the servers in clusters and to maintain the data consistency.

Experience in importing and exporting the data using Sqoop from HDFS to Relational Database systems and vice- versa.

Good Knowledge in UNIX Shell Scripting for automating deployments and other routine tasks.

Experience in relational databases like Oracle, MySQL and SQL Server.

Experienced in using Integrated Development environments like Eclipse, NetBeans, IntelliJ, Spring Tool Suite.

Used various Project Management services like JIRA for tracking issues, bugs related to code and GitHub for various code reviews and worked on various version control tools like GIT, SVN.

Experienced in working in SDLC, Agile and Waterfall Methodologies.

Excellent Communication skills, Interpersonal skills, problem solving skills and a team player. Ability to quickly adapt new environment and technologies.

Good understanding of Scrum methodologies, Test Driven Development and continuous integration.

Major strengths are familiarity with multiple software systems, ability to learn quickly new technologies, adapt to new environments, self-motivated, team player, focused adaptive and quick learner with excellent interpersonal, technical and communication skills.

Experience in defining detailed application software test plans, including organization, participant, schedule, test and application coverage scope.

Experience in gathering and defining functional and user interface requirements for software applications.

Experience in real time analytics with Apache Spark (RDD, Data Frames and Streaming API).

Used Spark Data Frames API over Cloudera platform to perform analytics on Hive data.

Experience in integrating Hadoop with Kafka. Expertise in uploading Click stream data from Kafka to HDFS.

Expert in utilizing Kafka for messaging and publishing subscribe messaging system.

PROFESSIONAL EXPERIENCE

AT&T, Plano, TX Aug 2021 – Till Date

Job Title: Big Data Engineer

Responsibilities:

Developed data pipelines using Stream sets Data Collector to store data from Kafka into HDFS, HBase.

Event Streaming on different stages on Stream sets Data Collector, running a MapReduce job on event triggers to convert Avro to Parquet.

Real time streaming, performing transformations on the data using Kafka and Kafka Streams.

Developing business logic using Scala.

Developed Pig Latin scripts to extract the data from the web server output files to load into HDFS

Implemented File Transfer Protocol operations using Talend Studio to transfer files in between network folders.

Creating Hive UDFs in java, compiling them into jars and adding them to the HDFS and executing them with Hive Queries.

Develop Hive queries for the analysts.

Involved in converting Hive/SQL queries into Spark transformations using Spark RDD, Scala.

Configured big data workflows to run on the top of Hadoop using Oozie and these workflows comprises of heterogeneous jobs like Pig, Hive, Sqoop Cluster co-ordination services through Zookeeper.

Load and transform data into HDFS from large set of structured data /Oracle/Sql server using Talend Big data studio.

Hands on experience on cloud services like Amazon web services (AWS)

Worked on migration of Oozie workflows into Apache Airflow DAGs.

Used Airflow workflows to automate jobs on Amazon EMR.

Worked with PySpark, improving the performance and optimized of the existing applications running on EMR cluster to Glue AWS.

Wrote AWS Glue scripts to extract, transform load the data.

Worked with parquet data format for faster transforming data to optimize query performance for Athena.

Worked on bitbucket, git and bamboo to deploy EMR clusters.

Worked with different File Formats like TEXTFILE, AVROFILE, ORC and PARQUET for HIVE querying and processing.

Integrated Hive tables to HBase to perform row level analytics.

Configured various property files like core-site.xml, hdfs-site.xml, mapred-site.xml based upon the job requirement.

Analyzing the metadata as per the requirements.

Loaded the data in HDFS from local system.

Writing the pig scripts for processing the data.

Created hive table with schema and loaded the data using Sqoop.

Writing queries in hive to map the data resided in HDFS.

Written java code for Map Reduce job.

Loaded data from csv files to spark, created data frames and queried data using Spark SQL.

Created external tables in Hive, Loaded JSON format log files and ran queries using HiveQL.

Involved in designing Avro schemas for serialization and Converting JSON data to Avro format.

Designed HBase row key and Data-Modelling of data to insert to HBase Tables using concepts of Lookup Tables and Staging Tables.

Used Agile Scrum methodology/ Scrum Alliance for development.

Involved in developing Hive DDLs to create, alter and drop tables.

Created and maintained technical documentation for launching Hadoop clusters and for executing Hive queries and Pig Scripts.

Also exploring with the Spark improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark SQL, Data Frame, Pair RDDs, Storm, Spark YARN.

Used many features like Parallelize, Partitioned, Caching (both in-memory and disk Serialization), Kryo Serialization, etc. Implemented Spark using Scala and Sparks for faster testing and processing of data

Provide inputs on long term strategic vision and direction for the data delivery infrastructure including Microsoft BI stack implementations and Azure Advanced Data Analytic Solutions

Evaluate existing data platform and apply technical expertise to create a data modernization roadmap and architect solutions to meet business and IT needs.

Utilized Microsoft data bricks to process spark jobs and blob storage services to process data.

Worked on data fabrics to process data silos of a big data system.

Make data fabrics to simplifies and integrate data management across cloud and on premises to accelerate digital transformation.

Environment: HDFS, Hadoop, Kafka, Map Reduce, Nifi, Elastic Search, Spark, Impala, Hive, Avro, Parquet, Grafana, Scala, JAVA.

Cigna, Chicago, IL

Job Title: Hadoop Developer Sep 2020 – July 2021

Responsibilities:

Implemented and maintained the monitoring and alerting of production and corporate servers/storage using AWS Cloud watch.

Managed servers on the Amazon Web Services (AWS) platform instances using Puppet, Chef Configuration management.

Developed PIG scripts to transform the raw data into intelligent data as specified by business users.

Worked in AWS environment for development and deployment of Custom Hadoop Applications.

Worked closely with the data modelers to model the new incoming data sets.

Involved in start to end process of Hadoop jobs that used various technologies such as Sqoop, PIG, Hive, Map Reduce, Spark and Shell scripts (for scheduling of few jobs.

Expertise in designing and deployment of Hadoop cluster and different Big Data analytic tools including Pig, Hive, Oozie, Zookeeper, SQOOP, flume, Spark, Impala, Cassandra with Horton work Distribution.

Involved in creating Hive tables, Pig tables, and loading data and writing hive queries and pig scripts

Assisted in upgrading, configuration and maintenance of various Hadoop infrastructures like Pig, Hive, and HBase.

Exploring with the Spark improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frame, Pair RDD's, Spark YARN.

Developed Spark code using Scala and Spark-SQL/Streaming for faster testing and processing of data. Configured deployed and maintained multi-node Dev and Test Kafka Clusters.

Performed transformations, cleaning and filtering on imported data using Hive, Map Reduce, and loaded final data into HDFS.

Worked on tuning Hive and Pig to improve performance and solve performance related issues in Hive and Pig scripts with good understanding of Joins, Group and aggregation and how it does Map Reduce jobs

Exploring with the Spark improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frame, Pair RDD's, Spark YARN.

Developed Spark code using Scala and Spark-SQL/Streaming for faster testing and processing of data.

Import the data from different sources like HDFS/HBase into Spark RDD.

Developed a data pipeline using Kafka and Storm to store data into HDFS.

Performed real time analysis on the incoming data.

Used Spark Streaming to divide streaming data into batches as an input to Spark engine for batch processing.

Implemented Spark using Scala and SparkSQL for faster testing and processing of data.

Environment: Apache Hadoop, HDFS, MapReduce, Sqoop, Flume, Pig, Hive, HBASE, Oozie, Scala, Spark, Linux.

Employer: Oracle, India Jan 2017 – June 2020

Role: Bigdata Consultant Responsibilities:

Experience with the Hadoop ecosystem and NoSQL database

Expertise in implementing Spark Scala application using higher order functions for both batch and interactive analysis requirement.

Created Reports with different Selection Criteria from Hive Tables on the data residing in Data Lake.

Deployed Hadoop components on the Cluster like Hive, HBase, Spark, Scala and others with respect to the requirement.

Used Spark UI to observe the running of a submitted Spark Job at the node level.

Developed Spark jobs and Hive Jobs to summarize and transform data

Developed code from scratch in Spark using SCALA according to the technical requirements.

Involved in developing different components of system like Hadoop process involves Map Reduce & Hive.

Involved in converting Hive/SQL queries into Spark transformations using Spark SQL and Scala.

Used Scala Collection Framework to store and process the complex consumer informations.

Worked in tuning Hive and Pig to improve performance and solved performance issues in Hiveand Pig scripts with understanding of Joins, Group and aggregation and how does it translate to Map Reduce jobs.

Used both Hive context as well as SQL context of Spark to do the initial testing of the Spark job.

Worked extensively with importing metadata into Hive and migrated existing tables and applications to work on Hive and Spark.

Involved in creating Hive tables, loading with data and writing hive queries.

Developed Sqoop scripts for importing and exporting data into HDFS and Hive

Exported the data into RDBMS using Sqoop for BI team to perform visualization and to generate reports

Used Spark streaming to receive near real time data from the Kafka and store the stream data to HDFS using Scala and NoSQL database such as HDFS

Involved in Analyzing data by writing queries using HiveQL for faster data processing

Created customized BI tool for manager team that perform Query analytics using Hive QL.

Developed Oozie workflows and they are scheduled through AutoSys on monthly basis

Create a complete processing engine, based on Hortonwork's distribution, enhanced to performance.

Utilized Agile and Scrum Methodology to help manage and organize a team of developers with regular code review session.

Environment: Hadoop, HDFS, MapReduce, Sqoop, Flume, Pig, Hive, HBASE, Oozie, Scala, Spark, Linux.

Indus Infotech, India April 2014 – Dec 2016

Role: Hadoop Developer Responsibilities:

Developed several advanced Map Reduce programs to process data files received.

Developed Map Reduce Programs for data analysis and data cleaning.

Firm knowledge on various summarization patterns to calculate aggregate statistical values over dataset.

Experience in implementing joins in the analysis of dataset to discover interesting relationships.

Completely involved in the requirement analysis phase.

Extending Hive and Pig core functionality by writing custom UDFs.

Worked on partitioning the HIVE table and running the scripts in parallel to reduce the run time of the scripts.

Strong expertise in internal and external tables of HIVE and created Hive tables to store the processed results in a tabular format.

Implemented Partitioning, Dynamic Partitions, Buckets in HIVE.

Developed Pig Scripts and Pig UDFs to load data files into Hadoop.

Analyzed the data by performing Hive queries and running Pig scripts.

Developed PIG Latin scripts for the analysis of semi structured data and unstructured data.

Strong knowledge on the process of creating complex data pipelines using transformations, aggregations, cleansing and filtering.

Experience in writing cron jobs to run at regular intervals.

Wrote ETL scripts in Python/SQL for extraction and validating the data.

Create data models in Python to store data from various sources.

Developed MapReduce jobs for Log Analysis, Recommendation and Analytics.

Experience in using Flume to efficiently collect, aggregate and move large amounts of log data.

Involved in loading data from edge node to HDFS using shell scripting.

Worked with application teams to install operating system, Hadoop updates, patches, version upgrades as required.

Experience in managing and reviewing Hadoop log files.

Responsible for cluster maintenance, adding and removing cluster nodes, cluster monitoring and troubleshooting, manage and review data backups, manage and review Hadoop log files.

Environment: Hadoop, Python, Apache Pig, Apache Hive, MapReduce, HDFS, Flume, GIT, UNIX Shell scripting, PostgreSQL, Linux.

Contact this candidate