Post Job Free

Resume

Sign in

Data Engineer

Location:
The Bronx, NY
Posted:
April 02, 2023

Contact this candidate

Resume:

Rosie DasGupta

Email:adv99r@r.postjobfree.com

Status: Green Card Holder

PH:646-***-****

Bronx, New York

Professional Summary:

Around 4+ years of data processing and analysis experience, handled high volume data and supported various technology stacks.

Experienced with use-case development, with Software methodologies like Agile and Waterfall.

Experience in Big Data and Hadoop Administration ecosystems including HDFS, Pig, Hive, Impala, HBase, Yarn, Sqoop, Flume, Oozie, Hue, MapReduce, and Spark

Worked with Play framework and Akka parallel processing.

Responsible for delivery in the areas of big data engineering with Python, Spark (Spark) and a good level understanding of machine learning algorithm including data analysis.

Installed, supported, and managed Hadoop clusters using CDH and HDP

Extensively analyzed data using HQL, Pig Latin and custom MapReduce programs in Java and python. ● Expertise in deployment of Hadoop, Yarn, Spark integration with Cassandra, etc.

Contributed to partitioning and bucketing of data, designed, and managed data and created external tables in Hive to optimize performance

Experience in importing and exporting data between HDFS and RDBMS using Sqoop

Strong experience and knowledge of real time data analytics using Spark, Kafka, and Flume

Carried out POCs on migrating to Spark-Streaming to process the live data.

Developing and maintaining Workflow Scheduling Jobs in Oozie for importing data from RDBMS to Hive.

Utilized Spark Core, Spark Streaming and Spark SQL API for faster processing of data instead of using MapReduce in Java.

Responsible for data extraction and data integration from different data sources into Hadoop Data Lake by creating ETL pipelines Using Spark, MapReduce, Pig, and Hive.

Involved in converting Hive/SQL queries into Spark transformations using Spark Data frames and Scala.

Involved with the team of fetching live stream data from DB2 to HBase table using Spark Streaming and Apache Kafka.

Load the data into Spark RDD and do in memory data Computation to generate the Output response.

Used Spark for interactive queries, processing of streaming data and integration with popular NoSQL database for huge volume of data.

Wrote different pig scripts to clean up the ingested data and created partitions for the daily data.

Developed Spark programs with Scala, and applied principles of functional programming to process the complex unstructured and structured data sets

Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, Scala, and Python.

Analyzed the SQL scripts and designed the solution to implement using Spark.

Involved in converting MapReduce programs into Spark transformations using Spark RDD in Scala.

Experienced in Apache Spark for implementing advanced procedures like text analytics and processing using the in-memory computing capabilities written in Scala.

Hands on with real time data processing using distributed technologies Storm and Kafka.

Experienced in implementing Kerberos authentication protocol in Hadoop for data security.

Experienced in writing queries and sub-queries for SQL, Hive, Impala and Spark; and used different Spark modules like Spark Core, Spark RDDs, Spark Data frame and Spark SQL.

Experienced in converting Hive queries into Spark Transformations and Actions

Worked on data serialization formats for converting complex objects into sequence bits by using CSS, Avro, Parquet, JSON and CSV.

Experience in implementing Kerberos authentication protocol in Hadoop for data security.

Strong command over relational databases including MySQL, Oracle, MS SQL Server, and MS Access

Experience in providing good production support for 24x7 over weekends on rotation basis.

Professional Experience: Data Engineer

Company Name: TCS, Client Name: VISA (Apr 2022– Feb 2023)

Responsibilities:

Involved in requirement gathering and coordination with offshore team to task assignment and completion within SLA.

Responsible for creating hive external tables, views, meta data validation and running Autosys job to process the Sqoop ingestion end to end pipeline.

Created various file formats such as parquet, orc, and load them in hive Datawarehouse.

Writing spark program to ingest data into hive table for accessing data.

Involved in development of scripts, which ingest data from various data sources by using table metadata and other connection credentials. This is done using Shell Script, Sqoop.

Run the job in the server base on shell script using Db2, hive, HQL.

Created external/managed tables in hive

Create and load delta table.

Worked data processing frameworks like Hive with MR, Distributed.

Responsible for data extraction and data integration from different data sources into Hadoop Data Lake by creating ETL pipelines Using Spark, Dataframe API, and Hive.

Implemented Partitioning, Dynamic Partitions, and Buckets in HIVE.

code uses mostly scala and SQL -lots of spark call

Hadoop migration where the Horton system is being migrated to open source system

Run the job in the server base on shell script using Db2, hive, HQL.

Technology: Hadoop,Hive, HQL, DB2, Git, Maven,Unix/Linix, Cloudera, Hortonworks.

Deutsche Bank, New York city, NY April 2019 – March 2022 Data Engineer/Big Data Developer

Responsibilities:

Designed and developed various modules in Hadoop Big Data platform and processed data using MapReduce, Hive, Sqoop, Kafka and Oozie

Experience in Job management using Fair scheduler and Developed job processing scripts using Oozie workflow.

Used Spark, Hive for implementing the transformations need to join the daily ingested data to historic data.

Used Spark-Streaming APIs to perform necessary transformations and actions on the fly for building the common learner data model which gets the data from Kafka in near real time.

Developed Spark scripts by using Scala shell commands as per the requirement.

Used Spark API over EMR Cluster Hadoop YARN to perform analytics on data in Hive.

Developed Scala scripts, UDFs using both Data frames/SQL/Data sets and RDD in Spark for Data Aggregation, queries and writing data back into OLTP system through Sqoop.

Experienced in performance tuning of Spark Applications for setting right Batch Interval time, correct level of Parallelism and memory tuning.

Optimizing of existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frames and Pair RDD's.

Performed advanced procedures like text analytics and processing, using the in-memory computing capabilities of Spark.

Developed reusable transformations to load data from flat files and other data sources to the Data Warehouse.

Assisted operation support team for transactional data loads in developing SQL Loader & Unix scripts

Implemented Spark SQL queries which intermix the Hive queries with the programmatic data manipulations supported by RDDs and data frames in Scala and python.

Implemented Python script to call the Cassandra Rest API, performed transformations and loaded the data into Hive.

Extensively worked on Python and build the custom ingest framework.

Experienced in handling large datasets using Partitions, Spark in Memory capabilities, Broadcasts in Spark, Effective & efficient Joins, Transformations and other during ingestion process itself.

Experienced in writing live Real-time Processing using Spark Streaming with Kafka.

Created Cassandra tables to store various data formats of data coming from different sources.

Designed, developed data integration programs in a Hadoop environment with NoSQL data store Cassandra for data access and analysis.

Worked extensively with Sqoop for importing metadata from Oracle.

Involved in creating Hive tables and loading and analyzing data using hive queries.

Developed Hive queries to process the data and generate the data cubes for visualizing.

Implemented schema extraction for Parquet and Avro file Formats in Hive.

Good experience with Talend open studio for designing ETL Jobs for Processing of data.

Implemented Partitioning, Dynamic Partitions, Buckets in HIVE.

Involved in file movements between HDFS and AWS S3.

Extensively worked with S3 bucket in AWS.

Used Reporting tools like Tableau to connect with Hive for generating daily reports of data.

Collaborated with the infrastructure, network, database, application, and BI teams to ensure data quality and availability.

Environment: Hadoop YARN, Spark 1.6, Spark Streaming, Spark SQL, Scala, Kafka, Python, Hive, Sqoop 1.4.6, Impala, Tableau, Talend, Oozie, Control-M, Java, AWSS3, Oracle 12c, Linux

New York Life Insurance, Sleepy Hollow, NY Jan 2018 – Mar’2019 Data Engineer/Big Data Developer

Responsibilities:

Implemented Spark using Scala and utilizing Data frames and Spark SQL API for faster processing of data

Converted existing MapReduce jobs into Spark transformations and actions using Spark RDDs, Data frames and Spark SQL APIs.

Developed a data pipeline using Kafka, Spark, and Hive to ingest, transform and analyzing customer behavioral data.

Worked on Big Data infrastructure for batch processing as well as real-time processing. Responsible for building scalable distributed data solutions using Hadoop.

Developed real time data processing applications by using Scala and Python and implemented Apache Spark Streaming from various streaming sources like Kafka.

Developed Spark jobs and Hive Jobs to summarize and transform data

Expertise in implementing Spark Scala application using higher order functions for both batch and interactive analysis requirement.

Experienced in developing Spark scripts for data analysis in Scala.

Used Spark-Streaming APIs to perform necessary transformations.

Involved in converting Hive/SQL queries into Spark transformations using Spark SQL and Scala.

Worked with spark to consume data from Kafka and convert that to common format using Scala.

Worked extensively with importing metadata into Hive and migrated existing tables and applications to work on Hive and Spark.

Converted existing MapReduce jobs into Spark transformations and actions using Spark RDDs, Data frames and Spark SQL APIs.

Wrote new spark jobs in Scala to analyze the data of the customers and sales history.

Involved in requirement analysis, design, coding, and implementation phases of the project.

Used Spark API over Hadoop YARN to perform analytics on data in Hive.

Experience in both SQLContext and Spark Session.

Developed Scala based Spark applications for performing data cleansing, data aggregation, de-normalization and data preparation needed for machine learning and reporting teams to consume.

Worked on troubleshooting spark application to make them more error tolerant.

Involved in HDFS maintenance and loading of structured and unstructured data and imported data from mainframe dataset to HDFS using Sqoop and written the Spark Script to process the HDFS data.

Used Spark API over Hadoop YARN to perform analytics on data in Hive.

Extensively worked on the core and Spark SQL modules of Spark.

Involved in Spark and Spark Streaming creating RDD's, applying operations -Transformation and Actions.

Used Impala to read, write and query the data in HDFS.

Worked on troubleshooting spark application to make them more error tolerant.

Stored the output files for export onto HDFS and later these files are picked up by downstream systems.

Load the data into Spark RDD and do in memory data Computation to generate the Output response.

Involved in converting Hive/SQL queries into Spark transformations using Spark RDD, Scala.

Environments: Hadoop 2.x, Spark Core, Spark SQL, Spark API Spark Streaming, Py spark, Hive, Oozie, Amazon EMR, Tableau, Impala, RDBMS, YARN, JIRA, MapReduce.



Contact this candidate