Data Engineer

Location:

Jersey City, NJ

Posted:

October 04, 2023

Contact this candidate

Resume:

Shashank Jadhav

Mobile Number: 917-***-**** Email: *********.****@*****.***

LinkedIn: https://www.linkedin.com/in/shashank-j/

Data Engineer: Overall 9+ years of experience as a Data Engineer with expertise on extraction, transformation and analysis of data using Pyspark and Hadoop ecosystem technologies. In-depth knowledge of Hadoop Architecture and Pyspark Architecture. Experience in working with different Tools and Technologies such as HDFS, Hive, Spark SQL, Spark Streaming, Kafka and Sqoop. Worked in different corporate domains such as Transportation, Sales and Retails, Information Systems and Software.”

PROFESSIONAL SUMMARY:

●Hands-on experience in developing and deploying enterprise-based applications using major Hadoop ecosystem components like MapReduce, Hive.

●Have a good experience in working as Data Engineer on Amazon cloud services, Bigdata/Hadoop Applications, and product development.

●Hands on experience in developing Spark applications and well exposure on Spark SQL, Spark Streaming and using Core Spark API to explore Spark features to build data pipelines.

●Worked on data cleansing using scripts written using Spark, MapReduce, and Pig.

●Good working experience on using Sqoop to import data into HDFS from RDBMS and vice-versa.

●Extensively, developed Python scripts, UDF's using both Data frames/SQL and RDD/MapReduce in Spark for Data Aggregation, queries, and data transformation.

●Experience in developing Data pipelines in spark using python perform ETL jobs.

●Good experience in creating and designing data ingest pipelines using technologies such as Kafka to handle real time data.

●Orchestrated multiple Spark application jobs and monitor their scheduling and triggering by using scheduler which is apache Airflow.

●Proficiently extracted Data from Amazon S3 Buckets with help of PySpark and stored it in Hive Containers for further analysis.

●Proficiently implemented various optimization techniques to improve execution time and resources consumption by writing different queries using Hive QL and Spark SQL tools.

●Extensively worked with AWS S3 Data Bucket storage to store the data files in different forms.

●Hands on experience in Sequence files, Combiners, Counters, Dynamic Partitions, bucketing for best practice and performance improvement using HiveQL.

●Experience in Extraction, Transformation and Loading (ETL) data from various sources into Data Warehouses using Kafka, AWS Kinesis for further analysis.

●Hands-on knowledge in extracting source data from Sequential files, XML files, CSV files, and JSON files transforming and loading it into the target Data warehouse.

●Worked on data processing like collecting, aggregating, and moving data from various sources using Spark, Kafka, and PowerBI.

●Excellent coding experience in python programming and worked on different python libraries to develop python applications.

●Extensively worked on different SQL Scripts and developed Stored Procedures, Triggers, Functions, Packages using SQL, PL/SQL.

●Experienced with Dynamic Partitioning and Bucketing on big data to optimize the execution time of Map Reduce operations using Hive.

●Created custom functions to process records or groups of records from data by using Hive Built-in-functions to perform different data transformation.

●Experience working with cloud services like Amazon Web Services and Azure.

●Followed most of the software methodologies like Agile, Waterfall model while working on different projects.

●Ample knowledge of data architecture including data ingestion pipeline design, Hadoop/Spark architecture, data modeling, data mining, and advanced data processing.

●Maintained the project repositories on GitHub and kept a track of source and version control system.

EDUCATION DETAILS: 2019-2021

Master of Science in Computer Science

Pace University, NYC

Bachelor of Engineering in Information Technology 2010-2014

Savitribai Phule Pune University, Pune

TECHNICAL SKILLS:

Big data Technologies

PySpark, Hadoop, HDFS, Spark, Hive, Sqoop, Pig, MapReduce, Kafka

Database tools

MySQL, Oracle, Microsoft SQL Server, PostgreSQL, Mongo DB, Cassandra,

AWS Redshift, AWS RDS

Programming languages and tools

Python, HiveQL, SQL, C, C++, Visual Studio Code, PyCharm, Jupyter Notebooks

Python Libraries

Matplotlib, Pandas, NumPy, Scikit-learn

Cloud Services

AWS S3 Buckets, AWS EC2, AWS Redshift, AWS Kinesis, AWS EMR, AWS Glue, Snowflake

BI Tools

Power BI, Tableau

CI/CD Tools

Git, Jira, Azure Devops

PROFESSIONAL EXPERIENCE:

Hewlett Packard Inc, San Diego, CA Feb 2022 - Present

AWS Data Engineer

Responsibilities:

●Worked on designing and developing Data pipelines for a product recommendation system using PySpark, AWS, and Rest API which involved Data Extraction, Transformation of the Data, Manipulation and Computation.

●Worked on querying structured data stored in AWS Datalake using AWS EMR Clusters to perform various PySpark operations.

●Implemented Spark using python and utilizing Data frames and PySpark SQL Feature for faster processing of data.

●Performed different forms of data extraction using different file formats like Text files, CSV, XML, JSON and transformed it and loaded to target Data warehouse.

●Efficiently handled big data by implementing multiple jobs using PySpark to integrate variety of raw data source such as AWS Redshift and RDS.

●Responsible for migration of the data consumed into the lake environment of Amazon S3 Bucket Storage using PySpark for further Batch processing.

●Extracted the raw data from S3 Buckets with the help of PySpark SQL queries and Pyspark to process this raw Data on-prem for further transformation.

●Developed Spark Scripts by writing various RDDs in Python to read and write different files in the form of CSV, Parquet, JSON.

●Leveraged PySpark Dataframe API for various PySpark SQL functions such as Aggregate, Window, JSON and Date and Timestamp function to process the raw data.

●Used Join and Aggregate PySpark Dataframes for joining different relational data and grouping it by different conditions.

●Leveraged AWS Glue for data integration, data migration and analysis while designing the data pipeline.

●Configured AWS Glue jobs and developed endpoints to use the Data Catalogues as an external Apache Hive metastore to run the Spark SQL queries directly on the data catalogues.

●Worked on developing different PySpark jobs on RDDs for creating them, Inspection of the content in those and parsing the Data in the RDDs.

●Filtered the data using PySpark Dataframe transformations by doing operations on Rows and Columns such as Dropping, Replacing, Modifying and Sorting.

●Implemented multiple Pyspark Jobs for filtering the raw data and applied the different business rules and algorithms for enrichments.

●Responsible for storing the results of processed data into Hive table using PySpark for further Business Analysis and Data Science Research.

●Implemented Partitioning and Bucketing on the records present in the Hive Tables and applied different types of joins to multiple bucketed tables using Hive Querying Language.

●Monitored various task Scheduling and Triggering by using Apache Airflow Scheduler to control the different ETL jobs.

●Maintained and monitored the various PySpark Jobs which involved Data Migration, Transformation and Loading into various Target Data Sources.

●Worked on Github and Azure Devops as a CI/CD tool for version control system and maintained the project data in different repositories.

Environment and Technologies: PySpark, Python, Spark SQL, AWS EMR, AWS S3, github, Azure Devops.

Metropolitan Transportation Authority, New York City, NY Feb 2020 – Feb 2022

Data Engineer

Responsibilities:

●Worked on designing and developing ETL pipelines using PySpark, Kafka, and Spark Structured Streaming which involved Data Ingestion, Transformation of the Data, Manipulation and Computation.

●Worked on querying structured data stored in Data frames or Hive tables using Spark SQL API to perform various SQL operations on the data.

●Implemented Spark using python and utilizing Data frames and Spark SQL API for faster processing of data.

●Performed different data extraction using different file formats like Text files, CSV, XML, JSON and transformed it and loaded to target Data warehouse.

●Efficiently handled real-time data by implementing multiple jobs using PySpark to integrate variety of raw data source such as Kafka.

●Responsible for migration of the data consumed into the lake environment of Amazon S3 Bucket Storage using PySpark for further Batch processing.

●Extracted the raw data from S3 Buckets with the help of Databricks to write PySpark SQL queries and Pyspark to process this raw Data into HDFS for further transformation.

●Developed Spark Scripts by writing various RDDs in Python to read and write different files in the form of CSV, Parquet, JSON.

●Used PySpark Dataframe API for various PySpark SQL functions such as Aggregate, Window, JSON and Date and Timestamp function to process the raw data.

●Used Join and Aggregate PySpark Dataframes for joining different relational data and grouping it by different conditions.

●Worked on developing different PySpark jobs on RDDs for creating them, Inspection of the content in those and parsing the Data in the RDDs.

●Filtered the data using PySpark Dataframe transformations by doing operations on Rows and Columns such as Dropping, Replacing, Modifying and Sorting.

●Extensively, ran multiple Pyspark Jobs for cleansing the raw data and applied the different business rules and algorithms for enrichments.

●Processed the PySpark Dataframes efficiently using different Pyspark UDFs (User Defined Functions).

●Responsible for storing the results of processed data into Hive table using PySpark for further Business Analysis and Data Science Research.

●Implemented Partitioning and Bucketing on the records present in the Hive Tables and applied different types of joins to multiple bucketed tables using Hive Querying Language.

●Monitored various task Scheduling and Triggering by using Apache Airflow Scheduler to control the different ETL jobs.

●Involved in Data Visualization of stored results from Hive Tables by using Tableau for further data analysis.

●Worked on Github as a CI/CD tool for version control system and maintained the project data in different repositories.

Environment and Technologies: PySpark, Python, Linux, Kafka, Stream sets, Spark SQL, Spark, AWS Glue, Structured Streaming, AWS EC2, S3, git.

Python Spark Developer Nov 2017 – Aug 2019

Dell Technologies, Dallas, TX

Responsibilities:

●Implemented Spark using Python (PySpark) and Spark SQL for faster testing and processing of data and loading the data into Spark RDD.

●Developed Python scripts, UDF's using both Data frames/SQL and RDD/MapReduce in Spark for Data Aggregation, queries.

●Performed various tasks in Spark by improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark SQL, Data Frame, Pair RDD & Spark YARN.

●Created modules for Spark streaming-in Data into Data Lake using Spark to perform batch processing on that data.

●Developed various PySpark Jobs to manipulate and apply feature engineering on big scale data using AWS EMR.

●Scheduled and triggered various Airflow DAGs on various frequencies to execute different PySpark Service Jobs.

●Migrated data from various data sources such as Redshift, S3 Buckets, RDS to target data Sources.

●Extracted and transformed the big data by applying the dedicated business rules with leveraging tools such as EMR, PySpark, PySparkSQL, Airflow etc.

●Performed transformations, cleaning and filtering on imported data using Spark Data Frame API, Hive, MapReduce, and loaded final data into Hive.

●Involved in converting Map Reduce programs into Spark transformations using Spark RDD on Python with the help of different user defined functions.

●In addition, developed Spark scripts by using python Shell commands as per the requirement of different operating environments.

●Worked on developing PySpark script to enriching the raw data by using hashing algorithms concepts on client specified columns.

●Utilized Spark SQL API in PySpark to extract and load data and performed SQL queries to apply different functions on RDDs such as aggregate, window, JSON, Date and timestamp.

●Aggregated daily team updates to send report to executives and to organize jobs running on Spark clusters.

●Extensively used Hive to analyze the partitioned and bucketed data and compute various metrics for reporting on the dashboard.

●In addition, Experience in developing custom aggregate functions using Spark SQL and performed interactive querying.

●Implemented Apache Airflow for authoring, scheduling as well as triggering various ETL Jobs and monitored different Data Pipelines.

●Responsible for processing the data and performed extract, transform, and load (ETL) operations as well as creating on-demand tables on S3 files using Lambda Functions and Amazon Glue Service using Python and PySpark.

●Performed various tasks on Amazon Redshift to optimize large scale data queries and Data Extraction from S3 into Redshift.

Environment and Technologies: Python, NumPy, Pandas, PySpark, Spark, Spark SQL, AWS Glue, AWS Redshift, Pig, Apache Airflow.

Software Engineer Sep 2015-Oct 2017

Seimens Information Systems Ltd., India.

Responsibilities:

●Involved in the complete Software Development Life Cycle (SLDC) to develop the application.

●Designed and developed web application using Django web framework for python models or scripts.

●Developed web applications and implementing Model View Control (MVC) architecture using server - side applications in Django framework.

●Developed server-based web traffic using RESTful API’s statistical analysis tool using Flask.

●Designed and developed User Interface using front-end technologies such as HTML, CSS, JavaScript, jQuery, Bootstrap and JSON.

●Worked in MySQL database on simple queries and writing Stored Procedures to reduce the network traffic and implemented business logics on the data.

●Wrote Python code embedded with JSON and XML to produce HTTP GET request, parsing HTML data from websites.

●Implemented and modified various SQL queries and Functions, Cursors and Triggers as per the client requirements.

●Worked on the design, development and testing phases of application using Agile methodology.

●Implemented MVC architecture in developing the web application with the help of Django framework.

●Worked on various Python Integrated Development Environments such as Visual Studio Code and Jupyter Notebook.

●Performed the tasks of collection of data and Data Visualization using Python NumPy, Pandas, Web2py and CherryPy packages.

●Using Python to write data into JSON files for testing Django Websites.

●Creating the front-end design of website applications using HTML, CSS, and JavaScript.

Environment and Technologies: Python, Node Js, HTML, CSS, Django, JavaScript, jQuery, Bootstrap, JSON, MySQL, Jupyter Notebook, NumPy, Panda, Matplotlib, Seaborn Web2py and CherryPy.

Jr. Python Developer Oct 2013- Aug 2015

Zensar Technologies, India.

Responsibilities:

●Designed the front end of the application using Python, HTML, CSS, Ajax, JSON, and jQuery.

●Analysed system requirements specifications in client interaction during requirements specifications.

●Used JavaScript and XML to update a portion of a webpage and node. JS for server-side interaction.

●Worked as part of the team implementing RESTAPI’s in Python using micro-framework like Flask with SQLALCHEMY in the backend for management of data centre resources on which OpenStack would be deployed.

●Worked on Python OpenStack API’s and used NumPy for numerical analysis.

●Implemented rally OpenStack benchmarking tool on the entire Cloud environment.

●Designed and configured database and backend applications and programs.

●Developed, tested, and debugged software tools utilized by clients and internal customers.

●Experienced in designing test plans and test cases, verifying, and validating web-based applications.

●Used test driven approach for developing the application and implemented the unit tests using Python unit test framework.

●Experience in reviewing Python code for running the troubleshooting test-cases and bug issues.

Environment and Technologies: Django, HTML, jQuery, MySQL, Python, PyCharm IDE, NumPy, Panda, Matplotlib, Seaborn.

Contact this candidate