SANJAY S
Email: ******.******@*****.*** LinkedIn: www.linkedin.com/in/sanjay-sihag Mobile: 919-***-****
Professional Summary:
Overall, 5 years of professional experience in the IT industry as a Data Engineer/Big Data Engineer with expertise in Pyspark and Hadoop ecosystem for analysis, transformations, deployment, and ingestion of data.
Proficient with Spark Core, Spark SQL and Spark Streaming for processing and transforming complex data using in-memory computing capabilities written in Pyspark.
Worked with Spark to improve efficiency of existing algorithms using Spark Context, Spark SQL, Data Frame, Pair RDD's and Spark Yarn.
Experience in analyzing data using Python, SQL, Hive, PySpark, SparkSQL for Data Mining, Data Cleansing, Data Munging.
Expertise in data processing using PySpark from external sources, and merging the obtained data, performing data enrichment, and loading into data warehouses.
Experienced with optimization on spark performance using concepts like persist, cache, broadcast, and efficient joins and SparkSQL for faster testing and processing of data.
Proficient in Hive Query language and experienced in hive performance optimization using Static-Partitioning, Dynamic-Partitioning, Bucketing and Parallel Execution concepts.
Executed complex HiveQL queries for required data extraction from Hive tables and written Hive UDF’s.
Experience in loading and analyzing large datasets with Hadoop framework (MapReduce, HDFS, PIG, HIVE, Sqoop, SPARK) and databases like MongoDB, HBase, Cassandra.
Experience in importing and exporting data using Sqoop from HDFS to Relational Database Systems and from RDMS to HDFS.
Expertise in creating, debugging, scheduling, and monitoring jobs using Apache Airflow and oozie.
Good understanding of Big Data Hadoop and Yarn architecture along with various Hadoop Daemons such as Job Tracker, Task Tracker, Name Node, Data Node, Resource/Cluster Manager.
Worked on developing Spark applications using SparkSQL in Databricks for data extraction, transformation, and aggregation from multiple file formats for analyzing & transforming the data to uncover insights into the customer usage patterns.
Experience in working with Spark eco-system using Scala and Hive queries on different data format like text file and parquet.
Experience in Apache spark for implementing advanced procedures like text analytics and processing using the in- memory computing capabilities written in Scala.
Development and maintenance of Scala applications that are executed on the Cloud platform.
Developed an automated process in Azure cloud which can ingest data daily from web service and load into Azure SQL DB.
Build pipelines in Azure data factory to move data from on prem to Azure SQL Datawarehouse, from Amazon S3 buckets to Azure blob storage.
Worked on ETL Migration services by developing and deploying AWS Lambda functions for generating a serverless data pipeline which can be written to Glue Catalog and can be queried from Athena.
Well versed with big data on AWS cloud services like S3, Redshift, Athena (Storage service) and ETL tools like EMR, Glue, Airflow.
Used Kafka producer to ingest the raw data into Kafka topics run the Spark Streaming app to process clickstream events.
Implemented Data warehouse solution consisting of ETLs, On-premises to Cloud Migration and good expertise building and deploying batch and streaming data pipelines on cloud environment.
Created few Tableaus dashboard reports, Heat map charts and supported numerous dashboards, pie charts and heat map charts that were built on Teradata database.
Experienced in writing SQL Queries, Stored procedures, functions, packages, tables, views, triggers using relational database like Oracle, DB2, MySQL, PostgreSQL, and MS SQL server.
Worked in Production Environment which involves building CI/CD pipeline using Jenkins with various stages starting from code checkout from GitHub to Deploying code in specific environment.
Decent work experience with UNIX/LINUX commands, shell scripting and deploying applications on the servers.
Performed complex data enrichments and provided critical reports to support various departments.
Experienced in collaborating with various Python IDE’s using Pycharm, Pyscripter, Spyder, Pystudio, PyDev, IDLE, NetBeans, and Sublime Text.
TECHNICAL SKILLS
Big Data Ecosystem
Pyspark, HDFS, Hive, Yarn, Pig, Sqoop, Kafka, Oozie, Spark, MapReduce, Airflow PyTorch, Tensorflow
Hadoop Distributions
Cloudera CDP, Hortonworks HDP
Programming Languages
Python, SQL, HiveQL, Scala, JavaScript, Shell scripting
Databases and Data Warehouse
MySQL, MS SQL Server, Oracle, PostgreSQL, Apache Cassandra, MongoDB, HBase
Software Methodologies
Agile, SDLC Waterfall
Data Visualization
Tableau, Power BI, Alteryx
CI/CD Tools
GIT, GITHUB, Bitbucket, SVN, Jenkins
IDE’s
PyCharm, Jupiter Notebook, Visual Studio
CLOUD
AWS, EC2, S3, EMR, IAM, Glue, MS Azure, Azure Data Factory, ADLS, Azure Blob Storage, Snowflake, Databricks, GCP
Certification:
Google Data Analytics Professional Certificate July 2021
Data Engineer
WBA, Illinois December 2021 – November 2023
Developed PySpark/ Scala Scripts to process streaming data ingested from data lakes using Spark Streaming.
Developed data processing pipelines using PySpark such as reading data from external sources, merging the obtained data, performing data enrichment, and loading into data warehouses.
Good understanding of Spark Architecture including Spark Core, Spark SQL, Data Frames, Spark Streaming, Driver Node, Worker Node, Stages, Executors and Tasks.
Proficient in developing APIs using Spark Scala, a powerful framework for distributed data processing and analytics.
Developed robust and scalable APIs that efficiently manage large volumes of data and leverage Spark’s parallel processing capabilities.
Developed a reusable Spark framework to enhance the efficiency and productivity of data processing tasks.
Performed transformations, cleaning and filtering the imported data using Spark Data Frame API and loaded final data into Hive.
Worked on converting Hive/SQL queries into Spark transformations using Spark Data frames, and Python Analyzed the SQL scripts and designed the solution to implement using PySpark.
Experience in working with Spark’s YAML configuration files for optimal cluster and application settings.
Proven record of successfully configuring Spark application to efficiently utilize available cluster resources.
Strong critical thinking skills to troubleshoot and resolve configuration issues, ensuring smooth execution of Spark applications.
Developed scripts to load data to hive from HDFS and involved in ingesting data into Data Warehouse using various data loading techniques.
Worked in collaboration with the Data Scientist team and developed pipelines to cater to their needs.
Proficient in designing, developing, and maintaining Oracle Stored Procedures using PL/SQL to meet specific business requirements.
Developed error handling and exception management strategies within Oracle Stored Procedure to automate complex data operations.
Developed and maintain Airflow DAGs (Directed Acyclic Graphs) to schedule and monitor the execution of Oracle Stored Procedures.
Experience in implementing hybrid connectivity between Azure and on-premises using virtual networks, VPN and Express Route.
Plan and Develop roadmaps and deliverables to advance the migration of existing solutions on-premises systems/applications to Azure cloud.
Experience on Migrating SQL database to Azure data Lake, Azure data lake Analytics, Azure SQL Database, Data Bricks and Azure SQL Data warehouse and controlling and granting database access and migrating on- premises databases to Azure Data Lake store (ADLS).
Successfully developed ADF pipelines to extract data from various source systems, including databases, APIs, and flat files, and load it into target destinations such as data lakes, data warehouses, and cloud storage.
Implemented data pipelines orchestration strategies, including scheduling, monitoring, and error handling, to ensure data integration processes run smoothly and reliably.
Optimized ADF pipeline performance by tuning data processing activities, partitioning, and parallelizing data loads, and leveraging ADF’s integration with Azure Databricks.
Experience in Developing Spark applications using Spark - SQL in Databricks for data extraction, transformation, and aggregation from multiple file formats for analyzing & transforming the data to uncover insights.
Responsible for estimating the cluster size, monitoring, and troubleshooting of the Spark data bricks cluster.
Integrated Azure API Management to effectively manage and monitor API calls made from the framework.
Environment: PySpark, Python, Scala, Hadoop, Hive, Sqoop, hive UDFs, MySQL, Linux, Ambari, Azure, ADLS, Databricks, Power BI, Tableau.
Data Engineer
Visa Inc, Texas October 2019 – November 2021
Analyzed large and critical datasets using Spark, Hive, Hive UDFs, MapReduce, Sqoop, Pig and HDFS.
Developed Spark Jobs to perform tasks like data validation, standardization, process and transform the data from Data center.
Used SparkSQL to load JSON data and create schema RDD and loaded it into Hive Tables and managed structured data using SparkSQL.
Worked on complex Hive queries to extract meaningful information related to the Spark job by joining multiple tables.
Developed Python scripts, UDF's using Data frames and MapReduce in Spark for Data Aggregation, queries and writing data back into HDFS.
Performed data cleansing, enrichment, mapping tasks and automated data validation processes to ensure meaningful and accurate data was reported efficiently.
Designed and created Hive external tables using shared meta-store instead of derby with static partitioning, dynamic partitioning, and bucketing.
Performed data wrangling to clean, transform and reshape the data utilizing panda’s library.
Analyzed data using SQL, Python, Apache Spark and presented analytical reports to management and technical teams.
Performed Data wrangling to clean, transform and reshape the data utilizing NumPy and Pandas library.
Imported data from AWS S3 into Spark RDD, performed transformations and actions on RDD's.
Created data pipeline for different events of ingestion, aggregation, and load consumer response data in AWS S3 bucket into Hive external tables in HDFS location to serve as feed for tableau dashboards.
Worked on data ingestion, Airflow Operators for data orchestration, and other related python libraries.
Scheduled Airflow DAGs to run multiple Hive and Spark jobs, which independently run with time and data availability.
Involved in various Transformation and data cleansing activities using various Control flow and data flow tasks in SSIS packages during data migration.
Develop SQL queries using stored procedures, common table expressions (CTEs), temporary table to support SSRS and Power BI reports.
Developed a comprehensive migration strategy for the implementation of Power BI, considering the organization’s unique requirements, existing data sources, and technical infrastructure.
Derived data from relational databases to perform complex data manipulations and conducted extensive data checks to ensure data quality.
Worked with different File Formats like Text, Avro, ORC, and Parquet for Hive querying and processing.
Involved in performance tuning of the ETL process by addressing various performance issues at the extraction and transformation stages.
Performed unit testing, provided bug fixes and deployment support.
Environment: PySpark, Hadoop, Hive, Sqoop, hive UDFs, MySQL, Linux, Python, Airflow, Snowflake, AWS, RDBMS, Power BI.