Dheeraj Kumar Email: ************@*****.***
Data Engineer Mobile: +44-740*******
• Overall 6+ years of professional expertise, I have specialized in Data Engineering encompassing Data Warehousing, Data Pipeline Design, Development, and Implementation. My skill set spans Analysis, Design, Integration, Deployment, and Maintenance of robust software applications, leveraging Big Data and Cloud technologies.
• Proficient in data modeling concepts, including dimensional and relational approaches such as star schema, snowflake schema, and fact & dimension tables.
• Skilled in structured data sets, data pipelines, ETL tools, data reduction, transformation, and aggregation technique. Proficient with tools like DBT, DataStage.
• Worked with Requests, NumPy, Matplotlib, Beautiful Soup and Pandas python libraries during development lifecycle.
• Conducted regular reviews of operational data sources, performed data structuring, cleaning, and ETL processes to maintain data quality and integrity.
• Well-versed in Amazon Web Services like S3, IAM, EC2, EMR, Kinesis, VPC, Dynamo DB, RedShift, Amazon RDS, Lambda, Athena, Glue, DMS, Quick Sight, Amazon Elastic Load Balancing, Auto Scaling, CloudWatch, SNS, SQS and other AWS services.
• Proficient in data manipulation using SQL, Python, and Pyspark for analytics, dashboard development and data-driven decision making.
• Hands-on experience with SQL Server Constraints, T-SQL Queries, and dynamic queries, and developing workflows and macros for data extraction from various sources, including Flat files, Excel, and SQL Server Databases.
• Demonstrated expertise in SQL across Oracle, Hadoop/Hive databases, and experience with ETL systems like Batch SQL and Data Stage, facilitating efficient data extraction, transformation, and loading processes.
• Skilled in programming languages like Python, DataBricks, and Spark Scala for effective task accomplishment.
• Worked with diverse data sources including Oracle, Teradata, Greenplum, and Hadoop databases to extract information and establish key performance indicators (KPIs), serving as benchmarks for critical program success.
• Proficient in Azure Services, including Azure Analytics Services, Azure Data Lake Store (ADLS), Azure Data Lake Analytics (ADLA), Azure SQL DW, Azure Data Factory (ADF), Azure Data Bricks (ADB), Azure Blob Storage, IAM role, and Azure Cosmos DB.
• Good Knowledge on DevOps tools and techniques like Jenkins and Docker, with experience in creating impactful dashboards using Power BI and Tableau for data visualization and forecasting.
• Expert in performance tuning for Azure SQL Synapse dedicated SQL Pools and serverless SQL Pools, as well as optimizing consumption for Cosmos, Mongo, and SQL APIs. Skilled in utilizing the Databricks platform for creating Spark applications.
• Expert in developing SSIS Packages for ETL processes, extracting, transforming, and loading data into data warehouses/marts from heterogeneous sources.
• Familiarity with data security best practices, including role-based access control (RBAC) and encryption methods in Azure.
• Proficient in setting up and managing Git repositories, with experience in version control using GitHub.
• My comprehensive experience spans Software Development Life Cycle (SDLC), encompassing Requirements Analysis, Design Specification, and Testing, with proficiency in both Waterfall and Agile methodologies. Dheeraj Kumar Email: ************@*****.***
Data Engineer Mobile: +44-740*******
Technical Skills
ETL: DBT, Data Stage, Azure Data Factory, Databricks, Informatica Cloud, Kafka, Spark, MS SSIS, AWS Glue, Amazon EMR, Talend, Cloud Experience, Sqoop.
Programming Languages: C, Python, PySpark, Scala, SQL, PL/SQL, T-SQL Database technologies: MySQL, Oracle, Teradata, DB2, NoSQL, databases, PostgreSQL. Tools: Microsoft Visual Studio, SQL Server management Studio, PyCharm, PostgreSQL, Azure Data Studio. Reporting Tools: Tableau, Power BI, MS Excel.
DevOps Tools: Jenkins, Kubernetes, Docker.
Cloud Technologies: Microsoft Azure, AWS
Version Control Tools: Git and GitHub.
Methodologies: Agile, Waterfall.
Work Experience
Epam, London Jan 2024 – present
Role: Data Engineer
Responsibilities:
• Responsible for providing Data modeling, Development, and strategic direction using Azure-based, implementing Data Engineering techniques to develop business solutions.
• Devise and deploy multi-stage ETL processes involving file-to-staging and staging-to-target transformations, orchestrated using Apache Airflow.
• Transfer data to Snowflake through the creation of Staging Tables, facilitating the loading of data from various files stored in Azure.
• Create complex stored procedures, cursors, tables, views, joins, statements, and sophisticated T-SQL scripts to optimize performance and enhance slow-performing SQL queries.
• Performance tuning for Spark Streaming, e.g., Setting right Batch Interval time, the correct level of parallelism, selection of correct Serialization & memory tuning.
• Enhancing Data Ingestion Framework with resilient, secure data pipelines, augmenting their robustness and dependability.
• Migrate SQL databases to Azure Data Lake, Azure SQL Database, Data Bricks, and Azure Synapse Analytics, seamlessly controlling access and managing On-premises database migration to Azure Data Lake Store through Azure Data Factory (ADF).
• Design, develop, and integrate Lambda functions to amplify Azure Data Lake Storage structures, seamlessly transferring refined data to Azure SQL Database.
• Working on development of data ingestion pipelines using ETL tool, Talend & bash scripting with big data technologies including but not limited to Hive, Impala, Spark, Kafka, and Talend.
• Configured encryption mechanisms for data at rest and in transit using Azure Key Vault and Transport Layer Security (TLS), safeguarding data integrity and confidentiality.
• Implemented and maintained robust security measures in Azure data environments, ensuring compliance with industry standards and regulations such as GDPR, HIPAA, and SOC 2.
• Used different flow controls such as data flow task, sequence container, execute SQL task, expression, variables, and script task.
Dheeraj Kumar Email: ************@*****.***
Data Engineer Mobile: +44-740*******
• Develop SSIS packages to load data from XML files to T-SQL tables using stored procedure & built testing document to test entire ETL workflow.
• Experience in working with spark ecosystem using Spark SQL and Scala queries on different formats like text files, CSV files.
• Involving in the development of agile, iterative, and proven data modeling patterns that provide flexibility. Worked on analyzing and resolving production job failures in several scenarios. Environment: Azure Data Factory, Apache Airflow, Talend, Spark, Azure SQL Database, Snowflake, SQL Server, Hive, T-SQL, Spark, Kafka, Azure Data Lake, Azure Synapse Analytics, Python, Scala, Tableau, Git. Data Engineer
Royal Bank of Canada, London April 2022 to Dec 2023 Responsibilities:
● Designed and setup Enterprise Data Lake to provide support for various uses cases including Analytics, processing, storing and Reporting of voluminous, rapidly changing data.
● Responsible for maintaining quality reference data in source by performing operations such as cleaning, transformation and ensuring Integrity in a relational environment by working closely with the stakeholders & solution architect.
● Extensively used Stash Git-Bucket for Code Control and Worked on AWS Components such as Airflow, Elastic Map Reduce (EMR), Athena and Snowflake.
● Designed and developed Security Framework to provide fine grained access to objects in AWS S3 using AWS Lambda, Dynamo DB.
● Set up and worked on Kerberos authentication principals to establish secure network communication on cluster and testing of HDFS, Hive, Pig and Map Reduce to access cluster for new users.
● Performed end- to-end Architecture & implementation assessment of various AWS services like Amazon EMR, Redshift and S3.
● Implemented the machine learning algorithms using python to predict the quantity a user might want to order for a specific item so we can automatically suggest using kinesis fire hose and S3 Data Lake.
● Used AWS EMR to transform and move large amounts of data into and out of other AWS data stores and databases, such as Amazon Simple Storage Service (Amazon S3) and Amazon Dynamo DB.
● Used Spark SQL for Scala & amp, Python interface that automatically converts RDD case classes to schema RDD.
● Import the data from different sources like HDFS/HBase into Spark RDD and perform computations using PySpark to generate the output response.
● Creating Lambda functions with Boto3 to deregister unused AMIs in all application regions to reduce the cost for EC2 resources
● Importing & exporting database using SQL Server Integrations Services (SSIS) and Data Transformation Services (DTS Packages).
● Created external and permanent tables in Snowflake on the AWS data.
● Coded Teradata BTEQ scripts to load, transform data, fix defects like SCD 2 date chaining, cleaning up duplicates.
● Developed reusable framework to be leveraged for future migrations that automates ETL from RDBMS systems to the Data Lake utilizing Spark Data Sources and Hive data objects.
● Conducted Data blending, Data preparation using Alteryx and SQL for Tableau consumption and publishing data sources to Tableau server.
● Worked extensively with importing metadata into Hive and migrated existing tables and applications to work on Hive and AWS cloud and making the data available in Athena and Snowflake.
● Experience in handling Python and spark context when writing PySpark programs for ETL. Dheeraj Kumar Email: ************@*****.***
Data Engineer Mobile: +44-740*******
● Developed Kibana Dashboards based on the Log stash data and Integrated different source and target systems into Elastic search for near real time log analysis of monitoring End to End transactions.
● Developed ETL’s using PySpark. Used both Data frame API and Spark SQL API
● Implemented AWS Step Functions to automate and orchestrate the Amazon Sage Maker related tasks such as publishing data to S3, training ML model and deploying it for prediction.
● Integrated Apache Airflow with AWS to monitor multi-stage ML workflows with the tasks running on Amazon Sage Maker.
Environment: AWS EMR, S3, RDS, Redshift, Lambda, Boto3, Dynamo DB, Amazon Sage Maker, Apache Spark, HBase, Apache Kafka, HIVE, SQOOP, Map Reduce, Snowflake, Apache Pig, Python, SSRS, Tableau, PySpark. Data Engineer
DBS Bank, India
April 2021 to March 2022
Responsibilities:
● Processed the Web server logs by developing multi-hop flume agents by using Avro Sink and loaded into Mongo DB for further analysis, also extracted files from Mongo DB through Flume and processed.
● Expert knowledge on Mongo DB, NoSQL data modeling, tuning, disaster recovery backup used it for distributed storage and processing using CRUD.
● Extracted and restructured the data into Mongo DB using import and export command line utility tool.
● Used Snowflake extensively to do the ETL operations and imported the data from Snowflake to S3 and S3 to Snowflake.
● Experience in setting up Fan-out workflow in flume to design v shaped architecture to take data from many sources and ingest into single sink.
● Experience in creating tables, dropping, and altered at run time without blocking updates and queries using HBase and Hive.
● Experience in working with different join patterns and implemented both Map and Reduce Side Joins.
● Responsible for creating on-demand tables on S3 files using Lambda Functions and AWS Glue using Python and PySpark.
● Wrote Flume configuration files for importing streaming log data into HBase with Flume.
● Imported several transactional logs from web servers with Flume to ingest the data into HDFS.
● Using Flume and Spool directory for loading the data from local system (LFS) to HDFS.
● Installed and configured pig, written Pig Latin scripts to convert the data from Text file to Avro format.
● Created Partitioned Hive tables and worked on them using HiveQL.
● Loading Data into HBase using Bulk Load and Non-bulk load.
● Worked on continuous Integration tools Jenkins and automated jar files at end of day.
● Performed extensive Data Analysis and Data Validation on Teradata and designed Star and Snowflake Data Models for Enterprise Data Warehouse using ERWIN
● Worked with Tableau and Integrated Hive, Tableau Desktop reports and published to Tableau Server.
● Developed Map Reduce programs in Java for parsing the raw data and populating staging Tables.
● Experience in setting up the whole app stack, setup, and debug log stash to send Apache logs to AWS Elastic search.
● Developed Spark code using Scala and Spark-SQL/Streaming for faster testing and processing of data.
● Analyzed the SQL scripts and designed the solution to implement using Scala.
● Used Spark-SQL to Load JSON data and create Schema RDD and loaded it into Hive Tables and handled structured data using Spark SQL.
Dheeraj Kumar Email: ************@*****.***
Data Engineer Mobile: +44-740*******
● Implemented Spark Scripts using Scala, PySpark, Spark SQL to access hive tables into Spark for faster processing of data.
● Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL and U-SQL Azure Data Lake Analytics. Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in Azure Data bricks.
● Tested Apache Tez for building high performance batch and interactive data processing applications on Pig and Hive jobs.
● Exploring with Spark to improve the performance and optimization of the existing algorithms in Hadoop using Spark context, Spark-SQL, PostgreSQL, Scala, Data Frame, Impala, Open Shift, Talend, pair RDD's. Environment: Hadoop (HDFS, Map Reduce), Data bricks, Spark, Talend, Impala, Snowflake, Hive, PostgreSQL, Jenkins, NiFi, Scala, PySpark, Mongo DB, Cassandra, Python, Pig, Sqoop, Hibernate, spring, Oozie, AWS Services EC2, S3, Auto scaling, Scala, Azure, Elastic Search, Dynamo DB, UNIX Shell Scripting, TEZ. Data Engineer
Value Labs, India
June 2016 to December 2018
Responsibilities:
• Spearheaded the acquisition of new data sources, ensuring comprehensive lifecycle management, data quality checks, transformations, and metadata enrichment.
• Develop a Python program to manage archival of raw files within the AWS S3 storage.
• Conducted in-depth analysis of diverse raw file formats, including JSON, CSV, and XML, utilizing Python libraries such as Pandas and NumPy.
• Created and executed SQL queries on Big Data platforms like Impala and Hive for data extraction, utilizing join and aggregation operations.
• Configured Spark Streaming for real-time data collection, using Python to store data in AWS DynamoDB and Delta Lake for comprehensive processing and analytics.
• Demonstrated expertise in migrating SQL databases to AWS services including Amazon S3, Amazon RDS, Amazon Redshift, and AWS Glue, while ensuring controlled access and management.
• Performed performance tuning at source, target, mapping, and session levels to optimize data processes.
• Utilized database management systems such as Apache Hadoop/HDFS, Apache Cassandra, Apache Hive, and NoSQL databases.
• Managing the entire software development life cycle, including planning, designing, testing, and deploying applications.
• Leveraged business intelligence tool Tableau to craft dashboards and ad hoc reports, effectively addressing business challenges, and optimizing processes.
• Proficient in designing and implementing IAM policies and roles in AWS using Terraform, emphasizing the principle of least privilege for enhanced security.
• Actively participated in the entire software development lifecycle, from design and development to unit testing and system integration testing.
Dheeraj Kumar Email: ************@*****.***
Data Engineer Mobile: +44-740*******
Environment: Python, Pandas, NumPy, SSIS, PySpark, AWS, AWS S3, AWS DynamoDB, Amazon RDS, Amazon Redshift, AWS Glue, SQL Server, PostgreSQL, Tableau, Medallion Architecture, Apache Airflow Education
BCOM Computers, Osmania University.
MBA- International Business, University of Greenwich.