SUJITH KUMAR REDDY BATHULA
Azure Data Engineer
Phone: +1-469-***-****.
Email: ***********************@*****.***.
PROFESSIONAL SUMMARY
Overall having 9+ years of experience in IT, which includes experience in Bigdata Technologies, Hadoop eco-system, Data Warehousing, SQL related technologies in various sectors. 5 Years of experience in Big Data Analytics using Various Hadoop eco-systems tools and Spark Framework and Azure cloud services using Scala and python as the main programming dialect. 4 years of experiences on Data warehouse developer role.
Have good experience working with Azure BLOB and Data Lake storage and loading data into Azure SQL Synapse analytics (DW).
Hands on experience on Python programming PySpark implementation azure data factory building data pipelines infrastructure to support deployments for Machine Learning models.
Proficient writing complex spark (PySpark) User defined functions (UDFs), Spark SQL and HiveQL.
Experience working on Azure Services like Data Lake, Data Lake Analytics, SQL Database, Synapse, Data Bricks, Data factory, Logic Apps, Functional App and EventHub services.
Experience in developing data pipeline using Hive and Sqoop, to extract the data from weblogs and store in HDFS and developing HiveQL for data analytics.
Extensively dealt with Spark Streaming and Apache Kafka to fetch live stream data.
Experience in converting Hive/SQL queries into Spark transformations using Java and experience in ETL development using Kafka, and Sqoop
Proficient in Hive, Oracle, SQL Server, SQL, PL/SQL, T-SQL and in managing very large databases.
Experience writing in house UNIX shell scripts for Hadoop & Big Data Development
Developed Spark scripts by using Scala shell commands as per the requirement.
Good experience in writing Sqoop queries for transferring bulk data between Apache Hadoop and structured data stores.
Substantial experience in writing Map Reduce jobs in Java.
Experience in Developing Spark applications using PySpark, and Spark-SQL in Databricks for data extraction, transformation, and aggregation from multiple file formats(structured/unstructured) for analyzing and transforming the data to uncover insights into the customer usage patterns.
Experience in Data Warehousing, Data Mart, Data Wrangling using Azure Synapse Analytics
Experience in understanding business requirements for analysis, database design & development of applications
Worked with Kafka tools like Kafka migration, Mirror maker and Consumer offset checker.
Experience with real time data ingestion using Kafka.
Experience with CI/CD pipelines with Jenkins, Bitbucket, GitHub etc.
Strong expertise in troubleshooting and performance fine-tuning Spark, and Hive applications
Hands on experience in developing SPARK applications using Spark tools like RDD transformations, Spark core, Spark Streaming and Spark SQL
Extensive experience in developing applications that perform Data Processing tasks using Teradata, Oracle, SQL Server, and MySQL database.
Worked on data warehousing and ETL tools like Informatica and Power BI.
Acquaintance with Agile and Waterfall methodologies. Responsible for handling several clients facing meetings with great communication skills.
EDUCATION
Bachelor of Technology from National Institute of Technology Calicut, India.
TECHNICAL SKILLS
Azure Services
Azure data Factory, Azure Data Bricks, Logic Apps, Functional App, Synopses, EventHub, Power BI, Airflow, Snowflake, Azure Devops
Hadoop Distribution
Cloudera, Horton Works
Big Data Technologies
HDFS, MapReduce, Hive, Sqoop, Oozie, Zookeeper, Kafka, Apache Spark, Spark Streaming
Languages
Java, SQL, PL/SQL, Python, HiveQL, Scala.
Web Technologies
HTML, CSS, JavaScript, XML, JSP, Restful, SOAP
Operating Systems
Windows (XP/7/8/10), UNIX, LINUX, UBUNTU, CENTOS.
Build Automation tools
Ant, Maven
Version Control
GIT, GitHub.
IDE &Build Tools, Design
Eclipse, Visual Studio.
Databases
MS SQL Server 2016/2014/2012, Azure SQL DB, Azure Synapse. MS Excel, MS Access, Oracle 11g/12c, Cosmos DB
PROFESSIONAL EXPERIENCE
Client: LexisNexis Risk Solutions Mar 2022 – Till Now
Role: Azure Data Engineer
Responsibilities:
Working knowledge on Azure cloud components (Databricks, Data Lake, Blob Storage, Data Factory, Storage Explorer, SQL DB, SQL DWH, CosmosDB).
Experience in analysing data from Azure data storages using Databricks for deriving insights using Spark cluster capabilities.
Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, data bricks, PySpark, Spark SQL and U-SQL Azure Data Lake Analytics.
Created Pipelines in ADF using Linked Services/Datasets/Pipeline/ to Extract, Transform, and load data from Azure SQL, Blob storage, and Azure SQL Data warehouse.
Worked with Azure BLOB and Data Lake storage and loading data into Azure SQL Synapse analytics (DW).
Involved in developing data ingestion pipelines on Azure HDInsight Spark cluster using Azure Data Factory and Spark SQL.
Designed and implemented a real-time data streaming solution using Azure EventHub.
Conducted performance tuning and optimization activities to ensure optimal performance of Azure Logic Apps and associated data processing pipelines.
Develop a Spark Streaming application to process real-time data from various sources such as Kafka, and Azure Event Hubs.
Build streaming ETL pipelines using Spark Streaming to extract data from various sources, transform it in real-time, and load it into a data warehouse such as Azure Synapse Analytics
Use tools such as Azure Databricks or HDInsight to scale out the Spark Streaming cluster as needed.
Developed Spark API to import data into HDFS from Teradata and created Hive tables.
Created Partitioned and Bucketed Hive tables in Parquet File Formats with Snappy compression.
Loaded data into Parquet Hive tables from Avro Hive tables.
Learner data model which gets the data from Kafka in real time and persist it to Cassandra.
Hands on programming experience in scripting languages like python and Scala
Involved in running all the Hive scripts through Hive on Spark and some through Spark SQL
Using the JSON and XML SerDe's for serialization and de-serialization to load JSON and XML data into HIVE tables.
Implement Hive partitioning, bucketing, optimization code through set parameters/
Perform different types of joins on Hive tables and implement Hive SerDe like Avro, JSON.
Managing logs through Kafka with Logstash
Involved in performance tuning of Hive from design, storage, and query perspectives.
Developed Kafka consumer's API in Scala for consuming data from Kafka topics.
Monitored Spark cluster using Log Analytics and Ambari Web UI.
Developed Spark core and Spark SQL scripts using Scala for faster data processing.
Worked on Hadoop ecosystem in PySpark on HDInsight and Databricks.
Extensively used Spark core – Spark Context, Spark SQL, and Spark Streaming for real time data
Performed data profiling and transformation on the raw data using Python.
Orchestrated number of Sqoop and Hives scripts using Oozie workflow and scheduled using Oozie coordinator.
Implement RDD/Datasets/Data frame transformations in Scala through Spark Context and Hive Context
Used Jira for bug tracking and Bitbucket to check-in and checkout code changes.
Experienced in version control tools like GIT and ticket tracking platforms like JIRA.
Expert at handling Unit Testing using Junit4, Junit5 and Mockito
Environment: Azure, Hadoop, HDFS, Yarn, MapReduce, Hive, Sqoop, Oozie, Kafka, Spark SQL, Spark Streaming, Eclipse, Informatica, Oracle, CI/CD, PL/SQL UNIX Shell Scripting, Cloudera.
Client: Verizon Dec 2019 – Feb 2022
Role: Azure Data Engineer
Responsibilities:
Hands - on experience in Azure Cloud Services, Azure Synapse Analytics, SQL Azure, Data Factory, Azure Analysis services, Application Insights, Azure Monitoring, Key Vault, and Azure Data Lake.
Created Batch & Streaming Pipelines in Azure Data Factory (ADF) using Linked Services/Datasets/Pipeline/ to Extract, Transform, and load data.
Created Azure Data Factory (ADF) Batch pipelines to Ingest data from relational sources into Azure Data Lake Storage (ADLS gen2) & incremental fashion and then load into Delta tables after cleansing.
Created Azure logic apps to trigger when a new email received with an attachment and load the file to blog storage.
Implemented CI/CD pipelines using Azure DevOps in cloud with GIT, Maven, along with Jenkins plugins.
Build a Spark Streaming application to perform real-time analytics on streaming data.
Experience in designing and developing POCs in Spark using Scala to compare the performance of Spark with MapReduce, Hive.
Use Spark SQL to query and aggregate data in real-time and output the results to various visualizations such as Power BI or Azure Data Studio.
Develop a Spark Streaming application that integrates with event-driven architectures such as Azure Functions or Azure Logic Apps.
Use Spark Streaming to process events in real-time, and trigger downstream workflows based on the results.
Involved in creating Hive tables and loading and analyzing data using hive queries.
Designed and developed custom Hive UDF’s
Using the JSON and XML SerDe's for serialization and de-serialization to load JSON and XML data into HIVE tables.
Involved in migration of ETL processes from Oracle to Hive to test the easy data manipulation.
Implemented to reprocess the failure messages in Kafka using offset id.
Used HiveQL to analyze the partitioned and bucketed data.
Developed a Spark job in Java which indexes data into azure functions from external Hive tables which are in HDFS.
Written Hive queries on the analyze data for aggregation and reporting.
Developed Sqoop Jobs to load data from RDBMS to external systems like HDFS and HIVE
Developed Spark applications using PySpark and Spark-SQL for data extraction, transformation, and aggregation from multiple file formats.
Worked on converting the dynamic XML data for injection into HDFS.
Transformed and Copied data from the JSON files stored in a Data Lake Storage into an Azure Synapse Analytics table by using Azure Databricks
Azure Databricks, Azure Storage Account etc. for source stream extraction, cleansing, consumption, and publishing across multiple user bases.
Created resources, using Azure Terraform modules, and automated infrastructure management.
Implemented Spark Scripts using Scala, Spark SQL to access hive tables into spark for faster processing of data.
Loading data from UNIX file system to HDFS
Configured spark streaming to receive real time data from the Apache Flume and store the stream data using Scala to Azure Tables.
Loaded the data into Spark-RDD and do in memory data Computation to generate the Output response.
Used several RDD transformation to filter the data injected into Spark SQL
Used Hive Context and SQL Context to integrate Hive meta store and Spark SQL for optimum performance.
Used the version control system GIT to access the repositories and used in coordinating with CI tools.
Environment: Spark SQL, HDFS, Hive, Pig, Apache Sqoop, Java (JDK SE 6, 7), Scala, Shell scripting, Linux, MySQL Oracle Enterprise DB, IntelliJ, CI/CD, Oracle, Subversion, and Agile Methodologies.
Client: HTC Global Svc, Michigan Nov 2018 – Dec 2019
Role: Hadoop Developer/Data Engineer
Responsibilities:
Worked on development of data ingestion pipelines using ETL tool, Talend & bash scripting with big data technologies including but not limited to Hadoop, Hive, Spark, Kafka.
Experience in developing scalable & secure data pipelines for large datasets.
Gathered requirements for ingestion of new data sources including life cycle, data quality check, transformations, and metadata enrichment.
Developed data pipeline using Flume, Sqoop, Pig and Java Map Reduce to ingest customer behavioural data into HDFS for analysis.
Supported data quality management by implementing proper data quality checks in data pipelines.
Enhancing Data Ingestion Framework by creating more robust and secure data pipelines.
Implemented data streaming capability using Kafka and informatica for multiple data sources.
Involved in SQOOP implementation which helps in loading data from various RDBMS sources to Hadoop systems and vice versa.
Worked with multiple storage formats (Avro, Parquet) and databases (Hive) Azure SQL.
Optimizing query performance in Hive using bucketing and partitioning techniques
Creating and managing partitions and buckets in Hive tables.
Responsible for maintaining and handling data inbound and outbound requests through big data platform.
Used Sqoop to transfer data between relational databases and Hadoop.
Knowledge on implementing the JILs to automate the jobs in production cluster.
Worked with SCRUM team in delivering agreed user stories on time for every Sprint.
Worked on analyzing and resolving the production job failures in several scenarios.
Implemented UNIX scripts to define the use case workflow and to process the data files and automate the jobs.
Environment: Spark, Azure SQL, Python, HDFS, Hive, Sqoop, Scala, Kafka, Shell scripting, Linux, Eclipse, Git, Oozie, Informatica, Agile Methodology.
Client: Illumina, Redwood City, CA Aug 2015 – Oct 2016
Role: Hadoop Developer.
Responsibilities:
Installed/Configured/Maintained Apache Hadoop clusters for application development and Hadoop tools like S, HDFS, Hive, Zookeeper and Sqoop.
Implemented Partitioning, Dynamic Partitions, Buckets in HIVE.
Installed and Configured Sqoop to import and export the data into Hive from Relational databases.
Administering large Hadoop environments build and support cluster set up.
Performance tuning and monitoring in an enterprise environment.
Close monitoring and analysis of the Map Reduce job executions on cluster at task level and optimized Hadoop clusters components to achieve high performance.
Used Python & SAS to extract, transform & load source data from transaction systems, generated reports, insights, and key conclusions.
Designed and Developed data mapping procedures ETL-Data Extraction, Data Analysis and Loading process for integrating data using python programming.
Used Hive and created Hive tables.
Involved in data loading and writing Hive UDFs.
Worked with Linux server admin team in administering the server hardware and operating system.
Designed, developed, and did maintenance of data integration programs in a Hadoop and RDBMS environments.
Configured Spark Streaming to receive real time data from the Kafka and store the stream data to HDFS.
Environment: Hadoop YARN, Spark, Spark SQL, Python, Hive, Sqoop, Map Reduce, Power BI, Oracle, Linux
Client: Callidus Cloud, Dublin, CA July 2013 —July 2015
Role: Data Warehouse Developer
Responsibilities
Creation, manipulation and supporting the SQL Server databases.
Involved in the Data modelling, Physical and Logical Design of Database
Helped in integration of the front end with the SQL Server backend.
Created Stored Procedures, Triggers, Indexes, User defined Functions, Constraints etc. on various database objects to obtain the required results.
Import & Export of data from one server to other servers using tools like Data Transformation Services (DTS)
Wrote T-SQL statements for retrieval of data and involved in performance tuning of TSQL queries.
Transferred data from various data sources/business systems including MS Excel, MS Access, Flat Files etc. to SQL Server using SSIS/DTS.
Supported team in resolving SQL Reporting services and T-SQL related issues and Proficiency in creating different types of reports such as Crosstab, Conditional, Drill-down, Top N, Summary, Form, OLAP and Sub reports, and formatting them.
Provided via the phone, application support. Developed and tested Windows command files and SQL Server queries for Production database monitoring in 24/7 support.
Developed, monitored, and deployed SSIS packages.
Generated multiple Enterprise reports (SSRS/Crystal/Impromptu) from SQL Server Database (OLTP) and SQL Server Analysis Services Database (OLAP) and included various reporting features such as group by, drilldowns, drill through, sub-reports, navigation reports (Hyperlink) etc.
Worked on all types of report types like tables, matrix, charts, sub reports etc.
Created Linked reports, Ad-hoc reports etc. based on the requirement. Linked reports are created in the Report Server to reduce the repetition the reports.
Environment: Microsoft Office, Windows 2007, T-SQL, DTS, SQL Server 2008, HTML, SSIS, SSRS, XML.
Client: Re-Nu Technology, Chennai, India Sep 2011 —June 2013
Role: Data Warehouse Developer
Responsibilities
Expert in designing ETL data flows using SSIS, creating mappings/workflows to extract data from SQL Server and Data Migration and Transformation from Access/Excel Sheets using SQL Server SSIS.
Efficient in Dimensional Data Modelling for Data Mart design, identifying Facts and Dimensions, and developing, fact tables, dimension tables, using Slowly Changing Dimensions (SCD).
Experience in Error and Event Handling: Precedence Constraints, Break Points, Check Points, Logging.
Experienced in Building Cubes and Dimensions with different Architectures and Data Sources for Business Intelligence and writing MDX Scripting.
Thorough knowledge of Features, Structure, Attributes, Hierarchies, Star and Snowflake Schemas of Data Marts.
Good working knowledge on Developing SSAS Cubes, Aggregation, KPIs, Measures, Partitioning Cube, Data Mining Models and Deploying and Processing SSAS objects.
Experience in creating Ad hoc reports and reports with complex formulas and to query the database for Business Intelligence.
Expertise in developing Parameterized, Chart, Graph, Linked, Dashboard, Scorecards, Report on SSAS Cube using Drill-down, Drill-through and Cascading reports using SSRS.
Flexible, enthusiastic, and project-oriented team player with excellent written, verbal communication and leadership skills to develop creative solutions for challenging client needs.
Environment: MS SQL Server 2016, Visual Studio 2017/2019, SSIS, Share point, MS Access, Team Foundation server, Git.