Name: Ajay Cell No: +1-204-***-****: Email: ****.*********@*****.***
LinkedIn: https://www.linkedin.com/in/ajay-babu-koleti-a28368272/
Professional summary:
Experienced Big Data Engineer with Around 6 years of experience in designing and implementing large-scale distributed systems using various Big Data Technologies. Expertise in building data processing pipelines, designing, and developing ETL workflows, and managing and analyzing large datasets. Proficient in using programming languages such as SQL, PL/SQL, Python, and R, and Big Data Technologies such as Apache Spark, Kafka, Cassandra, HBase, and Hadoop. Experienced in working with cloud infrastructures like AWS, Azure, and GCP.
Skilled in working with Big Data technologies such as Kafka, Cassandra, Snowflake, Apache Spark, Spark Streaming, HBase, Flume, Impala, HDFS, MapReduce, Hive, Pig, BDM, Sqoop, Flume, Oozie, Zookeeper, Hadoop, Kerberos, PySpark Airflow, Kafka, Snowflake, and Hadoop distributions such as Cloudera CDH, Apache, AWS, Horton Works HDP.
Proficient in programming languages such as SQL, PL/SQL, Python, R, PySpark, Pig, Hive QL, Scala, Shell, Python Scripting, and Regular Expressions. Experienced in working with cloud infrastructure such as AWS, Azure, and GCP, and AWS services such as S3, EC2, EMR, Redshift, RDS, Lambda, Kinesis, SNS, SQS, AMI, IAM, and Cloud formation. Skilled in working with databases such as Oracle, Teradata, My SQL, SQL Server, and NoSQL databases such as HBase and MongoDB.
Proficient in version control tools such as CVS, SVN, Clear Case, and GIT, and build tools such as Maven and SBT.
Experienced in working with containerization tools such as Kubernetes, Docker, and Docker Swarm.
Skilled in reporting tools such as Junit, Eclipse, Visual Studio, Net Beans, Azure Databricks, UNIX Eclipse, Visual Studio, Net Beans, Junit, CI/CD, Linux, Google Shell, Unix, Power BI, SAS, and Tableau.
Proficient in ETL tools such as Talend Open Studio and Talend Enterprise Platform. Adept at working with NoSQL databases such as Apache HBase, Mongo DB, Cassandra Distributed platforms such as Hortonworks, Cloudera, and Azure HD Insight, and operating systems such as UNIX, Ubuntu Linux, and Windows 00/XP/Vista/7/8.
Designed, developed, and implemented big data solutions using various technologies like Kafka, Cassandra, Snowflake, Apache Spark, HBase, Flume, Impala, HDFS, MapReduce, Hive, Pig, BDM, Sqoop, Flume, Oozie, and Zookeeper on Hadoop clusters (Cloudera CDH, Apache, AWS, Horton Works HDP).
Worked with programming languages such as SQL, PL/SQL, Python, R, PySpark, Pig, Hive QL, Scala, Shell, Python Scripting, Regular Expressions to build big data applications.
Developed Spark components like RDD, Spark SQL (Data Frames and Dataset), and Spark Streaming to process and analyze large data sets.
Worked on cloud infrastructure platforms such as AWS, Azure, and GCP, and utilized services like S3, EC2, EMR, Redshift, RDS, Lambda, Kinesis, SNS, SQS, AMI, IAM, and Cloud formation to develop big data solutions.
Worked with different databases including Oracle, Teradata, MySQL, SQL Server, and NoSQL databases like HBase and MongoDB to store and process large data sets.
Developed shell scripts and SQL queries to automate various processes and tasks, and managed version control using CVS, SVN and Clear Case, GIT.
Utilized build tools like Maven and SBT, and containerization tools like Kubernetes, Docker, and Docker Swarm to automate deployment and manage big data solutions.
Technical Skills:
Big Data Technologies: Kafka, Cassandra, Snowflake, Apache Spark, Spark Streaming, HBase, Flume, Impala, HDFS, MapReduce, Hive, Pig, BDM, Sqoop, Flume, Oozie, Zookeeper, and Hadoop
Hadoop Distribution: Cloudera CDH, Apache, AWS, and Horton Works HDP
Programming Languages: SQL, PL/SQL, Python, R, PySpark, Pig, Hive QL, Scala, Shell, Python Scripting, and Regular Expressions
Spark components: RDD, Spark SQL (Data Frames and Dataset), and Spark Streaming
Cloud Infrastructure: AWS, Azure, and GCP
AWS Services: S3, EC2, EMR, Redshift, RDS, Lambda, Kinesis, SNS, SQS, AMI, IAM, Cloud formation
Azure services: Azure Data Lake, Data factory, Azure Databricks, Azure SQL database, Azure SQL Datawarehouse
Databases: Oracle, Teradata, My SQL, SQL Server, and NoSQL Database (HBase, MongoDB)
Scripting & Query Languages: Shell scripting and SQL
Version Control: CVS, SVN and Clear Case, GIT
Build Tools: Maven and SBT
Containerization Tools: Kubernetes, Docker, and Docker Swarm
Reporting Tools: Junit, Eclipse, Visual Studio, Net Beans, Azure Databricks, UNIX Eclipse, Visual Studio, Net Beans, Junit, CI/CD, Linux, Google Shell, Unix, Power BI, SAS, and Tableau
NoSQL Databases: Apache HBase, Mongo DB, and Cassandra
Distributed platforms: Hortonworks, Cloudera, and Azure HD Insight
Operating Systems: UNIX, Ubuntu Linux, and Windows 00/XP/Vista/7/8
ETL Tools: Talend Open Studio and Talend Enterprise Platform
Methodologies: Agile, Waterfall, and Scrum
Professional Experience
Client: Manulife - Toronto, CA Jan 2022 – On-going
Senior Data Engineer
Responsibilities:
Participated in the analysis, design, and development phase of the Software Development Lifecycle (SDLC). Experience in the agile environment, used to have sprint planning meeting, scrum calls and retro meetings for every sprint,
Used JIRA for the project management and GitHub for the version control. Designed, developed, and implemented big data solutions using various tools and technologies such as Kafka, Cassandra, Apache Spark, Spark Streaming, HBase, Flume, Impala, HDFS, MapReduce, Hive, Pig, BDM, Sqoop, Flume, Oozie, and Zookeeper.
Experience in designing and testing highly scalable mission-critical systems, and Spark jobs both in Scala and PySpark, Kakfa.
Worked with various programming languages such as SQL, PL/SQL, Python, R, PySpark, Pig, Hive QL, Scala, Shell, Python Scripting, and Regular Expressions.
Collaborate with unit managers, end users, development staff, and other stakeholders to integrate data mining results with existing systems .
Designed and developed Flink pipelines to consume streaming data from kafka and applied business logic to massage and transform and serialize raw data.
Involved in Test coordination, resource allocation and Test spec preparation.
Widely used Angular JS Ul-bootstrap components like date picker, directives, model pop-ups, ng-grid, router Provider's, progress bar, ng-Idle, ng-Upload.
Data Engineer with in Cloud (AWS, Azure) Data warehousing Data engineering Feature engineeringHadoop big data, ETL/ELT, and Business IntelligenceAs a Cloud data architect and engineer, specialize in AWS and Azure frameworksCloudera, Hadoop Ecosystem, Spark/Py Spark/Scala, Data bricks, HiveRedshiftSnowflake relational databases, tools like Tableau, Airflow, DBTPresto/Athena, Data DevOps Frameworks/Pipelines Programming skills in Python.
Hands on experience in Designing UI Web Applications using Front End Technologies like HTML, CSS, JavaScript, jQuery, Angular JS, and Bootstrap.
Experience working with RDBMS including Oracle/ DB2, SQL Server, PostgreSQL 9.x, MS Access and Teradata for faster access to data on HDFS.
Responsible to interact with Developers and Project managers to determine requirements (Test Cases, Test Data and Bug Report). And to ensure effective coordination between development team and testing team.
Developed Spark Streaming script which consumes topics from distributed messaging source Kafka and periodically pushes a batch of data to spark for real-time processing.
Use HBase as the database to store application data, as HBase offers features like high scalability, distributed NOSQL. column oriented and real-time data querying to name a few.
Develop programs in Spark to use on application for faster data processing than standard MapReduce programs.
Involved in identifying job dependencies to design workflow for Oozie & Yarn resource management.
Working in an Agile team to deliver and support required business objectives by using Python, Shell Scripting and other related technologies to acquire, ingest, transform and publish data both to and from Hadoop Ecosystem.
Extending support to OracleHadoop& MongoDB along with DB2 to embrace multi DBMS support concept, and embrace paradigm of management for Open Source Software, & Commodity Hardware to reduce costs..
Directed all activities related to communications network planning, designing, development, operation, help desk, and implementation for the RL Polk/Acxiom Network. This included management of 75 people and annual budget of 22 million supporting application development, voice/data networking, client server, AS-400 and OS/390 mainframe business systems. Onsite business liaison between Acxiom and RL Polk.
Checking the data and tables structure in the PostgreSQL & Redshift databases and run the queries to generate reports.
Worked on SAP MTD process and SRM (Supply to Production - Supplier Management) portal.
Orchestrated and migrated CVCD processes using Cloud Formation and Terraform packer Templates and Containerized the infrastructure using Docker, which was setup in OpenShift. AWS and VPCS.
Worked with Spark components such as RDD, Spark SQL (Data Frames and Dataset), and Spark Streaming.
Worked with cloud infrastructure such as AWS, Azure, and GCP to deploy and manage big data solutions.
Worked with various databases such as Oracle, Teradata, MySQL, SQL Server, and NoSQL Database (HBase, MongoDB) to extract and process data.
Use Kafka a publish-subscribe messaging system by creating topics using consumers and producers to ingest data into the application for Spark to process the data and create Kafka topics for application and system logs.
Experience in RDBMS database design, Data ware House, performance tuning, optimization, client requirement Analysis, Logical design, Development, Testing, Deployment and Support.
Developed and maintained ETL pipelines to extract, transform, and load data from different sources into Hadoop using tools such as Sqoop, Flume, and Oozie.
Hands of experience in GCP, Big Query, GCS bucket, G-cloud function, cloud dataflow, Pub/suB cloud shell, GSUTIL, BQ command line utilities, Data Proc, Stack driver
Testing of IDOCS on SAP FI/CO Testing included verifying and validating all segments within the IDOCS were fulfilled and passed through ports.
Provided guidance and implemented Postgresql database solution in AWS (Amazon Web Services) EC2 & RDS environments.
Experience in writing MapReduce, YARN, PIG Scripts, Hive Queries, apache Kafka. Storm for analyzing Data.
Developed data pipeline using Flume, Sqoop, Pig and Java MapReduce to ingest customer behavioral data into HDFS for analysis.
Azure frameworksCloudera, Hadoop Ecosystem, Spark/Py Spark/Scala, Data bricks, HiveRedshiftSnowflake relational databases, tools like Tableau, Airflow, DBTPresto/Athena, Data DevOps Frameworks/Pipelines Programming skills in Python.
Experience processing Avro data files using Avro tools and MapReduce programs.
Developed and managed data processing pipelines using distributed frameworks such as Apache Spark, Spark Streaming, and HBase.
Using g-cloud function with Python to load Data in to Bigquery for on arrival csv files in GCS bucket.
Worked on developing Pyspark script to encrypting the raw data by using Hashing algorithms concepts on client specified columns.
Good Knowledge of SAP MTD, SAP FICO process, SRM (Supplier Relationship Management).
Trained Random forest algorithm on customer web activity data on media applications to predict the potential customers. Worked on Google TensorFlow, Keras API- convolution neural networks for classification problems.
Expert DB2 Database Administrator in database administration, Systems AdministrationDB2 Installation SupportDB2 New Release availabilityscalabilityand performance enhancements supportPlanning for migrationconversionand fall back supportSupport for SMP/E steps involved with DB2 migrationsRSU Support etcDB2 Capacity planning& performance monitoring and tuning DB2 Sub System & Database.
Design robust, reusable, and scalable data driven solutions and data pipeline frameworks to automate teh ingestion. processing and delivery of both structured and unstructured batch and real time data streaming data using Python Programming.
Worked with relational SQL and NoSQL databases, including Postgresql and Hadoop.
Develop a bigdata web application using Agile methodology in Scala as Scala has the capability of combining functional and object-oriented programming.
Hands on experience in AVRO and Parquet file format, Dynamic Partitions, Bucketing for best Practice and Performance improvement.
My functional testing skills include SAP Warehouse management, Inventory management, SAP console/RF solution, ITSmobile implementation, Handling Unit Management and Batch management, Task Resource Management, other LE functions including shipping and transportation, MM-purchasing, SD-Sales order management, pricing, delivery, billing and FI/CO(Treasury, Outbound/Inbound Payments).
Created pipelines for deploying code from GitHub to Kubernetes (K8s) cluster in the form of Docker containers using Spinnaker platform.
Expertise in implementing DevOps culture through CI/CD tools like Repos. Code Deploy. Code Pipeline. GitHub.
Worked on creating POC for utilizing the ML models and Cloud ML for table Quality Analysis for the batch process
Worked on ETL scripts to pull the data from denodo Data Base into HDFS.
Good knowledge in using cloud shell for various tasks and deploying services.
Good knowledge on image classification problems using the Keras Models for image classification with weights trained on ImageNet like VGG16, VGG19, ResNet, ResNetV2, InceptionV3. Knowledge on OpenCV for real time computer vision
Developed the roadmap and future state architecture for CMS/e-Commerce/OMS/APIs for RL Digital
Implemented and optimized MapReduce jobs to process large datasets in Hadoop using Hive and Pig.
Worked with reporting tools such as Junit, Eclipse, Visual Studio, Net Beans, Azure Databricks, UNIX Eclipse, Visual Studio, Net Beans, Junit, CI/CD, Linux, Google Shell, Unix, Power BI, SAS, and Tableau.
Create ETL Pipelines using Apache Crunch to read data from HBase and calculate the KPI metrics and store the data in HDFS, and Bulk load data into HBase.
Creating Amazon EC2 instances using command line calls and troubleshoot the most common problems with instances and monitor the health of Amazon EC2 instances and other AWS services.
Developed rest API's using python with flask and Django framework and done the integration of various data sources including Java, JDBC, RDBMS, Shell Scripting, Spreadsheets, and Text files.
Involved in designing and deploying multi-tier applications using all the AWS services like (EC2, Route53, S3, Databricks, RDS, Dynamo DB, SNS, SQS, IAM) focusing on high-availability, fault tolerance, and auto-scaling in AWS Cloud Formation.
Utilized Apache Spark with Python to develop and execute Big Data Analytics and Machine learning applications, executed machine Learning use cases under Spark ML, databricks and MLlib.
Developed Spark/Scala, Python for regular expression (regex) project in the Hadoop/Hive environment with Linux/Windows for big data resources.
Build a program with Python and apache beam and execute it in Cloud Dataflow to run Data validation between raw source file and Bigquery tables.
Create Hive based reports to support the application metrics (These will be used by UI teams for reports).
Authored Replication, Retention Policies used to automate dataset copy by weekly and retention for 6 months using NIFI UIand Ranger Access Manger.
Performing root cause analysis on Corso anomaly alerts and resolving anomaly bugs with comments for interested internal stakeholders in Confidential Music Data Warehouse.
Created internal tool for comparing the RDBMS and Hadoop such that all the data located in source and target matches using shell script.
Experience in working with deep learning framework Tensorflow, Keras and Pytorch
Created continuous integration and continuous delivery (CV/CD) pipeline on AWS that helps to automate steps in software delivery process.
Performed multiple Data Mining techniques like Classification, Clustering, Outlier detection and derived new insights from the data during exploratory analysis
Streaming data analysis using Dataflow templates by leveraging Cloud Pub/Sub service.
Experience with ODS (Operational Data Store) to integrate data from different heterogeneous data sources in order to facilitate operational reporting in real-time or near real-time.
Extracted and analyzed data from social media platforms like Google, Yelp, Foursquare and White pages using Rest API Spark NLTR HBase Business Analyst (BA) Resumes
Used Python& SAS to extract, transform & load source data from transaction systems, generated reports, insights, and key conclusions. Created Python functions to transform the data to AWS Databricks platform.
Involved in Automation scheduling AWS Databricks jobs and build SSIS packages to push data on-premises server.
Built ETL solutions using Databricks by executing code in Notebooks for loading data into AWS DW following the bronze, silver, and gold layer architecture.
Experience in Data transformation, Data Mapping from source to target database schemasData Cleansing procedures
Implemented user behavior, engagement, retention and sales analytics data lake.
Proficiency in R (eg. ggplot2, cluster, dplyr, caret), Python (eg, pandas, Keras, Pytorch, NumPy, scikit-learn, bokeh, nitk) Spark-MLlib, H20, or other statistical tools.
Used AWS services to orchestrate Databricks data preparation and load them into SQL Data warehouse. Responsible for ingesting data from Data Lake to Data Warehouse using AWS Databricks AWS, experience with Elastic IP, Security Groups and Virtual Private Cloud in AWS. Generated SQL and PL/SQL scripts to install, create and drop database objects, including tables, views.
Experience in developing machine learning models like Classification, Regression, Clustering, Decision Tree.
Worked on natural language processing for documentation classification, text processing using find the sensitive information in the electronically stored files and text summarization. NLTK, SPACY. TextBlob to
Experienced in building and optimizing big data pipelines, architectures, and data sets using TensorFlow Data API, Spark,and Hive.
Created logical and physical data models using Erwin and reviewed these models with business team and data architecture team.
Used python modules of urllib, urllib2. Requests for web crawling, Experience using all these ML techniques: clustering regression, classification, graphical models
Built machine learning models in Python and Java that automate unstructured data into structured representations. Architected and coded performance enhancements enabled by Semantic Web for life sciences genome mapping project/customer using RDF/XML data model
Strong experience in Data Migration, Data Cleansing. Transformation, Integration, Data Importand Data Export
Solid knowledge of Data Marts, Operational Data Store (ODS),OLAP, Dimensional Data Modeling with Ralph Kimball Methodology (Star Schema Modeling, Snow-Flake Modeling forFACT and Dimensions Tables) using Analysis Services
Created Amazon EC2 instances using command line calls and troubleshoot the most common problems with instances and monitor the health of Amazon EC2 instances and other AWS services.
Informatica Development of Master Data Portal from Siebel and Global Data Conversion C+, Performance Tuning for low +, Performance Tuning for low
Involved in designing and deploying multi-tier applications using all the AWS services like (EC2, Route53, S3, RDS, Dynamo DB, SNS, SQS, IAM) focusing on high-availability, fault tolerance, and auto-scaling in AWS Cloud Formation.
Utilized Apache Spark with Python to develop and execute Big Data Analytics and Machine learning applications, executed machine Learning use cases under Spark ML and MLlib.
Used libraries NLTK, Gensim Glove for NLP preprocessing and embedding
Involved in supporting cloud instances running Linux and Windows on AWS, experience with Elastic IP, Security Groups and Virtual Private Cloud in AWS.
Applied text pre-processing and normalization techniques, such as tokenization. POS tagging, and parsing. Expertise using NLP techniques (BOW, TF-IDF, Word2Vec) and toolkits such as NLTK, Genism, Spacy
Designed and developed the conceptual then logical and physical data models to meet the needs of reporting.
Extensive experience on configuring Amazon EC2, Amazon S3, Amazon Elastic Load Balancing AM and Security Groups in Public and Private Subnets in VPC and other services in the AWS Managed network security using Load balancer, Auto-scaling. Security.
Hands on experience on logical and physical data modelling, code review process, data normalization processes, data integration methods, Agile SDLC and production support.
Utilized AWS CLI to automate backups of ephemeral data-stores to S3 buckets, EBS and create nightly AMIs for mission critical production servers as backups.
Experience in coordination with Data Integration. Network, DBA, Server Systems and Storage Systems for cross-team.
Installed and configured an automated tool Puppet that included the installation and configuration of the Puppet master, agent nodes and an admin control, Involved in Chef and Puppet for Deployment on Multiple platforms.
Worked on OpenShift Pass product architecture and worked on creating OpenShift namespaces for on-prem applications migrating to cloud. Virtualized the servers using Docker for the test environments and dev-environments needs, using Docker containers.
Worked with Amazon AWS/EC2, and Google's Docker based cluster management environment Kubernetes.
Creating Jenkins jobs and distributing load on Jenkins server by configuring Jenkins nodes which will enable parallel builds.
Extensively worked on Jenkins CI/CD pipeline jobs for end-to-end automation to build, test and deliver artifacts and troubleshoot the build issue during the Jenkins build process.
Managing Jenkins artifacts in Nexus repository and versioning the artifacts with time stamp, Deploying artifacts into servers in AWS cloud with ansible and Jenkins.
Created continuous integration system using Ant, Jenkins, Puppet full automation, Continuous Integration, faster and flawless deployments. Installed and administered GIT source code tool and ensured the reliability of the application.
Installed/Configured and Managed Nexus Repository Manager and all the Repositories, Created the Release process of the artifacts.
Experience on working with on-premises network, application, server monitoring tools like Splunk, AppDynamics and on AWS with CloudWatch monitoring tool.
Involved in setting up JIRA as defect tracking system and configured various workflows, customizations, and plugins for the JIRA bug/issue tracker.
Environment: AWS, Ansible, ANT, MAVEN, Jenkins, Bamboo, Splunk, Confluence, Bitbucket, GIT, JIRA, Python, SSH, Shell Scripting, Docker, JSON, Kubernetes, Nagios, Red Hat Enterprise Linux, Terraform, Kibana.
Client: Skip the Dishes - Winnipeg, CA Feb 2020– Dec 2021
Senior Data Engineer
Responsibilities:
●Participated in the analysis, design, and development phase of the Software Development Lifecycle (SDLC). Experience in the agile environment, used to have sprint planning meeting, scrum calls and retro meetings for every sprint, Used JIRA for the project management and GitHub for the version control.
●Designed and implemented real-time data processing pipelines using Apache Kafka, Apache Spark, and Spark Streaming to process and analyze large volumes of data. Developed and maintained HBase clusters on Hortonworks and Cloudera distributed platforms for storing large scale unstructured data.
●Reverse engineering of legacy undocumented systems, data consolidation / standardization, data analyses, data migration &cleaning, reports/dashboard development. Develop several complex ETLS, design and develop data warehouse. Worked on SQL Server 2000, 05/08 R2SSISSSASSSRSDB2XMLInformaticsSAS, ProClarityShowcase reportingBuild several adhoc queries for reporting and data manipulation
●Worked on various Azure services such as Azure Data Lake, Data Factory, Azure Databricks, and Azure SQL Database to store and process data. Utilized NoSQL databases such as MongoDB and HBase for data storage and retrieval.
●Designed and deployed data pipelines using Azure cloud platform (HDInsight, Data Lake, Databricks, Blob Storage, Data Factory, Synapse, SQL, SQL DB, DWH, and Data Storage Explorer).
●Developed solutions in Databricks for Data Extraction, transformation, and aggregation from multiple data sources. Designed and implemented highly performant data ingestion pipelines from multiple sources using Azure Data Factory and Azure Databricks.
●Extensive knowledge of Data Modeling, Data ConversionsData integration and Data Migration with specialization in Informatica Power Center.
●Experienced in setting up VPC in AWS securityidentity and compliance related disciplines such as IAM, Organizations, S3 bucket policies etc.
●Erwin 7.389.5 (Dimensional Data ModellingRelational DM, Star Schema, Snow-flake Erwin 7.389.5 (Dimensional Data ModellingRelational DM, Star Schema, Snow-flake.
●Built SCD Type 2 Dimensions and Facts using Delta Lake and Databricks.
●Involved in Data mapping specifications to create and execute detailed system test plans. The data mapping specifies what data will be extracted from an internal data warehouse, transformed, and sent to an external entity
●Responsible for analyzing various data sources such as flat files, ASCII Data, EBCDIC Data, Relational Data (Oracle, DB2 UDB, MS SQL Server) from various heterogeneous data sources.
●Developed custom-built ETL solution, batch processing, and real-time data ingestion pipeline to move data in and out of the Hadoop cluster using PySpark and Shell Scripting.
●Created Azure Databricks (Spark) notebook to extract the data from Data Lake storage accounts and Blob storage accounts to load on premises SQL server database.
●Performed statistical analysis using SQL, Python, R Programming and Excel. Worked extensively with Excel VBA Macros, Microsoft Access Forms.
●Setting up VPC in AWS securityidentity and compliance related disciplines such as IAMOrganizations, S3 bucket policies etc.
●Used Python& SAS to extract, transform & load source data from transaction systems, generated reports, insights, and key conclusions. Created Python functions to transform the data from Azure storage to Azure SQL on Azure Databricks platform.
●Involved in Automation scheduling Azure Databricks jobs and build SSIS packages to push data from Azure SQL to on-premises server.
●Data Profiling, Data Migration, Process Designing and Re-engineering, DevOps CI/CDTechnical & User Documentation
●Built ETL solutions using Databricks by executing code in Notebooks against data in Data Lake and Delta Lake and loading data into Azure DW following the bronze, silver, and gold layer architecture.
●Used Azure Data Factory to orchestrate Databricks data preparation and load them into SQL Data warehouse. Responsible for ingesting data from Data Lake to Data Warehouse using Azure services such as Azure Data Factory, Azure Databricks
●Integrated on-premises data (MySQL, HBase) with cloud (Blob Storage, Azure SQL DB) and applied transformations to load back to Azure Synapse using Azure Data Factory.
●Built and published Docker container images using Azure Container Registry and deployed them into Azure Kubernetes Service (AKS).
●Imported metadata into Hive and migrated existing tables and applications to work on Hive and Azure, created complex data transformations and manipulations using ADF and Python.
●Configured Azure Data Factory (ADF) to ingest data from different sources like relational and non-relational databases to meet business functional requirements. Improved performance of Airflow by exploring and implementing the most suitable configurations.
●Optimized workflows by building DAGs in Apache Airflow to schedule the ETL jobs and implemented additional components in Apache Airflow like Pool, Executors, and multi-node functionality.
●Configured Spark streaming to receive real-time data from Apache Flume and store the stream data using Scala to Azure Table and Data Lake is used to store and do all types of processing and analytics. Created data frames using Spark Data frames.
●Designed cloud architecture and implementation plans for hosting complex app workloads on MS Azure.
●Performed operations on the transformation layer using Apache Drill, Spark RDD, Data frame APIs, and Spark SQL and applied various aggregations provided by Spark framework.
●Provided real-time insights and reports by mining data using Spark Scala functions. Optimized existing Scala code and improved the cluster performance. Processed huge datasets by leveraging Spark Context, Spark SQL, and Spark Streaming.
●Enhanced reliability of Spark cluster by continuous monitoring using Log Analytics and Ambari WEB UI.
●Improved the query performance by transitioning log storage from Cassandra to Azure SQL Datawarehouse.
●Implemented custom-built input adapters using Spark, Hive, and Sqoop to ingest data for analytics from various sources (Snowflake, MS SQL, MongoDB) into HDFS. Imported data from web servers and Teradata using Sqoop, Flume, and Spark Streaming API.
●Improved efficiency of large datasets processing using Scala for concurrency support and parallel processing.
●Developed map-reduce jobs using Scala for compiling program code into bytecode for the JVM for data processing. Ensured faster data