Swathi K
Email: ***********@*****.***
Phone: +1-651-***-****
SUMMARY:
Professional qualified Azure Data Engineer with over 12+ years of experience in Analysis, Design, Development in Big Data technologies like Spark, MapReduce, Hive, Yarn and HDFS including programming languages like Java, Scala and Python.
Have extensive experience in IT data analytics projects, Hands on experience in migrating on premise
ETLs to Google Cloud Platform (GCP) using cloud native tools such as BIG query, Cloud Data Proc,
Google Cloud Storage, composer.
Strong experience building data pipelines and performing large scale data transformations.
In-Depth knowledge in working with Distributed Computing Systems and parallel processing techniques to efficiently deal with Big Data.
Firm understanding of Hadoop architecture and various components including HDFS, Yarn, MapReduce, Hive, Pig, HBase, Kafka, Oozie etc.,
Strong experience building Spark applications using Scala and Python as programming language.
Good experience troubleshooting and fine-tuning long running spark applications.
Strong experience using Spark RDD API, Spark Dataframes/Dataset API, Spark-SQL and Spark ML frameworks for building end to end data pipelines.
Good experience working with real time streaming pipelines using Kafka and Spark-Streaming.
Strong experience working with Hive for performing various data analysis.
Detailed exposure with various hive concepts like Partitioning, Bucketing, Join optimizations, Ser-De’s, built-in UDF’s and custom UDF’s.
ETL/SSIS data processing experience with various source systems DB, Dynamics CRM, Dynamics Finance, Informix, XML, Json, Parquet.
Good experience in automating end to end data pipelines using Oozie workflow orchestrator.
Strong experience of leading multiple Azure Big Data and Data transformation implementations in Health domain.
Worked on Docker based containers for using Airflow.
Hands on experience in setting up workflow and Apache Airflow and Oozie workflow engine for managing and scheduling Hadoop jobs.
Implemented IAM policies, VPC Service Controls, and encryption strategies across GCP services to ensure security and compliance.
Expertise in configuring and installation of PostgreSQL, PostGREs plus advanced Server on OLTP to OLAP systems on from high end to low end environment.
Experience in backup/restore of PostgreSQL databases. Strong experience in performance tuning & index maintenance.
Detailed exposure on Azure tools such as Azure Data Lake, Azure Databricks, Azure Data Factory, HDInsight, Azure SQL Server, Azure DevOps, Azure Synapse, Azure SQL Database, Azure Monitoring, Key Vault and Azure Storage to build robust and scalable data pipelines.
Expertise in GCP workloads using Google Cloud Operations Suite (formerly Stackdriver) — logging, monitoring, alerting, and profiling.Experience in analyzing, designing, and developing ETL Strategies and processes, writing ETL specifications.
Analyze, design and build Modern data solutions using Azure PaaS service to support visualization of data. Understand current Production state of application and determine the impact of new implementation on existing business processes
Excellent understanding of NOSQL databases like HBASE, Cassandra, and MongoDB.
Setup of Cloud Watch alarms, setting up CloudTrail, creating cloud formation templates, creating S3 buckets.
Experienced in requirement analysis, application development, application migration and maintenance using Software Development Lifecycle (SDLC) and Python/Java technologies.
Excellent technical and analytical skills with clear understanding of design goals and development for OLTP and dimensions modeling for OLAP.
Adequate knowledge and working experience in Agile and Waterfall Methodologies.
Defining user stories and driving the agile board in JIRA during project execution, participate in sprint demo and retrospective.
Done POC on newly adopted technologies like Apache Airflow and Snowflake and GitLab.
TECHNOLOGY AND TOOLS:
Big Data Ecosystem
Hadoop, MapReduce, Pig, Hive, YARN, Kafka, Spark, Avro, Splunk, Parquet and Snappy.
Languages
Python, Scala, SQL, Linux shell scripting
Databases
Oracle, DB2, SQL Server, MySQL, PL/SQL., NoSQL, RDS, HBase, MongoDB.
Cloud [Azure, AWS]
ADF, Azure AD, Azure DevOps, Azure Synapse analytics,Azure Data Lake, EC2, S3, EMR, Redshift, Athena, Snowflake.
IDE/ Programming Tools
Eclipse, PyCharm, IntelliJ-J
Operating Systems
Unix, Linux, Windows.
Web Technologies
HTML, CSS, XML, JavaScript, jQuery, Bootstrap
Libraries and Tools
Pyspark, Jira, Scrum, Agile Méthodologies
Industrial Production Topics
ERP,CRM
PROFESSIONAL EXPERIENCE:
Client: Campbell Soup Company– New Jersey April 2023- Till date
Role: Sr Azure Data Engineer
Responsibilities:
Implemented large Lambda architectures using Azure Data platform capabilities like Azure Data Lake, Azure Data Factory, HDInsight, and Azure SQL Server.
Experience in developing Spark applications using Spark-SQL in Databricks for data extraction, transformation, and aggregation from multiple file formats for Analyzing & transforming the data to uncover insights into the customer usage patterns.
Responsible for estimating the cluster size, monitoring and troubleshooting of the spark Databricks cluster.
Built scalable, cost-effective ETL pipelines using Databricks, PySpark, and Delta Lake.
Leveraged Unity Catalog features (data access control, data discovery) to improve governance.
Performed data ingestion from ADLS, Blob Storage, SAP HANA, and legacy systems.
Extract, transform and load the data from the source systems to Azure data storage services with combination of Azure Data Factory, T-SQL and Spark Sql and processing the data in Databricks.
Analyzing the Data from different sourcing using Big Data Solution Hadoop by implementing Azure Data Factory, Azure Data Lake, Azure Data Lake Analytics, HDInsight, Hive, and Sqoop.
Involved in building database Model, APIs and Views utilizing Python in order to build an interactive web-based solution.
Involved on Requirements gathering, data modelling, table creation, ETL code design, testing and support.
Created end-to-end Data pipelines using ADF services to load data from On-premises to Azure SQL server for Data orchestration
Been involved in performance tuning on existing processes.
Create and run new or existing SQL queries to connect to business intelligence(BI) tools including Jupyter Notebooks(Python), Power bi Dashboard, and Excel reports.
Ingested the data from various sources like SAP HANA, Blob storage, SharePoint, and Raw excel files.
Created new medallion jobs to extract data from legacy sources and load data into Snowflake.
Prepared SQL scripts to test data, and stored source code in Bitbucket
Consumed APIS and used Python requests to read JSON reports and files.
Analyzed legacy code in SSIS packages and implemented similar fine-tuned code in Snowflake
Actively participated in migrating legacy servers to Snowflake.
Automated ETL orchestration tasks and log archiving through Bash/Shell scripting, streamlining daily data pipeline monitoring and reducing manual intervention.
Designed end-to-end scalable architecture to solve business problems using various Azure Components like HDInsight, Data Factory, Data Lake, Azure Monitoring, Key Vault, Function app and Event Hubs.
Created Pipelines in ADF using Linked Services/Datasets/Pipeline/ to Extract, Transform and load data from different sources like Azure SQL, Blob storage, Azure SQL Data warehouse, write-back tool and backward.
Creating pipelines, data flows and complex data transformations and manipulations using ADF and PySpark with Databricks.
Refactored legacy ETL logic into OOP, leading to a 30% reduction in code duplication across multiple data pipelines.
Exposure to Jupyter Notebook in Azure Databricks environment.
Good experience in tracking and logging end-to-end software applications build using Azure DevOps.
Worked with OpenShift platform in managing Docker containers and Kubernetes Clusters.
Involved in various SDLC Life cycle phases like development, deployment, testing, documentation, implementation & maintenance of application software.
Used Spark-Streaming APIs to perform necessary transformations and actions on the data got from Kafka.
Data is moved from SAP (HANA) into Snowflake.
Designed and configured Kafka cluster to accommodate heavy throughput of 1 million messages per second. Used Kafka producer 0.6.3 APIs to produce messages.
Experience in transporting and processing real-time stream data using Kafka.
Environment: Azure, Data Lake, Data Factory, Jupitar Notebook, SQL, Python, Event hubs, Kafka, Function app, Key vault, Azure SQL,Power bi, Azure Monitoring, Azure DevOps
Client: Charles Schwab – Chicago IL July 2021 – Feb 2023
Role: Data Engineer
Responsibilities:
Involved in Azure cloud, App services, Azure SQL Database, Azure Blob storage, Azure Functions, Virtual machines, Azure AD, Azure Data Factory, event hub and event Queue.
Worked on building the data pipeline using Azure Service like Data Factory to load the data from Legacy, SQL server to Azure Data warehouse with polybase, using Data Factories and Databricks Notebooks.
Build complex ETL jobs that transform data visually with data flows or by using compute services Azure Databricks, and Azure SQL Database
Developed CI/CD pipelines for Databricks using Azure DevOps.
Created Pipeline to extract data from on-premises source systems to Azure cloud data lake storage.
Created end-to-end Data pipelines using ADF services to load data from On-premises to Azure SQL server for Data orchestration.
Use various types of activities: data movement activities, transformations, and control activities; Copy data, Data flow, Get Metadata, Lookup, Stored procedure, and Execute Pipeline.
Migrated relational data to Cosmos DB using Azure Data Factory pipelines and Azure Functions, improving query latency by 40% for distributed workloads.Implemented Performance tuning techniques in Azure data factory and synapse.
Implement to ploy base SQL instances to processing data using native SQL queries from the external data sources.
Monitored the scheduled Azure Data Factory pipelines and configured the alerts to get notification of failure pipelines.
Managed real-time data ingestion from Event Hubs into Synapse and Snowflake.
Ensured security using Key Vault, role-based access, and VNETs.
Configured and implemented the Azure Data Factory Triggers and scheduled the Pipelines.
Involved in designing and developing Azure steam analytics jobs to process real time data using Azure event hubs.
Worked on building custom ETL workflows using Spark/Hive to perform data cleaning and mapping
Implemented Azure copy activity and dataflow jobs using azure data factory
Created reusable shell scripts for managing Azure CLI-based deployments and resource provisioning as part of infrastructure automation.
Managed host Kubernetes environment, making it quick and easy to deploy and manage containerized applications without container orchestration expertise.
Extensively used azure key vaults to configure the connections in linked services.
Automated ETL orchestration tasks and log archiving through Bash/Shell scripting, streamlining daily data pipeline monitoring and reducing manual intervention.
Create and maintain optimal data pipeline architecture in cloud Microsoft Azure using Data Factory and Azure Databricks.
Using a combination of Python and Snowflake Snow SQL to create ETL pipelines in and out of a data warehouse.
Using Snowflake to write SQL queries.
Exposed transformed data in Azure Spark Databricks platform to parquet formats for efficient data storage.
Followed Test-Driven Development (TDD) methodology using PyTest to write unit tests for data transformation and validation functions in Databricks, resulting in improved test coverage and fewer production defects.
Have extensive in creating pipeline jobs,scheduling triggers,mapping data flows using Azure Data Factory(v2) and using key vaults to store credentials.
Extensively worked on Azure Data Lake Analytics with the help of Azure Databricks to implement SCD-1, and SCD-2 approaches.
Worked on ARM templets to deploy in production using Azure DevOps.
Applied OOP principles in Python to design modular, reusable data transformation classes used in Azure Databricks notebooks, improving code maintainability and scalability.
Loading data into snowflake tables from the internal stage using Snow SQL.
Developed APIs to access enterprise metadata platform for registering datasets and migrated all the data pipeline applications from legacy to new platform.
Created and configured a new event hub with the provided event hubs namespace.
Extensive hands-on experience tuning spark Jobs.
Worked on spark performance tuning.
Develop Spark applications using PySpark and spark SQL for data extraction, transformation, and aggregation from multiple file formats for analyzing and transform.
Developed a customer message consumer to consume the data from the Kafka producer and push the messages to HDFS
Environment: Azure, Azure Data Factory, azure synapse, performance tuning, ADLS Gen2, dataflow jobs, copy activity, lookup activity, Data Flow, linked services, logic apps, Event Hub, Databricks, polybase, Snowflake, PySpark, Python, PyCharm, JIRA, GitHub.
Client: IMO -Rosemont, IL May 2019 – June 2021
Role: Data Engineer
Responsibilities:
Designed and setup Enterprise Data Lake to provide support for various uses cases including Analytics, processing, storing and reporting of voluminous, rapidly changing data.
Worked on creating tabular models on Azure analytic services for meeting business reporting requirements.
Created and optimized ADF dataflows and orchestrated pipelines integrating Snowflake and Databricks.
Data Ingestion to one or more cloud Azure Services - (Azure Data Lake, Azure Storage, Azure SQL,
Azure DW) and cloud migration processing the data in Azure Databricks.
Creating pipelines, data flows and complex data transformations and manipulations using ADF and PySpark with Databricks.
Working on Azure BLOB and Data Lake storage and loading data into Azure SQL Synapse analytics (DW).
Worked on Azure SQL Database Import and Export Service.
Developed Python, PySpark, and Bash script logs to Transform, and Load data across on-premise and cloud platforms.
Worked on Apache Spark Utilizing the Spark, SQL, and Streaming components for real-time data processing.
Set up and worked on Kerberos authentication principles to establish secure network communication on clusters.
Used Spark SQL for Scala & amp, Python interface that automatically converts RDD case classes to schema RDD.
Import the data from different sources like HDFS/HBase into Spark RDD and perform computations using PySpark to generate the output response.
Implementing different performance optimization techniques such as using distributed cache for small datasets, partitioning, and bucketing in hive, doing map side joins etc.
Good knowledge on Spark platform parameters like memory, cores and executors
Developed reusable framework to be leveraged for future migrations that automates ETL from RDBMS systems to the Data Lake utilizing Spark Data Sources and Hive data objects.
Environment: Azure, Azure Data Factory, Databricks, PySpark, Python, Apache Spark, HBase, HIVE, SQOOP, Snowflake, Python, SSRS, Azure Synapse analytics, Tableau.
Client: Medline Industries- Chicago, IL August 2016 – May 2019
Role:ETL/ Data Engineer
Responsibilities:
Designed end to end scalable architecture to solve business problems using various Azure Components like HDInsight, Data Factory, Data Lake, Storage and Machine Learning Studio.
Developed JSON Scripts for deploying the Pipeline in Azure Data Factory (ADF) that processes the data using the SQL Activity.
Developed Spark applications using Scala and Spark-SQL for data extraction, transformation, and aggregation from multiple file formats for analyzing & transforming the data to uncover insights into the customer usage patterns.
Written multiple Hive UDFS using Core Java and OOP concepts and spark functions within ETL
programs.
Written multiple MapReduce programs in Java for data extraction, transformation and aggregation from multiple file formats including XML, JSON, CSV and other compressed file formats.
Managed host Kubernetes environment, making it quick and easy to deploy and manage containerized applications without container orchestration expertise.
Undertake data analysis and collaborated with down-stream, analytics team to shape the data according to their requirement.
Used Azure Event Grid for managing event service that enables you to easily manage events across many different Azure services and applications.
Implemented code in Python to retrieve and manipulate data.
Used Service Bus to decouple applications and services from each other, providing the benefits like Load-balancing work across competing workers.
Scalable metadata handling, Streaming and batch unification are offered by Delta Lake.
Used Delta Lakes for time travelling as Data versioning enables rollbacks, full historical audit trails, and reproducible machine learning experiments.
Delta lake supports merge, update and delete operations to enable complex use cases.
Used Azure Databricks for fast, easy, and collaborative spark-based platform on Azure.
Used Databricks to integrate easily with the whole Microsoft stack.
Wrote spark SQL and spark scripts (PySpark) in Databricks environment to validate the monthly account level customer data.
Creating Spark clusters and configuring high concurrency clusters using Azure Databricks (ADB) to speed up the preparation of high-quality data.
Spun up HDInsight clusters and used Hadoop ecosystem tools like Kafka, Spark and Databricks for real-time analytics streaming, Sqoop, pig, hive and Cosmos DB for batch jobs.
Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in In Azure Databricks.
Used Azure Data Catalog which helps in organizing and to get more value from their existing investments.
Used Azure Synapse to bring these worlds together with a unified experience to ingest, explore, prepare, manage, and serve data for immediate BI and machine learning needs.
Utilized the clinical data to generate features to describe the different illnesses by using LDA Topic Modelling.
Processed the image data through the Hadoop distributed system by using Map and Reduce then stored into HDFS.
Created Session Beans and Controller Servlets for handling HTTP requests from Talend.
Performed Data Visualization and Designed Dashboards with Tableau and generated complex reports including charts, summaries, and graphs to interpret the findings to the team and stakeholders.
Wrote documentation for each report including purpose, data source, column mapping, transformation, and user group.
Utilized Waterfall methodology for team and project management.
Used Git for version control with the Data Engineer team and Data Scientists colleagues.
Environment: Ubuntu 16.04, Hadoop 2.0, Spark (PySpark, Spark streaming, Spark SQL, SparkMLIib), NiFi, Jenkins, Java, Pig 0.15, Python 3. x (Nltk, Pandas), Tableau 10.3, GitHub, Data Modeling, Azure (Storage, DW, ADF, ADLS, Databricks), AWS Redshift and OpenCV.
Client: Delta Airlines- Atlanta, GA June 2015 – Aug 2016
Role: ETL/Data ware Developer
Responsibilities:
Responsible for mentoring Developers and Code Review of Mappings developed by other
developers
Meeting with Business Process and Solution Development teams to deliver technical solutions
Based on functional requirements.
Coordinated with Business Users to understand business needs and implement the same into a
functional Data, warehouse design.
Used various transformations such as expression, filter, joiner, and lookups for better data
messaging, migrate clean and consistent data.
Built enterprise ETL solutions and performed extensive data modeling for OLAP systems.
Developed mapping using Informatica transformations.
Efficient in Dimensional Data Modeling for Data Mart design, identifying Facts and Dimensions, and
developing, fact tables, dimension tables, using Slowly Changing Dimensions (SCD).
Involved in ETL design for the new requirements.
Implemented procedures, triggers, cursors using Sybase T-SQL.
Finalized flat file structure with business Used INFORMATICA (Power Center) tool as ETL tool for
constantly moving the data from sources into staging area creating complex SQL.
Trouble shoot and tune SQL using EXPLAIN PLAN.
Environment: Oracle, Informatica Power Centre, Sybase, UNIX Scripting, Selenium, Maven, Eclipse, TOAD.
Client: Value Labs - India August 2010 – April 2013
Role: ETL Developer
Responsibilities:
Designed, developed, and maintained ETL processes to extract, transform, and load data from various source systems into data warehouses.
Analyzed business requirements and translated them into technical specifications for ETL processes.
Developed complex SQL queries, stored procedures, and scripts for data extraction and transformation.
Worked on performance tuning of ETL jobs to optimize load times and resource utilization.
Implemented error handling, auditing, and data validation mechanisms within ETL workflows.
Collaborated with business analysts, data architects, and QA teams to ensure data quality and consistency.
Migrated data across heterogeneous databases and performed data cleansing and data enrichment tasks.
Scheduled and monitored ETL workflows and jobs using scheduling tools and managed job failures effectively.
Documented ETL processes, mappings, and transformations for better maintainability and knowledge sharing.
Provided production support for ETL processes, troubleshooting issues and implementing fixes as required.
Environment: ETL, Informatica PowerCenter / SSIS, Oracle, SQL Server, MySQL, UNIX Shell Scripting, PL/SQL, Control-M, Windows, UNIX/Linux, SVN / TFS