Power Bi Data Engineering

Location:

Owings Mills, MD

Posted:

November 14, 2024

Contact this candidate

Resume:

Name: Sushmitha Kanapuram

Mobile: 443-***-****

Email Id: *********.*********@*****.***

Professional Summary:

6+ years of hands-on experience in Data Engineering, Data Modeling, Cloud Migration, and ETL Development, specializing in AWS, Azure, and Google Cloud Platform (GCP).

Expertise in designing, building, and managing large-scale data pipelines using Python, PySpark, Scala, SQL, and Apache Beam.

Proficient in GCP services like BigQuery, Dataflow, DataProc, and Pub/Sub for high-performance data processing and transformation.

Strong experience with AWS services: S3, Redshift, Glue, Lambda, and EMR for ETL and real-time analytics.

Proficient in Azure services: HDInsight, Data Lake, Databricks, Data Factory, and Synapse.

Extensive experience with Hadoop ecosystems, including Hive, HDFS, Kafka, Spark, and MapReduce.

Expertise in CI/CD automation with tools like Jenkins, Git, Docker, and Kubernetes for scalable cloud data solutions.

Proficient in SQL and NoSQL databases such as PostgreSQL, MongoDB, Cosmos DB, and DynamoDB.

Knowledge of data visualization using Tableau, Power BI, and Matplotlib for actionable insights.

Experience in Data Analysis, Data Profiling, Data Migration, Data Integration and validation of data across all the integration points.

Experience in configuring Spark Streaming to receive real time data from the Apache Kafka and store the data to HDFS and expertise in using spark-SQL with various data sources like JSON, Parquet, CSV.

Hands on experience in Architecting Legacy Data Migration projects such as Teradata to AWS Redshift.

Worked on ETL by developing and deploying AWS Lambda functions for generating a serverless data pipeline which can be written to Glue Catalog and can be queried from Athena.

Experience in Developing ETL solutions using Spark SQL in Databricks for data extraction, transformation, and aggregation from multiple file formats.

Good understanding on Informatica tools like Snap logic, Informatica, Ab initio and helped ETL Teams with mapping documents and transformations.

Extensive experience on usage of ETL & Reporting tools like SQL Server Integration Services (SSIS), SQL Server Reporting Services (SSRS).

Experience in analyzing data using Python, SQL, Hive, PySpark, and Spark SQL for Data Mining, Data Cleansing, Data Munging and Machine Learning.

Working knowledge of data migration, data profiling, data cleaning and transformation utilizing a variety of ETL such as Apache NiFi, Informatica and SSIS.

Extensive experience in setting up CI/CD pipelines using tools such as Jenkins, Bit Bucket, GitHub, Maven, SVN and Azure DevOps.

Knowledge in Data Visualization and analytics using Tableau desktop, Matplotlib, Seaborn, Plotly, and Power BI desktop.

Experienced in containerization of Applications (Micro services) with Docker and Kubernetes.

Capable in working with SDLC, Agile and Waterfall Methodologies.

Technical Skills:

Programming languages

Python, PySpark, Scala, Java, R, SQL, PL/SQL

Cloud

AWS (S3, EMR, RDS, Redshift, DynamoDB, Lambda, Glue, Athena), Azure (Data Lake, Data Factory, Databricks, Synapse), GCP (BigQuery, Dataflow, Pub/Sub), GCP (BigQuery, Dataflow, Pub/Sub, BigTable, DataProc)

Big Data Technologies

HDFS, Hive, MapReduce, Pig, Sqoop, HBase, Kafka, Airflow, Spark, Zookeeper, Oozie, and Flume.

Cloud

AWS, Azure, GCP

Data Warehousing Tools

Talend, SSIS, SSAS, SSRS, Toad Data Modeller

Python Libraries/Packages

NumPy, SciPy, Boto, PySide, PyTables, Pandas, Matplotlib, HTTPLib2, Urllib2, Beautiful Soup, Py Query

Automation Tools

Puppet, Chef, Ansible, Terraform

ETL Tools

Data Stage, Informatica, Power BI,

SDLC Methodologies

Agile, Waterfall, Scrum

Build and CI Tools

Docker, Jenkins, Kubernetes

Databases

PostgreSQL, SQL Server, MySQL, MongoDB, Cosmos DB, DynamoDB

Version Control Tools

Git, Git Repository, Bitbucket

Professional Experience:

Client: Southwest Airlines, Austin TX OCT 2023-Till Date

Role: Data Engineer

Responsibilities:

Spearheaded the design and implementation of scalable data infrastructure leveraging Azure services to enable real-time data processing and analytics, significantly enhancing operational efficiency across the organization.

Designed and implemented Azure-based data pipelines to streamline data ingestion, transformation, and storage, enhancing real-time data processing capabilities for operational analytics.

Utilized Azure Data Factory to orchestrate data workflows across multiple systems, ensuring seamless data integration from on-premises and cloud sources.

Deployed and managed Azure Databricks to enable large-scale data processing and analytics, improving the team's ability to run machine learning models and complex data transformations.

Leveraged Azure Synapse Analytics to build a unified analytics platform, allowing for efficient querying and analysis of structured and unstructured data across the organization.

Integrated Azure Blob Storage and Azure Data Lake Storage for cost-effective and scalable storage solutions, enhancing data availability for various departments.

Developed and maintained Azure SQL Databases for high-performance data storage, ensuring reliability and scalability of mission-critical system

Developed and optimized complex SQL queries for data transformation and integration within the Azure ecosystem, ensuring seamless data flow across systems.

Utilized NiFi for automating data ingestion workflows, enabling the efficient movement of data from on-premises systems to cloud storage.

Developed and maintained Azure SQL Databases for high-performance data storage, ensuring reliability and scalability of mission-critical systems.

Implemented Azure Kubernetes Service (AKS) for containerized application deployment, ensuring high availability, scalability, and efficient resource management.

Managed security and access control using Azure Active Directory (AAD), ensuring compliance with industry standards while controlling access to sensitive data and resources.

Built automated monitoring solutions using Azure Monitor and Azure Log Analytics, providing real-time insights into application health, data workflows, and resource utilization

Developed microservices for automation using Python and Jenkins

Developed scripts to load data to hive from HDFS and involved in ingesting data into Data Warehouse using various data loading techniques.

Developed PySpark jobs on Databricks to perform tasks like data cleansing, data validation, applying transformations for creating Datasets as per the use cases for Machine Learning algorithms.

Developed microservice on boarding tools leveraging Python and Jenkins allowing for easy creation and maintenance of build jobs and Kubernetes deploy and services.

Containerized applications using Docker to ensure consistency across development, testing, and production environments, reducing deployment errors and improving overall efficiency.

Analyzed business user requirements, analyzed data, and designed software solutions in Tableau Desktop based on the requirements.

Optimized and scaled big data processing frameworks using Apache Spark, Kafka, and Hive on GCP DataProc for high-volume data streams.

Used Jira to track and report performance metric such as burn up/down charts and sprints reports to allow teams closely monitor productivity over time.

Collaborated with cross-functional teams to define data engineering best practices, driving improvements in performance and reliability for data systems.

Environment: AWS (S3, Glue, Redshift, EMR, RDS, Dynamo DB, Lambda, Athena, KMS), MongoDB, SQL, Python, PySpark, Scala, Airflow, Hive, HDFS, Snowflake, Databricks, Talend, Kubernetes, Tableau, Git, Linux.

Client: UHG, MN, Minnesota Feb 2023 -Sep 2023

Role: Data Engineer

Responsibilities:

Integrated Azure Data Factory for orchestrating data workflows, extracting data from Azure Data Lake into HDInsight Spark clusters for transformation.

Configured real-time data ingestion pipelines using Apache Kafka and applied PySpark transformations for data processing.

Created CI/CD pipelines in Azure DevOps and automated deployment processes using Jenkins and Kubernetes..

Configured Input & Output bindings of Azure Function with Azure Cosmos DB collection to read and write data from the container whenever the function executes.

Involved in developing data ingestion pipelines on Azure HDInsight Spark cluster using Azure Data Factory, Azure Databricks and Spark SQL

Orchestrated and managed containerized microservices with Kubernetes, enabling seamless scaling, self-healing, and high availability of applications in a cloud-native environment.

Implemented CI/CD pipelines with Azure DevOps, leveraging Docker for containerization and Kubernetes for automated deployment and management of microservices in the Azure cloud environment.

Extracting the data from Azure Data Lake into HDInsight Cluster (Intelligence + Analytics) and applying PySpark transformations & Actions on the data and loading into HDFS.

Involved in loading and transforming large Datasets from relational databases into HDFS and vice-versa using Sqoop imports and export.

Created the Detail Technical Design Documents which have the ETL technical specifications for the given functionality.

Worked on Azure Data Factory to integrate data of both on-prem (MY SQL, Cassandra) and cloud (Blob storage, Azure SQL DB) and applied transformations to load back to Azure Synapse.

Configured spark streaming to receive real time data from the Apache Kafka and store the stream data using Scala to Azure Table.

Designed and wrote the entire ETL/ELT process to support Data Warehouse with complex dependencies in hybrid Business Intelligence environment (Azure & SQL Server).

Developed Spark applications using Scala and Spark-SQL for data extraction, transformation, and aggregation from multiple file formats for analyzing and transforming the data to uncover insights into the customer usage patterns.

Configured Azure Encryption for Azure Storage and Virtual Machines, Azure Key Vault services to protect and secure the data for cloud applications.

Closely worked with Artificial Intelligence Team to create and build a Machine learning layer in the final Product.

Created CI/CD Pipelines in Azure DevOps environments by providing their dependencies and tasks and created END-END Automation with CI Procedures using Jenkins & automated Maven builds by integrating them with Continuous Integration tools Jenkins.

Worked with Kubernetes pipeline of deployment & operation activities where all code is written in java, python & stored into bitbucket, for staging & testing purpose.

Design and develop business intelligence dashboards, analytical reports and data visualizations using power BI by creating multiple measures using DAX expressions for user groups like sales, operations and finance team needs.

Environment: Azure (HDInsight, Data Lake, Blob Storage, Data Factory, Synapse, DynamoDB, Lambda, Data Warehouse, Key Vault, SQL DB, DevOps), data bricks, SQL, Python, PySpark, Scala, Hadoop, Cassandra, Power BI, Java, Bitbucket, Jenkins.

Client: JME Insurance, India Feb 2021-Dec 2021

Role: Data Engineer

Responsibilities:

Participated in the full data lifecycle from analysis to production using Agile methodology.

Cleaned, transformed, and visualized data with SQL and Python, including data warehousing techniques (Star Schema, Snowflake Schema).

Designed and implemented cloud-based data pipelines on Azure to integrate data from multiple sources and ensure real-time analytics.

Utilized Python and SQL to develop efficient ETL processes, ensuring accurate data transformation and integration for business intelligence.

Worked with Docker to containerize Python applications, enabling fast deployment and improved resource management.

Integrated Kubernetes to automate the orchestration of containers, enabling dynamic scaling and load balancing of data processing services.

Improved data accessibility and query performance by applying indexing strategies within Azure SQL Database and Azure Data Lake.

Utilized NiFi for streamlining data ingestion processes from various sources, ensuring continuous data flow into Azure storage.

Implemented Elastic Stack for enhancing search and analytics capabilities across insurance datasets, allowing for better insights into customer behavior and claims data.

Worked closely with the data science team to develop data-driven solutions, utilizing cloud-native tools and services for advanced analytics

Environment: Tableau, SQL server, NumPy, seaborn, SciPy, Matplotlib, Python, SDLC - testing, Agile/ Scrum, Data Warehouse, MongoDB, PostgreSQL, Oracle, Teradata, Informatica ETL, Data Modelling - Star Schema, Snow Flake Schema, KPI.

Client: Sagar soft Pvt Limited, India. Dec 2019- Jan 2021

Role: Data Analyst

Responsibilities:

Generated energy consumption reports using SSRS, which showed the trend over day, month and year.

Performed ad-hoc analysis and data extraction to resolve 20% of the critical business issues.

Well versed in Agile SCRUM development methodology, used in day-to-day work in application for Building Automation Systems (BAS) development.

Streamlining and automating Excel/ Tableau dashboards for improved speed utilization through Python and SQL based solutions.

Designed creative dashboards, storylines for dataset of a fashion store by using Tableau features.

Developed SSIS packages for extract/load/transformation of source data into a DW/BI architecture/OLAP cubes as per the functional/technical design and conforming to the data mapping/transformation rules.

Developed data cleaning strategies in Excel (multilayer fuzzy match) and SQL (automated typo detection and correction) to organize alternative datasets daily to produce consistent and high-quality reports.

Created views to facilitate easy user interface implementation, and triggers on them to facilitate consistent data entry into the database.

Involved in Data Analysis and Data Validation by extracting records from multiple databases using SQL in Oracle SQL Developer tool.

Identified the data source and defining them to build the data source views.

Involved in designing the ETL specification documents like the Mapping document (source to target).

Used ETL (SSIS) to develop jobs for extracting, cleaning, transforming and loading data into data warehouse.

Created Stored Procedures and executed the stored procedure manually before calling it in the SSIS package creation process.

Written SQL test scripts to validate data for different test cases and test scenarios

Created SSIS Packages to export and import data from CSV files, Text files and Excel Spreadsheets.

Developed various stored procedures for the data retrieval from the database and generating different types of reports using SQL reporting services (SSRS).

Environment: Windows, SDLC-Agile/Scrum, SQL Server, SSIS, SSAS, SSRS, ETL, PL/SQL, Tableau, Excel, CSV Files, Text Files, OLAP, Data Warehouse, SQL - join, inner join, outer join, and self-joins.

Client: Hudda Infotech Private Limited, India Oct 2017-Nov 2019

Role: ETL Developer

Responsibilities:

Involved in complete SDLC ETL process from development to testing and production environments.

Design Setup maintain Administrator the Azure SQL Database, Azure Analysis Service, Azure SQL Data warehouse, Azure Data Factory, Azure SQL Data warehouse.

Orchestrated real-time warehouse management reports, synergizing with automated sorting machinery data, using SSRS and Tableau.

Installed, configured, and maintained Apache Hadoop clusters for application development and major components of Hadoop Ecosystem: Hive, Pig, HBase, Sqoop, Flume, Oozie and Zookeeper.

Worked on creating dashboards in Tableau for a collection of several views and comparing the data simultaneously.

. Involved in loading data into HBase using HBase Shell, HBase Client API, Pig and Sqoop.

Written Map Reduce program in Python with the Hadoop streaming API.

Excellent experience in ETL analysis, designing, developing, testing, and implementing ETL processes including performance tuning and query optimizing of database.

Authored SQL scripts for database optimization, including long-running query identification, session blocking resolution, and data archiving from production to archive servers.

Developed ETL programs using Informatica PowerCenter 9.6.1/9.5.1 to implement the business requirements.

Perform data scraping, data cleaning, data analysis, and data interpretation and generate meaningful reports using Python libraries like pandas, matplotlib, etc.

Continuously monitored and managed the Hadoop cluster using Cloudera manager and Web UI.

Used Maven as the build tool and SVN for code management.

Implemented testing scripts to support test driven development and continuous integration.

Environment: Hadoop, MapReduce, HDFS, HBase, Hive, Impala, Pig, SQL, Ganglia, Sqoop, Flume, Oozie, Unix, Java, Java Script, Maven, Eclipse.

Education:

Master of Science in Information Systems at University of Maryland Baltimore County

Contact this candidate