Data Engineer Azure

Location:

Cincinnati, OH

Salary:

100000

Posted:

February 25, 2025

Contact this candidate

Resume:

Monish Bandapalli

Data Engineer

******************@*****.*** 513-***-**** LinkedIn Cincinnati, Ohio

As a seasoned Data Engineer with over 5 years of experience in designing, developing, and optimizing data pipelines and architectures on AWS, Azure, and GCP, I am eager to leverage my expertise in cloud-based data engineering to support data-driven decision-making and enhance organizational efficiency.

Professional Summary:

Around 5+ years of professional experience in full life cycle system development involving analysis, design, development, testing, documentation, implementation, and maintenance of data engineering solutions in Web-based and Client/Server environments.

Hands-on experience with Amazon Web Services (AWS) including S3, EC2, Elastic MapReduce, Redshift, Relational Database Service (RDS), Lambda, Glue, Kinesis, Simple Notification Service (SNS), Simple Queue Service (SQS), Identity and Access Management (IAM), and CloudFormation.

Extensive experience with Microsoft Azure services such as Azure Data Factory, Azure Data Lake, Azure Databricks, Azure Synapse Analytics, Cosmos DB, HDInsight, and Azure Stream Analytics for building scalable data lakes, ETL processes, and big data solutions.

Expertise in migrating on-premises SQL databases to Azure Data Lake, Databricks, and Azure Synapse. Proficient in managing and granting database access, and implementing big data solutions using Hadoop, PySpark, Hive, Sqoop, Avro, and Thrift.

Experienced with Google Cloud Platform (GCP) including BigQuery, Dataproc, Dataflow, Pub/Sub, and Cloud Functions for large-scale data processing and analytics.

Expertise in real-time data integration using Kafka, Spark Streaming, and HBase to process and analyze data in real-time, ensuring responsive analytics.

Developed batch processing solutions using Azure Data Factory and Databricks, handling complex data transformations and improving processing efficiency by up to 40%.

Extensive experience with file formats such as Avro and Parquet, converting and transforming data using PySpark DataFrames, and optimizing query performance in both batch and real-time environments.

Strong knowledge of CI/CD pipelines using Jenkins, Kubernetes, and Docker for deployment and management of data pipelines across cloud platforms. Experience in using AWS CloudFormation, API Gateway, and Lambda to automate and secure infrastructure in AWS.

Practical experience in data warehousing using Snowflake and in integrating data from multiple source systems, including loading and transforming nested JSON data into Snowflake tables.

Proficient in utilizing Python and Apache Airflow to create, schedule, and monitor workflows, ensuring high availability and performance of data processing pipelines.

Hands-on experience with SQL and NoSQL databases such as Oracle, SQL Server, MySQL, Teradata, HBase, Cassandra, and MongoDB, ensuring efficient data storage, retrieval, and analysis.

Involved in all phases of Software Development Life Cycle (SDLC) for large-scale enterprise software, utilizing Object-Oriented Analysis and Design methodologies, Agile, and Waterfall frameworks.

Experienced in performance tuning, data profiling, and troubleshooting across cloud environments, with proficiency in using monitoring tools such as Nagios, CloudWatch, and ELK Stack to optimize system performance.

Technical Skills:

Amazon Web Services (AWS): S3, EC2, Elastic MapReduce, Amazon Redshift, Relational Database Service, AWS Lambda, AWS Glue, Amazon Kinesis, Simple Notification Service (SNS), Simple Queue Service (SQS), Amazon Machine Image (AMI), IAM, AWS CloudFormation, Athena

Microsoft Azure: Azure Data Factory, Azure Data Lake, Azure Databricks, Azure Synapse Analytics, Azure Data Explorer, Azure Stream Analytics, Azure Blob Storage, Azure Cosmos DB, Azure Event Hubs, Azure HDInsight, Azure Data Share, Azure Logic Apps, Azure Machine Learning, Azure Cognitive Services, Azure DevOps

Google Cloud Platform (GCP): Google BigQuery, Google Cloud Storage, Google Dataflow, Google Dataproc, Google Cloud Pub/Sub, Google Cloud Functions

Cloud Service Models: IaaS, PaaS, SaaS

Hadoop Ecosystem and Components: Hadoop Distributed File System (HDFS), Hadoop User Experience (Hue), MapReduce, Apache Pig, Apache Hive, HCatalog, Apache Sqoop, Apache Impala, Apache Zookeeper, Apache Flume, Apache Kafka, YARN, Cloudera Manager, Kerberos, Druid, Presto

Real-Time Data Processing: Apache Spark, PySpark, Apache Airflow, Apache Kafka

Data Warehousing: Snowflake

Currently Exploring: Apache Flink, Apache Drill, Apache Tachyon, SAP

Relational Databases: Oracle, Microsoft SQL Server, MySQL, Teradata

NoSQL Databases: Apache HBase, Apache Cassandra, MongoDB

Programming and Query Languages: Java, Scala, Python, SQL, C++, C, T-SQL, Apache Impala, Go

Python Library: Pandas, NumPy, Scikit-learn, TensorFlow

Scripting Languages: Shell Scripting, Bash Scripting

Machine Learning: Supervised learning, Unsupervised learning, Regression, Classification, Decision trees, Clustering, K-means

Web Servers: Apache Tomcat, Oracle WebLogic

ETL Tools: SQL Server Integration Services (SSIS), Informatica, IBM DataStage

Reporting Tools: Tableau, Microsoft Power BI, SQL Server Reporting Services (SSRS), Celebrus, Quantum Metric, Tealium, Grafana

Methodologies: Agile (Scrum, Kanban), Waterfall, Software Development Life Cycle (SDLC)

DevOps Tools: Jenkins, Kubernetes, Docker, Terraform

Integration Tools: REST, SOAP, GraphQL

Version Control: Git, Github, Bitbucket

Tools: Jira

Operating System: Linux

Education:

Master of Engineering in Computer Science Dec 2023

University of Cincinnati, Cincinnati, Ohio CGPA: 3.98

Work Experience:

Client: Techvajra Inc. (RiteRug Flooring), Cincinnati, Ohio, USA Mar 2024 - Present

Role: Azure Data Engineer

Description: RiteRug Flooring specializes in providing a diverse selection of flooring solutions for residential and commercial needs. Developed and optimized data pipelines, APIs, and monitoring systems to manage complex data transformations and migrations across various platforms.

Responsibilities:

Collaborated in various phases of SDLC including requirements gathering, design, development, deployment, and analysis of the application.

Executed a seamless data migration project, transferring 1.5TB of multistate-level data from SQL Server to Snowflake using Python scripts and Snow SQL, resulting in a 60% improvement in data accessibility and query performance.

Developed and automated data pipelines using Azure Data Factory (ADF), PySpark, and T-SQL for ETL processes, achieving a 40% reduction in manual workload and a 30% increase in data processing efficiency.

Implemented HBase schema design improvements for storing processed streams, reducing storage overhead by 15% and increasing data retrieval performance by 25%.

Replaced the default Derby metadata storage system for Hive with a MySQL system.

Configured and upgraded On-Premises Data Gateway to connect SQL Server, Azure Analysis Services, and Power BI Service, improving data transfer speed by 20%.

Utilized Spark Core, Spark ML, Spark Streaming, and Spark SQL to optimize data processing workflows.

Developed and implemented a scalable Java API (Commerce API) to seamlessly connect with Cassandra, increasing system efficiency by 30%.

Leveraged Apache Druid to implement real-time data ingestion and aggregation for high-velocity datasets, reducing data latency by 50% and enabling near real-time analytics for business-critical decisions.

Involved in database migration methodologies and integration conversion to migrate legacy ETL processes into Azure Synapse compatible architecture.

Automated monitoring and CI/CD pipelines using Jenkins, Kubernetes, Docker, GIT, and Bash scripting, reducing deployment time by 40% and pipeline issue resolution time by 40%.

Developed intricate data pipelines using Azure Data Factory (ADF) and PySpark in Databricks, resulting in a 40% increase in data processing efficiency.

Designed and developed ETL workflows using Informatica PowerCenter to streamline data integration processes, improving data accuracy by 30% and reducing processing time for complex data transformations by 25%.

Controlled and granted access to databases for a team of 50 analysts, enhancing data security measures by 20%.

Implemented Presto for ad-hoc querying across multiple data sources, improving query performance by 35% and reducing response times for complex analytics workloads.

Developed Kibana Dashboards based on Logstash data and integrated source and target systems into Elasticsearch for near real-time log analysis and end-to-end transaction monitoring.

Used Jira for ticketing and issue tracking, and Jenkins for continuous integration and deployment.

Developed and implemented PySpark code to validate data from raw sources, reducing data errors by 95% and ensuring accurate data migration to Snowflake tables.

Implemented and managed data and analytics tools such as Celebrus, Quantum Metric, and Tealium, optimizing data collection, user behavior analysis, and delivering real-time insights for enhanced business performance.

Environment: Azure, Azure Analysis Services, Azure Synapse Analytics, Cassandra, CI/CD, Data Factory, Docker, Elasticsearch, ETL, Factory, HBase, Hive, Java, Jenkins, Jira, Kafka, Kubernetes, lake, MySQL, Power BI, PySpark, Python, Services, Snowflake, Spark, Spark Core, Spark SQL, Spark Streaming, SQL

Client: Fifth Third Bank, Cincinnati, Ohio, USA Nov 2022 - Feb 2024

Role: Associate Data Engineer

Description: Fifth Third Bank is a diversified financial services company. It specializes in small business, retail banking, investments. Played a key role in designing and implementing scalable data solutions, automating infrastructure management, and optimizing system performance across various environments.

Responsibilities:

Provisioned AWS EC2 instances and automated infrastructure tasks using Terraform and Lambda, reducing resource costs by 20%.

Utilized AWS Terraform templates to automate infrastructure provisioning, resulting in a 40% decrease in deployment time.

Optimized Kubernetes and Docker deployments, improving system performance and application scalability across OpenShift environments.

Implemented and optimized CI/CD pipelines using Jenkins, Maven, GitHub, Chef, Terraform, and AWS, supporting parallel job execution and increasing successful builds by 60%.

Developed efficient data pipelines to load over 1TB of data daily from BDW Oracle database and Teradata into HDFS using Sqoop, improving data processing speed by 50%.

Implemented automated data pipelines for seamless migration of 150TB of data to AWS Cloud, resulting in a 50% reduction in migration time and ensuring zero data loss during the transition process.

Conducted performance tuning and optimization of Snowflake data warehouse, resulting in improved query execution times and reduced operational costs.

Configured Spark Streaming to continuously retrieve data from Kafka and store the stream information in DBFS, resulting in a 20% reduction in data processing time.

Utilized advanced programming techniques to optimize Spark applications, leading to a 50% reduction in runtime for data transformation tasks.

Utilized advanced SQL querying techniques to analyze over 1TB of data from Oracle 10g/11g and SQL Server 2012, resulting in the identification of key trends and patterns that informed business decision-making processes.

Collaborated with cross-functional teams including business partners, Business Analysts, and product owners to gather requirements for scalable distributed data solutions utilizing the Hadoop ecosystem.

Integrated Control-M with various AWS services, orchestrating job schedules, data processing, and workflow automation across the AWS ecosystem.

Monitored server performance using Nagios, CloudWatch, and ELK Stack to identify and resolve issues proactively, resulting in a 20% decrease in downtime.

Utilized Amazon Athena to efficiently query large data sets stored in Amazon S3, optimizing query times by 35% and reducing data processing latency for business-critical analytics.

Environment: AWS, CI/CD, Docker, EC2, ETL, Git, HDFS, Java, Jenkins, JS, Kafka, Kubernetes, lake, Lambda, Oracle, Snowflake, Spark, SQL, Sqoop, Teradata

Client: HCL Technologies (Alcon), Chennai, India Jan 2021 - Jul 2022

Role: GCP Data Engineer

Description: Alcon Inc. is a Swiss-American pharmaceutical and medical device company specializing in eye care products. Contributed to the design and implementation of data processing solutions, including data migration, transformation, and visualization, while optimizing performance and ensuring data integrity.

Responsibilities:

Streamlined data migration and transfer processes using Cloud SDK, GCP tools, and Azure Data Factory, improving accuracy and efficiency by 50%.

Developed advanced HiveQL queries to analyze data stored in various formats like Avro and Parquet within the created tables, resulting in a 20% increase in data processing efficiency.

Automated the creation and modification of Azure SQL Database objects using advanced T-SQL scripting techniques, resulting in a 20% reduction in manual tasks.

Created custom Data Studio reports to analyze billing and usage data for cloud services, reducing unnecessary queries by 15% and saving the company $50,000 annually.

Managed petabytes of real-time streaming data using HBase, increasing processing efficiency by 40%.

Configured Google Compute Engine, Google Storage, VPC, Cloud Load Balancing, and IAM to enhance system security and improve performance metrics by 20%.

Processed and analyzed large datasets using Azure Databricks, leading to a 40% increase in data processing efficiency.

Deployed various services through Cloud Shell, reducing deployment times by 40% and streamlining operational processes.

Designed Tableau dashboards to visualize insights from processed data, resulting in a 30% improvement in decision-making and a 25% increase in actionable insights.

Developed innovative Scala-based Spark applications to enhance data processing, reducing data cleansing and transformation time by 40% across multiple report suites.

Updated Django models seamlessly using Django Evolution and manual SQL modifications, ensuring the site remained fully functional in production mode without data loss.

Migrated an entire Oracle database to Big Query, optimizing data storage costs by 40% and improving query performance by 50%.

Automated data validation between raw source files and Big Query tables using Python and Apache Beam, reducing manual validation efforts by 90%.

Executed the data validation program with Cloud Dataflow, reducing processing time by 75% compared to previous manual processes.

Optimized data processing workflows using GCP tools like Dataproc, Big Query, and Cloud Functions, reducing overall processing time by 30%.

Environment: Azure, Azure SQL Database, Big Query, Cluster, Data Factory, EC2, EMR, Factory, GCP, HBase, Hive, HiveQL, Power BI, PySpark, Python, S3, Scala, SDK, Spark, SQL, Tableau, VPC

Client: CreditAccess Grameen, Chennai, India Mar 2019 - Dec 2020

Role: Data Engineer

Description: CreditAccess Grameen Limited (CA Grameen) is an Indian microfinance institution. Identified bottlenecks and inefficiencies in data pipelines and systems, and implementing optimizations to improve throughput, latency, and resource utilization.

Responsibilities:

Evaluated existing systems and proposed improvements for migrating legacy systems into an enterprise data lake on Azure Cloud and integrating modern scheduling tools like Airflow.

Automated AWS infrastructure with Terraform, implemented multi-node clusters on EC2, and built CI/CD pipelines with Jenkins on Kubernetes to enhance system reliability and deployment efficiency.

Developed complex ETL pipelines integrating Python, SQL, and Snowflake, streamlining data flow and reducing data processing time by 40%, significantly improving efficiency.

Optimized AWS Lambda functions written in Spark, reducing processing time by 30% and enhancing the efficiency of data delivery to the cloud.

Designed and implemented real-time and batch data pipelines using Spark, Kafka, and HDFS, optimizing data streaming and storage solutions for improved efficiency and data integrity.

Executed end-to-end data pipelines using Apache Airflow to ingest, process, and load data into AWS S3 and Snowflake, achieving a 40% increase in data processing efficiency.

Leveraged T-SQL for MS SQL Server and ANSI SQL across various database platforms to efficiently manage and query data.

Collaborated with cross-functional teams to create business intelligence dashboards, providing real-time insights and contributing to a 40% improvement in decision-making processes.

Environment: AWS, CI/CD, Cluster, Data Factory, Docker, EC2, ETL, Factory, Flume, HBase, HDFS, Java, Jenkins, Kafka, Kubernetes, lake, Lake, Lambda, PySpark, Python, S3, Services, Snowflake, Spark, Spark Streaming, SQL

Contact this candidate