Data Engineer Azure

Location:

Atlanta, GA

Posted:

February 27, 2025

Contact this candidate

Resume:

BENJIMIN KATTA

DATA ENGINEER

Objective: Motivated and results-oriented Data Engineer with 5+ years of experience in building and optimizing data pipelines, ETL processes, and data infrastructure. Seeking to leverage strong expertise in big data technologies, cloud platforms, and data modeling to contribute to data-driven initiatives. Eager to collaborate with cross-functional teams to deliver innovative solutions, improve data processes, and support informed decision-making across the organization

Professional Summary:

Highly skilled Data Engineer with 5+ years of experience in designing, building, and optimizing end-to-end data pipelines, ETL workflows, and scalable data architectures. Proficient in both batch and real-time data processing using Spark, MapReduce, and PySpark, with a strong focus on improving system efficiency and reducing processing time.

Expertise in leveraging the Hadoop ecosystem, including Apache Spark, Hive, HBase, and Kafka, to design high-performance, scalable data pipelines and analytics solutions. Experienced in cloud-based data engineering with AWS, Azure, and GCP, ensuring seamless data flow across diverse platforms while maintaining high data integrity and security standards.

Adept at working with a variety of relational and NoSQL databases (SQL Server, PostgreSQL, MongoDB, Cassandra) and implementing data integration, migration, and quality strategies.

Hands-on experience with Amazon EC2, Amazon S3, Amazon RDS, VPC, IAM, Amazon Elastic Load Balancing, Auto Scaling. CloudWatch, SNS, SES, SQS, Lambda, EMR and other services of the AWS family.

Implemented Azure Data Factory (ADF) extensively for ingesting data from different source systems like relational and unstructured data to meet business functional requirements

Created numerous pipelines in Azure using Azure Data Factory v2 to get the data from disparate source systems by using different Azure Activities like Move &Transform, Copy, filter, for each, Databricks etc. Maintain and provide support for optimal pipelines, data flows and complex data transformations and manipulations using ADF and Pyspark with Databricks. Automating data workflows and integrating with CI/CD pipelines to streamline data operations and reduce time to insight.

Used Azure Devops & Jenkins pipelines to build and deploy different resources (Code and Infrastructure) in Azure.

Experience managing Azure Data Lakes (ADLS) and Data Lake Analytics and an understanding of how to integrate with other Azure Services.

Specializing in Google Cloud Platform (GCP) technologies, including Big Query, Dataflow, Pub/Sub, and Cloud Storage, with a proven ability to build scalable and cost-effective data solutions.

Implementing scalable data pipelines and ETL processes using Python, delivering high-performance data solutions

Experience leveraging Snowflake for cloud data warehousing and implementing efficient data processing workflows using MapReduce to optimize performance and scalability in big data environments.

Expertise in version control using Git and automating build processes with Maven, ensuring seamless integration and deployment in data-driven projects Designing serverless data processing workflows using AWS Lambda, enabling scalable and cost-efficient solutions for real-time data integration and analytics.

A highly skilled and results-oriented DevOps Engineer with extensive experience in containerization, orchestration, and cloud-native technologies, specializing in Docker and Kubernetes.

Expertise in automating workflows, implementing continuous integration/continuous deployment (CI/CD) pipelines, and optimizing cloud infrastructure using Docker containers and Kubernetes clusters.

Technical Skills:

Programming Languages: Python, SQL, Scala.

Relational Databases: MySQL, PostgreSQL

NoSQL Databases: MongoDB, Cassandra, HBase.

Cloud Data Warehouses: Amazon Redshift, Google Big Query, Snowflake.

Big Data Storage: HDFS, Azure Data Lake.

Hadoop Ecosystem: HDFS, MapReduce, Hive, PIG, Sqoop

Data Querying: Impala, Hue

Distributed Computing: Apache Spark (PySpark), Apache Flink

Orchestration Tools: Apache Airflow.

AWS: S3, Redshift, EMR

Google Cloud: Big Query, Dataflow

Microsoft Azure: Azure Synapse Analytics, Azure Databricks, Azure Data Lake.

ETL Tools: Apache NiFi, Airflow, Talend, Informatica

Data Integration: Sqoop, Kafka, Flume.

Version Control: Git, Maven

CI/CD: Jenkins, GitLab CI, CircleCI.

Other Technologies: Star Schema, Snowflake Schema, OLAP

Containerization & Virtualization: Docker, Kubernetes

WORK EXPERIENCE:

TSYS Azure Data Engineer Georgia, USA Nov 2023 - Present

Objective: Total System Services, Inc is an American financial technology company. I design, development, and optimization of data models, schemas, and end-to-end data pipelines. Adept at ensuring seamless data integration from diverse sources to data warehouses and lakes, while enhancing data processing performance for scalability and efficiency.

Responsibilities:

Design, develop, and maintain data pipelines to integrate data from various sources (transactional data, customer data, external systems).

Ensure seamless extraction, transformation, and loading (ETL) of data into data warehouses, data lakes, or other analytical platforms.

Design and implement data models (both relational and NoSQL) to support analytics, reporting, and business intelligence.

Optimize data storage and retrieval strategies for better performance and cost-effectiveness.

Work with big data technologies like Hadoop, Spark, Kafka, and cloud platforms (Azure) to handle large-scale datasets.

Design and implement Docker containers for data processing pipelines, ensuring consistency across development, testing, and production environments.

Deploy and manage data processing workloads on Kubernetes clusters, ensuring seamless scaling, high availability, and fault tolerance of data applications.

Use Kubernetes to automate the orchestration of containerized data workflows, enabling efficient resource management and cost optimization.

Assist in the design of analytical queries, views, and aggregations in PostgreSQL and MongoDB, and support real-time analytics in HBase and Cassandra.

Enforce data governance policies for consistent, secure, and compliant data management in accordance with financial regulations (e.g., GDPR, PCI-DSS).

Implement CI/CD pipelines for automating data integration processes and system deployments.

Leverage DevOps tools for automated testing and deployment of data infrastructure and code.

Develop CI/CD pipelines for data workflows using Azure DevOps or GitHub Actions and Automate testing and deployment of data solutions using ARM templates and Bicep.

Implement big data processing workflows using Azure HDInsight with Hadoop for handling large-scale manufacturing data sets, including batch and real-time processing.

Develop and manage data pipelines using Hive on Azure HDInsight for querying and processing large datasets from manufacturing systems and Integrate Snowflake with Azure Data Factory to load and transform data from various manufacturing applications for advanced analytics and reporting.

Ensure data security and compliance with financial industry standards, such as PCI-DSS, SOX, and others.

Implement role-based access control (RBAC) and encryption protocols to protect sensitive financial data.

Support the development of financial reporting systems by making data available for business intelligence tools (e.g., Tableau, Power BI).

ENVIRONMENT: ETL, Tableau, Power BI, RBAC, PCI-DSS, SOX, DevOps, CI/CD pipelines, Bottlenecks, Ensuring data flows, GDPR, PCI-DSS, Redshift, or Azure Data Lake, Hadoop, Spark, Kafka, PostgreSQL, MongoDB, HBase Azure, NoSQL, Hive on Azure HDInsight, Snowflake with Azure Data Factory, Docker, Kubernetes.

ROCHE AWS Data Engineer Georgia, USA Dec 2022 – Nov 2023

Objective: Roche AG, commonly known as Roche, is a multinational holding healthcare company. I worked with AWS tools like AWS Glue, Redshift, S3, Lambda, and Kinesis to automate ETL (Extract, Transform, Load) processes and ensure seamless data integration.

Responsibilities:

Design, build, and maintain efficient and scalable ETL (Extract, Transform, Load) data pipelines using AWS services like AWS Glue, Amazon Kinesis, and AWS Lambda.

Implement and optimize cloud data storage solutions using Amazon S3, Amazon Redshift, and Amazon RDS to handle vast amounts of structured and unstructured data.

Design, implement, and maintain ETL (Extract, Transform, Load) pipelines to move data between PostgreSQL, MongoDB, Cassandra, HBase, and other data sources.

Implement fine-grained access control using AWS services like AWS IAM and AWS Lake Formation to manage permissions and ensure data security.

Leverage Apache Hadoop and Apache Spark on Amazon EMR for distributed data processing of large-scale datasets from flight data and customer activity logs.

Utilize Amazon Kinesis and AWS Lambda to process streaming data from flight status, baggage tracking, and customer requests in real time.

Optimize AWS infrastructure and resources for cost efficiency, including using Amazon S3 Lifecycle Policies, Redshift Spectrum, and Auto-scaling.

Design and manage cloud infrastructure for data engineering processes using AWS CloudFormation and Terraform and Work with DevOps teams to ensure continuous integration and deployment (CI/CD) of data pipeline code using Jenkins, AWS Code Pipeline and AWS Code Build.

Design RESTful APIs using Amazon API Gateway to expose data services for internal and external consumption and create interactive dashboards in Amazon Quick Sight for operational insights like on-time performance, baggage handling.

ENVIRONMENT: Amazon S3, Amazon Redshift, Amazon RDS, AWS IAM, AWS Lake Formation, AWS Glue, Amazon Kinesis, AWS Lambda, Amazon S3 Lifecycle Policies, PostgreSQL, MongoDB, Cassandra, HBase, Redshift Spectrum, and Auto-scaling, Amazon API Gateway, Amazon Quick Sight.

TRENT GCP Data Engineer Mumbai, India May 2020 – July 2022

Objective: Trent Limited is an Indian retail company. Developed and maintained the end-to-end data pipelines that collect, process, and transform retail-related data from various sources (e.g., POS systems, inventory, customer interactions). Worked with various GCP tools like Big Query, Cloud Dataflow, Cloud Dataproc, Pub/Sub, and others to integrate and manage data across diverse sources.

Responsibilities:

Design and optimize table structures, indexes, and relationships in PostgreSQL for transactional and analytical workloads. Create flexible, schema-less collections and document structures in MongoDB to handle varied data types.

Implement partitioning and clustering strategies in Cassandra and HBase to support scalability and performance. Utilize GCP tools like BigQuery, Dataflow, Pub/Sub, Dataproc, and Cloud Storage for creating robust data workflows.

Integrate data from various sources, including on-premise databases, APIs, and third-party systems into GCP.

Develop and manage ETL (Extract, Transform, Load) processes to transform raw data into actionable insights.

Build and manage cloud-native data architectures on GCP, ensuring scalability, performance, and security.

Implement best practices for data storage, management, and optimization using GCP tools (BigQuery, Cloud Storage, etc.). Maintain clear documentation for data architectures, processes, and pipelines.

Build and optimize data pipelines using Docker containers for data extraction, transformation, and loading (ETL) processes.

Leverage Kubernetes to orchestrate and monitor the execution of complex data workflows across distributed environments. Troubleshoot and resolve issues related to data pipelines, data integration, and cloud infrastructure.

Stay current with new GCP tools, features, and best practices to drive innovation within the organization.

ENVIRONMENT: GCP, BigQuery, Dataflow, HBase, Cassandra, MongoDB, PostgreSQL, Pub/Sub, Dataproc, Cloud Storage, Data pipelines, Data integration, and Cloud infrastructure, APIs.

STANDARD CHARTERED Data Engineer Mumbai, India May 2019 – May 2020

Objective: Standard Chartered PLC is a British multinational bank with operations in wealth management, corporate and investment banking, and treasury services. Contributed to traditional data management systems as well as modern ones like Hadoop, Spark, Kafka, etc.

Responsibilities:

Create robust and scalable data pipelines to ingest, process, and transform large volumes of data from multiple sources (e.g., POS systems, CRM systems, e-commerce platforms) into central repositories like data lakes or data warehouses.

Develop and maintain ETL (Extract, Transform, Load) processes to move data from disparate systems into a central data platform for analytics and reporting.

Design data models that support efficient data processing, including relational and NoSQL database structures, to meet business requirements. Build and maintain cloud-based data warehouses (e.g., Snowflake, Amazon Redshift, Google Big Query) to store large datasets for reporting and analytics.

Leverage cloud platforms like AWS, and Google Cloud to create scalable, high-performance data systems for storing, processing, and analysing retail data.

Utilize big data technologies like Apache Hadoop, Apache Spark, Apache Kafka, and Apache Flink for managing large datasets, stream processing, and real-time data analytics. Manage and optimize cloud-based data lakes (e.g., AWS S3) to ensure high performance and data availability for analytical processing.

Select the appropriate database system based on use case, such as PostgreSQL for relational data, MongoDB for document-based storage, Cassandra for high availability and distributed data, and HBase for large-scale, real-time analytics.

Ensure compliance with data privacy regulations and industry standards (e.g., GDPR, PCI-DSS) by implementing proper data security measures, such as encryption, access control, and masking for sensitive data.

Implement continuous integration and continuous deployment (CI/CD) practices using Jenkins, GitLab, DevOps for automating the deployment and testing of data pipelines and infrastructure.

Utilize tools like Terraform or CloudFormation to manage and provision cloud-based infrastructure for data solutions.

Support the creation and maintenance of business intelligence dashboards and reports using tools like Tableau, Power BI, and Looker by providing clean, transformed data ready for analysis.

Enable real-time analytics and reporting for areas like inventory management, sales performance, and customer behaviour using real-time data processing frameworks like Apache Kafka and Apache Flink.

ENVIRONMENT: Apache Kafka, Apache Flink, Tableau, Cassandra, MongoDB, PostgreSQL, HBase, Power BI, Looker, Terraform, CloudFormation, Jenkins, GitLab, CI/CD, Apache Hadoop, Apache Spark, AWS S3, AWS, and Google Cloud, Snowflake, Amazon Redshift, Google Big Query, ETL, NoSQL.

EDUCATION:

Auburn University at Montgomery: Masters in Information Systems from 2022 to 2024.

Contact this candidate