Post Job Free
Sign in

Senior Kafka Platform Engineer with Azure & Observability Expertise

Location:
Lake Forest, CA
Posted:
March 20, 2026

Contact this candidate

Resume:

Rahul P

361-***-****

***********@*******.***

Professional Summary:

•Overall 12 years of experience in IT industry

•Experienced in building and operating enterprise Kafka platforms using Confluent Kafka on Azure and Linux environments, including end-to-end ownership of cluster deployment, scaling, and lifecycle management.

•Strong expertise in designing highly available, multi-region distributed streaming architectures across Azure Availability Zones, implementing replication and disaster recovery strategies for mission-critical workloads.

•Hands-on experience with the Confluent ecosystem (Schema Registry, Kafka Connect, ksqlDB, Control Center) to enable reliable data ingestion, schema governance, and real-time data processing.

•Proven ability to optimize Kafka performance and capacity, tuning broker configurations, partition strategies, JVM settings, and infrastructure resources to meet high-throughput and low-latency requirements.

•Skilled in implementing secure messaging platforms, leveraging TLS/SSL, SASL, ACLs/RBAC, Azure Active Directory, and secrets management practices to enforce enterprise-grade security and compliance.

•Strong background in observability and monitoring, using tools like Azure Monitor, Prometheus, Grafana, Datadog, and Splunk to track system health, troubleshoot issues, and improve platform reliability.

Extensive experience in messaging systems beyond Kafka, including RabbitMQ on Kubernetes/OpenShift, with focus on high availability, automation, and secure messaging infrastructure.

•Proven track record in production support and incident management, leading root cause analysis, improving system stability, and collaborating across teams to deliver scalable, reliable, and performance-optimized platforms.

Certifications:

Splunk I – Core Certified User, Certifications in cloud computing, Scrum Certification, Python Programmer-PCEP, Terraform Certified Associate, Confluent Fundamentals Accreditation, Certified Kubernetes Administrator

Technical Skills:

•Integration software : RabbitMQ, Kafka

•Operating Systems : Windows, Linux

•Infrastructure as Code : Terraform

•Orchestration : Kubernetes

•Languages : Python, Shell, Java

•Automation Engines : Ansible, Jenkins

•Cloud computing : Amazon Web Services (AWS), Azure, GCP

•Project Management : Jira, ServiceNow, Kanban.

•Version control : Bitbucket, Git

•Monitoring : Dynatrace, Splunk, Datadog

Experience

Sept 2023– Present

Kafka Engineer – StateFarm, Inc.

•Deployed and operated Confluent Kafka clusters on Azure Kubernetes Service (AKS), managing broker provisioning, scaling, partition rebalancing, and rolling upgrades to support production-grade streaming workloads.

•Designed highly available Kafka architectures across Azure Availability Zones and regions, implementing Cluster Linking for cross-region replication, failover, and disaster recovery.

•Administered the Confluent Platform ecosystem, including Schema Registry, Kafka Connect, REST Proxy, and Control Center, enabling reliable data ingestion, schema governance, and platform visibility.

•Integrated Kafka with Azure Event Hubs (Kafka API) and built ingestion pipelines to Azure Data Lake Storage (ADLS Gen2), supporting hybrid streaming and analytics use cases.

•Performed Kafka performance tuning by optimizing broker configurations, partition strategies, replication settings, and JVM parameters to meet latency, throughput, and SLA requirements.

•Managed Kafka storage using Azure Managed Disks, tuning disk throughput and IOPS to support high-ingest workloads while controlling infrastructure cost.

•Implemented secure Kafka deployments using Azure Active Directory (AAD), TLS/mTLS encryption, SASL authentication, RBAC/ACL policies, and Azure Key Vault for secrets management.

•Automated Kafka infrastructure provisioning and deployments using Terraform, Helm, and Azure DevOps pipelines, ensuring consistent environments across Dev, UAT, and Production.

•Built observability and monitoring using Azure Monitor, Log Analytics, Prometheus, and Grafana, enabling visibility into consumer lag, broker health, ISR status, and cluster performance.

•Troubleshot complex distributed system issues such as rebalance storms, consumer lag spikes, replication delays, and disk pressure, applying strong knowledge of partition leadership, ISR behavior, and exactly-once semantics.

August 2018–August 2023

Kafka Admin – Capital Group Companies

•Administered and built enterprise Confluent Kafka clusters on Linux OS, installing and configuring brokers, Schema Registry, REST Proxy, Kafka Connect, ksqlDB, and Control Center to enable end-to-end streaming capabilities.

•Led capacity planning, benchmarking, and performance tuning across CPU, memory, disk, and network, while designing partitioning, replication, and rebalancing strategies to ensure high throughput and balanced cluster utilization.

•Implemented enterprise Kafka security using SSL/TLS encryption and fine-grained ACLs, enforcing least-privilege access for topics, producers, and consumer groups in compliance with organizational standards.

•Established centralized monitoring and observability for Kafka, RabbitMQ, and IBM MQ using Datadog and Confluent Control Center, enabling proactive alerting, faster troubleshooting, and improved platform reliability.

•Performed advanced operational activities including topic lifecycle management, partition reassignment, offset resets, schema deployment via Schema Registry, and deep-dive troubleshooting using strong knowledge of Kafka internals and ZooKeeper coordination.

RabbitMQ Admin:

•Supported the RabbitMQ messaging platform as a Solutions Reliability Engineer, ensuring high availability, performance, and secure messaging for application teams across DEV, QA, and PROD environments.

•Automated deployment and lifecycle management of RabbitMQ clusters, Messaging Topology Operator, and Kapp Controller using Git, Bitbucket, Jenkins, and Ansible Tower pipelines on OpenShift.

•Deployed and managed RabbitMQ (Tanzu) on OpenShift/Kubernetes, implementing a DevOps delivery model for consistent, repeatable, and scalable platform provisioning.

•Implemented end-to-end TLS/SSL encryption for RabbitMQ, including certificate provisioning, secure listener enablement, and client authentication for encrypted message flow.

•Managed TLS certificate lifecycle on OpenShift, integrating certificates with Kubernetes secrets, automating renewal/rotation, and performing zero-downtime updates to maintain compliance and platform security.

•Created and administered RabbitMQ objects including exchanges, queues, bindings, users, and vhosts, enforcing secure access controls and environment standardization.

•Led root-cause analysis of node outages and split-brain scenarios using OpenShift (OC) events and platform diagnostics, improving cluster resiliency and recovery time.

•Implemented chaos testing and synthetic health-check pods to validate high availability, failover readiness, and overall cluster resilience.

•Built comprehensive Datadog monitoring and alerting for nodes, memory, queue depth, connections, message rates, file/socket descriptors, and vhost metrics, with automated incident (Snow) ticket generation.

•Enhanced observability by onboarding new clusters into Datadog via agent configuration updates and creating Splunk dashboards and correlation searches for error logs, audit events, quorum queue performance, and WAL checksum failures.

•Generated OSS diagnostic artifacts for vendor support and troubleshot Java/.NET application integration issues, reducing MTTR and improving platform adoption.

•Enforced secure communication and access governance by aligning RabbitMQ and OpenShift configurations with enterprise security standards, including secrets management, certificate policies, and audit readiness.

August 2013 – 2018

Application Support Engineer – Capital Group Companies

•Served as onsite production support lead, coordinating with offshore teams to manage daily support activities for retirement and investment plan applications, ensuring timely incident resolution and clear communication with business stakeholders.

•Performed daily system health checks and proactive monitoring using Dynatrace and Splunk to validate application availability and detect performance issues across distributed environments.

•Conducted end-to-end root cause analysis for major production incidents using Dynatrace Digital Experience and Splunk log analytics, identifying performance bottlenecks and working with engineering teams to implement permanent fixes.

•Developed and maintained operational dashboards, alerts, and reports in Splunk using advanced SPL queries, macros, lookup tables, and scheduled cron jobs to monitor application health and incident trends.

•Built and monitored Splunk ITSI glass tables and deep dive dashboards to track service health scores, detect anomalies, and provide proactive visibility into critical business services.

•Troubleshot Informatica PowerCenter workflows and AutoSys batch jobs, analyzing session logs and job dependencies to resolve failures and improve batch performance for retirement plan data processing.

•Wrote complex Oracle SQL queries involving joins, aggregations, and unions to support reporting and data analysis for retirement products including 401(k) and 403(b) plans.

•Collaborated with application, database, and infrastructure teams to optimize batch processing performance through query tuning, indexing strategies, and improved resource utilization.

•Supported enterprise platforms including Documentum (UNIX), SearchBlox indexing, Tomcat services, and Salesforce integrations, diagnosing middleware and application issues impacting business operations.

•Created knowledge base articles, operational runbooks, and performance dashboards (Tableau) while also supporting automation initiatives using Blue Prism RPA to streamline operational reporting and reduce manual support tasks.



Contact this candidate