Post Job Free
Sign in

Senior Messaging Platform SRE and Kafka Expert

Location:
Prosper, TX
Posted:
March 19, 2026

Contact this candidate

Resume:

KESAVA RAO

Senior Messaging Platform SRE Confluent Kafka RabbitMQ AWS Kubernetes EKS Terraform

Aubrey, TX, USA +1-940-***-**** *************@*****.*** linkedin.com/in/kesava-rao-00350a71

PROFESSIONAL SUMMARY

Senior Messaging Platform SRE with 13+ years of experience owning the reliability, scalability, and operational excellence of enterprise messaging and event-streaming platforms. 6+ years of deep production expertise with Confluent Kafka and Apache Kafka, and 3+ years administering RabbitMQ in high-availability distributed environments. Proven SRE practitioner with hands-on experience defining and tracking SLOs and SLIs, leading incident response and post-incident reviews (PIRs), building observability pipelines, and systematically eliminating toil through automation. Strong AWS platform engineering background including MSK, EC2, EBS, ALB/NLB, and IAM. Hands-on Kubernetes and EKS experience deploying and operating Kafka platform workloads using Confluent Operator and Strimzi. Infrastructure-as-Code expertise in Terraform and Helm. Serves as single point of contact and primary escalation for all Kafka and RabbitMQ production issues. Experienced technical leader who mentors engineers and partners with architecture, security, and application teams to evolve the messaging platform roadmap.

KEY ACHIEVEMENTS

•99.99% platform availability sustained across Confluent Kafka and RabbitMQ clusters supporting mission-critical IoT, telemetry, and enterprise workloads.

•30% end-to-end latency reduction achieved through broker JVM heap tuning, G1GC policy optimization, log segment sizing, and thread pool configuration.

•35% faster MTTD and MTTR for streaming incidents via Prometheus, Grafana, Dynatrace, and Splunk with proactive alerting on consumer lag, under-replicated partitions, and broker JVM health.

•Zero-downtime rolling upgrades executed across multi-broker Kafka clusters with ISR validation, rack-aware partition rebalancing, and automated rollback safeguards.

•Days to minutes provisioning — eliminated manual toil through Ansible playbooks, Python scripts, and Terraform IaC for Kafka topic, ACL, and RabbitMQ queue lifecycle management.

•100% SLA adherence for P1 and P2 messaging incidents — led RCA, post-incident reviews, and implemented preventive controls to eliminate recurrence.

•Multi-DC disaster recovery replication implemented via MirrorMaker 2 for geo-redundancy and business continuity across real-time streaming pipelines.

•20% JVM performance improvement across IBM WebSphere and Kafka broker environments through heap, thread, and GC parameter tuning.

CORE COMPETENCIES

Kafka and Streaming Platforms: Confluent Kafka, Apache Kafka, Confluent Platform, Kafka Connect, Schema Registry, Mirror Maker 2, ksqlDB, KRaft, ZooKeeper — brokers, partitions, ISR, replication, rebalancing

RabbitMQ: HA Clustering, Quorum Queues, Classic Mirrored Queues, Flow Control, Federation, Shovel, Dead-Letter Exchanges, Management API, Consumer Acknowledgement Tuning

SRE Practices: SLO/SLI/SLA Definition and Tracking, Error Budget Management, Incident Response, RCA, Post-Incident Reviews (PIRs), Runbook Authoring, Toil Elimination, Capacity Planning, On-Call Operations

AWS: MSK, EC2, EBS, ALB/NLB, IAM, VPC, Security Groups, NACLs, CloudWatch, S3, Route53

Kubernetes and EKS: EKS Cluster Operations, Helm Chart Deployment, Pod Scheduling, Resource Limits, ConfigMaps, Secrets, Horizontal Pod Autoscaling, Confluent Operator, Strimzi, Kafka on Kubernetes

IaC and Automation: Terraform, Helm, Ansible, Bash, Python, PowerShell, CI/CD with Jenkins and GitLab

Observability and Monitoring: Prometheus, Grafana, Splunk, Dynatrace, ELK Stack, Confluent Control Center, AWS CloudWatch — dashboards, alerting, log aggregation

Security: TLS/mTLS, SASL/SCRAM, Kafka ACLs, RBAC, PKI Certificate Lifecycle, Secrets Management, CIS Benchmarks, OS Hardening

Middleware Platforms: IBM WebSphere Application Server (ND/Standalone), IBM MQ, IBM DataPower, Apache Tomcat, Apache Kafka, Confluent Kafka, RabbitMQ

Linux and OS Administration: RHEL 7/8/9, CentOS, Ubuntu 18/20/22 — TCP/kernel tuning, NIC bonding, LVM/XFS, NVMe storage optimization, OS hardening

PROFESSIONAL EXPERIENCE

Senior Messaging Platform SRE Kafka and RabbitMQ Platform Lead

DAF Connected Trucks (PACCAR) and Stellantis USA July 2019 – February 2026

Owned end-to-end reliability, scalability, and operational excellence of Confluent Kafka and RabbitMQ platforms supporting high-volume IoT, telemetry, and connected-vehicle event streaming across North America and Europe. Served as single point of contact and primary escalation for all Kafka and RabbitMQ production issues.

SRE Practices SLOs, Incident Response, and Reliability

•Defined and tracked SLOs, SLIs, and operational KPIs for Kafka and RabbitMQ platforms — established error budgets and reported reliability metrics to engineering leadership on a monthly basis.

•Led incident response, root cause analysis (RCA), and post-incident reviews (PIRs) for all Kafka and RabbitMQ production outages — implemented preventive controls to eliminate recurrence and reduce error budget burn.

•Acted as single point of contact and primary escalation for Kafka and RabbitMQ production issues — coordinated cross-functional resolution with application, network, and security teams under SLA pressure.

•Standardized operational runbooks for cluster lifecycle management, broker and node failure recovery, partition rebalancing, quorum queue failover, and disaster recovery procedures.

•Mentored junior SREs and platform engineers on Kafka internals, RabbitMQ operations, observability best practices, and SRE methodologies — drove platform-wide SRE culture adoption.

Confluent Kafka Platform Operations and Reliability

•Designed, deployed, and operated multi-broker Confluent Kafka clusters with 3 to 9 brokers — capacity planning, partition strategy, ISR configuration, rack-aware replication, and broker tuning for production IoT and telemetry workloads processing millions of events per day.

•Executed zero-downtime rolling broker upgrades and cluster expansions — validated ISR health, performed controlled partition rebalancing using kafka-reassign-partitions, and enforced rack-aware placement to maintain durability during maintenance windows.

•Configured and maintained MirrorMaker 2 for multi-data center replication and geo-redundancy — ensured RPO targets and business continuity for real-time streaming pipelines.

•Administered Kafka Connect, Schema Registry, and ksqlDB — managed connector lifecycle, schema evolution, and compatibility policies for downstream consumers.

•Tuned broker JVM heap, G1GC policies, log retention, segment sizes, and thread pools — reduced end-to-end latency by 30% and improved sustained throughput for IoT ingestion pipelines.

•Managed Kafka security end-to-end: TLS mutual authentication (mTLS), SASL/SCRAM broker-client authentication, ACL-based authorization, and RBAC via Confluent Platform — protecting sensitive IoT and enterprise data streams.

•Integrated Kafka workloads with AWS services: MSK for managed Kafka, EC2 and EBS for self-managed brokers, ALB/NLB for client connectivity, IAM role-based access, VPC security groups, and CloudWatch for metrics.

RabbitMQ HA Operations and Platform Reliability

•Deployed and administered RabbitMQ clusters in high-availability configurations — managed quorum queues and classic mirrored queues, configured flow control policies, and tuned prefetch and acknowledgement settings for throughput and reliability.

•Operated RabbitMQ clustering with multiple nodes across availability zones — managed node joins and removals, network partition handling with pause-minority mode, and cluster recovery procedures.

•Implemented and maintained RabbitMQ Federation and Shovel for cross-environment message routing and disaster recovery — ensuring message durability and delivery guarantees across distributed environments.

•Configured dead-letter exchanges (DLX), message TTLs, and queue length limits — implemented poison message handling and consumer retry strategies to prevent queue build-up and broker overload.

•Led RabbitMQ incident response for P1/P2 events including consumer stalls, memory alarms, disk alarms, network partition events, and node crashes — resolved issues under SLA with documented RCA and PIRs.

•Automated RabbitMQ queue, exchange, binding, and policy management using the RabbitMQ Management API, Python scripts, and Ansible — eliminated manual configuration drift across Dev, Staging, and Production environments.

Kubernetes (EKS) and Platform Infrastructure

•Operated Kafka platform workloads on AWS EKS — managed Confluent Operator and Strimzi deployments, Helm chart lifecycle, resource limits, pod scheduling, ConfigMaps/Secrets, and horizontal pod autoscaling for broker and connector pods.

•Managed EKS cluster operations including node group scaling, IAM OIDC integration, Kubernetes RBAC, network policies, and persistent volume management for stateful Kafka broker workloads.

•Deployed and maintained supporting microservices on EKS — collaborated with application teams on Kafka consumer and producer deployment patterns, resource sizing, and health probe configuration.

Infrastructure as Code Terraform and Helm

•Authored Terraform modules for AWS infrastructure provisioning: MSK clusters, EC2 broker nodes, EBS volumes, ALB/NLB, IAM roles and policies, VPC security groups, and CloudWatch alarms — enabling repeatable and auditable deployments.

•Built and maintained Helm charts for Kafka, RabbitMQ, Kafka Connect, and Schema Registry deployments on Kubernetes — parameterized values for environment-specific configurations across Dev, Staging, and Production.

•Integrated IaC workflows into CI/CD pipelines using Jenkins and GitLab for consistent, peer-reviewed infrastructure changes with automated plan validation and drift detection.

Observability, Monitoring, and Alerting

•Built comprehensive Kafka observability stack using Prometheus JMX exporter and Grafana dashboards for consumer lag, under-replicated partitions (URP), broker JVM heap, network throughput, and request latency — reduced MTTD by 35%.

•Configured Splunk and ELK Stack for centralized Kafka and RabbitMQ log aggregation, correlation, and alerting — enabled real-time detection of broker errors, consumer rebalancing storms, and security anomalies.

•Integrated Dynatrace APM for end-to-end distributed tracing across Kafka producer and consumer applications — correlated infrastructure metrics with application-level event streaming performance.

•Established alerting runbooks for key health indicators: ISR shrinkage, consumer lag thresholds, broker offline events, RabbitMQ memory and disk alarms, and queue depth breaches.

Automation and Toil Elimination

•Automated Kafka topic provisioning, ACL management, consumer group resets, and broker configuration changes using Ansible playbooks and Python scripts — reduced provisioning time from days to minutes and eliminated configuration drift.

•Developed automated recovery scripts for common failure scenarios including broker restarts, log corruption recovery, consumer group rebalancing, and RabbitMQ node re-joins — minimized MTTR across on-call incidents.

•Built reusable Ansible roles for Kafka cluster bootstrapping in ZooKeeper and KRaft modes, Kafka Connect connector deployment, and RabbitMQ policy and queue configuration.

Stakeholder Partnership and Platform Roadmap

•Partnered with architecture, security, and application development teams on event-streaming design patterns, topic and queue naming conventions, partition key strategies, schema evolution, and consumer group best practices.

•Collaborated with security teams on Kafka and RabbitMQ compliance: TLS certificate lifecycle management, SASL/SCRAM credential rotation, secrets management, and audit log reviews.

•Influenced platform-wide SRE best practices — introduced error budget reviews, production readiness checklists, and incident severity classification for messaging platform changes.

Senior Middleware and Messaging Engineer

CNA Insurance and Henkel USA July 2017 – June 2019

•Administered enterprise messaging platforms including Apache Kafka, RabbitMQ, IBM MQ, and IBM WebSphere in production environments supporting financial services and manufacturing workloads with 99.9%+ platform availability.

•Operated RabbitMQ clusters with classic mirrored queues and quorum queues — managed consumer flow control, dead-letter exchange policies, and node failure recovery in HA environments.

•Tuned Kafka broker JVM, replication factors, and consumer group configurations to improve throughput and stability of real-time data streaming pipelines.

•Implemented TLS/SSL, SASL authentication, and ACL-based authorization across Kafka and RabbitMQ platforms — enforced secure messaging practices in regulated financial environments.

•Automated deployment and configuration workflows using Bash and Ansible — reduced manual deployment errors and provisioning time across messaging infrastructure.

•Led P1/P2 incident response and root cause analysis for messaging platform outages — implemented preventive controls and maintained 99.9%+ availability for business-critical systems.

Middleware and Messaging Administrator

Marriott International USA November 2013 – June 2016

•Supported distributed messaging and middleware environments across Dev, QA, and Production tiers — performed application deployments, patching, upgrades, and performance tuning for availability and stability.

•Administered IBM WebSphere Application Server (ND and Standalone) across multiple environments — managed application deployments, JVM heap tuning, thread pool configuration, DataSource management, and fix pack upgrades for hospitality-facing enterprise applications.

•Configured and maintained IBM WebSphere Network Deployment (ND) clusters — managed Deployment Manager, Node Agents, and cluster members; performed node federation, cluster scaling, and session replication to ensure high availability for customer-facing workloads.

•Administered IBM MQ queue managers across multiple environments — created and managed queues, channels, listeners, and transmission queues; configured dead-letter queues (DLQ) and trigger monitors for reliable message delivery across distributed applications.

•Monitored IBM MQ health using MQ Explorer and custom scripts — tracked queue depths, channel status, and dead-letter queue activity; resolved stuck messages, channel disconnections, and queue manager connectivity issues to maintain 99.9%+ messaging availability.

•Performed IBM MQ and WebSphere fix pack upgrades and security patching — coordinated maintenance windows, validated post-upgrade application connectivity, and documented rollback procedures.

•Monitored platform health using Dynatrace and Splunk — maintained 99.9%+ uptime for customer-facing platforms through proactive alerting and preventive maintenance.

•Collaborated with development and operations teams on release management, environment provisioning, and incident escalation workflows.

Subject Matter Expert Middleware and Messaging

Max New York Life Insurance USA March 2012 – October 2013

•Managed day-to-day administration of IBM WebSphere Application Server, IBM MQ, IBM DataPower, and Apache Tomcat across Dev, QA, and Production environments — ensuring platform stability and availability for enterprise financial services applications.

•Performed application deployments, patches, upgrades, and middleware configuration changes using a zero-downtime approach across all environment tiers.

•Administered IBM MQ queue managers and channels — managed message flows, channel authentication, dead-letter queue handling, and trigger-based processing to ensure reliable point-to-point and publish/subscribe messaging.

•Managed IBM DataPower gateway administration — handled service configuration, SSL/TLS policy enforcement, and API proxy management for secure integration between internal and external systems.

•Managed SSL certificate and PKI lifecycle including generation, renewal, and integration with IBM WebSphere and DataPower — achieved 100% on-time renewals with zero certificate-related outages.

•Tuned JVM heap parameters, thread pool settings, and GC configurations across IBM WebSphere environments — achieved approximately 20% performance improvement in application response times.

•Led P1/P2 major incident response and root cause analysis — implemented preventive actions and maintained 95%+ SLA resolution rate across all production environments.

•Monitored system health using Splunk and Dynatrace for proactive detection of errors and performance bottlenecks — resolved 200+ production issues and reduced recurring incidents by 30% through automation and RCA improvements.

•Supported disaster recovery drills, backup strategy validation, and security compliance activities across middleware platforms.

TECHNICAL SKILLS

Kafka and Streaming: Confluent Kafka, Apache Kafka, Confluent Platform, Kafka Connect, Schema Registry, MirrorMaker 2, ksqlDB, KRaft, ZooKeeper

RabbitMQ: HA Clustering, Quorum Queues, Classic Mirrored Queues, Flow Control, Federation, Shovel, Dead-Letter Exchange, Management API

Middleware Platforms: IBM WebSphere AS (ND/Standalone), IBM MQ, IBM DataPower, Apache Tomcat, RabbitMQ

AWS: MSK, EC2, EBS, ALB/NLB, IAM, VPC, Security Groups, CloudWatch, S3, Route53

Kubernetes and EKS: EKS, Helm, Confluent Operator, Strimzi, Pod Scheduling, HPA, RBAC, PersistentVolumes

IaC and Automation: Terraform, Helm, Ansible, Bash, Python, PowerShell, CI/CD with Jenkins and GitLab

Observability: Prometheus, Grafana, Splunk, ELK Stack, Dynatrace, Confluent Control Center, AWS CloudWatch

Security: TLS/mTLS, SASL/SCRAM, Kafka ACLs, RBAC, PKI, Secrets Management, CIS Hardening

Linux and OS: RHEL 7/8/9, CentOS, Ubuntu 18/20/22 — TCP/kernel tuning, LVM/XFS, NVMe storage, OS hardening

Cloud: AWS (primary), Microsoft Azure

CERTIFICATIONS

•AWS Certified Solutions Architect — Associate

•Microsoft Certified: Azure Administrator Associate (AZ-104)

•Microsoft Certified: Azure Fundamentals (AZ-900)

•Dynatrace Associate Certified — Performance and Monitoring

•IBM WebSphere Application Server Certified Administrator

•ITIL Foundation — IT Service Management

•ServiceNow Certified System Administrator (CSA)

EDUCATION

Master of Science — Systems Management

Bachelor of Science — Computer Technology



Contact this candidate