Sushant Chaudhary
469-***-**** ********************@*****.*** https://www.linkedin.com/in/sushant-chaudhary Euless, TX Senior Platform Engineer AWS · Kubernetes · Terraform · ArgoCD · Python · Istio FinTech, Healthcare & SaaS Infrastructure
PROFESSIONAL SUMMARY
Senior Platform Engineer with 7 years of progressive experience building and operating cloud-native infrastructure across financial services, healthcare, e-commerce, and enterprise SaaS domains. Proven ability to own full-lifecycle Kubernetes platform delivery — from cluster provisioning and GitOps rollout through SLO-based reliability operations and FinOps governance.
• 7+ years of hands-on DevOps and platform engineering experience spanning AWS, Azure, and GCP environments across regulated and high-scale industries
• Deep expertise in Amazon EKS, Terraform, ArgoCD, and HashiCorp Vault with production deployments supporting multi- account AWS organizations and NIST 800-53 / SOC 2 compliance frameworks
• Designed and operated GitOps delivery pipelines using ArgoCD App of Apps pattern, Argo Rollouts canary analysis, and Prometheus-gated progressive delivery across dev, staging, and production
• Built and enforced platform security posture using OPA Gatekeeper ConstraintTemplates, Falco runtime threat detection, Trivy image scanning, and Cosign image signing integrated into CI/CD gates
• Hands-on experience with Istio service mesh, Cilium eBPF networking, cert-manager, and ExternalDNS for zero-trust networking and mTLS enforcement across multi-cluster Kubernetes environments
• Instrumented distributed systems with OpenTelemetry SDKs, Prometheus recording rules, Thanos long-term metrics storage, and Grafana SLO dashboards with PagerDuty escalation routing
• Authored Python boto3 and Go-based automation tooling for platform bootstrapping, IRSA provisioning, Vault AppRole management, and cross-account AWS resource lifecycle operations
• Linux systems engineering background including CIS benchmark hardening, auditd and SELinux configuration, sysctl tuning, Packer AMI hardening, and Fluent Bit log pipeline management TECHNICAL SKILLS
Cloud Platforms: Amazon EKS, AWS IAM, AWS KMS, AWS S3, AWS RDS, AWS Aurora, AWS DynamoDB, AWS SQS, AWS SNS, AWS EventBridge, AWS CloudTrail, AWS GuardDuty, AWS Secrets Manager, AWS Config, AWS ALB, AWS Route 53, AWS Lambda, Azure AKS, Azure Key Vault, Azure Monitor, Azure DevOps, Microsoft Entra ID, Azure Service Bus, GKE, Google Cloud Pub/Sub, Google Cloud KMS
Container Orchestration: Kubernetes, Amazon EKS, Helm, Kustomize, Karpenter, Cluster Autoscaler, Docker, Containerd CI/CD and GitOps: ArgoCD, Flux, GitHub Actions, Jenkins, Azure DevOps, Flagger, Argo Rollouts IaC and Provisioning: Terraform, Terragrunt, Ansible, Packer, AWS CloudFormation, AWS CDK, Crossplane Observability and Monitoring: Prometheus, Alertmanager, Grafana, Datadog, OpenTelemetry, Jaeger, Fluent Bit, PagerDuty, Thanos
Security and Compliance: HashiCorp Vault, OPA Gatekeeper, Falco, Trivy, Snyk, Checkov, IRSA, Workload Identity, mTLS, Cosign, tfsec, Pod Security Standards
Languages and Scripting: Python, Go, Bash, Java, Node.js, jq, yq Operating Systems and Systems Engineering: Linux, systemd, CIS Benchmarks, auditd, SELinux, AppArmor, cgroups, Logrotate PROFESSIONAL EXPERIENCE
Senior Platform Engineer September 2022 – Present
Truist Financial · Charlotte, NC
• Built and maintained a multi-account AWS platform spanning 14 accounts using Terraform with S3 remote state, DynamoDB locking, and Terragrunt orchestration, provisioning VPCs, Transit Gateway peering, Route 53 private hosted zones, and IAM roles governed by least-privilege SCPs across dev, staging, and production environments
• Engineered Amazon EKS clusters across three AWS regions using eksctl and Terraform, configuring Karpenter for spot-aware node provisioning with consolidation policies, PodDisruptionBudgets for zero-downtime rollouts, IRSA for pod-level IAM access, and ResourceQuota enforcement across 40 tenant namespaces
• Deployed ArgoCD using the App of Apps pattern with ApplicationSets, sync wave sequencing, and Prometheus-based health checks, enabling GitOps delivery for 60 microservices with automated Argo Rollouts canary analysis gated by Datadog SLO burn-rate metrics across dev, staging, and production
• Implemented HashiCorp Vault on Kubernetes using the Vault Agent Injector, configuring dynamic AWS credentials via the AWS secrets engine, PKI certificate issuance for internal mTLS, AppRole authentication for GitHub Actions service accounts, and automated lease renewal for trading platform credential rotation
• Authored GitHub Actions reusable workflows for Java Spring Boot microservices supporting Truist's retail banking platform, integrating Trivy image scanning, Snyk dependency checks, Checkov IaC policy validation, Docker BuildKit layer caching, and Helm chart version promotion with environment-gated approval steps
• Configured Istio service mesh across multi-cluster EKS environments, enforcing PeerAuthentication for mutual TLS across all inter-service communication, authoring VirtualService and DestinationRule CRDs for weighted canary traffic splitting, and integrating OPA Gatekeeper ConstraintTemplates to block unapproved image registries
• Instrumented platform microservices with OpenTelemetry SDKs exporting traces via OTLP to Jaeger and metrics via Prometheus remote write to Thanos, building Grafana dashboards with SLO burn-rate alerting rules and Alertmanager routing trees delivering incidents to PagerDuty escalation policies
• Reduced AWS EC2 spend across dev and staging clusters by configuring Karpenter spot node groups with SQS and EventBridge interruption handlers, applying bin-packing consolidation policies, and deploying Kubecost with per-team chargeback dashboards aligned to Truist's FinOps tagging taxonomy
• Established SLO-based reliability practices using Prometheus recording rules and Alertmanager multi-window burn-rate alerts, authored blameless postmortem runbooks for the retail banking API tier, and ran AWS Fault Injection Simulator experiments targeting EKS node failures and Aurora failover scenarios
• Authored OPA Gatekeeper ConstraintTemplates and Falco custom rules to enforce Pod Security Standards restricted profiles, restrict privileged container execution, and route Falco runtime alerts through Falcosidekick to Datadog and a dedicated Slack incident channel for the platform security on-call rotation
• Built Go-based CLI tooling for internal platform bootstrapping, automating EKS cluster registration into ArgoCD, Vault AppRole provisioning, and IRSA annotation injection for new service accounts, reducing onboarding time for new microservices across dev, staging, and production namespaces
• Hardened EC2 AMIs using Packer with Ansible provisioners applying CIS Level 2 benchmarks, configuring auditd rules for syscall monitoring, enabling SELinux enforcing mode, tuning sysctl parameters for network stack performance, and embedding Fluent Bit for structured log forwarding to AWS CloudWatch and Datadog Senior DevOps Engineer April 2020 – August 2022
Humana · Louisville, KY
• Provisioned Azure AKS clusters across dev, staging, and production environments using Terraform with Azure CNI networking, configuring Cluster Autoscaler for burst capacity, Workload Identity with Microsoft Entra ID for pod-level credential access, and PodDisruptionBudgets for zero-downtime rollouts on HIPAA-regulated clinical workloads
• Designed Azure DevOps multi-stage YAML pipelines for Python-based clinical data ingestion services handling HL7/FHIR payloads, integrating Trivy container scanning, tfsec IaC validation, SonarQube SAST analysis, and Helm chart promotion with manual approval gates between staging and production for PHI-adjacent services
• Deployed Flux as the primary GitOps delivery mechanism for 22 microservices, configuring HelmRelease and Kustomization objects with image automation controllers, and enforcing OPA Gatekeeper policies requiring all container images to be signed with Cosign before admission into AKS production namespaces
• Secured PHI workloads by integrating Azure Key Vault with AKS using the Secrets Store CSI Driver and Workload Identity, rotating database credentials for Azure SQL and Cosmos DB on 30-day cycles, enforcing Azure RBAC with least-privilege role assignments, and enabling Defender for Cloud across all AKS node pools
• Instrumented Python clinical data pipeline services with OpenTelemetry SDKs exporting traces to Jaeger and metrics to Prometheus, building Grafana dashboards for HL7/FHIR ingestion pipeline SLIs including message throughput, parse error rates, and downstream delivery latency across dev, staging, and production
• Hardened AKS node pools by authoring custom Packer images with CIS benchmark Level 2 controls, configuring AppArmor profiles for container runtime restriction, enabling auditd for syscall-level audit logging, and deploying Fluent Bit DaemonSets for structured log shipping to Azure Monitor Log Analytics for HIPAA audit trails
• Authored Terraform modules for Humana's Azure landing zone, provisioning VNets with hub-spoke topology, Azure Firewall policy rules, private DNS zones for Azure SQL and Cosmos DB, and Azure Service Bus namespaces for event-driven communication between clinical scheduling and claims processing services
• Automated batch job orchestration for nightly clinical data reconciliation using Python scripts with the Azure SDK, scheduling cron-based Kubernetes Jobs across AKS namespaces, configuring systemd timers for pre-job environment validation, and routing job failure alerts through Alertmanager to PagerDuty on-call schedules DevOps Engineer July 2018 – March 2020
Chewy · Dania Beach, FL
• Provisioned Amazon EKS clusters using Terraform and eksctl for Chewy's order management and inventory microservices platform, configuring managed node groups with launch templates, IRSA for pod-level S3 and DynamoDB access, and Cluster Autoscaler for burst capacity during peak e-commerce traffic events
• Built Jenkins declarative pipelines with shared library functions for Java Spring Boot order processing services, integrating Docker BuildKit multi-stage builds, Trivy image scanning, Snyk dependency checks, and Helm chart deployments with automated rollback triggers on failed liveness probe checks post-deployment
• Deployed ArgoCD with the App of Apps pattern for GitOps delivery of Chewy's inventory and fulfillment microservices across dev, staging, and production EKS clusters, configuring sync wave sequencing to enforce infrastructure-before-application deployment ordering and Prometheus health check gates for canary promotion
• Automated infrastructure provisioning for AWS SQS queues, SNS topics, and EventBridge rules supporting Chewy's real-time order event streaming pipeline, using Terraform with S3 remote state and DynamoDB locking, and configuring dead-letter queues with CloudWatch alarms routing to PagerDuty for order processing failures
• Instrumented Node.js Lambda functions handling Chewy's promotional pricing engine with OpenTelemetry SDKs, exporting traces to Jaeger and custom metrics to Prometheus via a sidecar exporter, and building Grafana dashboards tracking Lambda cold start latency, invocation error rates, and SQS consumer lag
• Hardened EC2 worker nodes using Packer with Ansible provisioners applying CIS Amazon Linux 2 benchmarks, configuring auditd for file integrity monitoring, tuning sysctl network parameters including tcp_fin_timeout and somaxconn for high- throughput order API traffic, and shipping audit logs via Fluent Bit to AWS CloudWatch
• Authored Python boto3 automation scripts for cross-account AMI sharing, EBS snapshot lifecycle management, S3 bucket policy auditing, and automated tagging enforcement across Chewy's AWS environment spanning dev, staging, and production accounts using AWS Config custom rules and Lambda-based remediation functions
• Configured Prometheus with recording rules and Alertmanager routing trees for Chewy's order management and inventory platform, defining SLI metrics for order placement success rate and inventory sync latency, and integrating Alertmanager with PagerDuty escalation policies and Slack channels for on-call incident notification Software Engineer June 2017 – June 2018
Rackspace Technology · San Antonio, TX
• Authored Terraform modules for multi-tenant AWS infrastructure provisioning including VPCs, public and private subnets, NAT gateways, security groups, IAM roles with least-privilege policies, and S3 buckets with versioning and lifecycle policies for Rackspace managed cloud customer environments
• Built Jenkins scripted pipeline jobs for Java Spring Boot microservice builds, integrating Maven dependency resolution, Docker image builds with multi-stage Dockerfiles, ECR push steps with image tagging conventions, and automated deployment to ECS Fargate task definitions across dev and staging environments
• Containerized Java Spring Boot REST APIs using Docker with multi-stage Dockerfiles and BuildKit layer caching, authoring Helm charts with configurable liveness and readiness probes, resource requests and limits, environment-specific values files, and Horizontal Pod Autoscaler definitions for EKS deployment
• Automated Linux system configuration for Rackspace managed customer EC2 instances using Ansible playbooks and roles, applying CIS benchmark hardening tasks including password policy enforcement, SSH daemon configuration, auditd rule installation, and sysctl kernel parameter tuning for network and memory performance
• Instrumented Python automation scripts using boto3 for scheduled EC2 instance right-sizing analysis, EBS volume snapshot creation, CloudWatch log group retention policy enforcement, and IAM access key rotation reporting across multi-account Rackspace customer environments using AWS Organizations and cross-account roles
• Configured CloudWatch alarms and SNS topics for EC2 instance health monitoring, RDS storage threshold alerting, and ECS service deployment failure notifications, integrating alert routing to PagerDuty on-call schedules and authoring runbooks for common infrastructure failure scenarios across managed customer environments EDUCATION
Master of Science in Information Studies
Trine University
Bachelor of Science in Computer Networking and IT Security Islington College