Sri Sai Teja J
Sr DevSecOps / Site Reliability Engineer
LinkedIn *******.*@*******.*** Ph: - +1-856-***-****
Senior Platform / AI Infrastructure / Site Reliability Engineer with 7+ years of experience designing, automating, and operating mission-critical, cloud-native platforms at scale. Proven leader in building secure, highly available, and cost-efficient infrastructure across AWS, GCP, and Azure, enabling enterprise systems, data platforms, and modern AI/ML & LLM workloads. Expert in Kubernetes, Infrastructure as Code, CI/CD automation, reliability engineering, and full-stack observability, with hands-on experience delivering MLOps and LLM infrastructure including GPU-enabled platforms, model serving, inference optimization, RAG pipelines, and AI system monitoring. Trusted to lead cloud migrations, platform modernization, security-by-design, and SRE practices (SLIs/SLOs, error budgets) while providing production ownership, incident leadership, and cross-team enablement. Recognized for building scalable platforms that accelerate engineering velocity, improve reliability, and drive measurable business outcomes across industries.
Bachelor of Information Technology from Jawaharlal Nehru Technological University (JNTU), Kakinada, India.
Masters in computer and information science from Texas A&M University, Kingsville, Texas.
Cloud Platforms Amazon Web Services (AWS), Google Cloud Platform (GCP), Microsoft Azure
Platform Engineering Cloud-native platforms, multi-cloud & hybrid architecture, internal developer platforms
Infrastructure as Code Terraform, AWS CloudFormation, ARM Templates, AWS CDK
Containers & Orchestration Docker, Kubernetes (EKS, GKE, OpenShift), Helm
AI / ML & LLM Infrastructure MLOps, LLMOps, GPU-enabled Kubernetes, model serving (REST/gRPC), inference
optimization, RAG pipelines, vector databases, embeddings, AI observability
CI/CD & Release Engineering Jenkins, GitHub Actions, GitLab CI, CircleCI, Bamboo, Spinnaker
Artifact & Code Quality SonarQube, JFrog Artifactory, Nexus Repository
SRE & Reliability SLIs, SLOs, Error Budgets, Burn-rate alerting, Incident Response, RCA Observability & Monitoring Prometheus, Grafana, Dynatrace, ELK Stack, CloudWatch, GCP Cloud Monitoring Automation & Scripting Python, Shell/Bash, Groovy
Configuration Management Ansible, Chef, Puppet
Databases & Storage PostgreSQL, MySQL, SQL Server, Oracle, MongoDB, Cloud Object Storage Networking & Security VPC/VNET design, Load Balancing, IAM, Zero-Trust concepts, Secure APIs Operating Systems Linux (RHEL/CentOS/Ubuntu), Windows Server
Version Control & Collaboration Git, GitHub, GitLab, Bitbucket
Production Operations 24 7 On-call, Release Management, Change Management, Platform Support
Client: Radiology Partners, CA March 2023 – Current
Role: Sr. DevSecOps/Infrastructure/Site Reliability Engineer
Architected and operated secure, highly available hybrid cloud platforms (AWS & GCP) supporting LLM-based AI services, ML pipelines, and enterprise applications, using VPCs, Transit Gateways, Direct Connect, Private Link, and zero-trust networking.
Led large-scale GCP AWS migration of mission-critical platforms, re-architecting IAM, DNS, and network segmentation to support data-intensive ML training and low-latency LLM inference.
Built AI & LLM-ready Kubernetes platforms (EKS) with GPU-aware node groups (NVIDIA), autoscaling (HPA/Karpenter), workload isolation, and multi-tenant namespaces to run training, batch inference, and real-time model serving.
Designed and implemented end-to-end MLOps & LLMOps platforms, covering model training, fine-tuning, versioning, evaluation, deployment, rollback, and monitoring, using Docker, Kubernetes, Terraform, Helm, and GitOps workflows.
Implemented LLM infrastructure patterns including RAG (Retrieval-Augmented Generation) pipelines, vector databases (OpenSearch / FAISS / embeddings storage), prompt orchestration, and secure inference endpoints.
Deployed and scaled model serving frameworks supporting REST/gRPC inference, autoscaling, and traffic management, achieving 1K+ requests/sec, p95 latency <120ms, and 99.9% availability.
Integrated CI/CD for ML & LLM pipelines using Jenkins, GitHub Actions, AWS CodePipeline, enabling automated model builds, prompt versioning, evaluation gates, artifact storage, and production releases.
Implemented LLM and ML observability using Prometheus, Grafana, CloudWatch, tracking inference latency, token throughput, GPU/CPU utilization, error rates, cost per request, and SLOs.
Automated infrastructure and AI operations using Python, reducing manual effort and release failures by 65%, including environment provisioning, health checks, drift detection, and ML job orchestration.
Designed secure model, data, and secrets management using IAM, KMS, AWS Secrets Manager, Azure Key Vault, Google Secret Manager, enforcing least-privilege access for AI workloads.
Embedded DevSecOps for LLM platforms, integrating Trivy, SonarQube, Checkov, container scanning, IaC policy enforcement, and runtime security controls for ML/AI services.
Provisioned and optimized PostgreSQL RDS and object storage (S3/GCS) to support TB-scale datasets, feature storage, embeddings, and ML metadata management.
Delivered 30%+ cloud cost reduction, optimizing compute-heavy ML/LLM workloads via rightsizing, autoscaling, spot & reserved instances, data lifecycle policies, and GPU utilization tuning.
Enabled multi-account governance using AWS Control Tower & Service Catalog, securely onboarding data science teams, AI vendors, and third-party integrations in isolated environments.
Provided 24 7 production support across cloud, Kubernetes, and AI platforms; resolved 30+ Sev-1/Sev-2 incidents, conducted RCA/postmortems, and implemented preventative automation.
Mentored engineers on LLM infrastructure, MLOps foundations, Kubernetes networking, secure AI architectures, and platform reliability engineering.
Client: Ford; Dearborn, MI March2021 – Jan 2023
Role: Site Reliability Engineer/DevOps Engineer
Designed and operated internal cloud platforms on AWS using EC2, Auto Scaling, ELB/ALB, VPC, IAM, Route53, S3, RDS, Lambda, CloudFront, enabling high availability, fault tolerance, and secure multi-tenant workloads.
Built and managed Kubernetes-based platform infrastructure (EKS & GKE) supporting multiple application teams, implementing networking, ingress, autoscaling, cluster isolation, and workload governance.
Led on-prem to AWS migrations for applications and databases, implementing CloudEndure-based replication to support
DR, resilience, and zero-downtime cutovers.
Developed Infrastructure as Code platforms using Terraform, AWS CloudFormation, and AWS CDK, provisioning Dev, QA, Staging, Prod, and DR environments with standardized modules.
Created self-service CI/CD pipelines and Jenkins shared libraries (Groovy) for automated platform provisioning and application deployments.
Automated configuration management using Ansible and Python, ensuring consistent, repeatable server and cluster configuration across environments.
Implemented platform security and access controls, designing IAM roles, API Gateway authentication, secrets management, DNS governance, and secure networking patterns.
Established SRE reliability frameworks, defining Service Level Indicators (SLIs), Service Level Objectives (SLOs), error budgets, and burn-rate alerts for both platform and tenant applications.
Built full-stack observability platforms using Prometheus, Grafana, Dynatrace, Google Cloud Monitoring, enabling visibility into latency, availability, saturation, and error rates.
Designed SLO-driven alerting strategies, integrating Alert manager, Dynatrace alerts, ServiceNow, Slack, and Webex, reducing alert noise and improving MTTR.
Actively participated in 24 7 on-call rotations, resolving 30+ production incidents, leading RCA/postmortems, and feeding learnings back into reliability improvements.
Served as Release Manager, coordinating production releases, executing health checks, and managing change requests via
ServiceNow.
Led SRE adoption and enablement, training multiple teams on observability best practices, dashboard design, and reliability engineering principles.
Conducted SRE tooling POCs (Nobl9) to formalize SLOs, error budgets, and reliability reporting across critical platforms.
Client: Equifax, Alpharetta, GA Feb 2019 – March 2021
Role: Site Reliability Engineer
Built and operated Google Cloud Platform (GCP)–based internal platforms using Compute Engine, GKE, Cloud Load Balancing, Cloud Storage, Cloud SQL, enabling scalable and highly available workloads.
Led server and application migrations from physical/on-prem environments to GCP, supporting cloud adoption and modernization initiatives.
Designed and maintained Infrastructure as Code on GCP using Terraform and Deployment Manager, provisioning projects, VPCs, subnetworks, GKE clusters, and compute resources across environments.
Standardized infrastructure by converting manually provisioned GCP resources into Terraform-managed state, improving reproducibility, auditing, and change control.
Built and operated Kubernetes platform infrastructure on GKE, deploying workloads using Deployments, Services, Ingress, ConfigMaps, health probes, and autoscaling.
Integrated Kubernetes networking, storage, and security, providing consistent scaling, load balancing, and service isolation
from development through production.
Designed and supported CI/CD platforms using Jenkins, Python, and Shell, automating build, test, and deployment pipelines for cloud-native applications.
Developed an internal cloud security validation tool, scanning GCP resources to enforce governance, compliance, and standard usage patterns across projects.
Implemented observability and monitoring platforms using GCP Stackdriver (Cloud Monitoring & Logging) and
Prometheus/Grafana, providing visibility into latency, availability, errors, and resource utilization.
Built cloud visibility and cost dashboards, enabling teams and leadership to track GCP project usage and consumption patterns.
Deployed and managed centralized logging platforms (ELK stack) for log analytics, troubleshooting, and operational insights.
Supported data platform POCs, including Kafka and Spark integrations, collaborating with data engineering teams.
Authored automation and platform tooling in Python, integrating Git and Jenkins to support application delivery and operational workflows.
Trained and supported application teams in platform usage, observability practices, and Kubernetes-based deployments. Client: Conning, Hartford, CT March 2017 –Dec 2018
Role: DevOps Engineer
Designed, provisioned, and operated secure AWS cloud infrastructure using EC2, VPC, IAM, ALB/ELB, Auto Scaling, S3, RDS, Route53, CloudFormation, delivering high availability and fault tolerance.
Implemented network and security architecture with private subnets, security groups, NACLs, VPC Flow Logs, NAT Gateways, Bastion Hosts, MFA, and LDAP integration.
Planned and maintained cost-efficient, scalable compute platforms, managing AMIs, auto-scaling policies, and load balancing for production systems.
Built CI/CD pipelines and DevOps workflows using Git, Jenkins, Docker, Chef, and CloudFormation, automating build, test, release, and deployment stages.
Implemented Jenkins as a shared platform, creating pipelines, shared libraries, DSL jobs, and master/agent architecture.
Automated application deployments including build, AMI creation, integration testing, performance testing, and production releases.
Built and managed Docker-based application platforms, supporting containerized application lifecycles.
Implemented monitoring and observability using CloudWatch, Prometheus, and centralized logging with Splunk and EFK (Elasticsearch, Fluentd, Kibana).
Integrated logging and tracing using Fluentd, AWS Lambda, X-Ray, improving troubleshooting and incident response.
Automated server provisioning and patching using Chef, supporting both Linux and Windows environments.
Strategic Initiatives & Program Contributions
Time-bound initiatives executed as an early contributing member, focused on building platforms, enabling operations, and supporting initial business scale-out for startup environments.
Healthcare Startup – Platform & Operations Enablement: -
Acted as an early platform and operations contributor, supporting the startup in taking a healthcare workforce solution from concept to production delivery.
Built and supported a multi-application ecosystem consisting of:
1.Mobile and Web app for healthcare staff
2.Mobile and Web app for facilities
3.Admin platform for scheduling, compliance, and vendor management
Implemented AI-assisted validation and pre-check workflows to support compliant staff onboarding, verification, and scheduling.
Contributed to end-to-end platform delivery, working closely with product, engineering, and business stakeholders through launch and stabilization.
Supported setup of a cost-effective offshore delivery (GCC-style) team, helping define early team structure, workflows, and execution practices.
Assisted with operational and early growth enablement, including vendor onboarding processes and initial client enablement activities.
Startup Idea-to-Platform Enablement: -
Supported founders as an early team member in transforming business ideas into production-ready B2B and B2C platforms, taking products from concept MVP early adoption.
Defined and executed lean delivery and operating processes (requirements, sprints, releases), enabling 30–40% faster MVP delivery in small, fast-moving teams.
Provided market and product input, helping refine feature scope and positioning, and supported early client acquisition enablement and onboarding readiness.
Drove cost-efficient development, leveraging AI-assisted tooling and optimized architecture, reducing initial build and operational costs by 20–30%.
Established end-to-end metrics and KPIs, tracking delivery velocity, cost, adoption, and reliability, and continuously improved execution through data-driven reviews.
PROFESSIONAL SUMMARY
Tools and TECHNICAL SKILLS
PROFESSIONAL EXPERIENCE