Post Job Free
Sign in

AI Architect

Location:
Berkeley, CA
Salary:
$235,000
Posted:
November 04, 2025

Contact this candidate

Resume:

Daemeon Reiydelle

Email: ********@*****.***

Phone: 415-***-****

Address: San Francisco, California (Berkeley area)

https://www.linkedin.com/in/daemeonreiydelle/

Professional Summary

AI Architect: Hands-on AI/ML Architect Results driven technologist with 15+ years of hands-on experience leading AI-driven digital transformation, building high-performance teams, and architecting scalable AI/ML infrastructures. A unique ability to bridge C-level business objectives with cutting-edge AI, HPC, and cloud solutions, LowCode AI, highly automated CI/CD, AI Cybersecurity & AI GRC. Proven track record in growing emerging practices, modernizing enterprise infrastructures, architecting, prototyping, and delivering complex AI/ML projects for Fortune 500 clients. Thousand GPU HPC, bare metal, Kubernetes/KubeFlow/MLFlow, LowCode Agentic AI (Agentic, RAG, GAN, RLHF), multiple Agentic Flow tools. Recent focus on enhanced use of SLM & MLM models to offload LLM workloads, including guided MOE’s (not multi-agentic).

Core Competencies

GenAI & AI Predictive Analytics Application experience (Tensor, Scikit, MatPlot, Torch, Numpy, Pandas, PySpark, Jupyter, MLFlow, etc.), LLM, MLM, SLM, IoT ML, MOE’s, advanced GenAI and GAN.

AI/ML Infrastructure & HPC Architecture, including tuning, optimizing of AI Data Pipelines, Training, Tuning, RAG RLHF, etc.

C-Level AI Practical Advisory & Strategy (AI COE, AI Data COE, AI GRC Risk, ROI management, etc.)

Kubernetes, Slurm, Airflow, Kubeflow, ArgoCD

NVIDIA GPUs (Ampere, Hopper, Blackwell), AMD MI300/300X, InfiniBand, SuperPods, BasePods, Cuda, BCM, etc.

AI Cybersecurity & Governance

LLM Pipelines, PyTorch, TensorFlow, Hugging Face, LangChain, LlamaIndex

Digital Transformation & Modernization

Cloud (AWS, Azure, GCP), Hybrid/On-Prem, Multi-Cloud

Team Leadership (200+ engineers)

Observability (MLFlow, Grafana, Prometheus, ELK, DataDog)

RFP, Proposal, and POC Leadership

Infrastructure as Code (Terraform, Helm, Ansible)

AI Ethics, Compliance, and Risk

Data Mesh, DataOps, MLOps/AIOps, DevSecOps

Selected Projects & Technical Achievements

Project

Role

Technical Impact

Lucid Motors

Chief AI Data Architect

Implement Jupyter based low code templates for Agentic GAN, RAG, AI Predictive Analytics data pipelines using MLM/SLM agents, MCP/Agentic AI (GenAI MLM), MCP

Improve GPU utilization and throughput by 21% (multiple 1000 node K8S clusters)

Resolve throughput and stability issues with Trino, Kafka for 20+% improvement in AI DataOps pipeline throughput

Upgrade various AI Apps to Python 3.10 from 3.8.

Dell AI Professional Services

Chief AI Architect

Develop AI Center of Execution collateral, consult with multiple clients to implement.

Developed and deliver AI Cybersecurity risk assessments.

Designed and deployed AI factories for petrochemical, pharma, and banking clients, supporting 100+ NVIDIA GPU nodes and petabyte-scale data (Nvidia Joint Venture)

Optimize InfiniBand and UFM for high-performance networking. RAG, Predictives, BioDynamics.

Walmart Online AI/ML Modernization

SRE/Observability Lead

Improved HPC utilization by 3x, reduced cloud footprint by $25-30M, and accelerated AI projects to production using Kubernetes, Airflow, and NVIDIA T100s.

Bosch Self-Driving Car AI

Global Architect

Enabled global CI/CD for ADAS regression testing, reducing disconnects by 30% and recovering a 1+ year late project using HPC, Kubernetes, and NVIDIA DRIVE Hyperion.

Google Telco AI Integration

SRE/DevOps Engineer

Onboarded 5G Core Network Functions to GDCE Kubernetes, supporting Jio, Deutsche Telekom, and T-Mobile. Optimized Vertex AI Cloud for RAN/MIMO and MEC use cases.

Sikorsky Helicopter Predictive Analytics

Technical Architect

Delivered IoT+ML Predictive Analytics in 90 days, generating $125M first year US DOD credits

Certifications

-Agile Certified Practitioner, Agile SAFe

-AWS Certification: AWS Architect, AWS Security

-Azure Certifications: Certified Architect, (Cyber)security Engineer, Azure AI Architect, Azure Trainer, Azure Security Engineer, Azure DevOps, Azure AI/Big Data

-Cisco Certified Network Engineer

-GCP Certification: Certified GCP Architect, GCP Security

-NVidia Partner Certified Associate: AI/DC, Mellanox, Accelerated Computing Fundamentals, HPC containers, Data Science Workflows, GenAI, GNN, RNN, Bright Clusters SuperPod(DGX)/BasePods, Accelerated Computing, BCM, Cuda, etc.

Citizenship: United States

Security Clearance: US Top Secret/EBI (expired)

Professional Experience

Lucid Motors May October 2025

Data Engineering: AI DataOps Architect Contract

Initial Scope of Engagement: modernize the architecture of various in-production and under development AI applications, including migrating from Python3.8 to 3.10, all the AI, DataOps, etc. libraries as the infrastructure had not been updated in 2+ years. Work with teams upgrading Trino, Spark, Kubernetes, upgrading Docker Compose, etc. for the AI containers. Doing so in a way that improves the CMM and Observability of the AI teams.

Develop AI Cybersecurity posture, develop AI COE and GRC (AI) collateral, checklists, etc. Leverage ISO, NIST, EU AIA, OWASP, build out the AI CyberSec COE.

Implement AI Data Ops Center of excellence: merging separate groups in Data Engineering into a cohesive AI Data COE. Transferring to Director of AI. Set up AI Cybersecurity standards (NIST AI, OWASP AI, ISO ISO 23894, ISO 42001, EU AI Act) in AI project risk & cybersecurity assessments.

Modernizing and extending Low AI code, low MLOps environment (Jupyter, Kubernetes, Kubeflow, MLFlow, Aero, Pydantic-AI, Autogen) to support scale-out advanced AI, AI-PA: petabytes of IOT eventing driven by exceeding targets of global vehicle shipments. Cutting edge (IOT, MFG 4.0, ADAS 3): NVidia, Stanford, KAUST, and ETH. Happily struggling to refresh the advanced math I learned at university! Working with RAG teams to leverage SLM/MLM’s in the AI DataOps and Vector pipelines (vehicle problem diagnosis, explain predictives alerts, etc. Develop Jupyter NB templates for Agentic Model Context Protocol (MCP).

Data Mesh from Data Marts: extending existing transactional (SAP S4, Rockwell Plex/FactoryTalk, Sales Force) data sources to include semi-structured, text, and image objects. Apache Open Data Mesh, Trino (moving to Snowrocks)

Developed a scalable Jupyter based LLM infrastructure by modernizing existing low-code environments to support advanced AI applications, handling petabytes of IoT eventing data.

Utilization improvements and scaleout of high-performance computing infrastructures for automotive manufacturing and vehicle telematics, leveraging NVIDIA SuperPods (DGX) and Oracle Cloud for GPUs. (Nvidia joint venture)

Implemented core IaC practices by using Docker and Kubernetes to orchestrate and manage containerized AI workloads.

Designed LLM and AI training pipelines using a variety of frameworks, including PyTorch, TensorFlow, and HuggingFace, with GPU support from CUDA 10/12.

Optimized GPU clusters and AI stacks to improve performance for predictive analytics, event detection, and real-time AI agents.

IOT, OT, ADAS: architect implementing best practices for AI enabled applications supporting the vehicles, ADAS 3 (Nvidia EGX/MGX DRIVE Hyperion w/Saudi KAUST, MobileEye), predictive maintenance, and supply chain AI for US and MEA manufacturing operations, on road vehicles, (OT, telemetry, analytics). Evangelize Observability to improve performance and scalability to focus investment in highest impact areas. AWS & OCI for high performance AI enabled computing infrastructures for automotive manufacturing, vehicle telematics, predictive analytics, and ADAS. Nvidia DRIVE AGX, RTI-DSS, simulation and training (K8S, Airflow, Nvidia DRIVE ADX ADAS 3 ADX, existing AV platform, Slerm)

Develop templated PYNB’s for AI enabled predictive analytics (Spark/PySpark, Trino, Pandas, PyTorch, SKLearn, TensorFlow, Cuda (K8S), MLFlow, MCP), support 2 drive train projects to deploy AI PA.

Upgrade multiple 10+ existing AI applications to Python3.10 from 3.8.

Developed templated Python notebooks for AI-enabled predictive analytics (Spark/PySpark, Trino, Pandas, PyTorch, SKLearn, TensorFlow, CUDA, MLFlow). Support implementation of two AI Predictive Analytics applications (identifying various reliability issues) and two GenAI applications (Fine tuned MLM’s)

Kahn Ventures Oct 2024 Mar 2025

Consulting AI Architect Consultant

Assess AI startup pitch decks, focusing on data, and operational layers, including AIOps and DevOps.

Evaluate AI infrastructure strategies, cost optimizations, and security mitigations.

Assess AI data processing options and leverage existing vs. new model development.

Advise on dataset quality and provider alternatives.

Identify day 0 opportunities to enable simple templatized AI Observability

Evaluated and advised on LLM infrastructure strategies, focusing on cost optimization, and security mitigations and efficient deployment on cloud platforms like AWS Bedrock and GCP Vertex AI.

Identified opportunities to enable simple, templatized LLM Observability by leveraging tools like MLFlow and Elasticsearch in AWS, GCP, and other specialized cloud environments.

Provided guidance on dataset quality and data processing pipelines for new LLM development and existing models.

Assessed and advised on leveraging GPU-enabled infrastructure from providers like Lambda Labs and RunPod for scalable AI operations

TensorFlow, LangChain, MLFlow/ElasticSearch, Pandas, AI agents, Open AI, HuggingFace, SAP HANA, Oracle ERP, Service Now, Docker, Kubernetes, Airflow, Slurm, Python, HPE for AI/Nvidia, SuperPods(DGX), Run:AI, Mission Control, JFrog, AWS (CodeWhisper, Tranium, Bedrock, Q/Connect, Inferentia), GCP (Vertex, Gemini, AI HPC, Model/Garden Builder), Codeweaver, Linode, Banana, Lambda, RunPod, MLFlow for DevOps and baselined SRE/Observability, OpenAI, HuggingFace.

Dell Professional Services Sept 2023 Oct 2024

Chief AI Architect Consultant

AI Modernization Practice: Develop the practice & execute: Develop Processes, training and customer collateral for AI Center of Excellence consulting (driven by Value Stream Mapping techniques from Accenture, Deloitte, Baine. Develop Unified Data architectures, explain and evangelize Bias and Variance processes, capabilities, readiness assessments, offerings, lead presales engagements (architect/Practice Lead) engage at CIO/CTO Level, deliver (ongoing) several engagements; GenAI, Hybrid Cloud, Big Data for ML/GenAI. Develop the Digital Human /Social Human Capital models for understanding transformations and risk, including hybrid SAP BW/HANA, S4/HANA, Oracle Cloud Apps, Service Now.

Designed and implemented full-stack Kubernetes solutions with Run:AI to support emerging hardware platforms like NVIDIA AI Enterprise and AMD MI300/300X for LLM training and inference.

Developed AI modernization practices with a heavy focus on infrastructure readiness assessments for GenAI and LLM applications, including a go-to-market strategy for Dell/NVIDIA BasePod/SuperPod (DGX).

Created LLM data strategies for various industries, focusing on bias, variance, and model reliability in distributed training environments.

Led the installation and tuning of client LLM infrastructures, including Hugging Face and NVIDIA NeMo, and configured NVIDIA Unified Fabric Manager (UFM) and InfiniBand for high-performance networking. Work to leverage purpose build MLM’s for client’s RAG and GAN architectures.

Architected and supported AI stacks with transformer networks like LLAMA2, FALCON, and MIXTRAL, and resolved bottlenecks in scalable model pretraining workflows.

AI Cybersecurity/Ethical/Reliability assessments: develop the assessments with Dell Delivery AI teams and execute (ISO 42001, OWASP AI, NIST AI V2, EU AI Act, GDPR, PII, HIPAA, etc.): Support emerging hardware platforms and readiness assessments: NVidia AI Enterprise, AMD MI300/300X. Full stack Kubernetes with Run:AI. Heavy focus on data (Bias, distribution, normalization, synthetic data, etc.). Health care, banking, sports medicine.

Key POC for the Dell/Nvidia BasePod/SuperPod(DGX) Go-To-Market strategy (and POC to Nvidia Partner Program). Heavy leveraging of Lean processes (VSM, Kaizen) to differentiate Dell as a trusted consulting partner.

Support Nvidia partnership, support in-flight and presales engagements (Dell BasePod/Nvidia SuperPod).

Focused on AI modernization, cybersecurity, governance, and infrastructure readiness assessments.

Developed unified AI data architectures and readiness assessments.

Designed GenAI data. strategies for bioinformatics, sports medicine, and pharmaceutical data mining.

Created AI modernization frameworks, focusing on bias, variance, and model reliability.

Led presales engagements and technical implementations across multiple industries.

Developed reference implementations and white papers for Nvidia NIM and Hugging Face HUGS ecosystem (GenAI, Image Tagging, including SRE Observability best practices for NIM configurations, MLFlow/Elasticsearch/Logstash/Kibana enablement in NIM containers, etc.

AI DevSecOps: AI Cybersecurity/Ethical/Reliability assessments: bringing my experience with actual client AI application issues, develop the assessments with Dell Delivery AI teams and execute (ISO 42001, OWASP AI, NIST AI V2, EU AI Act, GDPR, PII, HIPAA, etc.): Support emerging hardware platforms and readiness assessments: NVidia AI Enterprise, AMD MI300/300X. Full stack Kubernetes with Run:AI. Heavy focus on data (Bias, distribution, normalization, synthetic data, etc.). Health care, banking, sports medicine. Customer Demos (and support other teams having performance issues in their demos): NVIDIA NVAIE, Triton Inference Server, NeMo, etc., and AMDs ROCm frameworks.

AI DevSecOps/Tanzu (VMWare Tanzu Labs): Tanzu for Nvidia AI DataOps NIMs: Tanzu Kubernetes Grid (TKG), tune client’s Spring apps, CI/CD security (Tanzu App Catalog), Tanzu AI Solutions (DevOps/MLOps), SRE/Observability improvements for clients with pre-configured K8S extensions to Tanzu Cloud Health.

Delivering training classes: (Internal) Nvidia Enterprise AI for Dell, Dell Validated Design: GenAI Clusters in the Data Center (Nvidia EAI on Dell); Dell Data Mesh (DataBricks) for semi- image- and un-structured data, labelling, tagging, etc. processes for GenAI, AI Data and ML data utilizations and gotchas, etc.; Dell Validated Design: Data Wrangling for GenAI; (Internal) special topics for DVD for Nvidia (Hugging Face: Mistral, Run:AI; RAG w/Feedback (RLHF vs. RHAIF), DPO, Imitation Learning with DVD); Converted Data Centers: GenAI in the converged, hyperconverged, and virtualized data center; Cybersecurity for Generative AI; Digital Human: Implications for GRC, cybersecurity, and business processes.

Dell AI Factory: R760xa, through XE9680A/Dell AI Factory with Nvidia, Nvidia BCM, Nvidia Unified Fabric Manager (UFM)/Adaptive Routing/UFM Subnet Manager, etc.; RHEL AI: OpenShift AI, IBM OS Granite LLM, Large Scale Alignment for Bots. (LAB), IBM/REL InstructLab; NVidia Omniverse OVX; PyTorch, Llama 3, Pinecone DB, Pandas, NumPy, NextData, DataBricks, NER/Topic extraction, SpaCy, BigPanda, HuggingFace, MLFlow, Kubeflow, OpenAI API, etc. Support clients with IoT data volumes for AI Predictive Analytics, event detection, AI optimizations: Smart Grid, Power distribution, solar/wind load optimizations; Real time human sports kinetics, etc. at petabyte scale. Installation/configuration/alerts setup in Splunk, DataDog, ELK

Installation, tuning, and upgrades client AI infrastructures: Databricks, Hugging Face, Nvidia NeMo, NIS, Nvidia Base cluster manager as part of hardware and solutions sales: petrochemical, pharma, banking, higher education, etc.: support AIOps teams porting applications (image, pharma, RAG). BCM installations and upgrades. 7+ clients. Work on various POCs (Emory SPARC medicine, University of Tennessee – student outcomes, City of Austin, George Bush Airport modernization, etc.)

MLOps/AIOps: Install, configure, support AI engineers: Dell Nvidia clusters (up to 200 H100 GPUs), Nvidia UFM (Unified Fabric Manager), InfiniBand setup and testing (5-20 Dell node clusters, XE9680s thru XE8640s) internally and at client sites (3rd level support), Kubernetes, Bright Cluster Manager, NeMo, Nvidia Inference Microservices/NI, Cephs, Slerm. Architect data meshes (unstructured/semi-structured) for multiple clients (health care, medical schools, manufacturer, midsized airport (DHS pilot), etc. Hybrid cloud and private cloud solutions (AWS, Azure, some GCP), Master data management, end-to-end governance (including reputational, legal, financial risk). Work with implementation teams to migrate, add semantic content to multi-terabyte scale data (Hadoop, Cell tower IoT events, images, kinetic sports medicine modeling, semantic chunking, optimize AI stacks through UX, reduce UX response times, improve AI tech stack utilization, etc. in Kubernetes (Nvidia Base, GKS, AKS, EKS), Slurm, Jupyter, etc. NiFi, Kafka, ADS, Databricks, etc.

Perform AI stack (and task) tuning for multiple client’s deliveries: stabilize/tune scalable model pretraining workflows, debug/resolve bottlenecks as 3rd level Dell AI customer facing support. Architecture, implementation, and architectural issues transformer networks like LLAMA2, FALCON, MIXTRAL, T5. GenAI emerging issues: e.g., data quality, bias, explainability, training & tuning, data tagging, chunking, model degeneration, hallucination, overreliance.

Dell/Nvidia Presales, POC, MVP: EU Truck manufacturer IoT, ADAS events (training, regression testing); various sports medicine data storage and AI training/tuning processes, Dell Data Mesh (DataBricks) setup and flows, NiFi/Kafka optimizations; Banking credit card data storage and ML training/tuning; Cell phone 5G MEC and 4G/5G IoT event data flows (ML Predictives at scale)

Support internal POCS’ and client engagements of data, AI, and Ops layers per Nvidia BasePod architecture standards (K8S, BCM, DGX, Slerm, BCM Data Mesh, etc.): installation, tuning, and upgrades client AI infrastructures: Databricks/HuggingFace/OpenAI/Nvidia NeMo, NIS, Bright/Base cluster manager as part of hardware and solutions sales: petrochemical, pharma, banking, higher education, etc.: support AIOps teams porting applications (image, pharma, RAG). BCM installations and upgrades. 7+ clients. Work on various POCs with Databrick’s Mosaic AI team (Emory SPARC medicine, University of Tennessee – student outcomes)

Support presales and consultative POCS’ with Nvidia to showcase Nvidia BasePod architecture and solutions (Nvidia prebuilt solutions (NIMS), K8S/BCM, DGX, Slerm, BCM Data Mesh, etc.). Beta Nvidia DXC Accelerated Computing (BasePods)

Nvidia BasePods, Nvidia Superpods (DXC), Nvidia BCM, Nvidia Accelerated Computing, Lambda, RunPod, MLFlow (specific offering for MLOps Observability, DataBricks, Run:AI (pre-acquisition), DataBricks Mosaic & Data Mesh, BCM DataMesh, Base/Bright Cluster Manager/Kubernetes, OWASP, NIST, ISO, EUAIA. OpenAI, HuggingFace.

Key Engagements

Bank (US, EU): Nvidia EAI/NIM setup, Ethical/Reliable AI Life Cycle assessment, AI Cybersecurity assessments

Bank (US/EU): Nvidia EAI/NIM setup, AI Cybersecurity assessments

Cellular provide (EU): Dell/Nvidia Clusters for GenAI in the 5G MEC.

Cloud provider (US) AI acquisition

Health Care: two Medical School sports medicine facilities

Insurance (US): GenAI Cybersecurity assessments, Ethical/Reliable AI Governance in the life cycle assessments, etc.

Global energy sector consulting firm: Support the buildout of their infrastructure to support 100 16 Nvidia GPU nodes: Nvidia BCM, Kubernetes, Slerm. Tuning, assist AIOps teams in migrating to current NeMo, etc.

Military medical facility: architect shared service model for advanced AI, Digital Assistant process enhancements.

Petroleum (US): GenAI Cybersecurity assessments (Dell/Nvidia AIE/NeMo/NIM setup)

Pharma (US): GenAI Digital Humans for health care professionals (AI Cybersecurity Assessment, Nvidia AIE/NeMo/NIM setup)

Pharma (US): GenAI for LIMS (AI Cybersecurity Assessment), AI Stack improvements

Truck manufacturer (EU) (AI Cybersecurity Assessment, Nvidia AIE/NeMo/NIM setup)

Anthropomorphics Inc. 2019 2023

AI SRE/Architect Employee

Provide expertise in SRE Optimization driven AI (GPU HPC, Kubernetes) Architecture, AI SRE, AI Cybersecurity assessments, AI SOWs/POCs/MVPs to a variety of consulting firms for their clients, supporting the major Python AI Stacks (TensorFlow, PyTorch, Keras, MLFlow, etc.)

Re-architected an open-source big data platform to a hybrid cloud SaaS model to support generative AI and LLM-specific data best practices

Architected and managed Kubernetes GPU clusters (100+ GPUs) to support large-scale LLM models for telcos, including 5G and IIoT use cases

Led efforts in network optimization using technologies like InfiniBand and RDMA to enhance GPU performance and reduce bottlenecks across distributed training systems

Partnered with Microsoft and OpenAI to enable LLM container deployments on Azure Edge AKS

Leveraged Kubernetes and cloud-native technologies (AWS, Azure, GCP) for provisioning, monitoring, and optimizing GPU clusters for predictive analytics and AI workloads.

Ligadata

Practice Modernization June Sept 2023

Work with CIO to develop AI Center of Excellence. Taught management Value Stream Mapping techniques to add customer value. ML Predictive analytics extensions to BI, GenAI for problem resolution support for telcos in emerging markets: Work with CTO and team to rearchitect the existing open-source big data platform (Hadoop, Hive, Kafka) to hybrid Cloud SAAS (AWS, Azure, GCP) around GenAI Data specific MLOps best practices, unstructured and semi-structured big data, support Databricks partnership in AME. Identify opportunities around 5G IoT, continued support for 4G IoT, IIOS, MFG 4.0 leveraging current SAAS ML pipelines. Work with CEO and Heads of Sales teams on engagements in Dubai, UAE, Nigeria, Egypt. Extend existing predictive analytics and chat diagnostic RBESs to leverage generative AI over new data. Identify and resolve architectural and performance related issues across the client base (3rd level support). Hadoop optimization (HBase, Hive). Enhancements to ELK, Open Telemetry for observability improvements.

RPF responses and SOWs for AI modernization initiatives for telecoms in emerging markets, transitioning big data platforms to hybrid cloud SaaS. Incredible sales conversion rate improvements after sales teams began leveraging simple value stream mapping.

Worked with leadership teams and 2 key clients leveraging Value Stream Mapping to develop focused improvements to SAAS ML pipelines (Tensor, MLFlow) for 5G, IIoT, and Manufacturing 4.0.

Partnered with Microsoft/OpenAI to enable OpenAI container deployment on Azure Edge AKS.

Architected distributed AI/ML pipelines and managed GPU clusters (100+ GPUs) supporting large-scale ML models for telcos in emerging markets, including 5G and IIoT use cases.

Utilized Kubernetes and cloud-native technologies (AWS, Azure) for provisioning, monitoring, and optimizing GPU clusters for predictive analytics and AI workloads.

Led efforts in network optimization (VXLAN, fat-tree architecture) to enhance GPU performance and reduce network-related bottlenecks across distributed training systems.

Training sessions for sales and technical teams: hybrid cloud big data, best practices for migrations, etc.

Presales advisory for 3 Ligadata client’s AI cloud modernizations

GCP Edge, Azure Edge, AKS, GKS, Apache Kubernetes, HBase, HQL, Hadoop, BigData, BigQuery, AI Predictive Analytics, Generative AI/GenAI, NAG, LLM, Kubernetes, GPU scheduler, Nvidia Enterprise AI, Terraform, ETL via NiFi/Kafka, Elasticsearch

Google

Edge Compute SRE/Observability Archictect Nov 2022 June 2023

Supported AI and AIOps integration in Google Cloud for major telecom providers (Jio, Deutsche Telekom, Telus, TIM). Supported engagement management by teaching and using value stream mapping techniques for SOW’s.

Assisted in onboarding 5G Core Network Functions to Kubernetes (GDCE) for enhanced automation and scalability.

Validated and stress-tested Google’s internally developed AI predictive analytics for telco analytics, integrating Vertex AI Cloud to Edge for Telco GPU enabled GKS clusters.

Technical integration engineer (DevOps/AIOps) supporting Google Cloud Telco (install, tune, optimize) in pilots for Ericsson, Nokia, Casa: onboarding of their 5G Core Network Functions to GDCE Kubernetes for various telco providers (Jio, Deutche Telekom + T-Mobile, Telus, TIM (Italy, Brazil)) in 5G Core on Kubernetes (GCP Edge), supporting 5G RAN test integration and Spirent RAN load testing. Validating, load & stress testing (simulating 1-2Tb/sec of multi-node IoT eventing, CI/CD automation into Google Distributed Cloud Edge (GDCE) and hybrid cloud (GDCE/GKE): Provide security, k8s configurations for NF networking, Mellanox NICs, performance tuning, Helm chart support: Google Telco Solutions, Kubernetes, NF Networking, GCP, ELK, Grafana, integration with Open Telemetry.

Telco Analytics Solutions: support integration (testing) Google’s Vertex AI Cloud (GenAI RAG) to Edge for Telco RAN/MIMO and MEC optimization, training pipelines via DataProc & DataFlow on BigTables/DataProc (similar to HBase/Hadoop).

Western Digital/SanDisk

SRE/O April Nov 2022

Lead SRE/DevSecOps/Observability: support the migration from Red Hat Open Shift (Kubernetes) to hybrid Google Anthos Kubernetes for VMware (On Prem) + AWS EKS, with continued support of a specialized OpenShift cluster running DPDK – Data Plane Development Kit for shared NVidia GPUs (and debugging out of vector issues due to pod affinity defaults for very large nodes). Improving AI driven manufacturing insights & decision making, support the stabilization and migration of additional applications from VMWare: application (re)architecture, technology: buildout of Center of Excellence for Public/Hybrid/Private cloud, extending Splunk reporting and dashboards for reactive unexpected event (including security) event reporting.

AIOps: Rearchitect virtualized (VMWare) GKE GPU HPC environment: set up MLFlow tracking to central tracking, identified causes of large model snapshot/folding timeouts, excessive data copying (shared storage/server, GCP/Azure cloud). Blue/green, canaries, etc. via MLFlow (and added model registry to check-ins) to improve throughput, worked with the Keras/MLFlow based teams to deliver 3x improvement in GPU utilization, 2x faster image recognition, observability improvements in client’s fabs. Reduced fab error rates by 3-7%, increased accuracy and utilization of cell test system by 4x, reduced expedited shipment delivery costs by 2x.

Technical architect/deployment support: supporting new and existing AIOps/IIoT/Digital Twin deployments to Fab data centers (GKE Anthos/VMware) Worldwide; DataIQ integration; Cloudera (CDP/CDF) on Kubernetes, Architect new AI systems into Kubernetes, Rancher, Confluence Kafka, NiFi, DataIQ, CDP Spark, NVidia VMware nodes. Support Looker ETL and query performance improvements.

Observability – Security in Depth: Improve security posture of global private clouds (China, Thailand, Israel, US, India) - Kubernetes active intrusion detection, enable TLS, mesh, etc. Improved Splunk performance to resolve issues with IIOT image processing AI (Spark NVidia Tensor, 1200 to 200mscec improvement, increased ML parms), K8S Observability COE Chief Architect: DataIQ Integration, GCP, GCP Anthos Private Cloud for VMware, OpenShift Container Platform to Portworx, Anthos managed AWS EKS + GKS, AWS CI/CD Jenkins Pipelines, OpenShift/Anthos/Kubernetes/Container/Networking/NVidia/VMware vSphere. Hashicorp Vault, Active Intrusion Detection support, CI/CD code quality/Software Supply Chain Vulnerability testing, evaluate various security posture products (Qualys, BeyondTrust, Palo Alto Networks, Forcepoint, Proactive & reactive IDSs, etc.). AWS IAM, CloudWatch/Trails, Splunk, integration for EKS, Redshift, Cilium. External vendor (Workday, SAP, Oracle ERP) security and application integration technical architectures

Deep dive into complex technical issues affecting stability, scalability, security, stability. Lead relationship (technical) between Western Digital and Portworx, Google Support, F5, Palo Alto Networks, Mandiant, AWS Cilium, Cloudnetics.

Own the technical relationships with AWS, Google, Portworx, RedHat (Rancher, OpenShift) for all operational, technical and architectural asks (FAAS, SAAS, some IAAS/PAAS).

For the test cell analytics AIOps (ML feature detection) team(s), resolved scalability and performance issues around applications running in the pods, integrating to Vertex AI, help team to optimize containers, code, work with GCloud to identify Kubelet configuration issues, Linux kernel (CNI, Contrack, & NAT) issues, improve training, improve Looker performance, etc.

Improve application responsiveness by enhancing Kubernetes operators for AIOps training, inference exception eventing, monitoring, and complex Spark/Kafka/NiFi ELT needs.

AIOps improvements for GKE (TensorFlow: TFUs & GPUs; VMWare/GKE, HP HPC), NiFi, Kafka, Hadoop/HBase, MongoDB, Elasticsearch for predictive analytics and event driven ML inference for various digital twin applications. Extensive buildout of ELK and Grafana.

11 Fabs (3 US, Israel, India, 2 Thailand, 2 China, 2 Japan) with collocated data centers, running 12 clusters of Anthos Kubernetes (GKE), ~2k K8S nodes, 1M pods, support subset of global business apps running in AWS and GCP, moving to Onprem Anthos, with global ML primarily in GCP, and general business in AWS: except for Fab based operations collocated due to 1-2 petabytes of daily data per fab, client is multicloud native (no data centers). VMware, Rancher,



Contact this candidate