Henry Truong
Cell: 408-***-****, Email: *************@*****.***
OBJECTIVE
Seeking a senior technical and/or managerial position in the areas of AI Machine Learning, MLOps, Cloud DevOps SRE, QA/QE, and Analysis/Tuning/Testing for System Performance/Scaling/Reliability. SUMMARY
Result-oriented hands-on and management of DevOps/SRE/Test/QE/Development projects.
Strong focus in performance, scaling, resiliency, system monitoring, troubleshooting, security and quality.
Have built/mentored/managed teams of 5-50 on-shore/off-shore engineering teams, many are QAEs/SDETs.
Worked at companies as director, senior manager, architect, and principal IC engineer. EXPERIENCES
TikTok, Inc., Mountain View, CA - 01/2023 - Present Sr Staff SRE, AI Machine Learning Engine and Platform, US Tech Services AI ML Engine and Platforms (MLOps):
• TikTok core AI Machine Learning Recommendation Engine (AML engine) - CPU/GPU Model Serving Infrastructure, Training/Online Parameter Server infrastructure and CI/ CD deployments
• TikTok AML Platform (k8s Training platform and GPU Model Services platform)
• Extended hours of oncall support for many ML services and models on different region time zones (China/Singapore/Europe/US) - solving on average 40 oncall tickets/week with different MTTA’s, MTTR’s within SLO’s and SLAs - responded timely to many non-US off-hour on calls.
• Capacity Planning and Tracking for CPU and GPU requests/allocation for all ML platforms (PS, Serving Engine, Training/Serving platforms) user groups Tool Development:
• Implementation of AML Platform services' log censoring/redacting encoding system
(flask, python, APIs), and Compliance system (a lot of regex pattern matching coding)
• censoring/masking out sensitive data, unmasking of non-sensitive operational data.
• CI/CD Parameter Server preparation/upgrade/deployment/verification pipeline (mesos and k8s targets)
• Implementation is compliant to Data Privacy and Security as certified through all the Data Privacy and Security validation and trainings. Technology Stacks: mesos, k8s, tensorflow, pytorch, CPU/GPU models/services, parameter server, cuda/nvidia-smi, ES, grafana, Mongo DB, python, flask, a little of go and JS, Rancher
(Desktop)
eHealth, Inc., Santa Clara, CA - 11/2021 - 01//2023 Principal SRE & Automation Lead, TechOps SRE Group
SRE Lead for Observability CI/CD automation and many kubernetes microservices’ troubleshooting.
Spearheaded, architected and coded tools for the Observability stack (prometheus, prometheus rules/metrics/alerts, thanos, jaeger, grafana, blackbox exporter, node exporter, postgres exporter, cloudwatch exporter, alertmanager, fluentbit/fluentd, spotio logs, status page, pagerduty, rancher, vault, docker-repo, aws lambda functions for log processing into AWS OpenSearch).
Automation includes gitops multi-branch projects’ pipelines with ticket approval process integration (git, jenkins), pagerduty incident-triggered troubleshooting runbooks and statuspage, pro-active monitoring based on jaeger traces, pro-active probing of portals from outside simulating customers (Not using argo-cd, due to specific process/repo requirements):
All gitops automation is done with Jenkins groovy pipelines and bash scripts. Pipelines includes prepare, dry run, apply/update, upgrade, test/verify, and rollback
AWS Lambda functions for incidents and Log transformation are mostly in python .
Involve in infrastructure terraform automation and self-service (Vue.js + java + terraform)
Automation of Runbooks and root-cause analysis troubleshooting based on incident alerts.
Hands-on troubleshooting all the observability stack services on AWS EKS at k8s command line level, besides using prometheus/thanos, grafana dashboards. All command line troubleshooting runbooks are automated over kubernetes k8s kubectl and helm.
PagerDuty admin and On-call daily for all Observability incidents & k8s service troubleshooting
Automation and integration - PagerDuty with statuspage, MS Team, and JIRA.
Constant FinOps tuning - Optimization/Performance/Capacity monitoring and analysis of all Observability services, infrastructures, and 3rd-partied tools for cost-saving and scaling.
Configuration and monitoring:
security SIEM auditing system with IAM, API access cloudtrails
RBAC roles at namespace and cluster-level on EKS, k8s secrets, Secret vault
Sensitive log data masking with AWS Security
LinkedIn (through Harvey Nash), Sunnyvale, CA - 02/2021 – 10/2021 DevOps SRE Contractor, GRID SRE Group
Work on various SRE tools for big data Hadoop GRID cluster management:
Automation of the on-premise cluster management (expansion, migration, decommission, auditing of Hadoop clusters of 20K+ nodes across multiple region colos)
Jenkins self service pipeline design and automation (groovy, bash, python)
Extending and developing base service access layer in python to interface with Jenkins self service jobs, Kerberos, Jira, LDAP, Kafka, Slack, Salt, and other LinkedIn DevOps systems.
Work on Hadoop grid SRE Jenkins-based self services and pipelines.
Work on SRE library tools to put all current stateless Hadoop services (KDC, LDAP, Salt, Kafka, Spark, Presto, Azkaban, …) running on the on-premise GRID to k8s/Kubernetes and migration into Azure. NTT Data (Wells Fargo client), Fremont, CA, 04/2020 – 01/2021 Software Development Advisor, DevSecOps SRE Group
Advise and help Wells Fargo DevOps tool group with architecture, implementation, and maintenance of DevOps automated tools and application pipeline jobs and tools:
Enhance, re-architect and maintain CI/CD Jenkins pipeline tools in groovy, bash, and python
Automated codes talking to Jira, github, Jenkins (some groovy) to run pipeline build & release jobs
Automated tools for infrastructure checking, build metrics (could do it in python numpy & panda)
Set up pipelines for deployment/decommission of Hadoop clusters on-premises data center (machines includes standard high-end servers and Nvidia GPU servers)
Lead developer for DevOps node.js/express.js API services for management of configuration properties and application on-board process (involved with react.js UI development, but mainly responsible for all backend API services’ design and implementation).
Fairly fluent with bash scripting, python scripting
Strong knowledge of Linux system (OS, kernel) performance and system issues (CPU, memory, IOPS, kernel, drivers, diskspace, network, sockets, resources, corncobs, ….)
Understanding and use Salt, Cfg engine
Enlighted Inc., a Siemens Company, Sunnyvale, CA, 09/2018 – 04/2020 Principal Full-Stack (DevOps SRE & QE) Consultant, Enlighted TechOps (DevOps/IT/QE) Group
Lead performance analysis and reliability, scaling testing and tool development on backend cloud services which include clusters of Haproxy, Consul, Redis, Spring app servers, Kafka, Storm, Cassandra services.
Develop test automation framework tools and test suites using JMeter (scripting in Java and python):
Responsible for SRE system testing and DevOps automation for Enlighted IoT products and cloud solution:
Responsible for all frontend API and sensor/gateway backend Cloud services’ SRE
Automation of SRE testing from different cloud interfaces for different product versions:
Management/Orchestration interface: simulating user REST interfaces hitting Tomcat application servers, reddis and Cassandra DB from the user side, and
Sensor & Gateway interface: Sensor data injection from the gateway side through Kafka, Storm, and Cassandra pipeline running on different environments on VM-based GCP cloud.
Automate test reports and tools for TPS, Latency, Memory, integrate them into Jenkins CI
Test Framework and automated scripts include performance analysis and problem detection tools and reports for LB haproxy/snapt log, JVM GC log, HeapDump, and ThreadDump
Help fixing and trouble-shooting a number of system performance and scaling issues, such as memory, sockets, CPU, threading, IO, Java hotshot profiling, thread-dump, stack-dump, load-balancing algorithms for the product big data backend.
Work on and fix web (Spring Boot) services and data (Kafka/Storm/Cassandra) services’ development codes.
Responsible for GCP cloud deployment using automated Ansible and bash scripts (Nexus for artifacts)
Administer and maintain CI/CD Jenkins servers on GCP, hardware toolchain build servers, SDLC pipelines:
Set up and manage Jenkins pipelines for software development, QA, production
Manage and configure hardware toolchain build servers and pipelines
Help in evaluating/choosing, developing and maintaining other DevOps tools and cloud infrastructures: LBs
(haproxy/nginx/snapt, i.e. LB ADC, WAF, Caching, SSL-offloading), Firewall, Splunk.
Look into security and privacy testing and assessment for the entire IoT product (frontend & backend)
Backend penetration testing, vulnerability, data leak, data loss, access control, authentication/authorization
(SSL/certificates), data storage using tools nmap, OWASP ZAP, SonarQube.
Frontend Browser/mobile app privacy – autofill, account, fingerprinting Comcast, Sunnyvale, CA, 08/2017 – 04/2018
Principal Cloud SRE Contractor, Advanced Security Group
Responsible for architecting, integrating, system testing, and (DevOps) laying out of an integrated security product, which involves millions of Comcast CPE Linux-based gateways and security analytics AWS cloud
(Linux Ubuntu VMs), doing device fingerprinting, security checking, and parental control:
Work with partner company to architect and deploy services to various AWS regions, VPC, subnets (public/ protected/private), Availability Zones to map analytics services, i.e. application service, Kafka messaging service, partner API (websocket) services, RDS, fingerprint/security services (user device analytics, url- checker, ipset, speed-test, parental control, etc.)
Responsible for the implementation of product compliance to Comcast security standards as the modems in the Comcast HFC network are connected to the AWS analytics cloud through a direct shared link.
Admin the AWS dev/test staging, scaling test, and production clouds.
Constantly monitoring application performance metrics with NewRelic and Admin tools
Cost planning and analysis through capacity modeling test (write scaling test in node.js)
Set up and monitor webapp (mortar) Jenkins branch builds. Use github, gerrit, Jenkins.
Deploy webapps to Cloud Foundry PAAS environment (PCF) and monitor them.
Testing:
Help automating security product functional test framework and components:
End-to-end test (gateways -> analytics cloud -> API cloud -> gateway UI) in python
Automate and run Cloud Scaling tests (up to 15M gateways, 15 devices/gateway) in node.js (REST over WebSocket over STOMP). IDEs used: Visual Studio Code, IntelliJ.
Do performance analysis, capacity modeling, and trouble-shoot system queueing issues.
Developed UI automation tests using Selenium and Appium frameworks
Use many monitoring tools such as CloudWatch, Tick/Chronograph, NewRelic, Kibana, and Splunk
Define and monitor key application and system performance metrics
Participate and perform security review of the whole product. Familiar with OWASP ZAP (VA/port-scanning
(nmap)/Pen-testing/Fuzzing-testing), OAuthv1/v2, Certificates, SSH-CA, Ansible Vault. AutoDesk, San Francisco, CA, 03/2016 – 08/2017
Principal SDET Contractor, Cloud Licensing and Order Fulfillment Systems
Working on AutoDesk Cloud Licensing and Fulfillment Services product integration/QA/system testing with almost 100% test automation approach (AWS server cloud on Linux Ubuntu EC instances, docker containers running undertow, http-proxy, consul, Dynamo DB, ES, graphite/Datadog, Kibana/logstash, NewRelic)
Licensing Service Cloud:
Lead QE/Test architecture, automation design and implementation:
Developed UI Functional test suite using Selenium and TestNG frameworks, integrated it into Jenkins Selenium Grid
Maintained QA Functional API test automation in python (pytest framework)
Designed and implemented most parts of the new test automation using node.js using Mocha test framework
Own system testing - implemented JMeter tests for Performance, Scaling, Stability, Stress, Auto-Scaling, and node.js tests for HA/Resiliency. Good understanding of OS/System performance metrics (OS kernel, I/O, TPS, latency, component/cluster input/output queuing).
Maintain a java8 & spring framework mock service for mocking all DEV and INT REST endpoint services test data to help in parallel TDD test automation development.
Pioneer all system leveled product specific performance and API endpoint metrics and logging/tracing capabilities (TPS, latencies, count, error rate, requests, responses, etc…) with DevOps and Development on DataDog and Kibana.
Order Fulfillment Service Cloud:
Lead QE/Test architecture, automation design and mock server development:
Design and implement test automation framework in Java, TestNG, Rest-Assured, etc…
Build a java spring-based mock server for mocking of interfaces around the order fulfillment system, namely backend REST interface, SOAP WS,TIBCO EMS, and SAP IDOC. iControl Networks, Redwood City, CA, 08/2014 – 03/2016 System Test Architect, Internet of Things Server Cloud Group
Mainly responsible for architecting and managing all test automation and testing of all system testing areas, and all IoT Server Cloud Performance, Scalability, Stress, and Resiliency/HA testing of the company IoT home security & automation cloud which managing & controlling messages and alerts from home/business sensors, live/scheduled cameras, thermostats, outlets, door locks, etc…
Audio/Video (VOD/LiveStreaming) Scaling Test was done with JMeter (m3u8, adaptive bitrate, HLS)
Test Tool Development (Java), System Test Automation (Java, JavaScript, bash), and System Testing (HA Resiliency, Performance, Scaling, Stability, Capacity Modeling/Planning) of:
Backend cloud service clusters: httpd, haproxy, tomcat, Oracle, XMPP, RabbitMQ, Kafka, Zookeeper.
testing/automation using open-source test tools, own developed proprietary test tools (Java, bash)
Automation with bash, Java, and REST API system scaling/stress with JMeter, some tsung.
Statistics collection and analysis with Logic Monitor and graphite besides normal OS & JVM tools ps, top, lsof, netstat, vmstat, iostat, sar, free, jvisualvm/jconsole, jps, jstat, jmap, Jenkins, and write own stat tools.
Responsible for internal/external customer deployment/solution troubleshooting & analysis
Optimizing system performance and scaling. Solved a number of critical internal issues
Partly helping DevOps to analyze and trouble-shoot customer issues, reproduce and perform customer solution testing, and propose fixes or solutions to DevOps for customer deployment (Linux VM cloud)
Functional Test Automation in Python (pytest framework).
Influent directly and indirectly all QA testing activities, besides system leveled testing activities.
Prototyped and tested clusters of Cassandra and Spark for a short time
Familiar with many IoT devices (Linux-based IoT devices, Hue, LiFX, bands, Ring, Chamberlain, HomeBoy, cams, Sonos, Netgear Alos, etc..)
Nokia, Alcatel Lucent, Nuage Network Division, Mountain View, CA, 08/2012 – 08/2014 Sr. Manager & Architect of SDN Cloud Virtualization QE Test Group
Managed a team of 15 engineers doing testing/automation on development sanity, integration testing for UI, API, Functional and System leveled testing with tight Agile build structure in Jenkins.
Architect/Implement the SDN cloud virtualization VSD PaaS/SaaS QE test infrastructure, automation framework and test releases for testing clusters of JBoss, ejabberd, MySql, Hadoop, HBase, Cappuccino GUI to manage VmWare & OpenStack VM’s (all VMs are Linux CENTOS based)
Backend Functional Feature Regression Testing and automation was done in Python Robot framework.
System Performance/Scaling/Stress/HA test automation are in Java & JMeter.
UI Functional test automation in Ruby and Cuccumber.
Drive/Participate in all semi-SCRUM-like cycles, QE activities, standards, practices and criteria.
Write a number of test driver clients and simulators for Google Chat ejabberd XMPP, REST API, JMS, and HTTP as plugins into JMeter, mostly in Java & JavaScript.
Wrote SDN orchestration server and SDN controller test simulator and test clients using XMPP in Java as a plugin to JMeter, where Functional, Performance & Scaling test cases are driven from.
Work on Ruby, Perl scripts using Cucumber & CucApp for our Cappuccino GUI test automation.
Write bash scripts, python and Tcl/Expect scripts for getting accessing machine infrastructures.
Infrastructures: VMs & Hypervisors on CentOS, RHEL, OpenStack, vCenter/vCloud, CloudStack, etc..
Other Testing: NMAP, snort, vulnerability assessment, LDAP, REST API security/penetration.
Yahoo, Yahoo Mail Cloud, Sunnyvale, CA, 10/2009 – 08/2012 Principal QE Engineer, Yahoo Mail Cloud Backend QA/QE
Test Tool Development & Automation for both Functional and System Leveled Testing,
Performance/Scaling/Load/Stress/Capacity Testing, Analysis of Linux RedHat Mail Cloud Performance.
Test Tool Development / Automation Architecture & Implementation:
Introduce, architect, and implement QA test tools and test frameworks for unit, api, functional and system leveled testing – design and implement many Java JMeter plugin test clients to automate API/Functional/ System (scaling/load/capacity) testing of the Mail services. Received award for test tool development
Propose/Design/Implement JMeter + TestNG + Selenium2 test framework (JMeter launched parallel Selenium tests).
Technical Lead for the IMAP/POP Mail Performance/Load/Stress/Scaling Testing & Automation
Simulation of millions of user thread traffic coming in from mobile phones, desktop clients, and web service, and browsers (selenium-based) – test clients with very high level of concurrency.
Design and Implement a J2EE Statistics Collection and Data-Mining application server in Groovy and Grails and Java running in Apache Tomcat with MySQL DB.
System Leveled Testing & Analysis:
Leading component/system/Cloud Performance/Load/Stress and Capacity Planning & Modeling using well known Capacity planning/modeling theories and laws to predict software/hardware utilization for hardware ordering in the cloud and software performance optimization tuning on new cloud deployment.
Pinpointed a number of performance and system issues where the threads are blocked or deadlocked, race conditions, most time-spent codes, hotspots, exceptions,memory with tools: YourKit, jps, jstat, jmap, jhat, jconsole, visualvm, and other performance tools such as sar, (h)top, mpstat, netstat, and Project R.
Cisco Systems, San Jose, CA, 02/2008 – 10/2009
Technical Lead Contractor, Sports & Entertainment Solution Engineering, AS Group
Received ‘Outstanding Performance’ award while working as a consultant.
Key Responsibilities include test planning, scoping, hands-on Integration/Functional/System test case writing
& execution, test automation and training of most QA engineers for 7 SCRUM releases.
Found a majority of bugs in QA, of which were challenging system leveled bugs
End-to-End Solution testing & automation for the Digital Signage Java Media Server running in tomcat and MySQL DB, Flex/Flash GUI, IPTV Flash-based Digital Media Player set-top boxes.
Test Automation & Tool Development: JMeter, Emma, FlexMonkey, JDBC
Worked with Linux RHEL 5.2, MySQL, Digital Media Players busy-box firmware/kernels, js prototype library, scripts, flash browser, VLC client/server, integrated VOIP phones (standard SIP phone), video equipment such as Scientific Atlanta Cisco Analog to Digital IPTV Transcoder, Channel-Mapping DCM, IP- Phone CM & CUAE, DSM, POS, ad insertion tools (Adtec), VideoLAN client/server tools. QA Lead Contractor, Deep Packet Inspection Application Service Module, AON/PMBU
Tested low-end deep packet inspection appliance with focus on controlling different web traffic protocols, mainly HTTP/XPath/XML/JMS/REST (some WebDAV). The appliance OS is a multi-core, multi-threaded real-time kernel (MV CGE 5.0), JVM 1.5.x and Cisco PCube SML (Cisco DPI).
Helped with management of all QA resources, budgets, lab, projects & testing. FOREX/Currency Trading, Founder, Partner @UpRiseCapital, San Jose, CA, 01/2006 – 01/2008
UpRise Capital is mainly just a small investment club for retirement and hedge fund piloting purposes.
Implemented the “CashAdvisor” product - a software-based Technical Analysis Auto-Trading system using MetaTrader MetaQuotes (C-like) programming language. CashAdvisor automatically place and exit orders, gives buy/sell signals through emails and alerts, and posts results on web site. PacketMotion, San Mateo, CA, 11/2004 – 12/2005
Director of QA
Led the company QA group and direct all QA activities of PacketMotion PacketSentry product line, which is a deep-packet inspection and policy enforcement switch.
50% hands-on QA automation work, 50% on process and people management – researching, architecting, test plan writing/reviewing, writing test program and traffic generator development
Testing application security features such as malicious URL strings, buffer overflow in application and data base, URL redirection, phishing, identity spoofing & stealing, illegal file copy/access. Nokia, Mountain View, CA, 03/2000–09/2004
Director of QE (3rd Line Manager), Mobile Packet Core
Managed, architected & led directly and indirectly the global (US, India, Hungary, Finland) I&V / QE test automation effort (Integration & Verification, same as QA) organization of around 50 engineers, 6 1st and 2nd line managers and contractors, doing QA Testing (grey/black box testing, test automation).
Architected and implemented the base test automation framework for switch/router testing for both Amber Networks and Nokia Mobile Packet Core 3G Router. Knowledgeable of L2 to L7 switching/routing protocols.
Wrote, reviewed and approved all IV/QA test plans, program plans, and some development specifications.
Led the organization to CMMI Level 3 certification and tools for Quality assessment and achievement
Helped manufacturing group setting up test benches and training/running ATE/diagnostics tests Founding Consultant @FlowGate, Inc., San Jose, CA, 04/1997 – 03/2000 Hands-on & running 10+ contractors on various development & test projects for clients in networking industry: 1. Consultant @ Com21, Inc., 07/1997- 03/2000
Managed a small group of full-time/part-time contracting engineers to architect & implement a complete end- to-end test automation system and testing for many Com21 products (DOCSIS 1.1/1.0 and proprietary cable modem/headend with data + VOATM).
2.Consultant Developer @ FlowWise Networks, Inc. 04/1997 - 07/1997
Architected and implemented a Java-based Device Management Application for a workgroup Gigabit Ethernet L3/L4 Switched Router, using Java over SNMP with VisualCafe, AdventNet toolkit.
Oversee SNMP agent and MIB development on the switch; SNMP agent written in C; MIBs include MIB-2, If Extensions, Bridge MIB, RIPv2 MIB, RMON, OSPF, FlowWise Proprietary MIB. Member of Technical Staff @Network Peripherals Inc., CA, 01/95 - 03/97
Defined the Network Management product requirements and led the engineering development effort to:
Design and implemented multi-OS management Agent/EMS applications for the Ethernet-to-FDDI switch: (1) UNIX OpenLook/XView/MOTIF/C based NMS GUI, (2) Windows Visual C++ NMS GUI.
Implement the embedded SNMP agent on the switching hub, and designed part of NPI MIB's.
Integrate NMS products with Sunnet Manager, HP OpenView and SNMPc for both UNIX and PC. Sr Engineer @ Nortel Networks (Bay Networks, SynOptics Communications)., CA, 01/92 - 01/95
Embedded System development for Ethernet and FDDI switches and concentrators (L2 to L4 services)
Mainly responsible for network management agent development (SNMPv1/v2) in C.
Development Environments: Intel x86, ICE, VxWorks RTOS, and Sun OS/Solaris. EDUCATION - San Jose State University - MS Computer Engineering (3.6 GPA), 05/1994 RESIDENCY STATUS - U.S. Citizenship
REFERENCES – available upon request