Post Job Free
Sign in

SRE and Software Developer

Location:
Fremont, CA
Posted:
February 20, 2023

Contact this candidate

Resume:

Kevin Z. Zhou

Telephone: 925-***-**** Email: advgqp@r.postjobfree.com

OBJECTIVE Sr. Site Reliability Engineer

Highlights of Qualifications

Over 6 years of SRE experience on SaaS and enterprise software, projects cover storage design, networks, high availability, security, monitoring, troubleshooting, CI/CD pipeline, data lake, etc.

15 years of hands-on SaaS development experience covers distributed system design, frontend and backend development with Python, Java, Golang, Bash script and Oracle, MySQL, MongoDB, Kafka, etc.

5 years of experience on AWS services, e.g. Lambda, CloudWatch, EventBridge, API Gateway, EC2, Cognito, IAM, RDS(MySQL/Oracle), EKS/ECS, S3, Glue ETL, Route53, Load balancer, Auto scaling, etc.

5 years of experience on Kubernetes and Docker, e.g. EKS/ECS/kOps, Docker Compose, Helm, OIDC, PV/PVC, service/ingress, AWS load balancer controller, statefulset/daemonset, horizontal pod scaling, etc.

5 years of experience on optimizing distributed applications with APM tool (based on Zipkin and Pinpoint), auto diagnose & remediation of production deployments with Stackstorm, and micro service performance optimization with JProfiler, YourKit, cProfile, yappi, Chrome, etc.

10 years of experience on Linux. Good at kernel parameter tuning, system troubleshooting, security hardening, networking, and its virtualization technologies, e.g. KVM, QEMU, LXD.

Extensive experience on Prometheus for software observability, e.g. metrics exporter, prometheus adapter for kubernetes, time series storage, and Grafana and its plugins for monitoring and anomaly detection.

5 years of experience on DevOps automation, e.g. Ansible playbooks, CI pipeline scripts and Terraform scripts to set up AWS services, create k8s clusters, deploy helm charts, deploy k8s objects, etc. Professional Experience

Jul/2020 – Dec/2022: Cloud Software Architect (Atonarp, Pleasanton, CA) Development lead to leverage AWS as the cloud platform for optical spectroscopy device for blood test

Built Kubernetes clusters with AWS EKS and deploy the whole system with Terraform and Helm, e.g. Python Faust app to process Kafka stream, Java Spring app, Kafka cluster, and Redis distributed cache, etc. Also set up OIDC to grant the apps to access AWS services.

Set up the Prometheus exporters to collection app metrics and monitored them with Grafana. Also fed the metrics to K8s via Prometheus adapter so K8s HPA can use them to auto scale the apps more accurately.

Implemented a data lake solution on AWS, which stores data in S3 bucket, creates metadata with Glue, executes Spark jobs for ETL, invokes Lambdas to handle CloudWatch events, and integrates to Apache Superset via REST for business intelligence and machine learning. Single sign-on is set up with Google IdP.

Built SRE tools, e.g. to monitor Kafka message consumption rate to performance tune backend servers, to stress test and tune the resource parameters in Helm charts; to automate VPC subnet creation based on IP range and availability zones; to sanity check and troubleshoot the new setup & deployment. Also used cProfile/yappi to perf tune Python apps and YourKit/JProfiler for Java apps.

Set up a MongoDB cluster distributed across Japan, USA and India to comply to data retention regulations. Implemented the data layer to save data steams in MongoDB buckets to boost the batch read performance.

Designed the way for networked devices to securely connect to edge network and exchange messages via nat.io, which supports for multi-tenancy, distributed key control and SSL.

Building CI/CD pipeline with Jenkins, which uses Dockerfile to create images and uploads them to ECR, creates Helm charts for k8s objects, and performs sanity check on the new deployments.

Created Terraform script to set up the infrastructure, e.g. EKS cluster, ECS cluster/tasks/service, storage gateway for EFS, VPC and subnets, security groups, S3 buckets, roles, policies, route53, etc. Also created Ansible playbooks to provision the VM’s at local and Cloud, e.g. manage SSH keys, update software, etc.

Created POC for various storage subsystems by utilizing AWS S3 (S3FS), EFS, NFS or Rook/EdgeFS and tested them on Kubernets cluster for performance and reliability.

Designed and coded 100% the admin UI of optical spectroscopy devices as multi-tenant SaaS, with ReactJS as frameworks, Redux for state management, GraphQL for data integration, and PlotlyJS for plotting graphs. Nov/2018 – Jun/2020: Staff Cloud Software Engineer (Roche, Santa Clara, CA) Development lead of NAVIFY Digital Pathology, a SaaS solution for cancer diagnostics on AWS. It receives tissue images from medical devices and runs algorithms for tumor detection. Key contributions include:

Ported the on-premise solution to AWS ECS platform by redesigning it as multi-tenant SaaS with Spring Boot micro-services and REST API. Coded 30% of the project including Java backend and React frontend.

The compute platform is based on ECS/Kubernetes and Docker containers, with ECR as image repo for automated release deployment and with auto scaling and load balancer for performance management.

Set up Prometheus to collect metrics from Spring Boot apps and monitored them with Grafana.

The storage subsystem includes AWS S3 for object storage and AWS FSx as file system in the cloud, Oracle and MySQL (AWS RDS) as database, Redis as distributed cache, etc.

The security infrastructure includes the integration to Okta for identity management, AWS API gateway and Lambda for authentication, and IAM roles at both EC2 and ECS task for authorization. The network security is built on private subnets, security groups and Route53 for traffic management and IP-based routing.

The storage subsystem includes AWS S3 for object storage and AWS FSx as file system in the Cloud, Oracle

(AWS RDS) as database, Redis as distributed cache, etc.

Coded 100% the infrastructure script with Terraform, such as ECS, API Gateway with Open API, VPC peering, etc. CI/CD pipeline includes Jenkins, Maven, Docker builds, Sonar. Shell scripts to automate the sanity test and troubleshooting. Also provided Ansible playbooks for VM provisioning locally and on Cloud.

Redesigned and coded 100% of Solr search module with Spring Data Solr to support multi-tenancy and Cloud mode. Also developed the POC of ElasticSearch with Spring Data ElasticSearch for future migration. Oct/2014--Aug/2018: Java, Cloud & SOA Architect (Futurewei, Santa Clara, CA) Developed various web tools to support application development and SRE operations. Key projects are,

Developed APM tool for distributed systems, which uses Java agent approach to auto instrumentation of the Java app. It can dynamically turn on/off call tracing for troubleshooting or monitoring.

Developed monitoring tools based on Prometheus, Grafana, Kafka, InfluxDB, Elastic Search to support for high throughput and long term storage of time series metrics.

Developed auto diagnose and remediation tool for cloud operations. The technology stacks include Python, JavaScript, Mistral workflow, Stackstorm automation platform, MongoDB, MySQL, RabbitMQ, etc

Developed on-line IDE to support for cloud-based application development. The backend modules are implemented as web services, which spin up docker containers as sandbox for each developer.

Set up and administrated CI/CD pipeline for the projects. It includes Jenkins and gitlab integration for build automation, and docker-based CI/CD environments targeted for specific Linux distribution. Developed multi-tenancy CRM SaaS on Huawei public cloud. Highlights are,

Designed the app with highly scalable metadata-driven software architecture so that each tenant can develop his own app on-demand from abstracted service layer API

Combining event-driven and multi-threading technique to implement the web services, which optimizes CPU usage and reduces the impact of overload and back pressure.

Stress testing and performance tuning with YourKit and Jmeter to reduce latency and increase throughput and resiliency.

Integrated the app with SRE monitoring tools by exporting the Prometheus metrics to InfluxDB. Oct/2010--Oct/2014: Java Software Architect (Spigit, San Francisco, CA) Developed SpigitEngage, an award winning idea management SaaS for enterprise, which combines the technologies of social network graph, data mining and machine learning to find out the top ideas.

Designed the application architecture to support customization on-demand, which uses XML metadata to describe UI components and backend domain logic. The backend uses Cassandra for scalability and performance and is implemented with Spring.

Implemented the business intelligence module, which combines ETL, data mining for data analysis and integrated Jasper Reports Enterprise for report generation.

Implemented the idea market module, which allowes user to trade in/out ideas and later share the success with the idea owner.

Implemented the authentication module, which supports new authentication methods as plug-in. Also implements the single sign on (SSO) for SAML2 (e.g. Okta) and oauth2 (e.g. Facebook, LinkedIn).

Implemented an asynchronous email engine, which can send big volume of email with low resource consumption and also supports i18n content.

Used shell script, Awk script and Java to implement a migration tool, which updates database schema and application configuration data to support each new release.

Worked with operation team to manage the production farm. Resolved the issues on performance, Tomcat JDBC connection leak, MySQL configuration, etc.

May/2008--Sep/2010: Sr. Java Software Engineer (Northrop Grumman, San Mateo, CA) Worked as US Postal Service contractor on PostalOne!, a billing system generates $35B revenue and handles 140B pieces of mailing per year. Overall responsibilities include:

Provided technical requirements, system integration API, design document, development guidelines and some critical components to developers.

Developed prototype to verify important design decisions, such as system topology, performance, etc.

Architect of poboxes.usps.com website, an ecommerce site with SSO, POS, payment switch, etc.

Architect of MATC99 enhancement project - integrating Jasper Report to the system.

Architect of PostalOne! test platform - separating test data from code to cover to more test cases Apr/2007--Apr/2008: Sr. Java Software Engineer (Xora, Mountain View, CA)

Developed Xora GPS TT Asset inGeo service based on QualComm inGeo low-cost GPS device.

Used Google Maps and YUI to display complex geographic information with rich features EDUCATION

MS in EE, Zhejiang University, China

BS in EE, Zhejiang University, China



Contact this candidate