Senior Software Data Products

Location:

Cupertino, CA

Posted:

June 03, 2024

Contact this candidate

Resume:

Ey-Chih Chow **** Bollinger Road, Cupertino, CA **014

******@*******.***

408-***-**** or 408-***-****

SUMMARY OF QUALIFICATIONS

Senior Software professional with 10+ years experience in designing large-scale systems. Notable accomplishments in the latest five years include:

·Develop and deploy cloud-native AI/ML models, applications, and data products for AWS and Azure platforms.

·Design and implement a scheme to efficiently and reliably extract data, in parallel manner, from each of 40+ Oracle tables to HDFS, which facilitates better performance for downstream applications of Apple ads-engineering.

·Architect, design, and implement Materialized View extension of SparkSQL/Spark streaming, in Java and Skala, to optimize Huawei cloud-based BI platform.

·Architect, design, and implement a data platform, which, combined with Kenesis (AWS version of Kafka), serves as a PubMatic data pipeline for reporting, OLAP, insights, and ad targeting.

TECHNICAL SKILLS

Systems: AWS, Azure, Mac, Intellij, VScode, Git, Docker, Linux, Hadoop, FastAPI, Kafka, Concourse

Database: Redshift, MogoDB, Hive, MySQL, Oracle, SQL Server, Vertica

Languages: Python/Pyspark/PyTorch, Pytest, Terraform, Java, Scala, Json, Maven, Ant, YAML, XML

Protocol: HTTP, TCP/IP, UDP, SSH, Tunneling

PROFESSIONAL EXPERIENCE

Humana, KY (Remote 100%) 2021 – present

Engineering Lead for Vertical Engineering and Generative AI

·Work with data scientists to perform MLOps, i.e. deploy/retire models to/from production. Models are developed and tested in Azure cloud via pyspark, ADO pipelines, Databricks workspace, ADF and ADLS.

·Develop services to evaluate experiments for Generative-AI use cases and to fine-tune pre-trained model, from Hugging Face, for retrieval-augumented generation. Each experiment includes a set of questions and prompts for OpenAI LLM to answer. Technologies used for the services include: LangChain, MongoDB, Beanie, FastAPI, Pytest, and Pedantic.

·Look into ways to improve performance for LLM questions that require knowledge base and/or reasoning in generative AI chatbot applications.

·Create a QDrive ingestion pipeline of which Json support files are exported from ASA. The pipeline can read dataset from an arbitrary Qdrive folder location and write to ADLS.

Duke Energy, Charlotte, NC (Remote 100%) 2021 – 2021

Senior Consulting for AWS Cloud Native Analytic Applications

·Implement required infrastructure in Terraform and application code in Glue/Dynamic Frame for movement of data from on-prem or third-party data sources, such as Sprinklr, to AWS S3 and/or Redshift.

·Design and implement Lambda functions to upsert data, in form of micro batch, from S3 to Redshift.

·Refactor existing Step Functions to perform ETL tasks in efficient and real-time fashion.

·Use Terraform to configure an AWS api gateway for Dynatrace to send data to S3.

·Create materialized views in scalable manner using Pyspark, rather than Redshift SQL.

·Other technologies involved include: Parquet, CloudWatch, Flyway, Vault, Kafka, Concourse CI/CD pipelines, and Qlik Replicate.

EY-CHIH CHOW

Page 2

Apple Inc., Cupertino, California 2020 – 2021

Senior Consulting for Ads Engineering

·Propose, implement, and test a scheme, based on Sqoop, to extract data, in a parallel manner, from each of 40+ Oracle tables to HDFS. The scheme completely eliminates possible failure incurred when extracting big Oracle tables. This also facilitates better performance for data extraction and downstream applications. The implementation takes into consideration of migrating job scheduling and monitoring tool to Airflow.

·Design and implement an alerting scheme for maintenance of 60+ Oracle Materialized views. Model associated with the scheme is based on concept of event/processing time and water mark in Apache Beam. This is implemented on top of the open source monitoring platform developed by InfluxData.

·Implement ads spend generators for a new version of data pipelines based on Airflow, Docker, Kubernetes, Spark SQL/DataFrames.

·Propose an end-to-end exactly once scheme for critical job in the data pipeline.

Intuit, Mountain View, California 2019 – 2020

Senior Consulting for Identity Platform on AWS

·Propose and architect a backend store and GraphQL server for the Identity Platform. The backend store is based on Delta Lake, which meets business requirements of transactional, exactly once, low latency, and high throughput. The GraphQL server uses open source GraphQL-java and schema federation to achieve high performance and monolith.

·Develop Jolt specifications for streaming ingestor to transform Json data into corporate standard entity profile hierarchical schema and load them to DynamoDB. Generate SQL insert scripts from Json metadata and load them to MySQL for admin service.

·Use Kafka, Jenkins, Github, and other AWS technologies, such as EMR, EC2, Elastic Search, and DynamoDB to deploy new financial attributes of Intuit business to the high volume production Identity Platform running on AWS.

Huawei Technologies, Santa Clara, California 2016 – 2019

Senior Consulting BI Platform Architect

·Propose, architect, design, and implement Materialized View (MV) to drastically improve SparkSQL performance for interactive BI workloads against data lake. The code, written in Scala, becomes a part of an Apache open source project, used by Huawei, China.

·Extend SparkSQL internal query framework, Catalyst, with Modular Plan, corresponding to Query Graph Model of DB2, to facilitate implementation of rule-based query rewrite, MV advisor, and data harmonization.

·Extend techniques of SparkSQL cache to handle query execution involving MV dataset as external data sources in the columnar format.

·Propose techniques to perform MV maintenance for SparkSQL, based on λ architecture, structured streaming, Delta table, and customized state store. Unlike the approach in conventional DBMS, the objective of proposed techniques is to balance latency, throughput, scaling, and fault-tolerance.

·Use Spark MLlib to implement a microservice, using Dropwizard, to cluster tables of database into fact and dimension tables with a novel method, based on query usage and table statistics.

·Propose, architect, design, and implement log data source for Apache Spark, which infers (multi-table) schema for log data, extracts and saves them in Parquet format for analytics.

·Implement bottom-up hierarchical clustering scheme to infer log schema, based on union-find algorithm and feature statistics, i.e. frequency histograms of tokens.

·Integrate schema inference and log-data parser into Spark framework of dataframe/dataset/datasource.

EY-CHIH CHOW

Page 3

ClearStoryData, Menlo Park, California 2015 – 2016

Distinguished engineer

·Architect, design, and implement Multi-Query Optimization (MQO) extension of SparkSQL to optimize analytical queries for Company’s cloud-based BI platform via in-memory cache and materialized derived dataset. The component extends SparkSQL’s rule-based optimization/execution engines and tree libraries such as TreeNode, QueryPlan, LogicalPlan and SparkPlan to identify summary datasets from query logs and to rewrite user queries against summary datasets instead. The component is written in Scala.

·Integrate MQO into Dropwizard framework, running as a Spark application.

PubMatic, Redwood City, California 2013 – 2014

Pincipal Data Architect

·Lead a team to re-architect, design, and implement company’s next generation big data platform for advertising insights, reporting, and ad targeting.

·Build an optimizer that transforms multiple MapReduce jobs into one for execution, based on Pig’s horizontal packing capability that processes Pig scripts of single input/multiple group-bys using a single MapReduce job. The optimizer has been used in various aggregations for report generations/cube materializations. Daily reports generated are saved as Hive external tables. The system is written in Scala/Java/Python and running on AWS/Qubole platform.

·Architect, design, and implement Company’s data science platform for mobile advertising, based on Amazon Web Service, using open source software such as Hadoop MapReduce, Parquet, Hive, Presto, and Spark.

·Use HiveQL and Python scripts to convert hourly rotated gzipped log files into Parquet format and to save on AWS S3.

·Build advertising OLAP applications, such as AVAILS using Presto and Hive. Implement Hive UDAF for approximate counting, based on open source HyperLogLog libraries. Build summary tables and save on S3 in Parquet format. Construct Presto OLAP queries, using built-in functions and types, to merge spare/dense representations of HyperLogLog retrieved from summary tables.

·Write ETL in Scala, based on Spark RDD libraries, to transform and load incoming logs into Vertica for interactive analysis.

·Implement a pull-based data ingestion sub-system that pulls data from Kinesis of AWS or Kafka and ingests them to HDFS. For Kinesis, the sub-system is embedded in a Pig UDF for an Amazon customized Pig script. For Kafka, it is embedded in record writer of Camus job.

·Propose architecture for cube materialization that uses Hbase for on-the-fly computing of measures, algebraic or partially algebraic, for reducer unfriendly cube regions.

·Build a predictive model for click through rate using Generalized Linear Model with Lasso Regularization of MATLAB.

Booyah, San Francisco, California 2011 – 2013

Lead Data Scientist: Product

·Implemented Hadoop-based ETLT for processing event logs, saved on HDFS in Snappy-compressed splittable Avro format and loaded to column-based Vertica data warehouse for analysts. Map/Reduce jobs (mostly in Java, one in Pig, and all support schema evolution via Avro and modularized code) included in the ETLT are Data Cleansing (integrated Pentaho tool into this job), Place Expansion (expanding Json event logs with place data moved from MySQL to HDFS via Sqoop), User Data Expansion (expanding events with user summary data), Hbase to HDFS (saving sessionized events from Hbase to HDFS, loading such events from HDFS to Vertica), User Data Aggregation (aggregating user data, moved from Vertica to HDFS via Pig, with post ETLT data, via 3-way reduce-side join and approximate counting techniques of data mining). Workflow of the ETLT subsystem is coded in Python. The subsystem is deployed with Rundeck, Poppet, and Cloudera Manager.

EY-CHIH CHOW

Page 4

·Initiated, formulated, architected, and implemented data products and predictive dashboard for E-Merchandising and user acquisition, using machine learning techniques such as Random Forest and Logistic Regression, for classifying gamers into enthusiastic, value, and non-payers. Work included: data preparation based on the user summary data from ETLT, feature extraction (coded as UDFs of Pig) by first identifying the most important feature and adding many other relevant features by intuition, large scale machine learning via enhanced version of Mahout Random Forest (adding class probability estimation and concept of most important feature to RF), dissecting and evaluating models using Mahout libraries such as Tree Visualizer, AUC, Percent Correct, Log Likelihood, and Confusion/Entropy Matrix, tuning classifiers by comparing results of two distinct models, i.e. SGD and Random Forest, adjusting parameters of algorithms, proposing an SOA architecture with Memcached and Hbase for game client to access user pay prediction, etc.

Ngmoco, San Francisco, California 2010 – 2011

Senior Data Architect

·Initiated the Hadoop Mapreduce project as a data platform for mobile social game analytics. The platform uses open source tools such as HDFS, Avro, Hbase, and Pig.

·Implemented various mapreduce jobs including (1) ETL processing of incoming events for social games, (2) OLAP (i.e. MR-Cube) sourcing techniques for social game analytics such as Activity Gathering, Play Rank, Cohort Analysis, Unique Counting, Session Duration, etc.

·Explored Protocol Buffers and Twitter’s Elephant Bird as serialization tool to replace Avro for all the mapreduce jobs.

·Fixed various issues with the mapreduce jobs, such as memory exceptions (at reducer side), counting errors, etc.

·Built Java libraries using design patterns, such as Chain Responsibilities, Adaptor, etc., for complex mapreduce jobs and Avro interface with Hbase.

·Explored using Pig for ordinary users to build mapreduce jobs for experimentation.

Yahoo! Inc., Sunnyvale, California 2008 - 2009

Technical Yahoo/Senior Software Architect

·Designed and implemented stream mining techniques (with C++, STL, LAMP, the same below) for use in real-time opportunity forecasting and selecting 3rd party ads networks of lower latency and higher click through rate in ad auctions.

·Added frequency-capping capabilities to the system for targeting.

·Implemented an emitter to broadcast incoming ad calls to LWES (Light-Weight Event System) receiver for training new releases of ads server.

·Implemented Web Service API in Perl for 3rd party ad networks to use to control ad traffic flow when participating in Yahoo’s auction-based ads services.

·Enhanced an infrastructure that migrates cookies from browser to backend for display ads serving.

Google Inc., Mountain View, California 2007 - 2008

Senior Consultant

·Recommendation System based on in-house Search/Content ads technologies.

·Scalable on-line learning algorithm for user-behavior modeling.

·Query expansion based on relevancy analysis via noisy-OR Bayesian net.

·Built infrastructure (with Java, Eclipse, LAMP) to extract documents from Web.

·Designed and implemented an information retrieval model that takes advantage of GFS, Bigtable, Sharding, and MapReduce to rank documents and recommend to users.

Accelerate Software, Cupertino, California 2000- 2007

Founder and Chief Architect/Senior Consultant for Shopzilla, Retrevo, Dorado, and Lockheed

·Conducted behavior modeling and statistical data mining/database marketing using SAS for online advertising.

·Developed optimal sponsored-search bidding algorithms for online advertising.

EY-CHIH CHOW

Page 5

·Architected and developed an on-demand event-driven information delivery framework for structured, semi-structured, and unstructured data.

·Developed a focused web crawler to fetch consumer product review data for summarization, phrase extraction, clustering, and categorization for web2.0 applications.

·Developed portal applications with Websphere portal and JSR168.

·Java server infrastructure, including JMS, for SaaS implementation of CRM.

·Built a high performance, multithreaded domain-specific Java application platform that combines service oriented, event-based, model-based architecture by extending Hibernate, Spring, and EJB.

Western Technology Center, AT&T Laboratories, San Jose, California 1996- 1997

Senior Architect for a Network Convergence Software

·Developed a scalable event notification agent to integrate update process for a secure and highly distributed system.

·Developed a high performance data exchange mechanism to load registration profile to various services.

·Developed an interoperable proxy for storing and retrieving information from various storage systems.

·Coded various network components using HTTP, URL encoding HTML attribute model, socket, non-blocking I/O, RPC, thread, MS Visual C++, and Java.

Gupta Technologies, Inc., Menlo Park, California 1990- 1994

Chief Architect for SQLBase Query Optimizer

Designed and implemented the following:

·an unified execution model in C (the same below).

·hybrid hash join with bit map filtering technique.

·data and demand driven query execution.

·asynchronous prefetch for sequential and clustered index scan.

·OR and Min/Max optimization using index.

·result set as scrollable programming language interface.

·query rewrite techniques (subquery elimination, magic-sets transformation).

·plan optimization techniques (dynamic programming, exhaustive search, plan costing).

EDUCATION

Ph.D. in Computer Science, University of California at Berkeley, Berkeley, California

Thesis:"Queries in Deductive Database Systems"

M.S. in Electrical Engineering, National Taiwan University, Taiwan, R.O.C.

Contact this candidate