Resume

Data Engineer

Location:

San Jose, CA

Posted:

January 03, 2024

Contact this candidate

Resume:

Summary

**+ years Industrial experienced data engineer/analyst emphasis on data operations, analysis, scrubbing, testing, modelling, transformation, evaluation, and visualization.

Professional on developing Python data API services and creating Docker images for data processing and ETL to load data to multiple Data Warehouse, Lakes, and column-based DB.

Expert on dealing data issues on Databricks Data Lakehouse with Delta Table format as the advantages of Databricks and data governance with Unity Catalog and communicate/config with other Services such as, Cloud Services Provider, Business Intelligence Tools and APIs.

Data pipeline process professional used Python, Hive UDF (regrex, posexplode) Databricks Pyspark to extract structured/ unstructured data and maintain pipeline for both Batch job and streaming jobs through Git, Jenkins or Azure DevOps.

Solid at applying Machine Learning, statistical models, such as Traditional Regression, Logistic Regression, Support Vector Machine, Random Forest, Gradient Boosting, Neural Networks, K Nearest Neighbors using PySpark ML, Python Sklearn.

Expert at mining features from structured (MySQL, SQL Server, Postgre SQL, MS Access, Joins, Union, Subqueries, Aggregate functions, Window functions, Partition by) to unstructured (XML, Json, NoSQL) data sources operated in cross platforms ETL using Python (Pandas SQL, Dataframe Opeartions) and PySpark Dataframe/SQL (Filter, Groupby, Column, Row based).

Statistical analysis professional applied techniques such as A/B testing, hypothesis testing, sample ANOVA and Design of Experiments to explore data insight.

Strong Experience at define Success Metrics or Key Performance Indicator KPI in Macro (Total new customers, Monthly Revenue, Growth) or Micro (Click Through Rate, Conversion Rate, User Engagement) Perspectives and present using Tableau Dashboard, MS Power BI.

Professional at model validation via adjusting hyperparameters by GridSearch CV, and model evaluation by Recall, Precision, Accuracy, F1 Score and ROC-AUC by different use case.

Experience at multidimensional data analysis in terms of Ingestion, Processing, Persistence, Management) on HDFS and all kinds of Relational DB or NoSQL by using different tools such as Python (Pandas, Numpy, Scipy, PreProcessing), R (Sapply, Tapply, Ggplot,), SQL (Joins, Union, Subqueries, Aggregate functions, Window functions), Spark (Column and row based operations), Hive (Get_json_object, Etl, Case when), Pig (Join, Split, JsonLoader), MongoDB (Find, Projectiong, forEach, $gt, $set), HBase (Count, Get). Clickhouse (ArrayJoin, Map)

Expert at data visualization by Tableau Modeling, Power BI Dashboard Design, SQL Server SSRS Python Matplotlib, Seaborn libraries, and HDFS Ambari presentation.

Energetically devote myself on discovering the story and insight behind the data in different scopes of view as a data expert.

Skills

Programming Skills & Tools

Python, Java, Shell, R, HiveQL, Presto, Pig Latin, MySQL, MS SQL Server, Power BI, PySpark, H2O, MongoDB Shell, Tensorflow, Keras, XML, PMML, MS Access.

Project/Source Management

Agile methodology, Radar, Jira, Git.

IDE

Pycharm, Intellij, Lens, Docker, Ambari, Hue, Qubole, Google-colab Notebook, Jupyter Notebook, R Studio, Visual Studio, Teradata Studio, MongoDB Compass, Jenkins.

Work Experience

Mercedes-Benz Beijing CN

Big Data Engineer Oct. 2021 – Nov.2023

Data Warehouse

Designed and managed data pipelines ETL from Azure Data Explorer(ADX/Kusto) to Databricks Data Lakehouse by using Pyspark with the data format Delta Table, and triggered, monitored, logged by Azure Data Factory with data query performance increased 50% and data ETL reliability increased 90% in month.

Designed and managed data pipelines ETL from Postgres DB daily automatically ingest to Azure Data Lake Storages (ADLS) with user defined Postgres Stored Procedure functions and triggered, monitored, logged by Azure Data Factory with data query performance increased 80%.

Took advantages of Databricks Data Lakehouse to manage the data access control via Unit Catalog, data table version control via Change Data Feed and merge operations via Delta Table with efficiency, operated with Pyspark and SQL.

Researched and utilized the OLAP tool, column based ClickHouse (CK) Database on testing the performance when sync to Postgres DB with CRUD operations and then deployed CK for data querying by web services with 80% speed up.

Created and maintained the daily/monthly data metrics, reports to fulfill the Product Managers’ request, for example, the usage of a new online services to vehicles and the monthly comparisons with insights and strategy making by Pyspark and cron Jobs with Azure Data Factory by designed triggers, linked services, runtimes, alerts.

Experienced with the data operation auditing and governance, and customer private data protection with data pseudogenization when data transfer across border or between entities, as well as the designed customer data deletion process.

Cooperated with Devops colleague to design the SAAS communications with Private Link Services, operated on Bunkers and deployed Integrated Runtimes to hold commute process under Virtual Networks and manage the product as Admin.

End to End Services

Developed and expanded business to multiple teams to apply big data platform to process vehicle signal datasets and save 90% cost when comparing Azure products in Europe, and fulfilled engineers requests to compare data from multiple dimensions.

Designed the end-to-end data processing pipeline and RESTful API services to manage the high-level processes via Python FLASK and Linux bunker crontab to trigger APIs for job dispatching and data processing, also optimized algorithms to efficiently scan latest upload files from ADLS and trigger batch Jobs through K8S.

Deployed code via Docker Images and generate K8S Jobs by calling REST API services with provided YAML definition of K8S Secret and optimized K8S Configs. K8S resources handled with Azure KS and CI/CD with Azure Devops.

Ingested data to cluster distributed ClickHouse with well-designed table structures (merged tree engines, partitions, key-value pairs structure) with high-performed views and indexing tables to improve the speed for complex data querying for analytics and data visualizations in Grafana.

Experienced Streaming data processing with Kafka (Azure EventHub) by data processing with configed Java Spring Framework, secured and authorized with oauth2 token master, service deployed on Azure KS with Helm template under CI/CD tool and consumed data to load to Databricks as Delta Table.

Managed project of access control of utilizing the ADLS (Gen 2) and provided service across teams by developing Access Control List (ACL) service via Java Spring Boot by configuring the ADLS JDK, project aim to expand the big data platform services to stakeholders in different use cases and monitor the user behaviors to abuse data storages.

Soul App Beijing/Shanghai CN

Data Mining Engineer Apr.2021 – Oct. 2021

Analyzed the business process and discussed with stakeholders about the keys for business success and create the data features for further analysis, reporting and modeling predictions with tool of Spark, Presto and Python.

Generated customer/ membership churn modelings such as Logistic Regression, Random Forest, LightGBM, Neural Nets and selected the fitted evaluation methods for each business use case.

Started from 0 to 1 to build the data mining platform for the data mining team, used Python, Shell, Pyspark to generate the automotive pipelines under Ali Cloud eco system with Zepplin.

Predicted list of customers with high probability of churn in 14 days future and making recall strategies for these customers' retention with business operation team.

Recalled customer increased 90% when compared the last period without the modelling prediction. Recall strategy is deployed through marketing Ads channels such as TikTok, JD, etc.

Python and Spark programming for analyzing data, data fetching, scrubbing, transforming and molding with evaluation.

Cooperated with business users to define KPIs, Metrics and features with documentation.

Apple Inc. Sunnyvale, CA

Data Analyst May 2019 – March 2021

Project: Data Operations

Used Hive, Scala Spark, Python programming for analyzing data, data fetching, scrubbing, transforming and molding with evaluation under Cloudera Hadoop DFS by Zeppelin.

Developed Python and Hive UDF (regrex, posexplode, get_json_object) advanced queries for unstructured data extracts, data cleaning, build new features and generated partition tables.

Applied supervised machine learning models including Logistic Regression, KNN, SVM and Random Forest to make predictions and to mitigate overfitting, Cross Validation method is used for selecting the optimal hyperparameters.

Constructed and maintained data pipeline with Git code source management by importing Hive QL and PySpark SQL programming scripts configured with Jenkins data pipeline scheduler builder.

Cooperated with stakeholder to understand the request and supported to design success metrics to monitor the business progress weekly and monthly.

Designed the generalized aggregated tables for the downstream stakeholders, optimized the table query efficiency and improved the running time from 2 hours to 1 hour using Hive UDF.

Interpreted data, analyzed results using statistical techniques and generated data visualization dashboards with Tableau Desktop/Server by applying Tableau Level of Details (Tableau aggregated methods) Expressions.

Maintained and managed the internal Tableau Dashboard by monitoring the TDS data source and dashboard refreshment issue such as from server, port or interruption of data pipelines.

Conducted data migration from Hive QL to Scala Spark and completed 40% workload with the improvement of accuracy of data and query by removing temporary tables/views and validated the results, such as the accuracy for approx_percentile aggregate function in each QL.

Big data OLAP processed based on Spark SQL and applied Spark ML module to cluster Geo related data and analyzed insights from data for different clusters based on Spark SQL.

Scientific Research on resolve problems from stakeholder by using statistical methodology and programming skills.

Nauto, Inc. Palo Alto, CA

Data Analyst Jan 2019 – Apr 2019

Project: Construct Data Pipeline and Data Visualization

Developed Hive/Presto (window functions, with clause) queries for data extracts and data cleaning under Hadoop DFS using Qubole.

Interpreted data, analyzed results using statistical techniques and generated data visualization with Tableau Desktop/Server as Tableau admin.

Developed and implemented data analyses and made estimation of designed metrics, such as, device LTE usage analysis for the purpose of saving monthly LTE cost and found the unhealthy devices.

Acquired data from primary and secondary data sources and maintained, built databases, such as, Postgre DB and Syclla DB under AWS.

Maintained and managed the internal Tableau Dashboard for multiple requests of stockholders from customer success, marketing, manufactory operation or financial departments.

Python and Spark programming for analyzing data, data fetching, scrubbing, transforming and molding with evaluation.

Cooperated with business users to define KPIs, Metrics and features with documentation.

Provided data quality and reconciliation support against different data sources with data refreshing Cron Jobs under Qubole scheduler and notebooks.

Worked with the Data Infrastructure team to introduce new data mart, data modeling and cleaning with build of aggregate and summary tables for improving analytic team working efficiency.

Beshton Software Inc. San Jose, CA

Data Scientist Jun 2017 – Dec 2018

Project: Customer Gain and Credit Default Analysis

Operated structured and unstructured data for data analyzing and data visualization. Data source mainly provided by customers with many sources i.e. Traditional SQL and NoSQL.

Fetched client’s database which stored users’ demographic information with SQL, users’ previous credit history from credit bureau with Python and public crime data with Spark.

Applied supervised machine learning models including Logistic Regression, KNN, SVM and Random Forest. To mitigate overfitting, Cross Validation method is used for selecting optimal hyperparameters (, k, Number of Feature for RF).

Manipulated data for different format such as Log Json or CSV files. Format transformation, integration, ingestion and process to Hadoop Distributed File System (HDFS) by Hive, Pig or Spark managed by Ambari or Hue. Shell operations used as well.

Built new features from raw data by PySpark (Filter, Groupby, Splite) or HiveQL (Isnull, Clusterby) for further analysis such as build predictive models.

Analyzed customers request such as improve the sales or increase customer retention rate, A/B Test design based on discovered data insights. Run A/B Test and analyzed the result for short-term and long-term effects then provided suggestions. Tableau modelling for analyzing and presenting.

Wrote HiveQL and Spark SQL for querying to retain insights of customers data such as the time spend in ads, CTR. Queries stored and exported for visualization analysis Power BI or Tableau (Multi-dimensional Graphs).

Applied K-means clustering model to generate users’ location clusters by Spark ML lib. Further analysis such as credit default for different clusters.

Clustered timely distribution or click amount on different time span such as hourly, daily and monthly. Data export and Dashboard built to present by Tableau.

Developed algorithms to predict customer churn probability based on labeled data via Python programming used libraries like Pandas, Numpy, Scipy, Matplotlib.

Cisco Systems, Inc. Phoenix, AZ

Data Analyst May 2015 – Aug 2016

Project: Telecom Customer Churn Analysis with Designed Data Mining Model

Designed association model focus on finding concurrent effects of features without prior knowledge of any variables.

Define model parameters by high level understanding and low-level detail explorations.

Validated model by introducing customers’ dataset. Data source from various platforms SQL Server, PeopleSoft, My SQL (IFNULL, Joins, Unions, Like).

Created feature transformation methods numerical data to categorical data by Tableau (Trend Line, aggregations modeling) and Python.

Adjusted parameters such as Established Association parameter for algorithms based on data interpretation.

Created new features from raw unstructured data, data exploratory for each unusual/unexplained feature by Python(Pandas) & R (Dataframe Opreations, Maps, Lambda), SQL Server (SSIS).

Transformed numerical data to categorical data based on data trend or data cluster analysis before applied to algorithm by Python used Pandas library with aggregate functions and Lambda functions. New built features appear to be highly correlate to customer churn such as percentage of usage per month.

Analyzed data clusters using data visualization tools by Tableau and Python matplotlib. Line charts work perfect for analyzing data tendency. Non-overlapping intervals of data clusters to define categories.

Built data matrix for analyzing One-One or Two-One associations for each instance and summarized features importance. Python Numpy and Pandas libraries are used for data manipulating.

Designed feature analysis for different sub-groups of observations, such as gender, occupancy and nationality filtered by Python Dataframe operations (Self defined functions).

Compared different groups of observed significant features by Tableau visualization tool. Map plot and bar charts are deployed for multi-dimensional figures.

Discovered uncommon conditional variables and feature associations.

Provided suggestions for customers by Tableau, Power BI Dashboards.

JD.com, Inc. Beijing, China

Data Analyst Feb 2013 - Aug 2014

Project: Ecommerce Customer Engagement Analysis

Analyzed on ecommerce datasets, found customer shopping patterns and provided business insights. Datasets from multiple sources SQL and NoSQL.

Analyzed historical dataset related to Digital Marketing Vehicles (DMV) performance such as SEO, SEM, Email, Display Ads with designed performance indicators. Made strategies after comparisons.

Studied nurturing the leads with SQL queries such as email list for target customers, customers’ search topics in search engine among certain time.

Run AB Test for understanding customers preference such as emails title and body format, also analyze the shot-term and long-term results shown by Tableau. AA Test applied in certain cases.

Applied Python Pandas, Numpy and visualization tools for data exploratory and feature engineering, such as group amount of orders per customers in last month.

Conducted statistical analysis customer and transactional data to forecast the probability customer would purchase using statistical models such as Logistic Regression, Support Vector Machine. Random Forest used for searching the key features for purchasing.

Analyzed Fractional Attribution model, effectiveness of advertisements and built regression model to assign weights to each marketing vehicle, also suggested to build accuracy databases for long-term analysis.

Estimated customer lifetime value by various models and acquired customer segmentation, created Tableau dashboard to presents the gross merchandise value GVM after first purchase.

Presented effects of customer shopping time, market vehicle and proposed new customer acquisition (SEO, Organic) and retention strategies with predictive models by R and Python.

China Post Jilin, China

Database Developer Jun 2012 – Jan 2013

Project: Relational Database Design

Designed relational database for cloths product industry. Analyzed the business mode and conducted Objected Oriented Analysis (OOA).

Analyzed data sources and formats from various tools such as Oracle database, SQL Server. Generated key entities and activities for the certain industry.

Built from the concepts of Entity - Relation Models, then methods to build Relational Model and create or modify database schema with foreign key migration rules.

Mitigated the storage space by introducing concepts of Normal Forms. Operated by MS Access (GUI) and MySQL (Create DB, Alter table, Not Null) for demonstrations.

Designed use case and operated with SQL syntax such as Join Functions, Window Functions and Subqueries to examine the optimal time complexity.

Teradata database design, applied same schema and created table relations. Uploaded data source from various sources such as log files generated tables, SQL Server reports.

Generated analysis based on Teradata used functions such as Windows Aggregate functions, Regression.

Education

Master of Science in Industrial Engineering - Statistics May 2017

Arizona State University, Tempe, AZ

Machine Learning, Data Science, Deep Learning, Regression Analysis, Optimization.

Master of Science in Construction Engineering May 2015

Arizona State University, Tempe, AZ

Algorithms, Operations Research, Quality Control, Reliability Engineering, Database.

Bachelor of Engineering in Civil Engineering Jun 2012

Jilin University, Changchun, China

Contact this candidate