Data Engineer Visualization

Location:

Boulder, CO, 80302

Salary:

Posted:

November 26, 2024

Contact this candidate

Resume:

Tommy Wang

732-***-****

************@*****.***

linkedin.com/in/tommywang091

Around 11 years of IT experience as a Data Engineer in a variety of industries, which includes hands on experience in Hadoop, MapReduce, Hive, Spark (Pyspark), Azure, AWS, Sqoop, Snowflake, Teradata, Oracle, RDMS, Python, Scala, ETL and Data Visualization.

Experienced in Databricks & Big Data Technologies mainly on Data and Delta Lake Implementation.

Excellent understanding of all stages in a typical SDLC like Requirement Analysis, Design, Programming, Project Status Review (PSR), Unit Testing, Integration Testing, Support

Expert in designing and implementing efficient data processing pipelines using Spark, Hadoop, and cloud based ETL tools such as AWS Glue and Azure Data Factory.

Extensive knowledge in Hadoop, Spark, Pyspark, Hive, Impala, Sqoop, RDBMS (Oracle, Maria DB).

Experienced in Spark with Hadoop platform for processing billions of data which uses in-memory data processing.

Experience in Data governance & Data modelling, use ER studio analyse the relationship database.

Worked with NoSQL databases like HBase in creating tables to load large sets of semi structured data.

Worked with Python and R Libraries like NumPy, Pandas, Scikit-Learn, SciPy, and TensorFlow.

Proficient in managing entire data science(AI) project life cycle and actively involved in all the phases of project life cycle including data acquisition, data cleaning, data engineering, features scaling, features engineering, statistical modelling (decision trees, regression models, neural networks, SVM, clustering), dimensionality reduction using Principal Component Analysis and Factor Analysis, testing and validation using ROC plot, K- fold cross validation and data visualization.

Good knowledge of Data Visualization including producing tables, graphs, and listings using various procedures and tools such as Tableau.

Working knowledge of Integration Services (SSIS), Reporting Services (SSRS), and Analysis Services (SSAS).

Worked in extracting data from multiple sources to Power BIData Preparation.

Managed and optimized cloud-based infrastructure on AWS Azure GCP to ensure high availability and scalability of data processing and machine learning workloads.

Supported migration of data of loading modes transaction, master from RDMS to Hadoop servers.

Designed EC2 instance architecture to meet high availability application architecture and security parameters.

Developed python scripts to hit REST API’s and extract data to AWS S3

Experienced in streaming analytics Spark Streaming and Databricks Delta

Good understanding of monitoring tools like AWS Cloud watch, Grafana, Splunk and Nagios.

SKILLS

Programming Language:

Python, SQL, Pl/SQL, Java, R Data Visualization Tableau, Microsoft Power

BI Libraries

NumPy, Pandas, Matplotlib, Scikit-learn, Seaborn, Plotly, NLTK, TensorFlow, Statistics/Machine Learning: Regression, Classification (Decision Trees, SVM, KNN), Bayesian Statistics, Time Series Big

Data Technologies

Hive, Spark, MapReduce, Databricks, Oozie, Pig, Zookeeper

Another Tool

Excel, Anaconda Jupyter, GitHub, Azure ML studio, Visual Studio, Spyder, MATLAB

Cloud

Azure, AWS, Apple Services, Meta Services, GCP

EDUCATION

Shanghai University - Shanghai, China

Bachelor of Computer Science and Technology (Graduated 2005)

Shanghai University - Shanghai, China

Master of Finance (Graduated 2009)

PROFESSIONAL EXPERIENCE:

McKinesey & Company - Dallas, TX Jun 2020 – Present

Data Engineer/Data Analyst

McKinsey & Company is a management consulting firm that advises on strategic management to corporations, governments, and other organizations. McKinsey is the oldest and largest of the "Big Three" management consultancies (MBB), the world's three largest strategy consulting firms by revenue. It has consistently been recognized by Vault as the most prestigious consulting firm in the world.

Responsibilities

Worked for VF Corporation loyalty team, design, build, and maintain data infrastructure and models that support VF Corporation's different loyalty programs, integrate data from various sources and monitor the performance of loyalty program initiatives using key performance indicators, including enrolment, order and points redemption management, Management of loyalty data up and downstream, with understanding of reporting requirements for loyalty as a critical business driver, such as OMS, CDP, GDF, Driving transformation of loyalty program enablement through improved technology, processes, data management, and use case refinement for VF brands Vans, The North Face and Timberland.

Write Scope script to extract the data of user subscription and usage from Cosmos and Analysis different features. Created the connection between Cosmos and Azure data lake storage gen1 and transferred the data-to-Data Lake. Used DataBricks, python to process data, Analysis the relationship and relevance about the features and subscription. Used Logistic Regression, Random Forest to build the Machine learning model to identify and target ‘at risk’ users. Deploy and Operate the Machine learning model on Azure (DevOps), help the Microsoft Xbox marketing team to optimize and personalize campaigns to move members from high churn risk to low-risk segment.

Worked for the Walmart supply chain e-commerce team. Write shell script to extract data from Kafka and transfer to GCP buckets with data mapping and transformation, create different schemas in Big Query and use ETL tool (Automic) to write pipeline and land data. Write query to get data and create dashboard and do data quantitative analysis.

Assist in building Data pipelines to improve data quality and facilitate iterations for accommodating new user requirements.

Proactively working on building data lineage which would display incoming and outgoing reference of an asset by extensive use of data pipelines and other Meta tools such as Uni dash, Dai query, Diffs and Sev.

Performed data transformation and processing tasks within Snowflake by leveraging Snowflake's built-in SQL capabilities. Also, data quality and consistency by implementing data validation checks, data cleansing, and normalization processes.

Design the data bricks notebooks using Pyspark for data extraction, transformation and landing the data into the data bricks Delta Table.

Utilize Azure DevOps for continuous integration and deployment (CI/CD) pipelines, integrating Kubernetes deployments and configuration management for automated deployment, monitoring, and management of data solutions.

Using Pandas, NumPy, SciPy, and Scikit-learn in Python for scientific computing and data analysis.

Working extensively with Databricks to build scalable and efficient data processing and transform data pipelines to handle high-volume data ingestion, transformation, and aggregation.

Leveraged Databricks notebooks and Spark clusters to perform data ingestion, ETL (Extract, Transform, Load), and data wrangling tasks.

Dump data into Azure Data Lake Storage and analysing raw data using Azure Data Lake Analytics jobs.

Implement complex data transformations using Pyspark, SQL, and Data Frame operations within the Databricks environment.

Design, build and launch efficient & reliable ETL pipelines to move and transform data (both large and small amounts).

Perform bash operations, using GIT for version control.

Created Azure Databricks notebooks in Pyspark for transforming raw JSON data into structured data.

Led scrum meetings and performed other Agile ceremonies like effort estimation and user stories management using JIRA.

Intelligently design data models for optimal storage and retrieval.

Working on designing and implementing the architecture and data models in Snowflake.

Design and develop new systems in partnership with software engineers to enable quick and easy consumption of data.

Closely partner with our Data Science team to build dashboards, self-service tools, and reports to analyse and present data associated with customer experience, product performance, business operations and strategic decision-making.

Environment: SQL Server Management Studio, Azure Data Factory, Azure Databricks, Kubernetes, Visual Studio, Agile (SCRUM), Python, Power BI, Meta Services, JSON, Databricks, Spark

Mount Sinai Health - New York, NY Jan 2018 – Jun 2020 Data Analyst / Data Engineer

The Mount Sinai Health System is a hospital network in New York City. It was formed in September 2013 by merging the operations of Continuum Health Partners and the Mount Sinai Medical Center.

Responsibilities

Developed and maintained interactive dashboards and reports using Tableau, enabling business users to access and analyse critical data for decision-making purposes.

Assisted in building Data pipelines to improve data quality and facilitate iterations for accommodating new user requirements.

Optimized database performance by analysing query execution plans, implementing indexing strategies, and fine-tuning SQL queries, resulting in significant reduction in query response times.

Assisted in the administration and maintenance of data warehouses and databases, ensuring scalability, availability, and adherence to data security standards.

Created visualisations and dashboards using Tableau to communicate data insights to business stakeholders.

Conducted A/B testing and other experiments to determine the effectiveness of various solutions and improve product performance.

Conducted statistical analyses, including hypothesis testing and regression analysis, to determine the significance of various factors affecting business performance.

Developed simple and complex MapReduce programs in Hive, Pig, and Python for Data Analysis on different data formats.

Implemented Spark using Python and Spark SQL for faster processing and testing of the data.

Worked with Hadoop infrastructure to store data in HDFS storage and use Hive SQL to migrate the underlying SQL codebase in Azure.

Designed and Created Pipeline using Python to transform and persist data in parquet format and enable Hive.

Applied ML techniques like Classification, Clustering, and Neutral Networks using libraries like sci-kit-learn, pandas, NumPy to model and identify patterns in customer behaviour for grouping.

Rank Customers based on purchasing behaviour of intelligent data in corporate markets.

Managed and optimised cloud-based infrastructure on Azure to ensure high availability and scalability of data processing and machine learning workloads.

Developed data pipelines in Azure Data Factory using linked services, datasets, and pipelines for ETL from Azure SQL, Blob storage, Azure SQL Data Warehouse, and write-back tool, handling forward and backward data flow.

Run SQL DDL/DML scripts to set up database objects in Azure SQL Database.

Collaborated with business stakeholders to understand their reporting and analysis needs, providing insights and recommendations based on data analysis.

Set up access control and user permissions on Azure SQL database.

Developed and delivered training programs to educate business users on self-service reporting tools, empowering them to independently access and analyse data.

Collaborated with cross-functional teams to gather and analyse business requirements, translating them into technical specifications and data models.

Developed an ETL pipeline using Apache Spark to extract data from a public dataset, perform data cleaning and transformation, and load the data into a PostgreSQL database for analysis.

Designed and implemented ETL processes using SQL and Python, ensuring accurate and efficient data integration from multiple sources into the data warehouse.

Environment: SQL Server Management Studio, Azure SQL Database, Azure Data Factory, Pandas, Hive, Sci-kit Learn, NumPy, Visual Studio, Python, Tableau, Apple Services, Hydra

First National Bank Omaha - Omaha, Nebraska june 2014 – Dec 2017

Data Engineer

Chartered and headquartered in Omaha, Nebraska, United States, First National provides corporate banking, investment banking, retail banking, wealth management and consumer lending services at locations in Nebraska, Iowa, Colorado, Texas, Kansas, South Dakota and Illinois.

Responsibilities

Helping executives to understand the usage of Azure services which mainly focus on Machine Learning Model where Design Develop and Deploy all the Azure Machine Learning Model with their activities and their ETL data pipeline in Azure Data Factory which helps to analyse the data to track the usage of Azure services.

Developed and optimized data warehouse for datasets using Azure Data Lake Storage Explorer and Kusto Data Explorer

Support senior management by providing automatic metrics reporting and ad-hoc deep dive analysis, managed SLA, implemented & automated reported intelligent analytics infrastructure of ML Models using SQL Server, Excel, and Power BI.

Used Hive & Spark for data processing & Flume to transfer log files from multiple sources to HDFS.

Developed Stream Processing data pipelines using Spark Streaming, and Kafka.

Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL, and U-SQL Azure Data Lake Analytics.

Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in In Azure Databricks.

Implemented real time data ingestion pipeline to ingest data from trip booking system into Hadoop and HBase for analytical purposes using Kafka, Python, Pyspark, HBase enabling to load 100 million records per day into Data Lake.

Worked in close partnership with cross-functional teams across organization to define business rules and cleansing requirements for data transformation process.

Responsible for technical coordination including writing SQL query and C# Script logic and leveraging assets to build and deploy email campaigns.

Responsible for DevOps activities such as code maintenance in GIT, Jenkins.

Generated discover reports using Power BI and presented campaign engagement statistics to business owners to direct future strategies and initiatives.

Environment: Azure Data Factory, Kafka, Pyspark, SQL Server Management Studio, Visual Studio, Power BI

Lufax - Shanghai, China Dec 2011 – May 2014

Python Developer/Data Engineer

Lufax, full name Shanghai Lujiazui International Financial Asset Exchange Co., Ltd., is an online Internet finance marketplace headquartered in Lujiazui, Shanghai. Founded in 2011, it is an associate of China Ping An Group. The company was founded in September 2011, and started with P2P lending as the only service.

Responsibilities

Created action filters, calculations, parameters, and calculated sets for preparing dashboards and worksheets.

Designed and developed Power BI graphical and visualization solutions with business requirement documents and plans for creating interactive dashboards and developed various DAX measures in Power BI to satisfy business needs.

Transformed stored data by writing Spark jobs based on business requirements.

Developed SSRS reports, SSIS packages, and data pipelines using Kafka for ETL jobs to optimize the Tensor Flow model, and implemented efficient Hive SQL, and Spark SQL.

Collaborated with data engineers and operation team to implement ETL process, wrote and optimized SQL queries to perform data extraction to fit analytical requirements.

Performed preliminary data analysis using descriptive statistics and handled anomalies such as removing duplicates and imputing missing values.

Performed data mining on large datasets (mostly structured) and raw text data using different Data Exploration techniques.

Developed Spark applications using Pyspark and Spark-SQL for data extraction, transformation, and aggregation from multiple file formats for analysing & transforming the data to uncover insights into the customer usage patterns.

Build and maintain Snowflake databases, including data modelling, partitioning, and clustering for optimized performance.

Wrote ETL jobs to read from web APIs using REST and HTTP calls and loaded into HDFS using Python.

Built Data Integration and Workflow Solutions for data warehousing using SQL Server Integration Service and OLAP.

Environment: Python, Azure PaaS service, PowerBI, SSRS, T-SQL, Spark SQL, U-SQL, SOAP APIs, REST APIs, Pyspark, Spark-SQL, JSON, Azure Data Factory, Azure Data bricks

Contact this candidate