Data Engineering Google Cloud

Location:

Miami, FL

Posted:

March 19, 2024

Contact this candidate

Resume:

Name : Vishnu kumar

Phone Number: +1-908-***-****

Email : ******.*****@*****.***

PROFESSIONAL SUMMARY

Dynamic and motivated professional with 7+years of experience in the IT industry working on multiple technologies including Data Engineering, Product Management, cloud engineering, etc.

Strong experience in leading teams, mentoring junior engineers and delegating tasks

Strong academic background in technology and management/business

Highly proficient in cloud platforms – Amazon Web Services (AWS), Google cloud Platform (GCP)

Strong experience with programming languages like Python, JavaScript (node.js, react.js), etc.

Deep working knowledge of multiple Python libraries and frameworks like pandas, numpy, sklearn

Strong understanding of Data Engineering tools – Databricks, Airflow, Snowflake, etc.

Strong understanding of APIs, RESTful services, API development and usage

Strong understanding of OLTP, OLAP, data governance and data modelling, data warehousing

Proficient and experienced in various agile methodologies like Scrum, Kanban, Waterfall, etc.

Strong in Statistical Modeling and Machine Learning techniques

Demonstrated history of cross-functional collaboration across various disciplines in an organization

Strong ability to easily translate business requirements to technical workflows and tasks

Proficient in documentation, and presenting technical details to non-technical audience

Fast learner with a divergent/creative thinking, able to come up with unique ideas and solutions

Experienced in automating tasks, saving immense resources to organizations through innovation

Strong interest in data engineering, Artificial Intelligence, NLP and LLM (Large Language Models)

CERTIFICATIONS

Scrum.org – Professional Scrum Master – II

Meta – Introduction to Databases

IBM – Design Thinking Practitioner

EDUCATION

M.S. in Management Science

Columbia University in the City of New York

Columbia Business School (M7 business school)

Activities – Officer at Blockchain@CU, Columbia University Financial Engineering Club, Board member at Columbia University Product Management club

G.P.A – 3.41/4.00

B.Eng. in IT

B.M.S College of Engineering

Activities – NSS, BMSCE Investment Club, BMSCE Mathematics Club, Phase Shift Department coordinator

Thesis project – “Real-time conversion of text-to-braille using Python, Machine Learning, Optical Character Recognition (OCR) and Google Cloud Platform (GCP)”

Received the best project award at the university level (among 100+ projects)

G.P.A – 8.45/10.00

TECHNICAL SKILLS

Category

Skills

Programming Languages

Python (pandas, numpy, sklearn), JavaScript, Go, Bash, SQL

Databases

MySQL, PostgreSQL, Oracle, MongoDB, AWS DynamoDB, Redis

AWS

Amazon S3, EC2, Redshift, Glue, Glue Studio, Amazon RDS, Aurora, Athena, AWS Lambda, AWS API Gateway, Elasticache, Amazon SNS, Amazon CloudWatch, Amazon SQS, AWS EMR, AWS Sagemaker

Big Data Technologies

Apache Spark, PySpark, Apache Hive, Databricks, PySpark

Data Visualization

Tableau, Microsoft PowerBI

Integration Tools

Mulesoft

Project Management

JIRA, Confluence, Asana, Monday.com

Workflow Automation

Apache Airflow, Databricks Jobs

Containerization

Docker

CI/CD

Jenkins

Front-end Development

HTML5, CSS3, react.js, node.js

Google Cloud Platform

Google Cloud Storage, Google Compute Engine, Google BigQuery, Google Cloud Dataflow, Google Cloud Dataproc, Google Cloud Functions, Google Cloud Endpoints, Google Cloud Pub/Sub, Google Cloud Monitoring, Google Cloud Pub/Sub

Snowflake

Snowflake, Snowpipe, Snowpark, SnowSQL, Snowsight

RELEVANT EXPERIENCE

Fox Sports–New York, NY Apr 2023 – Present

Senior Data Engineer (FOX Sports AI Platform)

Highlighted Projects

1.Fox Data Products – Data Platform (Medallion Architecture)

Designed and implemented scalable ETL pipelines using AWS S3, AWS Lambda, AWS DynamoDB, Databricks (PySpark), AWS CloudWatch, etc.

Utilized AWS S3 as a common Datalake and Data Warehousing solution (silver and bronze)

Developed a data ingestion pipeline to ingest data from multiple vendors into the AWS S3

Worked with data flow APIs from multiple vendors for data like SR360, SportRadar, etc.

Implemented a medallion architecture to process and deliver validated event-by-event data

Utilized Python based convertors that leverage Pydantic models to convert Bronze level data

Flattened tables and created views using SQL for ease of understanding and visualization

Worked with data from multiple sports and leagues like MLB, NCAAFB, NFL, etc.

Managed a team of 3 junior data engineers and delegated pipeline development

Worked closely with business analysts to understand and translate business requirements

Persisted silver tables onto S3 and ingested tables into DynamoDB through Databricks DLT

Developed and maintained robust, scalable data pipelines using Apache Airflow to orchestrate complex workflows across AWS services such as AWS Lambda, AWS Glue, and Amazon Redshift, ensuring efficient data extraction, transformation, and loading (ETL)

Worked on Live ETL Pipelines by leveraging Apache Kafka/AWS Kinesis

Worked extensively with Jupyter/Ipython style notebooks on databricks for building pipelines

Utilized Git within the Databricks platform for version-control and collaboration

Extensively worked with PySpark on the Databricks to perform Dataframeoperations

2.Fox Data Platform – Cold Ranking

Devised a mechanism for cold ranking various events, metric and stats for MLB, NFL, etc.

Ranked career, season-level data by using business logic based on opportunity categories

Identified metrics for filtering based on threshold and stored them relevant Python classes

Leveraged Object Oriented programming to use functions in Python to create PySpark UDF’s

Created a cold ranking mechanism in PySpark, performed ranking in Pyspark

Leveraged Apache Airflow to automate and schedule ETL tasks - validation, processing.

Compressed and synced data to AWS DynamoDB using a PySpark UDF

Extensively relied on SQL/PySpark SQL for all tasks related to ETL to arrive at final table

3.Fox Sports AI Platform – Generative AI Event Commentary (Speech/Voice AI)

Worked on a POC project highlighting the capabilities of Generative AI in the sports space

Leveraged the capabilities of GPT and other functional models on AWS Bedrock

Created dynamic, real-time commentary across various sports events (simulated NFL games)

Created a tool that generates Event-By-Event Description of sports play data for NFL

Exposure to Voice generative AI for Sports Data commentary

Tech Stack: Python (incl. Pandas, Numpy, sklear, matplotlob, etc), SQL, Databricks, PySpark, Apache Airflow, Apache spark, AWS S3, AWS EC2, AWS DynamoDB, AWS ElastiCache, Redis, AWS Lambda, AWS API Gateway, AWS Bedrock, Apache Kafka, AWS Kinesis, AWS CloudWatch, JIRA, Confluence, DBeaver

Deloitte – New York, NY Apr 2021 – Apr 2023

Client: The Walt Disney Company

Role: Sr Data Engineer

Highlighted Projects

1.Disney Experiences – Disneyland and Disneyworld Cancellations and Reservations

Integrated diverse data sources such AWS RDS and DynamoDBinto a unified data ecosystem

Deployed Python-based data processing andworkflows on AWS Lambda and EC2 instances

Developed and maintained ELT/ETL pipelines using Apache Airflow

Ensured timely and error-free data ingestion from various sources into Snowflake

Leveraged data warehousing capabilities of Snowflake - to perform complex analytics

Worked with historical as well as live reservations, cancellations and guest experience data

Worked closely with the business intelligence and customer experience teams to translate data insights into actionable strategies

Staged extracted data in AWS S3 which functioned as a global data lake

Leveraged AWS SDK for Python (Boto3) to interact with AWS services

Utilized Snowpark to develop and execute complex data transformation and processing

2.Disneyland – FastPass

Used Python, directly within the Snowflake environment to transform FastPassdata

UtilizedSnowpipe for continuous, near-real-time data ingestion from AWS S3

Leveraged Snowflake SQL for executing SQL queries to efficiently, query, and transform

Employed Snowsightfor advanced data visualization and dashboarding

Transformed data using a combination of Snowflake SQL and Snowpark

3.Disney Parks – CCPA, GDPR compliance and Global reference table (GENIE)

Extensively worked withSQL on Snowflake to ensure GDPR and CCPA compliance

Engineered a pipeline to gather data from multiple sources to create a global metadata reference table that housed all unique ID’s across Disney’s business –GenieID, DisneyID, etc.

Utilized Snowflake’s Time Travel capability for analyzing historical data

Utilized the COPY command for bulk data load from S3 to Snowflake

Tech Stack: Python (incl. Pandas, Numpy, sklear, matplotlob, etc), SQL, Apache Airflow, Apache spark, AWS S3, AWS EC2, AWS DynamoDB, AWS ElastiCache, Redis, AWS Lambda, AWS RDS, AWS CloudWatch, JIRA, Confluence, Snowflake, Snowpark, SnowPipe, Snowsight, boto3, dbt

Client: Amazon

Role: AI Data Engineer

Highlighted Projects

1.CONFIDENTIAL – Generative AI Project

Devised a POC roadmap for creating a in house generative AI for product descriptions

Collected and prepared Prompt-Completion pairs in multiple styles across broad topics

Closely revised data and handed over the initial stage development of generative AI

Tech Stack: Python, AWS S3, AWS EC2, AWS EMR,Microsoft Excel, Microsoft PowerPoint, Microsoft Word, JIRA, Confluence, etc.

Client: Lumen

Role: Sr Data Engineer

Highlighted Projects

1.Churn Prediction – Using CDRs (Call Data Records)

Worked extensively with Lambda and AWS API Gateway to build a microservices architecture

Extensively leveraged Jupyter/IPython notebooks for ETL scripts in Python/PySpark

Utilized AWS CloudWatch for scheduling end-to-end ETL pipelines and monitoring.

Created pipelines to extract data from on-premises source systems to AWS S3

Utilized Amazon S3 to trigger Functions and microservices written in Python

Hands-on experience with PySpark, using Spark libraries and building UDF’s for tasks

Extensive experience with Databricks notebooks for PySparkdataframe manipulations

Worked extensively on ETL Pipelines to consume the data from APIs using Python, transform data, load it into AWS S3 for Reporting and Analysis purposes by Data Scientists

Managed codebase versions and changes effectively using Git

Created workflows for a job and scheduled AWS Glue jobs

Leveraged python boto3 to configure AWS services including Glue, Redshift, EC2, S3, etc.

Used AWS Glue to catalog data stored in S3 for integration with AWS Redshift Spectrum

Engineered crawlers within AWS Glue to store data in the AWS glue metadata catalog

Triggered AWS Glue ETL jobs using AWS Lambda functions (event-driven ETL pipelines)

Loaded processed data onto AWS Redshift to function as a OLAP database

Tech Stack: Python, AWS S3, AWS EC2, AWS Glue, AWS EMR, Pyspark, Databricks, Jupyter, AWS Lambda, AWS API Gateway, AWS Redshift, Tableau, Git, AWS CloudWatch, JIRA, Confluence, etc.

Client: Capital One

Role: Data Engineer

Highlighted Projects

1.Credit Cards – default handling and spend-pattern (at-risk accounts)

Worked extensively with AWS Lambda, built functions in Python

Designed efficient SQL queries to extract data from relational databases, Amazon RDS

Designed an ETL pipeline incorporating Amazon SNS, KinesisFirehose for live transactions

Engineered Amazon Kinesis Firehose to subscribe to Kinesis for batch processing of live data

Engineered Amazon Redshift to function as a data warehousing solution

Stored output of Amazon Kinesis Firehose stream in Amazon S3

Reprocessed data stored in S3 using Lambda functions and synced the data into DynamoDB

Created multiple AWS Glue scripts for validating/processing source systems

Scheduled AWS Glue jobs to be executed on job clusters for data transformation tasks

Utilized AWS Step Functions for orchestration tasks such as Lookups, if conditions, foreach loops, setting and appending variables, getting metadata, filtering, and wait operations.

Designed crawlers in AWS Glue to infer schema and create tables in the Glue Data Catalog

Structured PySpark scripts generated by AWS Glue for performing serverless ETL.

Wrote queries on Amazon Athena to extract data from Amazon RDS and create files in S3

Extensively relied on PySpark and SparkSQL to perform analysis on large volumes of data

Leveraged python boto3 to configure multiple AWS services including Glue, Redshift, EC2, S3

Experienced in using Git-based tools like GitHub, GitLabforcollaboration and version control

Designed and implemented automated CI/CD pipelines using Jenkins, Terraform, and other DevOps tools, enabling efficient and reliable software delivery and deployment.

Implemented real-time data streaming solutions using Amazon Kinesis, processing high volumes of data with low latency to provide near-real-time insights and analytics.

Worked with AWS Glue Studio and Jupyter Notebooks hosted on Amazon Sagemaker to create and schedule data engineering and data science workflows using Python and Scala

Tech Stack: Python, Go, AWS RDS, Amazon Aurora,SparkSQL, AWS Glue Studio, AWS S3, AWS EC2, AWS Glue, AWS EMR, PySpark, Databricks,Jenkins, Amazon SageMaker,Terraform, Docker, Kafka, AWS Kinesis,Jupyter, AWS Lambda, AWS API Gateway, AWS Redshift, Tableau, Git, AWS CloudWatch, JIRA, Confluence, etc.

ISERP (Columbia University) - New York, NY Sep 2020 – Feb 2021

Data Engineer

Highlighted Projects

1.AI Model Share platform – Web based platform for executing Machine Learningmodels

Created a Python based data ETL pipeline to extract semi-structured/structured data

Extracted data from sources like AWS DynamoDB, AWS Elasticache Redis,REST APIs

Cleaned and transformed data using lambda functions written in Python

Designed automated archiving for Redis using Lambda, CloudWatch cron jobs and S3 buckets

Designed and implemented high-performance caching solutions using AWS Elasticache Redis, enabling efficient and low-latency data retrieval across microservices

Configured and optimized Redis clusters and nodes using AWS ElastiCache Redis, enabling efficient and performant data storage and retrieval, and reducing operational costs

Leveraged the power of Python and the boto3 library to write efficient and scalable Lambda functions, enabling seamless integration with various AWS services

Designed and implemented an efficient and scalable data model using DynamoDB to store data related to ML models and their usage, enabling efficient querying and retrieval of data

Designed and maintained APIs that execute Machine Learning models on-demand

Tech Stack:Python, AWS DynamoDB, AWS Elasticache, Redis, REST APIs, AWS CloudWatch, AWS Lambda, Machine Learning, cron, microservices, boto3

United Nations (UNDP) - New York, NY Sep 2019 – Sep 2020

Data Engineer

Highlighted Projects

1.Sentiment Analysis to predict violent events in sub-Saharan Africa (Sahel)

Supervised a team of 8 to engineer a Python script to scrape news data for multiple countries in the Sahel region across 20+ years by leveraging Selenium, requests, and beautifulsoup

Engineered a Python script on Google VM’s to scrape news data (headlines and text)from the web for 20+ years by utilizing Selenium, beautiful soup and requests

Authored cron jobs to perform periodical data extraction from various sources

Built and end-to-end ETL pipeline using Python and the Google Cloud Platform (GCP) to gather data from multiple sources and prepared it for Natural Language Processing (NLP)

Utilized IPython notebooks on Google collab and leveraged Python, pandas, numpy, matplotib to perform Exploratory Data Analysis (EDA) on scraped and cleaned data

Stored scraped data in the form of Parquet files on GCS (Google Cloud Storage)

Leveraged Google Cloud Platform (GCP) for hosting continuous data gathering scripts on virtual machines (VMs) /cloud compute resources

Utilized Pyspark on Google Cloud Dataproc for performing sentiment analysis on large volumes of data and generating insights from parquet files

Utilized Tableau for creating dashboards to visualize correlations and understand relationships between parameters

Structured Google Storage to function as a storage solution for all obtained file formats

Stored insights and sentiment analysis data as CSV files on GCS

Created multiple Databricks Pyspark notebooks for validating/ processing source systems data into warehouse

Designed and developed tables on BigQuery, loaded CSV insights data into BigQuery

Implemented a data warehousing solution using Google BigQuery for analytical purposes

Plugged in Tableau onto BigQuery to plot correlations and other statistical data for stakeholders

Designed vibrant and informative dashboards on Tableau providing a descriptive snapshot of metrics and data

Estimated sentiments using NLP through pandas, NumPy, and NLTK (Vader Sentiment)

Utilized multiple operators in Apache Airflow to author DAGs to orchestrate the entire data pipeline/ETL flow

Utilized Google Cloud Dataproc for Pyspark based analysis of Parquet files

Tech Stack:Python, GCP VMs, Selenium, BeautifulSoup, NLP, IPython, Google Collab,numpy, matplotlib, Google Cloud Storage, Pyspark, DataProc, Tableau, BigQuery, Apache Airflow, Parquet, cron

Great West Financial - Bangalore, IN May 2017 – Sep 2019

Data Engineer

Highlighted Projects

1.Non-Qual US Retirement Plans

Architected a data warehouse solution on Redshift to store transformed data

Leveraged PySpark/Apache Spark to optimize extraction and transformation of data

Extensively worked on PySpark to analyze, transform and manipulate large datasets

Worked with PySpark SQL functions and wrote multiple UDF’s to perform column operations

Designed ELT and ETL workflows using Mulesoft to for data to be loaded onto Redshift

Created an AWS Lambda pipeline to migrate microservices from Mulesoft API Gateway to AWS API Gateway

Worked on migration of data from On-Prem SQL server to Cloud databases

Developed cron jobs to run BASH/ to automate servermaintenance, and health checks

Structured node.js functions on AWS Lambda to provide API proxy integrations

Designed a web application to replace oracle forms and moved it to the cloud

Leveraged react.js, SASS, HTML5, CSS3, node.js to design anuser-friendly web application

Worked in a fast-paced agile development environment that relied on JIRA & Confluence

Developed CI/CD processes for SDLC by leveraging Git, Jenkins, Docker, Kubernetes

Utilized Amazon S3 as a data lake to store raw data gathered in JSON, XML, text, and other related file formats

Integrated Salesforce Marketing Cloud with AWS services, using Amazon S3 for data storage, AWS Lambda for data processing, and Amazon Redshift for data warehousing

Worked extensively with Salesforce Marketing cloud via Mulesoftconnectors

Created a Salesforce Marketing cloud to JIRA integration to automate the Salesforce Marketing cloud Service requests to JIRA Tickets

Tech Stack: AWS Redshift, PySpark, Mulesoft, AWS Lambda, AWS API Gateway, Mulesoft API Gateway, REST APIs, cron, BASH, Shell Scripting, node.js, Python, Java, HTML5, CSS3, Bootstrap, Oracle Forms, SASS, react.js, JIRA, Confluence, SDLC, Git, Jenkins, Docker, Kubernetes, Salesforce, Salesforce Marketing Cloud, Postman, Oracle DB

Contact this candidate