Data Engineer Visualization

Location:

Texas

Posted:

February 20, 2024

Contact this candidate

Resume:

Shivani Shah

732-***-**** *****************@*****.*** https://www.linkedin.com/in/shivani-shah-300640158/

Around 9+ years of IT experience as a Data Engineer in a variety of industries, which includes hands on experience in Hadoop, MapReduce, Hive, Spark (Pyspark), Azure, AWS, Sqoop, Snowflake, Teradata, Oracle, RDMS, Python, Scala, ETL and Data Visualization.

Experienced in Databricks & Big Data Technologies mainly on Data and Delta Lake Implementation.

Excellent understanding of all stages in a typical SDLC like Requirement Analysis, Design, Programming, Project Status Review (PSR), Unit Testing, Integration Testing, Support

Expert in designing and implementing efficient data processing pipelines using Spark, Hadoop, and cloud based ETL tools such as AWS Glue and Azure Data Factory.

Extensive knowledge in Hadoop, Spark, Pyspark, Hive, Impala, Sqoop, RDBMS (Oracle, Maria DB).

Experienced in Spark with Hadoop platform for processing billions of data which uses in-memory data processing.

Worked with NoSQL databases like HBase in creating tables to load large sets of semi structured data.

Worked with Python and R Libraries like NumPy, Pandas, Scikit-Learn, SciPy, and TensorFlow.

Good knowledge of Data Visualization including producing tables, graphs, and listings using various procedures and tools such as Tableau.

Working knowledge of Integration Services (SSIS), Reporting Services (SSRS), and Analysis Services (SSAS).

Worked in extracting data from multiple sources to Power BI for Data Preparation.

Managed and optimized cloud-based infrastructure on AWS and Azure to ensure high availability and scalability of data processing and machine learning workloads.

Supported migration of data of loading modes transaction, master from RDMS to Hadoop servers.

Designed EC2 instance architecture to meet high availability application architecture and security parameters.

Developed python scripts to hit REST API’s and extract data to AWS S3

Experienced in streaming analytics Spark Streaming and Databricks Delta

Good understanding of monitoring tools like AWS Cloud watch, Grafana, Splunk and Nagios.

SKILLS

Programming Language:

Python, SQL, Pl/SQL, Java, R Data Visualization Tableau, Microsoft Power

BI Libraries

NumPy, Pandas, Matplotlib, Scikit-learn, Seaborn, Plotly, NLTK, TensorFlow, Statistics/Machine Learning: Regression, Classification (Decision Trees, SVM, KNN), Bayesian Statistics, Time Series Big

Data Technologies

Hive, Spark, MapReduce, Databricks, Oozie, Pig, Zookeeper

Another Tool

Excel, Anaconda Jupyter, GitHub, Azure ML studio, Visual Studio, Spyder, MATLAB

Cloud

Azure, AWS, Apple Services, Meta Services

EDUCATION

California State University, Los Angeles

Master of Science (MS) in Information System

Gujarat Technological University, India

Bachelors in Information Technology

PROFESSIONAL EXPERIENCE:

Facebook (Meta)

Data Engineer Nov 2022 - Present

Meta Platforms, Inc., formerly named Facebook, Inc is an American multinational technology conglomerate based in Menlo Park, California. The company owns Facebook, Instagram, and WhatsApp, among other products and services. Meta is one of the world's most valuable companies and among the ten largest publicly traded corporations in the United States. It is considered one of the Big Five American information technology companies, alongside Alphabet (Google), Amazon, Apple, and Microsoft.

Responsibilities

Assist in building Data pipelines to improve data quality and facilitate iterations for accommodating new user requirements.

Proactively working on building data lineage which would display incoming and outgoing reference of an asset by extensive use of data pipelines and other Meta tools such as Uni dash, Dai query, Diffs and Sev.

Performed data transformation and processing tasks within Snowflake by leveraging Snowflake's built-in SQL capabilities. Also, data quality and consistency by implementing data validation checks, data cleansing, and normalization processes.

Design the data bricks notebooks using Pyspark for data extraction, transformation and landing the data into the data bricks Delta Table.

Utilize Azure DevOps for continuous integration and deployment (CI/CD) pipelines, integrating Kubernetes deployments and configuration management for automated deployment, monitoring, and management of data solutions.

Using Pandas, NumPy, SciPy, and Scikit-learn in Python for scientific computing and data analysis.

Working extensively with Databricks to build scalable and efficient data processing and transform data pipelines to handle high-volume data ingestion, transformation, and aggregation.

Leveraged Databricks notebooks and Spark clusters to perform data ingestion, ETL (Extract, Transform, Load), and data wrangling tasks.

Dump data into Azure Data Lake Storage and analysing raw data using Azure Data Lake Analytics jobs.

Implement complex data transformations using Pyspark, SQL, and Data Frame operations within the Databricks environment.

Design, build and launch efficient & reliable ETL pipelines to move and transform data (both large and small amounts).

Perform bash operations, using GIT for version control

Created Azure Databricks notebooks in Pyspark for transforming raw JSON data into structured data.

Led scrum meetings and performed other Agile ceremonies like effort estimation and user stories management using JIRA.

Intelligently design data models for optimal storage and retrieval.

Working on designing and implementing the architecture and data models in Snowflake.

Design and develop new systems in partnership with software engineers to enable quick and easy consumption of data.

Closely partner with our Data Science team to build dashboards, self-service tools, and reports to analyse and present data associated with customer experience, product performance, business operations and strategic decision-making.

Environment: SQL Server Management Studio, Azure Data Factory, Azure Databricks, Kubernetes, Visual Studio, Agile (SCRUM), Python, Power BI, Meta Services, JSON, Databricks, Spark

Apple Inc - Menlo Park, CA

Data Analyst / Data Scientist Nov 2021 - Nov 2022

Apple Inc. is an American multinational technology company headquartered in Cupertino, California. Apple is the world's largest technology company by revenue, with US$394.3 billion. Apple is the world's biggest company by market capitalization. Apple is the fourth-largest personal computer vendor by unit sales and the second-largest mobile phone manufacturer in the world.

Responsibilities

Developed and maintained interactive dashboards and reports using Tableau, enabling business users to access and analyse critical data for decision-making purposes.

Assisted in building Data pipelines to improve data quality and facilitate iterations for accommodating new user requirements.

Optimized database performance by analysing query execution plans, implementing indexing strategies, and fine-tuning SQL queries, resulting in significant reduction in query response times.

Assisted in the administration and maintenance of data warehouses and databases, ensuring scalability, availability, and adherence to data security standards.

Created visualisations and dashboards using Tableau to communicate data insights to business stakeholders.

Conducted A/B testing and other experiments to determine the effectiveness of various solutions and improve product performance.

Conducted statistical analyses, including hypothesis testing and regression analysis, to determine the significance of various factors affecting business performance.

Developed simple and complex MapReduce programs in Hive, Pig, and Python for Data Analysis on different data formats.

Implemented Spark using Python and Spark SQL for faster processing and testing of the data.

Worked with Hadoop infrastructure to store data in HDFS storage and use Hive SQL to migrate the underlying SQL codebase in Azure.

Designed and Created Pipeline using Python to transform and persist data in parquet format and enable Hive.

Applied ML techniques like Classification, Clustering, and Neutral Networks using libraries like sci-kit-learn, pandas, NumPy to model and identify patterns in customer behaviour for grouping.

Rank Customers based on purchasing behaviour of intelligent data in corporate markets.

Managed and optimised cloud-based infrastructure on Azure to ensure high availability and scalability of data processing and machine learning workloads.

Developed data pipelines in Azure Data Factory using linked services, datasets, and pipelines for ETL from Azure SQL, Blob storage, Azure SQL Data Warehouse, and write-back tool, handling forward and backward data flow.

Run SQL DDL/DML scripts to set up database objects in Azure SQL Database.

Collaborated with business stakeholders to understand their reporting and analysis needs, providing insights and recommendations based on data analysis.

Set up access control and user permissions on Azure SQL database.

Developed and delivered training programs to educate business users on self-service reporting tools, empowering them to independently access and analyse data.

Collaborated with cross-functional teams to gather and analyse business requirements, translating them into technical specifications and data models.

Developed an ETL pipeline using Apache Spark to extract data from a public dataset, perform data cleaning and transformation, and load the data into a PostgreSQL database for analysis.

Designed and implemented ETL processes using SQL and Python, ensuring accurate and efficient data integration from multiple sources into the data warehouse.

Environment: SQL Server Management Studio, Azure SQL Database, Azure Data Factory, Pandas, Hive, Sci-kit Learn, NumPy, Visual Studio, Python, Tableau, Apple Services, Hydra

Microsoft – Redmond, WA

Data Engineer Nov 2019 - Oct 2021

Responsibilities

Helping executives to understand the usage of Azure services which mainly focus on Machine Learning Model where Design Develop and Deploy all the Azure Machine Learning Model with their activities and their ETL data pipeline in Azure Data Factory which helps to analyse the data to track the usage of Azure services.

Developed and optimized data warehouse for datasets using Azure Data Lake Storage Explorer and Kusto Data Explorer

Support senior management by providing automatic metrics reporting and ad-hoc deep dive analysis, managed SLA, implemented & automated reported intelligent analytics infrastructure of ML Models using SQL Server, Excel, and Power BI.

Used Hive & Spark for data processing & Flume to transfer log files from multiple sources to HDFS.

Developed Stream Processing data pipelines using Spark Streaming, and Kafka.

Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL, and U-SQL Azure Data Lake Analytics.

Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in In Azure Databricks.

Implemented real time data ingestion pipeline to ingest data from trip booking system into Hadoop and HBase for analytical purposes using Kafka, Python, Pyspark, HBase enabling to load 100 million records per day into Data Lake.

Worked in close partnership with cross-functional teams across organization to define business rules and cleansing requirements for data transformation process.

Responsible for technical coordination including writing SQL query and C# Script logic and leveraging assets to build and deploy email campaigns.

Responsible for DevOps activities such as code maintenance in GIT, Jenkins.

Generated discover reports using Power BI and presented campaign engagement statistics to business owners to direct future strategies and initiatives.

Environment: Azure Data Factory, Kafka, Pyspark, SQL Server Management Studio, Visual Studio, Power BI

Los Angeles World Airport - Los Angeles, CA

Python Developer/Data Engineer Nov 2017 - Oct 2019

Los Angeles World Airports is the airport authority that owns and operates Los Angeles International Airport and Van Nuys Airport for the city of Los Angeles, California. LAWA also owns and manages aviation-related property near the Palmdale Regional Airport.

Responsibilities

Created action filters, calculations, parameters, and calculated sets for preparing dashboards and worksheets.

Designed and developed Power BI graphical and visualization solutions with business requirement documents and plans for creating interactive dashboards and developed various DAX measures in Power BI to satisfy business needs.

Transformed stored data by writing Spark jobs based on business requirements.

Developed SSRS reports, SSIS packages, and data pipelines using Kafka for ETL jobs to optimize the Tensor Flow model, and implemented efficient Hive SQL, and Spark SQL.

Collaborated with data engineers and operation team to implement ETL process, wrote and optimized SQL queries to perform data extraction to fit analytical requirements.

Performed preliminary data analysis using descriptive statistics and handled anomalies such as removing duplicates and imputing missing values.

Performed data mining on large datasets (mostly structured) and raw text data using different Data Exploration techniques.

Developed Spark applications using Pyspark and Spark-SQL for data extraction, transformation, and aggregation from multiple file formats for analysing & transforming the data to uncover insights into the customer usage patterns.

Build and maintain Snowflake databases, including data modelling, partitioning, and clustering for optimized performance.

Wrote ETL jobs to read from web APIs using REST and HTTP calls and loaded into HDFS using Python.

Built Data Integration and Workflow Solutions for data warehousing using SQL Server Integration Service and OLAP.

Environment: Python, Azure PaaS service, PowerBI, SSRS, T-SQL, Spark SQL, U-SQL, SOAP APIs, REST APIs, Pyspark, Spark-SQL, JSON, Azure Data Factory, Azure Data bricks

Adani Group

Software Engineer Jan 2014 - Aug 2017

Adani Group is an Indian multinational conglomerate, headquartered in Ahmedabad. Founded by Gautam Adani in 1988 as a commodity trading business, the Group's businesses include port management, electric power generation and transmission, renewable energy, mining, airport operations, natural gas, food processing and infrastructure

Responsibilities

Designed and developed data migration processes to move data from on-premises systems to AWS storage buckets using Python scripts.

Managed data from multiple sources, including structured and unstructured data, and maintained the Hadoop Distributed File System (HDFS).

Partnered with ETL developers to ensure that data is well cleaned, and the data warehouse is up to date

for reporting purpose by Pig.

Developing parser and loader MapReduce application to retrieve data from HDFS and store to HBase and Hive.

Involved in complete Big Data flow of the application starting from data ingestion upstream to HDFS, processing the data in HDFS and analysing the data and involved.

Applied all the machine learning algorithms line Random Forest, Decision tree and predicted the demand for the peak seasons for the company to allocate the proper facilities as per demand.

Used Python libraries, build predictive models using ML algorithms such as Linear Regression and Logistic Regression, Decision Tree, Random Forest, Neural Networks.

Optimized SQL queries for data extraction and merging, and converted Hive/SQL queries into Spark transformations using Spark RDDs and Scala for improved performance.

Developed and maintained AWS Lambda functions for event-based triggers and assigned IAM roles for proper data access control.

Selected and generated data into csv files and stored them into AWS S3 by using AWS EC2 and then structured and stored in AWS Redshift

Streamlined data ingestion with an efficient ETL workflow & automated incremental data load, saving cloud storage cost by 30%

Analysed pricing data and redesigned competitive ads pricing strategy based on predicted metrics.

Deployed applications on AWS EC2 and worked on AWS infrastructure and automation using Docker containers and CI/CD for improved deployment efficiency.

Environment: Python, AWS, PowerBI, Spark SQL, REST APIs, Hive, Oozie, Pig, Pyspark, Spark-SQL, JSON, Azure Data Factory, Azure Databricks.

Contact this candidate