Data Analyst Engineer

Location:

United States

Posted:

May 02, 2024

Contact this candidate

Resume:

SAI PRIYA

Email Id: *********@*****.***

Contact Details: 262-***-****

Data Engineer

LinkedIn: https://www.linkedin.com/in/priya-s-b191a1231/

Professional Summary:

Over 9+ years of experience in Analyzing, Designing, Developing, and Implementation of data, architecture, and frameworks as a Data Engineer.

Designed and deployed data pipelines on GCP using Cloud Composer, Cloud Functions, and various GCP services.

Implemented effective monitoring and alerting systems using Stackdriver, reducing issue resolution time.

Developed end-to-end testing strategies ensuring data accuracy and completeness, minimizing processing errors.

Utilized DevOps practices and tools to automate GCP infrastructure, reducing deployment time and errors.

Developed and maintained CI/CD pipelines on GCP, enabling controlled code deployment and testing.

Implemented data versioning and lineage tracking using Data Catalog and Data Studio, ensuring data traceability.

Conducted capacity planning and scaling for GCP data pipelines using Kubernetes and Cloud Autoscaling.

Enhanced Spark processes for ODS, ingesting data, and validating for integrity before conversion to parquet format.

Led the team in developing business intelligence solutions and business models.

Proficient in working with Spark and Hadoop Architecture, using Google Stack Driver for monitoring, and configuring alerts.

Contributed as a Senior Big Data Engineer, validating the DSE Graph database.

Developed enterprise Data Lake and performed tasks related to project oversight and enterprise architecture design.

Proficient in developing Spark applications for data extraction, movement, and transformation.

Responsible for designing data pipelines and estimating, monitoring, and troubleshooting Spark clusters.

Worked extensively with Azure Data Lake, Snowflake, and Airflow for ETL and data migration processes.

Utilized Agile Scrum Methodology to manage and organize a team of developers.

Developed scalable distributed data solutions using Hadoop technologies and Azure platforms.

Led multiple phases of data collection, cleaning, model development, and machine learning.

Implemented various machine learning algorithms and developed Big Data analytic solutions.

Developed data pipelines for ingesting customer behavioral data and financial histories into HDFS.

Proficient in analyzing and maintaining big data technologies, automation, and data transformation.

Developed real-time analytics pipelines using Confluent Kafka, Storm, and Green Plum for data transfer.

Assisted teams with SQL, MPP databases, and worked on data transformation pipelines using technologies such as Hive and Splunk.

Technical Skills:

Big Data Technologies

HDFS, YARN, MapReduce, Hive, Pig, Impala, Sqoop, Storm, Flume, Spark, Apache Kafka, Zookeeper, Ambari, Oozie, MongoDB, Cassandra, Mahout, Puppet, Avro, Parquet, Snappy, Falcon.

NO SQL Databases

Postgres, HBase, Cassandra, MongoDB, Amazon DynamoDB, Redis

Hadoop Distributions

Cloudera (CDH3, CDH4, and CDH5), Hortonworks, MapR, and Apache.

Languages

Scala, Python, R, XML, XHTML, HTML, AJAX, CSS, SQL, PL/SQL, HiveQL, Unix, Shell Scripting

Source Code Control

GitHub, CVS, SVN, ClearCase

Cloud Computing Tools

Google Cloud Platform (GCP), Amazon Web Services AWS, (S3, EMR, EC2, Lambda, VPC, Route 53, Cloud Watch, Cloud Front), Microsoft Azure.

Databases

Teradata Snowflake, Microsoft SQL Server, MySQL, DB2

DB languages

MySQL, PL/SQL, PostgreSQL & Oracle

Build Tools

Jenkins, Maven, Ant, Log4j

Business Intelligence Tools

Tableau, Power BI

Development Tools

Eclipse, IntelliJ, Microsoft SQL Studio, Toad, NetBeans

ETL Tools

Talend, Pentaho, Informatica, Ab Initio, SSIS

Development Methodologies

Agile, Scrum, Waterfall, V model, Spiral, UML

EDUCATION: New Horizon College August 2009 to March 2013 B.E., Computer Science Engineering

LINKEDIN: https://www.linkedin.com/in/sai-priya-b191a1231

Professional Experience:

Quotient, Mountain View, CA April 2022 to Present

Role: GCP Data engineer

Responsibilities:

Built and deployed data pipelines using Cloud Composer and Cloud Functions, enabling seamless integration with other GCP services such as BigQuery, Pub/Sub, and Cloud Storage.

Implemented monitoring and alerting mechanisms using Stackdriver, enabling proactive issue identification and resolution in GCP data pipelines.

Designed and executed end-to-end testing strategies for GCP data pipelines, ensuring the accuracy and completeness of data from ingestion to analysis.

Utilized DevOps practices and tools such as Jenkins, Terraform, and Ansible to automate GCP infrastructure deployment and configuration, resulting in a 50% reduction in deployment time.

Worked with Python, SQL, and Bash scripts to develop custom data transformations and data quality rules, resulting in a 25% reduction in data processing errors.

Developed and maintained CI/CD pipelines on GCP using Cloud Build and Cloud Run, enabling seamless code deployment and testing in a controlled environment.

Implemented data versioning and lineage tracking using tools such as Data Catalog and Data Studio, enabling auditability and traceability of healthcare data in GCP.

Conducted capacity planning and scaling of GCP data pipelines using Kubernetes and Cloud Autoscaling, ensuring optimal performance and cost-efficiency.

Developed multi-cloud strategies in better using GCP (for its PAAS).

Designed and developed Spark jobs with Scala to implement end-to-end data pipelines for batch processing.

Developed data pipeline using Flume, Kafka, and Spark Stream to ingest data from their weblog server and apply the transformation.

Developed data validation scripts in Hive and Spark and performed validation using Jupiter Notebook by spinning up the query cluster in EMR.

Utilized Spark SQL API in PySpark to extract and load data and perform SQL queries.

Developed PySpark script to encrypt the raw data by using hashing algorithms concepts on client-specified columns.

Developed Stored Procedures, Views, and Triggers, and was responsible for the design, development, and testing of the database.

Developed Python-based API (RESTful Web Service) to track revenue and perform revenue analysis.

Environment: GCP, Gcs Bucket, G-Cloud Function, Apache Beam, Cloud Dataflow, Cloud Shell, Gsutil, Cloud SQL, Big Query, Cloud Data Proc, GCS, Cloud Composer, Talend for Big Data, Airflow, Hadoop, Hive, Teradata, SAS, Teradata, Spark, Python, SQL Server, Kubernetes, Docker.

Truist Bank, Charlotte, NC September 2019 to March 2022

Role: Sr. Data Engineer

Responsibilities:

Developed and refined the Spark process for ODS (Operations Data Store) by making changes and enhancing the performance of the data ingestion from raw and refined to publishing Postgres data to the core script using Python and PySpark.

Responsible for the ingestion of data from various APIs and writing modules to store data in S3 buckets.

Validating data fields from the refined zone to ensure the integrity of the published table.

Converting ingested data (CSV, XML, JSON) to parquet file format in compressed form.

Created the business models from business cases and enterprise architecture requirements for process monitoring, improvement, and reporting and led the team in business intelligence solutions development

Experience in performing transformations and actions on RDD, Data frames, and Data sets using Apache Spark.

Good Knowledge of Spark and Hadoop Architecture and experience in using Spark for data processing.

Hands-on experience in using Google Stack driver for monitoring the logs of both GKE and GCP instances and configured alerts from Stack driver.

Experience as a Senior Bigdata Engineer, Unix Validating DSE Graph database.

Have a proven track record of working as a Data Engineer on Amazon cloud services, pipeline using the AWS EMR Spark cluster, and Cloud Dataflow on GCP.

Developed Gsutil scripts for compression with Gzip, backup, and transfer to edge node with all necessary file operational requirements for BQ load jobs.

Initiated and oversee the project to build an enterprise Data Lake for a large energy company with System reliability and scalability architecture design

Created various PL/SQL objects like Stored Procedures, Functions, and packages as per business requirements, for multiple stages like the Delta and Beta process (Client-defined custom process).

Working with HDFS config files for application logs Yarn-site.xml, yarn-default.xml. Mapred-site.xml and setting up log-aggregation properties in config files.

Worked on data that was a combination of unstructured and structured data from multiple sources and automated the cleaning using Python scripts.

Congregated data from multiple sources and performed resampling to handle the issue of imbalanced data.

Extensive knowledge in importing and exporting data using Sqoop from RDBMS (Relational Data Base Systems) to HDFS and also worked on Google Cloud Platform (GCP)

Countable experience in data engineering using GCP, client/server architecture along web and Windows-based applications.

Written PL/SQL Stored Procedures, Functions, Packages, and Triggers to implement business rules into the application.

Environment: Spark 2.4, H Base 1.2, Tableau 10, Power BI, Python 2.7 and 3.4, Scala, PySpark, HDFS, Flume 1.6, Hive, Zeppelin, PostgreSQL, MySQL, TFS, Linux, Spark SQL, Kafka, NIFI, Sqoop 1.46, GCP.

Mayo Clinic Rochester MN May 2017 to August 2019

Role: Data Engineer

Responsibilities:

•Experience in developing Spark applications using Spark-SQL in Databricks for data extraction, transformation, and aggregation from multiple file formats for Analyzing & and transforming the data to uncover insights into customer usage patterns

•Enveloped batch data pipelines for data movement from on-prem Teradata/Vertica database to staging data brick tables

•Worked on creating data lineage/ data pipelines in Palantir by migrating data from Teradata/Vertica

•Ata ingestion to one or more Azure services (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in Azure Databricks

•Design the solution and develop the program for data ingestion using - Sqoop, map-reduce, Shell script, and Python and Implemented Spark using Pyspark using Python for faster testing and processing of data

•Extract Transform and Load data from source systems to Azure Data Storage services using a combination of Spark SQL, and U-SQL Azure Data Lake Analytics

•Responsible for estimating the cluster size, monitoring, and troubleshooting of the Spark data bricks cluster.

•Related Databricks notebooks using SQL, Python, and automated notebooks using airflow.

•Extensively used Azure Data Lake and worked with different file formats CSV, parquet, delta

•Assigned and implemented Delta Optimizer daily job using Databricks Optimization

•Responsible for migrating tables from traditional RDBMS into Snowflake tables using Spark and later generating required visualizations and dashboards using Tableau.

•Nesting data into stage tables by using copy functionality in Snowflake.

•Scheduled snowflake data pipelines in the Airflow integration tool

•Designed and Developed ETL pipelines in and out of Snowflake using SnowSQL

•Updated maps, sessions, and workflows as a part of ETL change and also modified existing ETL Code, and documented the changes.

•Worked on complex queries so that Snowflake utilizes the Queries in an optimized way

•Utilized Agile Scrum Methodology to help manage and organize a team of 4 developers with regular code review sessions.

Yana Software Private Limited Hyderabad, India November 2016 to February 2017

Role: Data Engineer

Responsibilities:

•Provided technical expertise and aptitude to Hadoop technologies as they relate to the development of analytics

•Responsible for building scalable distributed data solutions using Big Data technologies like Apache Hadoop, MapReduce, Shell Scripting, Hive

•Involved in developing data ingestion pipelines on Azure Databricks Spark cluster using Azure Data Factory and Spark SQL

•Implemented various Azure platforms such as Azure SQL Database, Azure SQL Data Warehouse, Azure Analysis Services, HDInsight, Azure Data Lake, and Data Factory

•Enveloped ADF Pipelines to load data from on-prem to AZURE cloud Storage and databases

•Related data integration and technical solutions for Azure Data Lake Analytics, Azure Data Lake Storage, Azure Data Factory, Azure SQL databases, and Azure SQL Data Warehouse for providing analytics

•Worked in Azure environment for development and deployment of Custom Hadoop Applications

•Rote complex Hive queries to extract data from heterogeneous sources (Data Lake) and persist the data into HDFS

•Implement ad-hoc analysis solutions using Azure Data Lake Analytics/Store, Databricks

•Involved in all phases of data mining, data collection, data cleaning, developing models, validation, and visualization

•Worked on machine learning on large-size data using Spark and MapReduce

•Assigned and implemented scalable Cloud Data and Analytical architecture solutions for various public and private cloud platforms using Azure

•Assigned and developed Big Data analytic solutions on a Hadoop-based platform and engaged clients in technical discussions

•Responsible for loading and transforming huge sets of structured, semi-structured, and unstructured data

•Enveloped Spark scripts by using Python and bash Shell commands as per the requirement

•Implemented Security in Web Applications using Azure and deployed Web Applications to Azure

•Responsible for translating business and data requirements into logical data models in support of Enterprise data models, ODS, OLAP, OLTP, and Operational data structures

•Enveloped numerous MapReduce jobs in Python for Data Cleansing and Analyzing Data in Impala

•Applied various machine learning algorithms and statistical modeling like decision tree, logistic regression, and Gradient Boosting Machine to build a predictive model using the sci-kit-learn package in Python

•Worked with Microsoft Azure Cloud Services, Storage Accounts, Azure data storage, and Azure Data Factory

•Assigned Data Marts by following Star Schema and Snowflake Schema Methodology, using industry-leading Data modeling tools

•Responsible for importing and exporting data from different sources like MySQL, and Teradata databases into HDFS using SQOOP to save in file formats AVRO, JSON, and ORC file formats

Environment: Hadoop 3.0, MapReduce, Hive 2.3, Agile, Databricks, Azure, HDFS, Spark 2.4, Python, SQL, HBase 1.2, OLAP, OLTP.

Avon Technologies Pvt Ltd Hyd India July 2013 to October 2015

Role: Hadoop Developer

Responsibilities:

Developed data pipeline using Flume, Sqoop, Pig, and MapReduce to ingest customer behavioral data and financial histories into HDFS for analysis

Worked on analyzing Hadoop cluster and different Bigdata analytic tools including Pig, Hive, HBase database, and Sqoop

Implemented PySpark using Python and utilizing data frames and temporary table SQL for faster processing of data.

Implemented real-time analytics pipeline using Confluent Kafka, storm, Elastic Search, Splunk, and Green Plum.

Design and develop Informatica BDE Application and Hive Queries to ingest Landing Raw zone and transform the data with business logic to refined zone and Green plum data marts for reporting layer for consumption through Tableau.

Installed, configured, and maintained big data technologies and systems. Maintained documentation and troubleshooting playbooks.

Automated the installation and maintenance of Kafka, storm, zookeeper, and elastic search using salt stack technology.

Developed connectors for elastic search and green plum for data transfer from a Kafka topic. Performed Data Ingestion from multiple internal clients using Apache Kafka Developed k-streams using Java for real-time data processing.

Responded to and resolved access and performance issues. Used Spark API over Hadoop to perform analytics on data in Hive

Exploring with Spark improving the performance and optimization of the existing algorithms Hadoop using Spark context, Spark-SQL, Data Frame, Spark YARN

Imported and exported data into HDFS and Hive using SQOOP and developed POC on Apache-Spark and Kafka. Proactively monitored performance and assisted in capacity planning.

Worked on Oozie workflow engine for job scheduling Imported and exported data into Map Reduce and Hive using Sqoop

Developed spark applications in Python (PySpark) on a distributed environment to load a huge number of CSV files with different schema into Hive ORC tables.

Worked on reading and writing multiple data formats like JSON, ORC, and Parquet on HDFS using PySpark.

Performed transformations, cleaning, and filtering on imported data using Hive, and Map Reduce, and loaded final data into HDFS Good understanding of performance tuning with NoSQL, Kafka, Storm, and SQL Technologies

Design/Develop framework to leverage platform capabilities using MapReduce, Hive UDFs

Worked on data transformation pipelines like Storm. Worked with operational analytics and log management using ELK and Splunk. Assisted teams with SQL and MPP databases such as Green Plum.

Worked on Salt Stack automation tools. Helped teams working with batch processing and tools in the Hadoop technology stack (MapReduce, Yarm, Pig, Hive, HDFS)

Environment: Hadoop, Map Reduce, HDFS, Hive, HBase, Sqoop, Pig, Flume, Oracle 11/10g, DB2, Teradata, MySQL, Eclipse, PL/SQL, Java, Linux, Shell Scripting, SQL Developer, SOLR.

Contact this candidate