Data Engineer Google Cloud

Location:

Charlotte, NC

Posted:

March 02, 2025

Contact this candidate

Resume:

VAMSI KRISHNA PARUCHOORI

*** -***-****

**********@*****.***

"linkedin.com/in/vamsi1617”

Professional Summary:

I am an experienced Data Engineer with over 8 years in the IT industry. I specialize in designing and developing cloud-native data solutions for the healthcare, banking, and insurance sectors. I have extensive experience working on the Google Cloud Platform (GCP) and possess a deep knowledge of ETL pipelines, data processing architecture, and data governance

• Data Engineer with hands-on experience in installing and configuring using Hadoop ecosystem components like Hadoop, Map Reduce, HDFS, HBase, Zookeeper, Hive, Sqoop, Pig, Flume, Cassandra, Kafka, Flink, CouchDB, and Spark.

• Proficient in developing batch and streaming data pipelines using Google Cloud Dataflow (Apache Beam) and PySpark, orchestrated through Cloud Composer (Apache Airflow).

• Experience in creating and executing data pipelines on GCP AWS and Azure platforms.

• Expertise in migrating on-premises data to GCP using Pub/Sub, Cloud Functions, and Dataflow for seamless real-time and batch data processing.

• Extensive experience with GCP services, including Big Query, Fire store, GCP ML, Cloud SQL, Datastore, Spanner, and Looker, designing and implementing scalable, high-performance data architecture for complex analytics and reporting needs

• Skilled in Google Cloud Healthcare API, processing and integrating healthcare data in FHIR and HL7 formats, ensuring secure and compliant data exchange.

• Experience in implementing Cloud Data form for SQL-based transformations and orchestrations within Big Query

• Skilled in shell scripting, Python libraries such as Pandas and NumPy, and extensive knowledge of SQL for data extraction, transformation, and analysis.

• Working knowledge in Core Java and Java2EE platform with Servlets, Groovy, JSP, Microservices, JDBC, Multithreading, Hibernate, Spring MVC, and Spring Boot.

• Hands-on experience in GCP, Big Query, GCS, Cloud Composer, GCP Vertex AI, Cloud Spanner, Cloud functions, Cloud dataflow, Cloud Run Functions, Data fusion, Pub/Sub, cloud shell, GSUTIL, bq command-line utilities, and Data Proc.

• Experience in developing big data applications and services using the Amazon Web Services (AWS) platform using EMR, AWS Kinesis, AWS EMR, Aws Glue, AWS LAMBA, S3, EC2, Lambda, CloudWatch, and cloud computing using AWS RedShift.

• Set up and build AWS infrastructure using various resources, such as VPC, EC2, S3, AWS EMR, AWS Kinesis, IAM, EBS, Simple Notification Service (SNS), Simple Queue Service (SQS), Security Group, Auto Scaling, and RDS, in Cloud Formation JSON templates.

• Experience in Data Integration and Data Warehousing using various ETL tools Informatica PowerCenter, AWS Glue, SQL Server Integration Services (SSIS), Talend.

• Integrate and manage Apache Kafka for real-time data streaming and event-driven architecture.

• Experience with Software development tools such as JIRA, Bitbucket, GIT, GitHub, and SVN.

• Strong focus on data governance, security, and compliance, leveraging Google Data Plex, IAM, Cloud Key Management, Google DLP, and audit logs to ensure data lineage, access control, encryption, and regulatory compliance (e.g., HIPAA).

• Worked closely with business stakeholders to gather requirements and deliver robust, data-driven solutions using Looker for interactive dashboards and reporting.

• Experience developing storytelling dashboards and data analytics, designing reports with visualization solutions using Tableau Desktop, and publishing onto the Tableau Server.

• Created reports using visualizations such as Bar charts, Clustered Column charts, Waterfall Charts, Gauges, Pie Charts, Treemaps, etc., in Power BI.

• Flexible working Operating Systems like Unix/Linux (Centos, Red hat, Ubuntu) and Windows Certifications:

Google Cloud Certified - Professional Data Engineer [2025] Google Cloud Certified- Cloud Engineer [2025]

AWS Certified Data Engineer - Associate [2024}

Education:

Bachelor’s in computer science from JNTUK in India in the year 2016 TECHNICAL SKILLS:

Big Data Technologies: Kafka, Java, Apache Spark, HDFS, Yarn, MapReduce, Hive, HBase, CouchDB Cassandra.

Cloud Services: GCP (BIG QUERY), Data Flow, Data Proc, Pub/Sub, cloud composer, cloud fusion Amazon EMR (EMR, EC2, EBS, RDS, S3, Glue, Elasticsearch, Cloud Functions, Lambda, Kinesis, SQS, DynamoDB, Redshift, ECS) Azure HDInsight (Databricks, Data Lake, Blob Storage, Data Factory, SQL DB, SQL DWH, Cosmos DB, Azure DevOps, Active Directory).

Programming Languages Python, SQL, Scala, Java

Data Processing &

Pipelines

Cloud Dataflow, Data Proc, Pub/Sub, Cloud Composer, Apache Airflow, Cloud Functions, Cloud Workflow

Data Management &

Analytics

Google Big Query, Spanner, Cloud SQL, Data plex, Pandas and Data Frames, PySpark.

Databases: Snowflake, MS-SQL SERVER, Oracle, MySQL, PostgreSQL, DB2, DBT Reporting Tools/ETL

Tools:

IICS, Informatica, Talend, SSIS, SSRS, SSAS, ER Studio, Tableau, Power BI, Qlik, Looker Studio

Methodologies: Agile/Scrum, Waterfall.

Others: Machine learning, NLP, Airflow, Jupiter Notebook, Microservices, Docker, Kubernetes, Jenkins, Jira.

Leadership Skills:

Strong Expertise in handling teams and Developers to meet the business demands of stakeholders. Focus on scalable technology tools and leverage resources based on the project's needs. Mentored team members in designing and optimizing Dataflow pipelines, enabling skill development in data streaming and transformation.

Provided architectural guidance on building real-time data pipelines, ensuring optimal use of Cloud services.

Professional Experience:

Client: Mayo Clinic, US Dec 2022 –Present

LOCATION: Remote USA

Role: GCP Data Engineer

Mayo Clinic is a renowned nonprofit American academic medical center focused on integrated clinical practice, education, and research. It provides data-driven healthcare solutions through its unified analytics platform, consolidating data from various sources, including patient electronic health records

(EHR), clinical trials, genomics, radiology, pathology, and public health databases. This platform ingests data from multiple healthcare systems, research initiatives, and patient management systems, transforming it according to clinical guidelines, research protocols, and patient care standards. Requirement Gathering (Functional & Non-Functional): Documented functional requirements outlining features and functionalities. Defined non-functional requirements like performance, security, and usability criteria. Client sign-off on the documented requirements, ensuring a shared understanding of preparing a high-level roadmap for the project and checking the fitment based on the team's capacity.

RESPONSIBILITIES:

• Designed and developed Cloud Dataflow (Apache Beam) jobs using Java/Python and microservices SDK to build ELT pipelines that extracted conversation logs from Google Dialog flow and loaded them into Big Query for further processing.

• Ensured near real-time data ingestion by orchestrating data workflows using Cloud Composer

(Apache Airflow), seamlessly integrating various GCP services like Pub/Sub and Dataflow to handle large-scale unstructured data and persist it into Big Query. Transformed raw conversation logs into structured datasets to generate KPIs, including response time, customer satisfaction, escalation rates, and conversation abandonment metrics.

• We transfer patient information, customer transaction details, and insurance claim data from SQL SERVER to a GCS Bucket

• We utilized GCP IAM roles to establish the connection between on-premises to GCP.

• We have established a data flow cluster for batch processing and data streaming using GCP.

• GCS bucket into the temporary table.

• Extensively used performance tuning techniques while loading data into Bigquery e using IICS

• Design and implement various layers of Data Lake, Design star schema in Big Query.

• Using the g-cloud function with Python to load data into Big query for the arrival of CSV files in the GCS bucket.

• Created a data pipeline using Dataflow to import all UDF and Parquet files and load the data into Big Query.

• Automated data extraction and transformation processes on GCP with Impala and improved data pipeline efficiency and reliability.

• Designed, Developed and Implemented ETL processes using IICS Data integration

• Design and implement various layers of Data Lake, Design star schema in Big Query.

• Created and optimized Hive queries on GCP to efficiently process large datasets

• Developed data transformation scripts using Python and PySpark to clean and aggregate data for analysis.

• Understanding of creating DATA PROC cluster management and configuration using GCP.

• Created real-time data streaming pipelines using GCP Pub/Sub A Data flow.

• Creating a data flow cluster for an automatic template is available to trigger the job using Apache Beam

• Worked with GCP services like Cloud Storage, Compute Engine, App Engine, Cloud SQL, Cloud Functions, Storage transfer service, Cloud Run, Cloud Data Flow, Data Form, Data Fusion, Cloud Spanner DB, Cloud Bigtable, and Confluent-KAFKA. to process data for the downstream Customers.

• Configure GCP Looker to connect to your GCP data sources, such as Big Query, Cloud Storage, or other relevant services.

ENVIRONMENT: Java, Python, Google Big Query, Microservices, IICS, Google Dialog flow, GCS, GKE, Google Cloud Storage, Cloud Dataflow, Cloud Spanner, Snowflake, Cloud Pub/Sub, DBT, Cloud Function, Cloud Run, CouchDB, Cloud Composer, Workflow, Apache Airflow, Looker Studio, Terraform, Cloud Build, Cloud Repository, GitLab, GitHub

Client: HCL Tech (Citi Bank) Mar 2021 – Nov 2022

Location: Chennai, INDIA

Role: GCP Data Engineer

Citi Bank provides a broad range of consumer, business, and institutional banking and wealth management services through a portfolio of financial services brands and businesses. Citi: Delivering a broad range of financial services to commercial, corporate, institutional, and government customers operating in and with connections established across the world. RESPONSIBILITIES:

• Experience building multiple data pipelines, end-to-end ETL, and ELT processes for data ingestion and transformation in GCP and coordinating tasks among the team.

• Used Data form tool for complex transformations by reading data from the Bronze Layer in Big Query and writing it to the Silver Layer, and next-level transformations have occurred to write it to the gold Slayer

• Designed and Co-ordinate with the Data Science team in implementing Advanced Analytical Models in Compute Engines and Big Queries over large Datasets

• Used Spark for data frames and ETL tools such as DBT, Creating Data Mapping, Transformation, and Loading in complex, high-volume environments.

• Extensive use of cloud shell SDK in GCP to configure/deploy the services Data Proc, Storage, Big Query

• Ingested data from various sources, including CSV files, PDFs, storage containers, and streaming data, transforming and processing the data for meaningful insights.

• Wrote and maintained programs in Python notebooks and SQL for data extraction, cleaning, transformation, and processing to ensure quality and accuracy in the pipeline.

• Managing External Sources using Cloud Spanner and scheduling using Cloud Composer

• Managed writing of SQL queries and schema evolution to develop the code using Cloud Spanner.

• Worked on developing streamlined workflows using high-performance API services dealing with large amounts of structured and unstructured data.

• Developed Spark jobs in Python to perform data transformation, creating Data Frames and Spark SQL.

• Worked on data serialization formats to convert complex objects into sequence bits using Parquet, Avro, JSON, and CSV formats.

• Used Apache airflow in the GCP Composer environment to build data pipelines and explored various airflow operations like bash operator, Hadoop operators, and branching operators.

• Worked on scheduling the Control-M workflow engine to run multiple jobs.

• Coordinate with team and develop a framework to generate Daily ad-hoc reports and Extracts from enterprise data from Big Query

• Created and deployed Kubernetes clusters to Google Cloud, Created Docker images, and pushed to the Google Cloud container registry using Jenkins

• Automated data extraction, transformation, and loading processes using Python and Shell scripting

• Business applications and Data marts for reporting. I was involved in different phases of development, including analysis, design, coding, unit testing, integration testing, review, and release, as per the business requirements.

• Implemented Apache Airflow for authoring, scheduling, and monitoring Data Pipelines Environment: Kafka, Spark, GCP, Cloud SQL, GKE, Cloud Composer, Cloud Spanner, IICS, Snowflake, Data Flow, Data Fusion, Apache Airflow, Apache Beam, Cloud Shell, Big Query, Tableau, SQL, Python, Hive, CouchDB, Spark SQL, MongoDB, TensorFlow, Jira.

Neo Soft

Client :(Aditya Birla insurance) Mar 2019 to Feb 2021 Role: AWS Data Engineer

Description:

Aditya Birla Insurance refers to the insurance products and services offered by the ABC group

(ABCG) is the financial services arm of the Aditya Birla Group, one of India’s largest and most diversified conglomerates. The company provides a comprehensive suite of financial solutions across various sectors, including insurance, asset management, lending, and advisory services. ABCG plays a significant role in helping individuals and businesses manage their finances and achieve their financial goals

Responsibilities:

• Experience loading data into Spark RDDs, performing advanced procedures like text analytics and processing using Spark's in-memory data computation capabilities and generating the output response using Scala.

• Written Terraform scripts to automate AWS services which include ELB, CloudFront distribution, RDS, EC2, database security groups, Route 53, VPC, IAM, EBS Lambda, ECS Faregate, API Gateway, Subnets, Security Groups, AWS GLUE, AWS Lambda, Auto Scaling, AWS Kinesis, AWS Cloud formation, AWS EMR, RDS & S3 Bucket in Cloud Formation JSON templates and converted existing AWS infrastructure to AWS Lambda deployed via Terraform and AWS CloudFormation.

• Designed and developed several complex mappings using various transformations like Source Qualifier, Aggregator, Router, Joiner, Union, Expression, Lookup, Filter, Git Hub Update Strategy, Stored Procedure, Sequence Generator, etc.

• Used Spark SQL to read the Parquet data and Avro data loaded tables in hive to Spark using Scala.

• Developed Talend Bigdata jobs to load heavy volumes of data into the S3 data lake and then into Snowflake

• Performed loads into Snowflake instance using Snowflake connector in IICS for a separate project support data analytics and insight use case for the Sales team

• Worked in AWS environment for development and deployment of custom Hadoop applications.

• Involved in file movements between HDFS and AWS S3 and extensively worked with S3 bucket in AWS.

• Optimizing existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frames, and Pair RDDs.

• Build machine learning models to showcase big data capabilities using Spark and MLlib.

• Experience developing big data applications and services using the Amazon Web Services

(AWS) platform, including EMR, AWS Kinesis, AWS EMR, AWS Glue, AWS LAMBA, S3, EC2, Lambda, CloudWatch, and cloud computing using AWS RedShift. Environment: AWS, S3, Spark-Core, Spark-Streaming, SQL, Jira, Lambda, ECS, Scala,DBT, IICS, AWS Kinesis, Apache Airflow, VPC, Python, Kafka, Hive, EC2, Elastic Search, Impala, Cassandra, Tableau, Talend, ETL, Linux

Client: United Health Group May 2016 to Feb 2019

Location: Hyderabad, INDIA

Role: Hadoop/Spark

Description:

UnitedHealth Group is one of the largest and most diversified healthcare companies in the world

, providing healthcare benefits and insurance services to millions of people. It offers coverage across a range of segments, including employer-based health insurance, Medicare, Medicaid, and international health plans

UnitedHealth Group leverages data and analytics through Optum to improve healthcare delivery and reduce costs, making it a leader in health IT.

Responsibilities:

• Worked in comprehending and examining the client and business requirements.

• Imported and exported data from different databases such as MySQL, Oracle, Netezza, Teradata, and DB2 into HDFS using Sqoop and Talend

• Familiar with data architecture, including data ingestion pipeline design, Hadoop architecture, data modeling and mining, machine learning, and advanced data processing

• Name Node, Data Node, Resource Manager, Node Manager, Application Master, YARN, and MapReduce Concepts.

• Optimized SQL queries and database performance for faster data retrieval and processing

• Migrated an existing on-premises application to AWS. Used AWS services like EC2 and S3 for small data sets processing and storage and experienced maintaining the Hadoop cluster on AWS EMR.

• Experience in optimizing ETL workflows.

• Good experience working with SerDes, such as Avro Format and Parquet format data.

• I have good experience with Hadoop tools related to data warehousing, such as Hive and Pig, and I was involved in extracting the data from these tools on the cluster using Sqoop.

• We Used CI/CD service, which is helpful for building, testing, and deploying Python applications

• Skilled in executing programming code for intermediate to complex modules following development standards; planning and conducting code reviews for changes and enhancements that ensure standards compliance and systems interoperability. Environment: Python 3, Hadoop, HDFS, Big Table, IICS, Talend, Scala, Hive, HBase, Zookeeper, Sqoop, CouchDB, MySQL, MapReduce, Tableau, Cassandra, YARN, Jira, JSON, Pig scripts, Apache Spark, Linux, Git, Amazon S3, Jenkins, Mongo DB, T-SQL, Eclipse.

Contact this candidate