Big Data Google Cloud

Location:

Rancho Cordova, CA

Posted:

March 01, 2025

Contact this candidate

Resume:

Vishal Halale

E-mail: *****.*.***@*****.***

Mobile : 916-***-****

Summary

● 9+ years of IT experience, including relevance years of experience in designing and developing Big Data/Hadoop applications.

● Excellent understanding of Hadoop architecture and various components such as Spark, HIVE, Kafka, Sqoop, High Availability, HDFS and YARN.

● Expertise in using Spark RDD transformations and actions to process large-scale structured and unstructured data sets, including filtering, mapping, reducing, grouping, and aggregating data.

● Skilled in using Spark RDD persistency and caching mechanisms to reduce data processing overhead and improve query performance.

● Experienced in optimizing Spark RDD performance by tuning various configuration settings, such as memory allocation, caching, and serialization.

● Integrated Spark with data lakes such as AWS S3, HDFS, EMR, Google Cloud Storage, Azure Blob Storage.

● Developed large-scale distributed data pipelines using PySpark on AWS EMR, Google Dataproc, Azure HDInsight .

● Used AWS Step Functions, Google Cloud Run, Azure DataFactory to orchestrate PySpark workflows and automate data pipelines.

● Debugged PySpark code to resolve errors and improve efficiency on Spark jobs.

● Integrated Spark with messaging systems like Kafka and AWS Kinesis

● Proficient in handling hive partitions and buckets with respect to the business requirement.

● Experienced in efficiently using Hive managed and external table with respect to the business requirement.

● Knowledge of Hive table formats, including ORC, Parquet, and Avro, and their advantages and disadvantages for different use cases.

● Proficient in optimizing Hive query performance by tuning various configuration settings, such as, parallelism, vectorization and changing execution engine.

● Skilled in writing SQL,HiveQL and SparkSQL queries for data analysis.

● Experienced in importing large datasets to Hadoop environment from relational databases using Sqoop.

● Skilled in configuring Sqoop jobs for incremental data transfers using Sqoop's incremental import feature.

● Experienced in using Sqoop to import data from On Prem to cloud-based data storage services such as Amazon S3

● Experience in working with version control systems like GIT and using source code management tool GitHub.

● Good knowledge of Linux and UNIX administration.

● Good team player with excellent interpersonal and leadership skills. Technical Skills:

Big Data Technologies : Spark, HDFS, YARN, Hive, Sqoop, Kafka, Hue, Zoo-Keeper Cloud : AWS, GCP, Azure

AWS Services : EC2,S3, EMR, Athena, Kinesis, StepFunction Azure Services : Blob storage, Virtual Machine, HDInsight, Azure Data Factory GCP Services : Cloud storage, Compute Engine, Dataproc, CloudRun Function Distribution : Cloudera,Horton Works

Databases : MySQL

Scripting : Shell Scripting, Python

Operating Systems : Linux, Unix, Windows

Monitoring Tools : Zenoss, Grafana, PagerDuty

Languages : PySpark, Scala, Unix, Linux, HiveQL, SQL Central Repo : GitHub

Ticketing Tool : SonarX, Jira

Professional Experience

Client – Yahoo Inc, California Oct 2022 to Present Role : Bigdata Engineer

Responsibilities

● Developed large-scale distributed data pipelines using PySpark on AWS EMR.

● Integrated AWS S3 with PySpark jobs to handle large datasets in a distributed environment.

● Used AWS Step Functions to orchestrate PySpark workflows and automate data pipelines.

● Configured Spark jobs on AWS EMR to efficiently read and write data from AWS S3.

● Debugged Spark jobs running on AWS EMR, identifying performance bottlenecks.

● Built end-to-end PySpark pipelines on AWS EMR, reading data from AWS S3,Snowflake and WebAPI’s and write the data to S3 and Elastic.

● Used AWS Step Exections to validate the Pyfiles executables

● Used Spark RDD transformations and actions to process large-scale structured and unstructured data sets, including filtering, mapping, reducing, grouping, and aggregating data.

● Implemented Spark DataFrame persistency and caching mechanisms to reduce data processing overhead and improve query performance.

● Used AWS Hive/Athena to perform SQL queries on datasets processed by Spark on AWS EMR.

● Was responsible for managing on-Prem Hive tables.

● Performed Hive query performance tuning, by changing the configurations such as parallelism and vectorization

● Configured sqoop incremental jobs to import incremental data in RDBMS to Hive. Technologies: AWS S3, AWS EMR, AWS EC2, AWS Athena, PySpark, HDFS, Hadoop, Hive, Sqoop, Linux Bash & Python Scripting, Git.

Client : Apple Inc, Bengaluru, India May 2019 to Aug 2022 Role : Bigdata Engineer

Responsibilities

● Was involved in Importing data from RDBMS into HDFS and HIVE using Sqoop

● Created and Managed Hive tables, including managed, external, and partitioned tables.

● Utilized DataFrames for structured data manipulation and analysis.

● Optimized Spark jobs for performance and resource utilization.

● Implemented Spark SQL queries for data querying and aggregation

● Implemented Spark partitioning and caching strategies.

● Monitored Spark jobs using cluster management tools like YARN

● Worked with Spark's data serialization formats (Avro, Parquet, JSON, etc.).

● Collaborated with DevOps teams for cluster provisioning and maintenance.

● Documented Spark workflows and best practices.

● Mentored junior Spark developers and engineers.

● Adjusted YARN queue priorities and capacities dynamically based on workload demands.

● Analyzed Yarn job performance metrics and restructured queue hierarchies to reduce job wait times and prevent resource contention.

● Documented system configurations, procedures, and troubleshooting steps for streamlined team collaboration.

● Proficient in managing and optimizing data storage solutions using Google Cloud Storage, ensuring efficient data organization, access, and security.

● Skilled in provisioning and configuring virtual machines on Google Compute Engine to meet specific computing requirements for various projects.

● Proficient in setting up and customizing Google Dataproc clusters, including cluster resizing and configuration tuning for optimal performance.

● Expertise in developing and deploying serverless applications using Google Cloud Functions, enabling cost-effective and scalable solutions.

Technologies: GCP Cloud Storage, GCP Compute Engine, GCP Dataproc, GCP Cloud Functions, Linux, Cloudera Data Platforms, HDFS, Hadoop, Spark, Spark SQL, Hive, Sqoop, Linux Bash Scripting, Git. Client : Apple Inc, Bengaluru, India Aug 2017 to Apr 2019 Role : DevOps Engineer

Responsibilities

● Managed and supported 6+ Hadoop clusters with 500+ nodes and over 1 PB of storage, including cluster capacity planning, performance tuning, and monitoring.

● Hands-on experience commissioning, decommissioning, adding, and removing nodes in Hadoop clusters.

● Implemented SFTP for secure data transfer from external servers to Hadoop and utilized Sqoop for data import/export into HDFS.

● Tuned Spark applications for optimal batch interval and memory configuration; implemented Spark-Kafka streaming pipelines for real-time data processing.

● Knowledgeable in Kafka partitioning, replication, and cluster setup for message distribution.

● Collaborated with engineering teams to design, build, and maintain large-scale distributed systems.

● Managed CI/CD pipelines, deployments, and source control systems (GIT); automated infrastructure monitoring and recovery using Ansible and Python.

● Ensured system health using tools like Grafana, Zenoss, Kibana, PagerDuty, and Splunk; performed RCA and communicated impacts to stakeholders during incidents.

● Worked with global teams to improve workflows, enhance application code for operational ease, and ensure high availability targets.

● Led troubleshooting, implemented fixes for SLA compliance, and contributed to scalability and performance enhancement initiatives.

Technologies: Linux, Cloudera Data Platforms, HDFS, Hadoop, Spark, Hive, Linux Bash Scripting, Zookeeper, Kafka, PagerDuty, Ansible, Git.

Client - Apple Inc, Bengaluru, India Dec 2015 to Jul 2017 Production Support Engineer

Responsibilities

● Collaborated with System Admin, DBAs, and Network teams for change request activities, ensuring smooth implementation.

● Customized scripts for user data requests, fetching logs from servers and providing them in required formats.

● Involved in capacity planning and estimating storage requirements to meet application needs.

● Conducted root cause analysis by reviewing application logs and data, implementing corrective and preventive actions.

● Evaluated logs to identify performance issues and recommended solutions for remediation.

● Participated in annual stress tests by monitoring server load and capacity under high data volume scenarios.

● Monitored server performance using Zenoss/Grafana, collaborating with teams to resolve issues.

● Managed production deployments and provided post-production support to address arising issues.

● Utilized Ansible for configuration management, creating and executing playbooks for service installation and configuration.

● Documented technical changes, procedures, and configurations on internal Wiki, assisting in knowledge transfer and training new team members.

Technologies: Linux, Linux Bash Scripting, Jenkins, Ansible, SQL Developer, Git. Education:

Bachelor of Engineering in Computer Science & Engineering College : East West Institute of technology (2011-2015) University : Visvesvaraya Technological University

Contact this candidate