Vishal Halale
E-mail: *****.*.***@*****.***
Mobile : 916-***-****
Summary
● 9+ years of IT experience, including relevance years of experience in designing and developing Big Data/Hadoop applications.
● Excellent understanding of Hadoop architecture and various components such as Spark, HIVE, Kafka, Sqoop, High Availability, HDFS and YARN.
● Expertise in using Spark RDD transformations and actions to process large-scale structured and unstructured data sets, including filtering, mapping, reducing, grouping, and aggregating data.
● Skilled in using Spark RDD persistency and caching mechanisms to reduce data processing overhead and improve query performance.
● Experienced in optimizing Spark RDD performance by tuning various configuration settings, such as memory allocation, caching, and serialization.
● Integrated Spark with data lakes such as AWS S3, HDFS, EMR, Google Cloud Storage, Azure Blob Storage.
● Developed large-scale distributed data pipelines using PySpark on AWS EMR, Google Dataproc, Azure HDInsight .
● Used AWS Step Functions, Google Cloud Run, Azure DataFactory to orchestrate PySpark workflows and automate data pipelines.
● Debugged PySpark code to resolve errors and improve efficiency on Spark jobs.
● Integrated Spark with messaging systems like Kafka and AWS Kinesis
● Proficient in handling hive partitions and buckets with respect to the business requirement.
● Experienced in efficiently using Hive managed and external table with respect to the business requirement.
● Knowledge of Hive table formats, including ORC, Parquet, and Avro, and their advantages and disadvantages for different use cases.
● Proficient in optimizing Hive query performance by tuning various configuration settings, such as, parallelism, vectorization and changing execution engine.
● Skilled in writing SQL,HiveQL and SparkSQL queries for data analysis.
● Experienced in importing large datasets to Hadoop environment from relational databases using Sqoop.
● Skilled in configuring Sqoop jobs for incremental data transfers using Sqoop's incremental import feature.
● Experienced in using Sqoop to import data from On Prem to cloud-based data storage services such as Amazon S3
● Experience in working with version control systems like GIT and using source code management tool GitHub.
● Good knowledge of Linux and UNIX administration.
● Good team player with excellent interpersonal and leadership skills. Technical Skills:
Big Data Technologies : Spark, HDFS, YARN, Hive, Sqoop, Kafka, Hue, Zoo-Keeper Cloud : AWS, GCP, Azure
AWS Services : EC2,S3, EMR, Athena, Kinesis, StepFunction Azure Services : Blob storage, Virtual Machine, HDInsight, Azure Data Factory GCP Services : Cloud storage, Compute Engine, Dataproc, CloudRun Function Distribution : Cloudera,Horton Works
Databases : MySQL
Scripting : Shell Scripting, Python
Operating Systems : Linux, Unix, Windows
Monitoring Tools : Zenoss, Grafana, PagerDuty
Languages : PySpark, Scala, Unix, Linux, HiveQL, SQL Central Repo : GitHub
Ticketing Tool : SonarX, Jira
Professional Experience
Client – Yahoo Inc, California Oct 2022 to Present Role : Bigdata Engineer
Responsibilities
● Developed large-scale distributed data pipelines using PySpark on AWS EMR.
● Integrated AWS S3 with PySpark jobs to handle large datasets in a distributed environment.
● Used AWS Step Functions to orchestrate PySpark workflows and automate data pipelines.
● Configured Spark jobs on AWS EMR to efficiently read and write data from AWS S3.
● Debugged Spark jobs running on AWS EMR, identifying performance bottlenecks.
● Built end-to-end PySpark pipelines on AWS EMR, reading data from AWS S3,Snowflake and WebAPI’s and write the data to S3 and Elastic.
● Used AWS Step Exections to validate the Pyfiles executables
● Used Spark RDD transformations and actions to process large-scale structured and unstructured data sets, including filtering, mapping, reducing, grouping, and aggregating data.
● Implemented Spark DataFrame persistency and caching mechanisms to reduce data processing overhead and improve query performance.
● Used AWS Hive/Athena to perform SQL queries on datasets processed by Spark on AWS EMR.
● Was responsible for managing on-Prem Hive tables.
● Performed Hive query performance tuning, by changing the configurations such as parallelism and vectorization
● Configured sqoop incremental jobs to import incremental data in RDBMS to Hive. Technologies: AWS S3, AWS EMR, AWS EC2, AWS Athena, PySpark, HDFS, Hadoop, Hive, Sqoop, Linux Bash & Python Scripting, Git.
Client : Apple Inc, Bengaluru, India May 2019 to Aug 2022 Role : Bigdata Engineer
Responsibilities
● Was involved in Importing data from RDBMS into HDFS and HIVE using Sqoop
● Created and Managed Hive tables, including managed, external, and partitioned tables.
● Utilized DataFrames for structured data manipulation and analysis.
● Optimized Spark jobs for performance and resource utilization.
● Implemented Spark SQL queries for data querying and aggregation
● Implemented Spark partitioning and caching strategies.
● Monitored Spark jobs using cluster management tools like YARN
● Worked with Spark's data serialization formats (Avro, Parquet, JSON, etc.).
● Collaborated with DevOps teams for cluster provisioning and maintenance.
● Documented Spark workflows and best practices.
● Mentored junior Spark developers and engineers.
● Adjusted YARN queue priorities and capacities dynamically based on workload demands.
● Analyzed Yarn job performance metrics and restructured queue hierarchies to reduce job wait times and prevent resource contention.
● Documented system configurations, procedures, and troubleshooting steps for streamlined team collaboration.
● Proficient in managing and optimizing data storage solutions using Google Cloud Storage, ensuring efficient data organization, access, and security.
● Skilled in provisioning and configuring virtual machines on Google Compute Engine to meet specific computing requirements for various projects.
● Proficient in setting up and customizing Google Dataproc clusters, including cluster resizing and configuration tuning for optimal performance.
● Expertise in developing and deploying serverless applications using Google Cloud Functions, enabling cost-effective and scalable solutions.
Technologies: GCP Cloud Storage, GCP Compute Engine, GCP Dataproc, GCP Cloud Functions, Linux, Cloudera Data Platforms, HDFS, Hadoop, Spark, Spark SQL, Hive, Sqoop, Linux Bash Scripting, Git. Client : Apple Inc, Bengaluru, India Aug 2017 to Apr 2019 Role : DevOps Engineer
Responsibilities
● Managed and supported 6+ Hadoop clusters with 500+ nodes and over 1 PB of storage, including cluster capacity planning, performance tuning, and monitoring.
● Hands-on experience commissioning, decommissioning, adding, and removing nodes in Hadoop clusters.
● Implemented SFTP for secure data transfer from external servers to Hadoop and utilized Sqoop for data import/export into HDFS.
● Tuned Spark applications for optimal batch interval and memory configuration; implemented Spark-Kafka streaming pipelines for real-time data processing.
● Knowledgeable in Kafka partitioning, replication, and cluster setup for message distribution.
● Collaborated with engineering teams to design, build, and maintain large-scale distributed systems.
● Managed CI/CD pipelines, deployments, and source control systems (GIT); automated infrastructure monitoring and recovery using Ansible and Python.
● Ensured system health using tools like Grafana, Zenoss, Kibana, PagerDuty, and Splunk; performed RCA and communicated impacts to stakeholders during incidents.
● Worked with global teams to improve workflows, enhance application code for operational ease, and ensure high availability targets.
● Led troubleshooting, implemented fixes for SLA compliance, and contributed to scalability and performance enhancement initiatives.
Technologies: Linux, Cloudera Data Platforms, HDFS, Hadoop, Spark, Hive, Linux Bash Scripting, Zookeeper, Kafka, PagerDuty, Ansible, Git.
Client - Apple Inc, Bengaluru, India Dec 2015 to Jul 2017 Production Support Engineer
Responsibilities
● Collaborated with System Admin, DBAs, and Network teams for change request activities, ensuring smooth implementation.
● Customized scripts for user data requests, fetching logs from servers and providing them in required formats.
● Involved in capacity planning and estimating storage requirements to meet application needs.
● Conducted root cause analysis by reviewing application logs and data, implementing corrective and preventive actions.
● Evaluated logs to identify performance issues and recommended solutions for remediation.
● Participated in annual stress tests by monitoring server load and capacity under high data volume scenarios.
● Monitored server performance using Zenoss/Grafana, collaborating with teams to resolve issues.
● Managed production deployments and provided post-production support to address arising issues.
● Utilized Ansible for configuration management, creating and executing playbooks for service installation and configuration.
● Documented technical changes, procedures, and configurations on internal Wiki, assisting in knowledge transfer and training new team members.
Technologies: Linux, Linux Bash Scripting, Jenkins, Ansible, SQL Developer, Git. Education:
Bachelor of Engineering in Computer Science & Engineering College : East West Institute of technology (2011-2015) University : Visvesvaraya Technological University