Senior Data Engineering

Location:

Alameda, CA, 94501

Posted:

September 04, 2024

Contact this candidate

Resume:

Kaushal Harshad Thakar

Contact: (***) ***- ****; Email: **********@*****.***

Sr. Big Data Engineer

PROFILE SUMMARY

A dynamic and result-oriented Big Data Engineer with 10+ years of progressive experience in information technology, including over 8 years specializing in Big Data development. Demonstrated expertise in harnessing the power of leading Cloud Platforms (AWS, Azure, GCP) to architect, develop, and implement scalable and high-performance data processing pipelines.

Key Strengths:

•Big Data Mastery: Proven track record in leveraging Hadoop ecosystems (Cloudera, Hortonworks, AWS EMR, Azure HDInsight, GCP Data Proc) and proficient in data ingestion, extraction, and transformation using cutting-edge tools like AWS Glue, Lambda, and Azure Data Bricks.

•Cloud Platforms Excellence:

•AWS: Skilled in designing and optimizing data architectures using AWS services such as Redshift, Kinesis, Glue, and EMR for efficient and real-time data processing. Implemented data security using AWS IAM and CloudTrail for auditing and compliance. Proficient in AWS CloudFormation and Step Functions for infrastructure as code and workflow automation.

•Azure: Competent in leveraging Azure Data Lake, SynapseDB, Data Factory, and DataBricks to build and manage robust data solutions. Experience with Azure HDInsight for big data processing, and Azure Functions for serverless computing. Skilled in configuring and optimizing Azure HDInsight clusters and utilizing Azure Storage for efficient data storage and retrieval.

•GCP: Expertise in utilizing Google Cloud services like DataProc, Dataprep, Pub/Sub, and Cloud Composer for sophisticated data workflows and processing. Proficient in Google Cloud Pub/Sub for scalable event-driven messaging and Google Cloud Audit Logging for compliance monitoring. Experience with Terraform for managing GCP resources and implementing CI/CD pipelines.

•Advanced Analytics & Machine Learning: Adept in managing and executing data analytics, machine learning, and AI-driven projects, ensuring robust and insightful data-driven decision-making.

•Data Pipeline Engineering: Extensive experience in building and managing sophisticated data pipelines across AWS, Azure, and GCP, ensuring reliable and efficient data workflows.

•Performance Optimization: Skilled in optimizing Spark performance across various platforms (Databricks, Glue, EMR, on-premises), enhancing the efficiency of large-scale data processing.

•Data Security & Compliance: In-depth knowledge of implementing data security measures, access controls, and compliance monitoring using tools like AWS IAM, CloudTrail, and Google Cloud Audit Logging.

•DevOps Integration: Hands-on experience with CI/CD pipelines (Jenkins, Azure DevOps, Code Pipeline), Kubernetes, Docker, and GitHub for seamless deployment and management of big data solutions.

•Comprehensive Data Handling: Expertise in working with diverse file formats (JSON, XML, Avro) and utilizing SQL dialects (HiveQL, BigQuery SQL) for robust data analytics.

•Agile & Collaborative: Active participant in Agile/Scrum processes, contributing to Sprint Planning, Backlog Management, and Requirements Gathering while effectively communicating with stakeholders and project managers.

Key Achievements:

•Architected and deployed scalable data processing solutions on AWS, Azure, and GCP, significantly improving data handling efficiency and performance.

•Led the migration of complex on-premises data ecosystems to cloud-based architectures, enhancing scalability and reducing costs.

•Developed robust data warehousing solutions using AWS Redshift and Azure SynapseDB, enabling advanced analytics and business intelligence.

•Orchestrated and optimized big data clusters using Azure HDInsight and Kubernetes, ensuring seamless data processing and resource management.

TECHNICAL SKILLS

Big Data Systems: Amazon Web Services (AWS), Azure, Google Cloud Platform (GCP), Cloudera Hadoop, Hortonworks Hadoop, Apache Spark, Spark Streaming, Apache Kafka, Hive, Amazon S3, AWS Kinesis

Databases: Cassandra, HBase, DynamoDB, MongoDB, BigQuery, SQL, Hive, MySQL, Oracle, PL/SQL, RDBMS, AWS Redshift, Amazon RDS, Teradata, Snowflake

Programming & Scripting: Python, Scala, PySpark, SQL, Java, Bash

ETL Data Pipelines: Apache Airflow, Sqoop, Flume, Apache Kafka, DBT, Pentaho, SSIS

Visualization: Tableau, Power BI, Quick Sight, Looker, Kibana

Cluster Security: Kerberos, Ranger, IAM, VPC

Cloud Platforms: AWS, GCP, Azure

AWS Services: AWS Glue, AWS Kinesis, Amazon EMR, Amazon MSK, Lambda, SNS, Cloudwatch, CDK, Athena

Scheduler Tools: Apache Airflow, Azure Data Factory, AWS Glue, Step functions

Spark Framework: Spark API, Spark Streaming, Spark Structured Streaming, Spark SQL

CI/CD Tools: Jenkins, GitHub, GitLab

Project Methods: Agile, Scrum, DevOps, Continuous Integration (CI), Test-Driven Development (TDD), Unit Testing, Functional Testing, Design Thinking

PROFESSIONAL EXPERIENCE

Sr. Big Data Engineer

Exelixis, Alameda, CA Aug’23 - Present

•Led the secure and efficient migration of data from on-premises data centers to the cloud (AWS) using a meticulously planned and executed approach.

•Automated the CI/CD pipeline for migration deployment tasks with Jenkins.

•Set up a scalable and cost-effective data storage solution on AWS S3.

•Transitioned ETL processes and data cataloging to AWS Glue, utilizing Glue jobs written in Pyspark.

•Migrated data warehousing and analytics to Amazon Redshift.

•Established a Hadoop cluster on AWS using EMR and EC2 for distributed data processing.

•Monitored and managed the entire migration process with CloudWatch and CloudTrail.

•Performed rigorous testing of migrated data and ETL processes to guarantee data accuracy and completeness.

•Implemented robust ETL pipelines using PySpark to clean, transform, and enrich data as it flows between various data sources (MySQL, NoSQL databases, Snowflake, MongoDB).

•Leveraged PySpark's capabilities for data manipulation, aggregation, and filtering for optimal data preparation.

•Utilized AWS Redshift for efficiently storing terabytes of data in the cloud.

•Loaded structured & semi-structured data from MySQL tables into Spark clusters using Spark SQL & Data Frames API.

•Designed and implemented data ingestion pipelines using AWS Lambda functions to efficiently bring data from diverse sources into the AWS S3 data lake.

•Built AWS Fully Managed Kafka streaming pipelines using MSK to deliver data streams from company APIs to processing points like Spark clusters in Databricks, Redshift, and Lambda functions.

•Ingested large data streams from company REST APIs into EMR clusters via AWS Kinesis.

•Processed data streams from Kafka brokers using Spark Streaming for real-time analytics, employing explode transformations for data expansion.

•Utilized Python & SQL for data manipulation, joining data sets, and extracting actionable insights from large datasets.

•Integrated Terraform into the CI/CD pipeline to automate infrastructure provisioning alongside code deployment, including data services like Redshift, EMR clusters, Kinesis streams, and Glue jobs.

•Maintained control and review of infrastructure changes through Terraform's plan and apply commands.

Sr. Big Data Engineer

Valero Energy, San Antonio, Tx Jun’21-Jul’23

•Utilized AWS S3 for efficient data collection and storage, enabling easy access and processing of large datasets.

•Transformed data using Amazon Athena for SQL processing and AWS Glue for Python processing, encompassing data cleaning, normalization, and standardization.

•Collaborated with data scientists and analysts at Valero Energy to leverage machine learning for critical business tasks, such as fraud detection, risk assessment, and customer segmentation, using Amazon SageMaker.

•Leveraged CloudWatch and CloudTrail for a robust fault tolerance and monitoring setup.

•Set up the underlying Infrastructure and leveraged EC2 Instances in a load balanced setup.

•Monitored Amazon RDS and CPU/memory usage with Amazon CloudWatch.

•Utilized Amazon Athena for faster information analysis compared to Spark and leveraged the serverless functionalities within AWS.

•Orchestrated data pipelines with AWS Step Functions and facilitated event messaging with Amazon Kinesis.

•Containerized Confluent Kafka applications and configured subnets for secure communication between containers.

•Performed data cleaning and pre-processing using AWS Glue, with expertise in writing Python transformation scripts.

•Planned and executed data migration strategies to move data from legacy systems to MySQL and NoSQL databases.

•Implemented robust security measures, access controls, and encryption to guarantee data protection throughout the migration process.

•Automated the deployment of ETL code and infrastructure changes using a CI/CD pipeline built with Jenkins.

•Implemented and monitored scalable and high-performance computing solutions using AWS Lambda (Python), S3, Amazon Redshift, Databricks (with PySpark jobs), and Amazon CloudWatch.

•Maintained version control of ETL code and configurations using Git.

•Optimized query performance and throughput by fine-tuning database configurations for MySQL and NoSQL databases and optimizing the queries.

•Developed automated Python scripts for data conversion from various sources and to generate ETL pipelines.

•Converted SQL queries into Spark transformations using Spark APIs in Pyspark.

•Collaborated with the DevOps team to deploy pipelines in AWS using CodePipeline and AWS CodeDeploy.

•Executed Hadoop/Spark jobs on Amazon EMR, utilizing programs and data stored in Amazon S3 buckets.

•Loaded data from diverse sources (S3, DynamoDB) into Spark data frames and implemented in-memory data computation for efficient output generation.

•Utilized Amazon EMR for processing Big Data across Hadoop clusters, along with Amazon S3 for storage and Amazon Redshift for data warehousing.

•Developed streaming applications with Apache Spark Streaming and Amazon Managed Streaming for Apache Kafka (Amazon MSK).

Sr. Data Engineer

Deliverr, San Francisco, CA Dec’19- May’21

•Utilized Cloud Storage Transfer Service for high-speed and secure data movement between on-premises systems and GCP at Deliverr.

•Developed and maintained large-scale data processing and analysis pipelines using Apache Spark and Python on Google Cloud Platform (GCP).

•Migrated data to optimal storage solutions like Google Cloud Storage (GCS), Bigtable, or BigQuery based on analytical needs.

•Utilized Google Dataprep to ensure clean and prepared data during migration, monitored by Cloud Monitoring.

•Orchestrated data migration with Cloud Composer for a smooth and controlled process.

•Designed & implemented efficient data models and schema designs in BigQuery for optimized querying and storage.

•Defined a scalable and comprehensive data architecture integrating Snowflake, Oracle, GCP services, and other crucial components.

•Utilized Vertex AI Pipelines (formerly Kubeflow Pipelines) to orchestrate machine learning workflows on GCP.

•Created & configured BigQuery datasets, tables, and views for efficient storage and management of transformed data.

•Established data quality checks and validation rules to guarantee data accuracy and reliability in BigQuery.

•Integrated BigQuery with other GCP services for various purposes, including data visualization (Data Studio), AI/ML analysis, and long-term data archiving (Cloud Storage).

•Leveraged Google Cloud Storage for data ingestion and Pub/Sub for event-driven data processing.

•Developed ETL pipelines using efficient methods like CDC (Change Data Capture) or scheduled batch processing to extract data from Oracle databases and migrate it to BigQuery.

•Implemented Cloud Billing reports and recommendations to identify and optimize GCP resource usage for cost-efficiency.

•Created data models and schema designs for Snowflake data warehouses to support complex analytical queries and reporting.

•Handled diverse data sources (structured, semi-structured, unstructured) to design data integration solutions on GCP.

•Implemented real-time data processing using Spark, GCP Cloud Composer, and Google Dataflow with PySpark ETL jobs for efficient analysis.

•Built data ingestion pipelines (Snowflake staging) from disparate sources and data formats to enable real-time analytics.

•Integrated data pipelines with various data visualization and BI tools like Tableau and Looker for dashboard and report generation.

•Mentored junior data engineers, providing guidance on ETL best practices, Snowflake, Snowpipes, and JSON.

•Implemented infrastructure provisioning using Terraform for consistent and repeatable environments across project stages.

•Utilized Kubernetes to manage the deployment, scaling, and lifecycle of Docker containers.

•Optimized ETL and batch processing jobs for performance, scalability, and reliability using Spark, YARN, and GCP DataProc.

•Managed and optimized GCP resources (VMs, storage, network) for cost-effectiveness and performance.

•Configured Cloud Identity & Access Management (IAM) roles to grant least privilege access to GCP resources at Deliverr.

•Used Google Cloud Composer to build and deploy data pipelines as DAGs using Apache Airflow.

•Constructed a machine learning pipeline using Apache Spark and scikit-learn for training and deploying predictive models.

Data Engineer

Truist Financial Corporation, Charlotte, North Carolina Apr’18- Nov’19

•Migrated ETL workflows and orchestration to Azure Data Factory for streamlined data movement.

•Ensured data security throughout migration using Azure Active Directory & Key Vault for access control & encryption.

•Transferred data to optimal storage solutions like Azure Blob Storage, Data Lake Storage, or Azure SQL Data Warehouse based on data needs.

•Extensively modeled Hive partitions for efficient data separation and faster processing, adhering to Hive best practices for tuning in Azure HDInsight.

•Cached RDDs within Azure Databricks to enhance processing performance & efficiently perform actions on each RDD.

•Successfully transferred data from Oracle and SQL Server to Azure HDInsight Hive and Azure Blob Storage using Azure Data Factory.

•Implemented Azure Stream Analytics for real-time data processing during migration, ensuring continuous data flow.

•Monitored and managed the entire migration process with Azure Monitor for performance insights and Logic Apps for automated tasks.

•Created data frames from various data sources (existing RDDs, structured files, JSON datasets, databases) using Azure Databricks.

•Loaded large datasets (terabytes) into Spark (Scala/PySpark) RDDs for data processing and analysis, efficiently importing data from Azure Blob Storage.

•Conducted comprehensive testing to guarantee data integrity, performance, and scalability across RDBMS (MySQL, MS SQL Server) and NoSQL databases.

•Continuously monitored database performance (MySQL, NoSQL) and implemented optimizations for improved efficiency.

•Implemented data visualization solutions using Tableau and Power BI to translate data into insights and analytics for business stakeholders.

•Developed well-structured, maintainable Python and Scala code leveraging inbuilt Azure Databricks libraries to meet application requirements for data processing and analytics.

•Automated the ETL process using UNIX shell scripts for scheduling, error handling, file operations, and data transfer via Azure Blob Storage.

•Managed jobs and file systems using UNIX shell scripts within Azure Linux Virtual Machines.

•Migrated legacy MapReduce jobs to PySpark jobs using Azure HDInsight for improved processing capabilities.

Hadoop Data Engineer

Allstate Corporation, Glenview, Illinois Jan’16-Mar’18

•Performed data profiling and transformation on raw data leveraging Pig, Python, and Oracle for data preparation.

•Created Hive external tables, loaded data, and queried it using HQL for efficient data retrieval and analysis.

•Leveraged Sqoop for seamless data transfer between relational databases and HDFS and utilized Flume to stream log data from servers for real-time processing.

•Explored Spark's capabilities to optimize existing Hadoop algorithms, leveraging Spark Context, Spark SQL, DataFrames, Spark Paired RDDs, and Spark YARN for improved performance.

•Built ETL pipelines using Apache NiFi processors to automate data movement and transformation within the Hadoop ecosystem.

•Designed and implemented solutions to ingest and process data from diverse sources using Big Data technologies like MapReduce, HBase, and Hive.

•Developed Spark code using Scala & Spark SQL to accelerate data processing and enable rapid testing and iteration.

•Utilized Sqoop to efficiently import millions of structured records from relational databases into HDFS for further processing with Spark. Ensured data is stored in CSV format within HDFS.

Data Analyst

Harley Davidson, Milwaukee Wisconsin Mar’14-Dec’15

•Analysed customer data to understand buying behaviours, preferences, and brand loyalty.

•Leveraged analytical skills to assess marketing campaign performance and recommend data-driven strategies for maximizing sales effectiveness.

•Partnered with product development teams to analyse product usage data and customer feedback which will help in guiding the creation of innovative motorcycle features and improvements that meet the needs and desires of Harley-Davidson riders.

•Transformed complex data into clear, compelling narratives that resonate with stakeholders across all levels of the organization.

•At the forefront of data analytics trends and technologies. Explored the potential of new tools and methodologies to enhance Harley-Davidson's data-driven decision making.

EDUCATION

Master of Science in Computer Science

Binghamton University, State University of New York, Thomas J. Watson College of Engineering and Applied Science

Bachelor of Science in Computer

Mumbai University, Mumbai (India)

Contact this candidate