Senior Data Engineering

Location:

Irvine, CA, 92614

Posted:

November 14, 2024

Contact this candidate

Resume:

Deepkumar Patel

Contact: (***) ***- ****; Email: *********@*****.***

Lead Big Data Engineer

PROFILE SUMMARY

A dynamic and result-oriented Big Data Engineer with 14+ years of progressive experience in information technology, including over 11 years specializing in Big Data development. Demonstrated expertise in harnessing the power of leading Cloud Platforms (AWS, Azure, GCP) to architect, develop, and implement scalable and high-performance data processing pipelines.

Skilled in designing and optimizing data architectures using AWS services such as Redshift, Kinesis, Glue, and EMR for efficient and real-time data processing. Implemented data security using AWS IAM and CloudTrail for auditing and compliance. Proficient in AWS CloudFormation and Step Functions for infrastructure as code and workflow automation.

Proven track record in leveraging Hadoop ecosystems (Cloudera, Hortonworks, AWS EMR, Azure HDInsight, GCP Data Proc) and proficient in data ingestion, extraction, and transformation using cutting-edge tools like AWS Glue, Lambda, and Azure Data Bricks.

Expertise in utilizing Google Cloud services like DataProc, Dataprep, Pub/Sub, and Cloud Composer for sophisticated data workflows and processing. Proficient in Google Cloud Pub/Sub for scalable event-driven messaging and Google Cloud Audit Logging for compliance monitoring. Experience with Terraform for managing GCP resources and implementing CI/CD pipelines.

Competent in leveraging Azure Data Lake, SynapseDB, Data Factory, and DataBricks to build and manage robust data solutions. Experience with Azure HDInsight for big data processing, and Azure Functions for serverless computing. Skilled in configuring and optimizing Azure HDInsight clusters and utilizing Azure Storage for efficient data storage and retrieval.

In-depth knowledge of implementing data security measures, access controls, and compliance monitoring using tools like AWS IAM, CloudTrail, and Google Cloud Audit Logging.

Adept in managing and executing data analytics, machine learning, and AI-driven projects, ensuring robust and insightful data-driven decision-making.

Extensive experience in building and managing sophisticated data pipelines across AWS, Azure, and GCP, ensuring reliable and efficient data workflows.

Skilled in optimizing Spark performance across various platforms (Databricks, Glue, EMR, on-premises), enhancing the efficiency of large-scale data processing.

Hands-on experience with CI/CD pipelines (Jenkins, Azure DevOps, Code Pipeline), Kubernetes, Docker, and GitHub for seamless deployment and management of big data solutions.

Expertise in working with diverse file formats (JSON, XML, Avro) and utilizing SQL dialects (HiveQL, BigQuery SQL) for robust data analytics.

Active participant in Agile/Scrum processes, contributing to Sprint Planning, Backlog Management, and Requirements Gathering while effectively communicating with stakeholders and project managers.

TECHNICAL SKILLS

Big Data Systems: Amazon Web Services (AWS), Azure, Google Cloud Platform (GCP), Cloudera Hadoop, Hortonworks Hadoop, Apache Spark, Spark Streaming, Apache Kafka, Apache NIFI, Hive, Amazon S3, AWS Kinesis

Databases: Cassandra, HBase, DynamoDB, MongoDB, BigQuery, SQL, Hive, MySQL, Oracle, PL/SQL, RDBMS, AWS Redshift, Amazon RDS, Teradata, Snowflake

Programming & Scripting: Python, Scala, PySpark, SQL, Java, Bash

ETL Data Pipelines: Apache Airflow, Sqoop, Flume, Apache Kafka, DBT, Pentaho, SSIS

Visualization: Tableau, Power BI, Quick Sight, Looker, Kibana

Cluster Security: Kerberos, Ranger, IAM, VPC

Cloud Platforms: AWS, GCP, Azure

AWS Services: AWS Glue, AWS Kinesis, Amazon EMR, Amazon MSK, Lambda, SNS, Cloudwatch, CDK, Athena

Scheduler Tools: Apache Airflow, Azure Data Factory, AWS Glue, Step functions

Spark Framework: Spark API, Spark Streaming, Spark Structured Streaming, Spark SQL

CI/CD Tools: Jenkins, GitHub, GitLab

Project Methods: Agile, Scrum, DevOps, Continuous Integration (CI), Test-Driven Development (TDD), Unit Testing, Functional Testing, Design Thinking

PROFESSIONAL EXPERIENCE

Lead Big Data Engineer

Alteryx, Irvine, CA Apr 2023 - Present

I led a comprehensive data modernization initiative, migrating our data infrastructure to AWS. I designed and implemented a scalable data architecture on S3, modernized ETL processes with AWS Glue and PySpark, and deployed a Hadoop cluster on EMR for big data processing. I also spearheaded the secure migration of data from on-premises to AWS, automating the CI/CD pipeline with Jenkins. Additionally, I architected real-time data pipelines using Kafka and Kinesis, optimized data warehousing with Redshift, and leveraged Terraform for infrastructure automation. By effectively managing the entire migration process and conducting rigorous testing, I ensured seamless data integration and enhanced data accessibility.

Designed and implemented a scalable, cost-effective data storage architecture on AWS S3 for optimal data management.

Transitioned legacy ETL processes and data cataloging to AWS Glue, utilizing PySpark for writing Glue jobs to streamline data transformation.

Deployed a Hadoop cluster on AWS EMR and EC2 for distributed data processing, enabling efficient handling of big data workloads.

Spearheaded the secure and efficient migration of data from on-premises data centers to AWS, employing a meticulously structured strategy to ensure seamless execution and minimal downtime.

Automated the CI/CD pipeline using Jenkins for migration deployment tasks, enhancing efficiency and reducing manual intervention.

Design, implement, and manage data flow pipelines using Apache NiFi to automate the movement and transformation of data between systems, ensuring real-time data ingestion, processing, and routing while monitoring performance and data lineage for compliance and auditing purposes.

Developed robust ETL pipelines using PySpark to cleanse, transform, and enrich data, ensuring seamless data flow across multiple sources, including MySQL, Snowflake, MongoDB, and NoSQL databases.

Built fully managed Kafka streaming pipelines on AWS MSK to deliver data streams from company APIs to processing points such as Databricks Spark clusters, Redshift, and Lambda functions.

Applied Python and SQL for advanced data manipulation, joining datasets, and extracting actionable insights from large-scale datasets.

Optimized data preparation using PySpark for manipulation, aggregation, and filtering to enhance processing efficiency.

Migrated data warehousing and analytics workflows to Amazon Redshift, ensuring high performance and scalability for large datasets.

Integrated Terraform into the CI/CD pipeline to automate infrastructure provisioning, including resources like Redshift, EMR clusters, Kinesis streams, and Glue jobs.

Ensured infrastructure management and version control with Terraform's plan and apply commands, maintaining governance over changes and deployments.

Monitored and managed the entire migration process using AWS CloudWatch and CloudTrail, ensuring real-time visibility and tracking of all activities.

Conducted extensive testing of migrated data and ETL workflows to ensure data integrity, accuracy, and completeness post-migration.

Ingested large data streams from REST APIs into AWS EMR clusters through AWS Kinesis for real-time data processing.

Leveraged Spark Streaming and Kafka brokers to perform real-time analytics, utilizing explode transformations for comprehensive data expansion.

Utilized AWS Redshift to store and manage terabytes of structured and semi-structured data efficiently in the cloud environment.

Loaded data from MySQL tables into Spark clusters using Spark SQL and DataFrames API for further analysis and transformation.

Architected data ingestion pipelines using AWS Lambda functions to bring data from diverse sources into AWS S3 for unified data lake management.

Sr. Data Engineer

Shell USA, Houston, Texas Oct 2020 – Mar 2023

I led a complex data migration project to Azure, leveraging technologies like Azure HDInsight, Databricks, and Data Factory. I optimized data processing pipelines using Hive partitioning and RDD caching, ensuring efficient data handling. I migrated data from Oracle and SQL Server to Azure storage solutions, utilizing Stream Analytics for real-time data processing. I also implemented robust data security measures and monitored the migration process for optimal performance. By writing optimized Python and Scala code and automating ETL processes, I successfully transformed our data infrastructure and delivered actionable insights through data visualization tools.

Modeled Hive partitions for efficient data separation and accelerated processing, adhering to Hive best practices within Azure HDInsight environments.

Caching RDDs in Azure Databricks to optimize processing performance and efficiently execute operations on distributed datasets.

Transferred data from Oracle and SQL Server to Azure Blob Storage and Azure HDInsight Hive using Azure Data Factory for seamless migration.

Utilized Azure Stream Analytics for real-time data processing during migration, ensuring uninterrupted data flow and consistency.

Imported and processed large datasets (terabytes) into Spark RDDs for analysis, effectively utilizing Azure Blob Storage for seamless data integration.

Conducted comprehensive testing to validate data integrity, performance, and scalability across both RDBMS (MySQL, MS SQL Server) and NoSQL databases.

Ensured robust data security during migration by leveraging Azure Active Directory and Key Vault for secure access control and encryption protocols.

Transferred data to optimal Azure storage solutions, such as Blob Storage, Data Lake Storage, and Azure SQL Data Warehouse, depending on storage and analytical needs.

Monitored the migration process with Azure Monitor for performance optimization and automated repetitive tasks with Logic Apps.

Created data frames from diverse sources like existing RDDs, JSON datasets, and databases using Azure Databricks to streamline data analysis.

Managed job scheduling and file systems on Azure Linux Virtual Machines using UNIX shell scripting to streamline operations.

Migrated legacy MapReduce jobs to PySpark within Azure HDInsight, enhancing processing speed and scalability for large datasets.

Continuously optimized MySQL and NoSQL database performance, identifying improvements to enhance overall efficiency.

Developed data visualization solutions in Tableau and Power BI, translating complex data into actionable insights for business stakeholders.

Wrote maintainable and optimized Python and Scala code within Azure Databricks, utilizing built-in libraries to meet specific data processing requirements.

Automated ETL processes with UNIX shell scripts for scheduling, error handling, file management, and data transfer through Azure Blob Storage.

Sr. Big Data Engineer

BNY Mellon, New York City, NY Jan 2018 – Sep 2020

I led a significant data modernization project on GCP, migrating and transforming complex datasets. I designed and implemented scalable data pipelines using Apache Spark and Python, optimizing performance and cost-efficiency. I migrated data to various GCP storage solutions like GCS, BigQuery, and Bigtable, ensuring data integrity and accessibility. I leveraged GCP tools like Dataprep, Cloud Composer, and Cloud Monitoring for data preparation, orchestration, and monitoring. Additionally, I integrated BigQuery with AI/ML, data visualization, and archiving solutions to enable advanced analytics and insights.

Designed and developed large-scale data processing pipelines using Apache Spark and Python on GCP, optimizing performance for data processing.

Migrated data to storage solutions such as Google Cloud Storage (GCS), BigQuery, and Bigtable based on specific analytical needs and requirements.

Used Google Dataprep to ensure data was clean and ready for migration, with monitoring enabled through Cloud Monitoring for real-time oversight.

Implement and manage real-time data streaming solutions using Apache Kafka to facilitate reliable data ingestion and processing across distributed systems, ensuring high throughput and low latency while enabling seamless integration with various data consumers and producers.

Orchestrated smooth data migrations using Cloud Composer and automated workflows to ensure a controlled process.

Designed scalable data models and schema structures in BigQuery for enhanced data querying and storage optimization.

Defined a comprehensive data architecture integrating Snowflake, Oracle, and GCP services, ensuring seamless data flow and integration.

Utilized Vertex AI Pipelines for machine learning workflow orchestration and managed BigQuery datasets, tables, and views for efficient data handling.

Established robust data quality checks in BigQuery to ensure accuracy and reliability throughout the migration process.

Integrated BigQuery with GCP services for AI/ML analytics, data visualization with Data Studio, and long-term archiving via Cloud Storage.

Employed Pub/Sub for event-driven data processing and Google Cloud Storage for large-scale data ingestion.

Developed ETL pipelines utilizing Change Data Capture (CDC) and batch processing to migrate Oracle data to BigQuery.

Implemented cost-efficient resource management strategies by analyzing Cloud Billing reports and optimizing GCP usage.

Created advanced data models and schemas in Snowflake for complex analytics and reporting, managing diverse structured and unstructured data sources.

Led real-time data processing with Spark, Cloud Composer, and Dataflow, deploying PySpark ETL jobs for efficient analysis.

Built data ingestion pipelines to enable real-time analytics and integrated BI tools like Tableau and Looker for dashboard generation.

Mentored junior engineers on ETL best practices, Snowflake management, and JSON processing.

Used Terraform for infrastructure provisioning, ensuring consistent environments, and Kubernetes to manage Docker containers for scalable deployments.

Optimized Spark and GCP DataProc jobs for performance and reliability, managing GCP resources for cost-efficiency and performance.

Configured Cloud IAM roles to ensure least-privilege access to resources and used Cloud Composer with Apache Airflow for data pipeline automation.

Developed machine learning pipelines using Apache Spark and scikit-learn, focusing on model training and deployment.

Sr. Data Engineer

Roche, San Francisco, CA Jun 2015 – Dec 2017

I led a data engineering initiative on AWS, deploying machine learning models using SageMaker and streamlining data pipelines with Step Functions and Kinesis. I designed a fault-tolerant infrastructure on EC2, automated deployments with Jenkins, and leveraged serverless functions with Lambda. I also migrated legacy data to MySQL and NoSQL databases, optimizing performance and security. By executing Hadoop and Spark jobs on EMR, I processed large datasets and utilized S3 for scalable storage. I implemented robust security measures with IAM and ensured compliance through fine-grained access control.

Collaborated with data scientists and analysts to deploy machine learning models for critical business applications such as fraud detection, customer segmentation, and risk assessment using Amazon SageMaker.

Streamlined event-driven messaging systems by orchestrating data pipelines with AWS Step Functions and Amazon Kinesis.

Designed fault-tolerant infrastructure on EC2 instances within a load-balanced architecture, incorporating comprehensive monitoring through CloudWatch and CloudTrail.

Automated ETL code deployment and infrastructure updates using Jenkins within a CI/CD pipeline, ensuring smooth and continuous operations.

Leveraged AWS Lambda to build serverless applications, ensuring scalable, low-maintenance processing for real-time data transformations and event-driven workflows.

Successfully migrated legacy data to MySQL and NoSQL databases, implementing advanced security measures like encryption and access control to ensure data protection.

Enhanced database performance by optimizing MySQL and NoSQL configurations for maximum query throughput and efficiency.

Developed automated Python scripts for data transformation and ETL pipeline generation, converting SQL queries into PySpark transformations for efficient large-scale data processing.

Executed Hadoop and Spark jobs on Amazon EMR, leveraging data from S3 and DynamoDB, enabling high-performance in-memory computations.

Utilized Amazon EMR to process Big Data across Hadoop clusters, with S3 for scalable storage solutions and Amazon Redshift for data warehousing and analysis.

Implemented AWS Identity and Access Management (IAM) for fine-grained access control, ensuring least-privilege permissions for users and services across AWS resources, thereby enhancing security and compliance.

Hadoop Data Administrator

Nissan, Franklin, Tennessee Feb 2013 – May 2015

I led a data engineering project focused on optimizing data processing and analysis. I utilized Pig, Python, and Oracle to profile and transform raw data, then designed efficient Hive external tables for storage and querying. I leveraged Sqoop and Flume for seamless data transfer and real-time streaming. Additionally, I explored and optimized legacy Hadoop algorithms with Spark, significantly improving performance. By developing and managing ETL processes and leveraging big data technologies, I ensured data accuracy, consistency, and availability for downstream analytics.

Utilized Pig, Python, and Oracle to perform comprehensive data profiling and transformation on raw datasets, ensuring readiness for downstream analysis.

Designed efficient Hive external tables for structured data storage, loading, and querying using HQL to enable optimized retrieval and analysis.

Implemented Sqoop to seamlessly transfer vast datasets between relational databases and HDFS, while leveraging Flume for real-time streaming of server logs into the Hadoop ecosystem.

Explored and optimized legacy Hadoop algorithms by leveraging Spark capabilities, including Spark Context, DataFrames, Spark SQL, Spark Paired RDDs, and Spark YARN to significantly improve performance.

Develop and manage ETL processes to extract raw data from various sources, transform it into structured formats, and load it into target data stores, ensuring data accuracy, consistency, and availability for analytics and reporting purposes.

Engineered solutions to ingest and process data from diverse sources using big data technologies like HBase, Hive, and MapReduce.

Developed high-performance Spark code with Scala and Spark SQL to accelerate data processing and testing iterations.

Efficiently imported millions of structured records from relational databases into HDFS using Sqoop, ensuring data was stored in CSV format for further processing with Spark.

Data Engineer

MetLife, New York, NY Jan 2012 – Jan 2013

I led a data engineering project focused on optimizing data processing and analysis. I developed and optimized UNIX shell scripts to automate workflows and data transfers. I also maintained and updated Hadoop environments, created Hadoop infrastructure, and developed MapReduce programs for data processing. I utilized tools like Sqoop, Oozie, Hive, and Pig for data transfer, orchestration, and transformation. Additionally, I monitored and optimized SQL server performance, analyzed sales patterns, and created data visualizations in Power BI. By following Agile practices, I ensured a reliable and efficient data pipeline.

Developed and optimized UNIX shell scripts to automate the build process, automate workflows, and perform regular tasks such as data file transfers across systems.

Automated data extraction workflows to pull data from various databases into a shared file system (FS) and loaded it using terminal commands.

Maintained and updated Hadoop environments by performing upgrades, patches, and bug fixes in a clustered setup.

Created and validated Hadoop Infrastructure handled data centre planning for growth, and managed installation and configuration of Hadoop binaries (PoC).

Developed Java MapReduce programs for data processing and cleaning up data streams from server logs.

Transferred data between clusters and storage media using Sqoop, orchestrated data workflows with Oozie, and created Sqoop, Pig, and Hive scripts for data transformation.

Monitored and optimized SQL server performance and improved query efficiency across multiple servers.

Analysed sales patterns and customer satisfaction indices through Hive queries, integrating data from various relational databases.

Designed data visualizations in Power BI to present real-time impact and growth metrics.

Developed test cases following Agile methodology during two-week sprints to ensure data accuracy and process functionality.

Collaborated with technology and business teams to create a Hadoop migration strategy, recommending appropriate technology stacks based on enterprise architecture.

Conducted performance optimization by evaluating MapReduce job efficiency, including I/O latency, map time, combiner time, and reduce time.

Followed Agile practices throughout the project, regularly updating and refining processes for data pipeline reliability and efficiency.

Data Analyst

Target Corporation, Minneapolis, Minnesota Jan 2010 – Dec 2011

I led data analysis projects focused on extracting actionable insights. I collaborated with cross-functional teams to identify key performance indicators and measure the impact of marketing initiatives. I transformed complex data sets into clear visualizations and reports, and developed dashboards to monitor business metrics. By utilizing SQL and Python, I automated reporting processes and streamlined data collection. Additionally, I stayed current with industry trends and explored innovative methodologies to enhance data analysis capabilities.

Conducted in-depth analyses of large datasets to extract actionable insights, driving strategic business decisions.

Collaborated with cross-functional teams to identify key performance indicators (KPIs) and evaluate the effectiveness of marketing initiatives.

Transformed complex data sets into clear and compelling visualizations and reports, ensuring effective communication of findings to stakeholders at all levels.

Developed and maintained dashboards to monitor business metrics, enabling timely decision-making and performance tracking.

Conducted A/B testing and statistical analyses to measure the impact of various strategies and optimize performance.

Utilized SQL and Python to manipulate data, automate reporting processes, and streamline data collection methods.

Stayed current with industry trends and emerging analytical tools, continuously exploring innovative methodologies to enhance data analysis capabilities.

EDUCATION

Masters in computer science (Big Data Systems)

Arizona State University, Tempe, AZ

Bachelor’s in computer science

Portland State University, Portland, OR

Contact this candidate