Email- **************@*****.*** Phone:- 704-***-****
Priyanshu Sharma
Lead Big Data Engineer
+ PROFILE SUMMARY
•Versatile and Result-Oriented Big Data Engineer:
•10+ years of experience in IT with over 8 years in Big Data development.
•Expertise in architecting, developing, and implementing scalable, high-performance data processing pipelines using leading Cloud Platforms (AWS, Azure, GCP).
•
•Key Strengths:
•Big Data Mastery:
•Extensive experience with Hadoop ecosystems (Cloudera, Hortonworks, AWS EMR, Azure HDInsight, GCP Data Proc).
•Proficient in data ingestion, extraction, and transformation using tools like AWS Glue, Lambda, and Azure Data Bricks.
•
•Cloud Platforms Excellence:
•AWS:
•Skilled in designing and optimizing data architectures using AWS Redshift, Kinesis, Glue, and EMR.
•Implemented data security with AWS IAM and CloudTrail.
•Proficient in AWS CloudFormation and Step Functions for infrastructure as code and workflow automation.
•Azure:
•Competent in Azure Data Lake, SynapseDB, Data Factory, and DataBricks for robust data solutions.
•Experienced with Azure HDInsight, Azure Functions, and Azure Storage for efficient data management.
•GCP:
•Expertise in Google Cloud services like DataProc, Dataprep, Pub/Sub, and Cloud Composer.
•Proficient in Terraform for managing GCP resources and implementing CI/CD pipelines.
•
•Advanced Analytics & Machine Learning:
•Adept at managing and executing data analytics, machine learning, and AI-driven projects.
•
•Data Pipeline Engineering:
•Extensive experience in building and managing sophisticated data pipelines across AWS, Azure, and GCP.
•
•Performance Optimization:
•Skilled in optimizing Spark performance across platforms like Databricks, Glue, and EMR.
•
•Data Security & Compliance:
•In-depth knowledge of data security, access controls, and compliance monitoring using AWS IAM, CloudTrail, and Google Cloud Audit Logging.
•
•DevOps Integration:
•Hands-on experience with CI/CD pipelines, Kubernetes, Docker, and GitHub for seamless deployment and management of big data solutions.
•
•Comprehensive Data Handling:
•Expertise in working with diverse file formats (JSON, XML, Avro) and SQL dialects (HiveQL, Big Query SQL).
•
•Agile & Collaborative:
•Active participant in Agile/Scrum processes, contributing to sprint planning, backlog management, and requirements gathering.
•
•Key Success:
•Architected and deployed scalable data processing solutions on AWS, Azure, and GCP, improving data handling efficiency and performance.
•Led the migration of complex on-premises data ecosystems to cloud-based architectures, enhancing scalability and reducing costs.
•Developed robust data warehousing solutions using AWS Redshift and Azure SynapseDB for advanced analytics and business intelligence.
•Orchestrated and optimized big data clusters using Azure HDInsight and Kubernetes, ensuring seamless data processing and resource management.
+ TECHNICAL SKILLS
•
•Big Data Systems: Amazon Web Services (AWS), Azure, Google Cloud Platform (GCP), Cloudera Hadoop, Hortonworks Hadoop, Apache Spark, Spark Streaming, Apache Kafka, Hive, Amazon S3, AWS Kinesis
•Databases: Cassandra, HBase, DynamoDB, MongoDB, BigQuery, SQL, Hive, MySQL, Oracle, PL/SQL, RDBMS, AWS Redshift, Amazon RDS, Teradata, Snowflake
•Programming & Scripting: Python, Scala, PySpark, SQL, Java, Bash
•ETL Data Pipelines: Apache Airflow, Sqoop, Flume, Apache Kafka, DBT, Pentaho, SSIS, Databricks
•Visualization: Tableau, Power BI, Quick Sight, Looker, Kibana
•Cluster Security: Kerberos, Ranger, IAM, VPC
•Cloud Platforms: AWS, GCP, Azure
•AWS Services: AWS Glue, AWS Kinesis, Amazon EMR, Amazon MSK, Lambda, SNS, Cloudwatch, CDK, Athena
•Scheduler Tools: Apache Airflow, Azure Data Factory, AWS Glue, Step functions
•Spark Framework: Spark API, Spark Streaming, Spark Structured Streaming, Spark SQL
•CI/CD Tools: Jenkins, GitHub, GitLab
•Project Methods: Agile, Scrum, DevOps, Continuous Integration (CI), Test-Driven Development (TDD), Unit Testing, Functional Testing, Design Thinking
+ PROFESSIONAL EXPERIENCE
Oct 2023 – Present · Lead Big Data/ Cloud Engineer · Truist Financial Corporation, Charlotte, North Carolina
•Directed the secure and efficient migration of data from on-premises data centres to the cloud (AWS) through a meticulously planned and executed strategy.
•Streamlined migration deployment tasks by automating the CI/CD pipeline using Jenkins.
•Implemented a scalable and cost-effective data storage solution on AWS S3.
•Shifted ETL processes and data cataloguing to AWS Glue, leveraging Glue jobs written in PySpark.
•Transferred data warehousing and analytics to Amazon Redshift for improved performance and scalability.
•Established a Hadoop cluster on AWS using EMR and EC2 for efficient distributed data processing.
•Create and manage Hive databases, tables, and partitions to optimize data storage and retrieval for analytics and reporting purposes.
•Develop and maintain DBT models to transform raw data into structured, analyzable formats, ensuring data integrity and consistency for business intelligence and analytics.
•Supervised and managed the entire migration process with the help of CloudWatch and CloudTrail.
•Conducted rigorous testing of migrated data and ETL processes to ensure data accuracy and completeness.
•Design and implement DBT data transformation pipelines, applying SQL-based transformations to prepare data for downstream analysis and reporting.
•Developed robust ETL pipelines using PySpark to clean, transform, and enrich data across various data sources (MySQL, NoSQL databases, Snowflake, MongoDB).
•Utilized PySpark’s capabilities for data manipulation, aggregation, and filtering, optimizing data preparation.
•Develop and maintain Kafka producers and consumers to handle data flow between various systems and applications, ensuring reliable and efficient data exchange.
•Employed AWS Redshift for efficient storage of terabytes of data in the cloud.
•Loaded structured and semi-structured data from MySQL tables into Spark clusters using Spark SQL and DataFrames API.
•Implement and manage Microsoft Intune for overseeing mobile devices and applications accessing Big Data environments, ensuring secure and compliant usage.
•Develop and enforce security policies using Intune to protect sensitive data within Big Data environments, including encryption, data loss prevention (DLP), and access controls for mobile devices.
•Design, build, and manage complex data pipelines using Apache Airflow, orchestrating data workflows across multiple systems and ensuring the timely execution of ETL processes.
•Automate end-to-end workflows by scheduling and monitoring tasks with Airflow DAGs (Directed Acyclic Graphs), improving efficiency and reliability of Big Data processing pipelines.
•Designed and implemented data ingestion pipelines using AWS Lambda functions to efficiently bring data from diverse sources into the AWS S3 data lake.
•Constructed AWS Fully Managed Kafka streaming pipelines using MSK to deliver data streams from company APIs to processing points like Spark clusters in Databricks, Redshift, and Lambda functions.
•Ingested large data streams from company REST APIs into EMR clusters via AWS Kinesis.
•Processed data streams from Kafka brokers using Spark Streaming for real-time analytics, applying explode transformations for data expansion.
•Develop complex HiveQL queries to perform data extraction, transformation, and loading (ETL) tasks, enabling comprehensive data analysis and reporting.
•Utilized Python and SQL for data manipulation by joining datasets and extracting actionable insights from large data sets.
•Deploy, configure, and manage Kafka clusters to ensure high availability, scalability, and performance of data streaming and messaging systems.
•Integrated Terraform into the CI/CD pipeline to automate infrastructure provisioning alongside code deployment, covering data services like Redshift, EMR clusters, Kinesis streams, and Glue jobs.
•Design and implement distributed data processing applications using Scala, leveraging frameworks like Apache Spark for large-scale data transformation and analytics.
•Develop efficient, fault-tolerant, and scalable data pipelines using Scala to process high-volume, high-velocity data from various sources.
•Maintained control and review of infrastructure changes through Terraforms plan and apply commands.
•Lead and mentor a cross-functional team of data engineers, data scientists, and analysts, guiding them in the design, development, and optimization of Big Data pipelines and architectures.
•Oversee end-to-end project lifecycle for Big Data initiatives, from requirement gathering and architecture design to development, testing, and deployment, ensuring timely delivery within scope and budget.
Aug 2021 – Sep 2023 · Sr. Data Engineer · Centene Corporation, St. Louis, MO
•Leveraged the Cloud Storage Transfer Service for swift and secure data movement between on-premises systems and GCP at Centene.
•Engineered and sustained large-scale data processing and analysis pipelines with Apache Spark and Python on Google Cloud Platform (GCP).
•Transferred data to optimal storage solutions like Google Cloud Storage (GCS), Bigtable, or BigQuery tailored to analytical needs.
•Employed Google Dataprep to ensure clean and prepared data during migration, with monitoring by Cloud Monitoring.
•Managed data migration using Cloud Composer for a smooth and controlled process.
•Crafted and executed efficient data models and schema designs in BigQuery for enhanced querying and storage.
•Integrate Hive with other Big Data tools and frameworks such as Apache Pig, Spark, and Kafka to build end-to-end data processing and analytics solutions.
•Defined a scalable and comprehensive data architecture that integrates Snowflake, Oracle, GCP services, and other essential components.
•Applied Vertex AI Pipelines (formerly Kubeflow Pipelines) to orchestrate machine learning workflows on GCP.
•Set up and configured BigQuery datasets, tables, and views for effective storage and management of transformed data.
•Established data quality checks and validation rules to ensure data accuracy and reliability in BigQuery.
•Linked BigQuery with other GCP services for various purposes, including data visualization (Data Studio), AI/ML analysis, and long-term data archiving (Cloud Storage).
•Employed Google Cloud Storage for data ingestion and Pub/Sub for event-driven data processing.
•Design and implement data models and schemas in Hive to represent complex business requirements, ensuring data integrity and optimal performance for analytics queries.
•Developed ETL pipelines using methods like CDC (Change Data Capture) or scheduled batch processing to extract data from Oracle databases and transfer it to BigQuery.
•Implemented Cloud Billing reports and recommendations to enhance and optimize GCP resource usage for cost-efficiency.
•Implement ETL pipelines using Apache Airflow to extract data from various sources, transform it into a usable format, and load it into data warehouses or data lakes for analytics and reporting.
•Integrate DBT with data warehouse solutions such as Snowflake, BigQuery, Redshift, or Azure Synapse, ensuring seamless data transformation and modeling within the data ecosystem.
•Configure DBT to interact with data warehouses, including setting up connections, managing schema changes, and handling data versioning.
•Develop custom operators and sensors in Airflow to handle specific data transformation and loading tasks, ensuring seamless integration with Big Data technologies such as Hadoop, Spark, and Kafka.
•Formulated data models and schema designs for Snowflake data warehouses to support intricate analytical queries and reporting.
•Write Scala-based ETL pipelines to extract, transform, and load (ETL) data into Big Data platforms, optimizing performance for handling massive datasets.
•Managed diverse data sources (structured, semi-structured, unstructured) to design data integration solutions on GCP.
•Executed real-time data processing using Spark, GCP Cloud Composer, and Google Dataflow with PySpark ETL jobs for effective analysis.
•Developed data ingestion pipelines (Snowflake staging) from various sources and data formats to enable real-time analytics.
•Connected data pipelines with various data visualization and BI tools like Tableau and Looker for generating dashboards and reports.
•Guided junior data engineers, offering mentorship on ETL best practices, Snowflake, Snowpipes, and JSON.
•Implemented infrastructure provisioning using Terraform for consistent and repeatable environments across different project stages.
•Utilized Kubernetes to manage the deployment, scaling, and lifecycle of Docker containers.
•Enhanced ETL and batch processing jobs for performance, scalability, and reliability using Spark, YARN, and GCP DataProc.
•Administered and optimized GCP resources (VMs, storage, network) for cost-effectiveness and performance.
•Configured Cloud Identity & Access Management (IAM) roles to ensure least privilege access to GCP resources at Centene.
•Integrate Scala-based applications with Big Data ecosystems, including Apache Hadoop, Hive, HBase, Cassandra, and others, to enable efficient data storage and retrieval.
•Utilized Google Cloud Composer to build and deploy data pipelines as DAGs using Apache Airflow.
•Built a machine learning pipeline using Apache Spark and scikit-learn for training and deploying predictive models.
Mar 2020 – Jul 2021 · Sr. Big Data Administrator · Hyundai Motors, Fountain Valley, CA
•Shifted ETL workflows and orchestration to Azure Data Factory for more efficient data movement.
•Maintained data security during migration by leveraging Azure Active Directory & Key Vault for access control and encryption.
•Relocated data to optimal storage solutions like Azure Blob Storage, Data Lake Storage, or Azure SQL Data Warehouse based on specific data requirements.
•Thoroughly modelled Hive partitions for efficient data segmentation and faster processing, adhering to Hive best practices for optimization in Azure HDInsight.
•Cached RDDs within Azure Databricks to boost processing performance and efficiently execute actions on each RDD.
•Successfully relocated data from Oracle and SQL Server to Azure HDInsight Hive and Azure Blob Storage using Azure Data Factory.
•Integrated Azure Stream Analytics for real-time data processing during migration, ensuring uninterrupted data flow.
•Oversaw and managed the entire migration process using Azure Monitor for performance insights and Logic Apps for automated tasks.
•Generated data frames from various data sources (existing RDDs, structured files, JSON datasets, databases) using Azure Databricks.
•Loaded extensive datasets (terabytes) into Spark (Scala/PySpark) RDDs for data processing and analysis, efficiently importing data from Azure Blob Storage.
•Performed comprehensive testing to ensure data integrity, performance, and scalability across RDBMS (MySQL, MS SQL Server) and NoSQL databases.
•Continuously tracked database performance (MySQL, NoSQL) and implemented optimizations to enhance efficiency.
•Implemented data visualization solutions using Tableau and Power BI to convert data into insights and analytics for business stakeholders.
•Developed well-structured, maintainable Python and Scala code utilizing built-in Azure Databricks libraries to fulfil application requirements for data processing and analytics.
•Automated the ETL process using UNIX shell scripts for scheduling, error handling, file operations, and data transfer via Azure Blob Storage.
•Managed jobs and file systems using UNIX shell scripts within Azure Linux Virtual Machines.
Apr 2018 – Feb 2020 · Sr. Big Data Engineer (AWS) · Pfizer, New York City, NY
•Applied AWS S3 for effective data collection and storage, enabling seamless access and processing of large datasets.
•Transformed data using Amazon Athena for SQL processing and AWS Glue for Python processing, covering data cleaning, normalization, and standardization.
•Partnered with data scientists and analysts to utilize machine learning for critical business tasks, such as fraud detection, risk assessment, and customer segmentation, via Amazon SageMaker.
•Employed CloudWatch and CloudTrail for a reliable fault tolerance and monitoring setup.
•Established the underlying infrastructure and leveraged EC2 Instances in a load-balanced configuration.
•Monitored Amazon RDS and CPU/memory usage using Amazon CloudWatch.
•Utilized Amazon Athena for faster information analysis compared to Spark, taking advantage of AWS serverless functionalities.
•Coordinated data pipelines with AWS Step Functions and facilitated event messaging with Amazon Kinesis.
•Containerized Confluent Kafka applications and configured subnets for secure container-to-container communication.
•Conducted data cleaning and preprocessing with AWS Glue, demonstrating expertise in writing Python transformation scripts.
•Planned and executed data migration strategies to transfer data from legacy systems to MySQL and NoSQL databases.
•Implemented strong security measures, access controls, and encryption to ensure data protection throughout the migration process.
•Automated the deployment of ETL code and infrastructure changes using a CI/CD pipeline built with Jenkins.
•Implemented and monitored scalable, high-performance computing solutions using AWS Lambda (Python), S3, Amazon Redshift, Databricks (with PySpark jobs), and Amazon CloudWatch.
•Managed version control of ETL code and configurations through Git.
•Enhanced query performance and throughput by fine-tuning database configurations for MySQL and NoSQL databases and optimizing the queries.
•Developed automated Python scripts for data conversion from various sources and for generating ETL pipelines.
•Translated SQL queries into Spark transformations using Spark APIs in PySpark.
•Collaborated with the DevOps team to deploy pipelines on AWS using CodePipeline and AWS CodeDeploy.
•Executed Hadoop/Spark jobs on Amazon EMR, leveraging programs and data stored in Amazon S3 buckets.
•Loaded data from various sources (S3, DynamoDB) into Spark data frames and implemented in-memory data computation for efficient output generation.
•Utilized Amazon EMR for processing Big Data across Hadoop clusters, alongside Amazon S3 for storage and Amazon Redshift for data warehousing.
•Developed streaming applications using Apache Spark Streaming and Amazon Managed Streaming for Apache Kafka (Amazon MSK).
Jan 2016 – Mar 2018 · Hadoop Data Engineer · Cox Communications, Atlanta, GA
•Conducted data profiling and transformation on raw datasets using Pig, Python, and Oracle to prepare data effectively.
•Developed Hive external tables, populated them with data, and executed queries using HQL for efficient data retrieval and analysis.
•Utilized Sqoop for efficient data transfer between relational databases and HDFS and employed Flume to stream log data from servers for real-time processing.
•Explored Spark's features to optimize existing Hadoop algorithms, leveraging Spark Context, Spark SQL, Data Frames, Spark Paired RDDs, and Spark YARN for enhanced performance.
•Constructed ETL pipelines using Apache NiFi processors to automate data movement and transformation within the Hadoop ecosystem.
•Designed and implemented solutions to ingest and process data from various sources using Big Data technologies such as MapReduce, HBase, and Hive.
•Developed Spark code using Scala & Spark SQL to accelerate data processing, facilitating rapid testing and iteration.
•Utilized Sqoop to efficiently import millions of structured records from relational databases into HDFS, ensuring data storage in CSV format within HDFS for further processing with Spark.
Jan 2014 - Dec 2015 · Sr. Software Engineer · ETSY Inc., Brooklyn, NY
•Developed over 30 large-scale distributed systems and client-server architectures, primarily utilizing Django, Spring Boot, and Go.
•Implemented a distributed, asynchronous task queue architecture with Python Celery, optimizing memory utilization on the API server by 8-12%.
•Engineered critical, performance-driven backends using the CQRS pattern, and integrated Live Weather and Map services with PostGIS for real-time order updates.
•Enhanced and maintained critical B+ and R+ database indices for performance-sensitive services.
•Migrated 10+ Elixir/Java/Python/.NET microservices to Managed RabbitMQ and Elasticache, reducing API latency by 27%.
•Standardized Micro-Frontend Architecture across the UI using Runtime Module Federation, accelerating product iteration by 5-10%.
+ EDUCATION
M.S. (Computer Science)
University of California - Riverside, CA, US
B.S. (Computer Science)
University of California - Riverside, CA, US