Senior Data Engineering

Location:

Seattle, WA, 98109

Posted:

March 05, 2025

Contact this candidate

Resume:

P R A N AY P E N T A PARTHY

Big Data Engineer/ ETL Developer / Cloud Engineer

Email: *************@*****.***; Phone: 206-***-**** DATABRICKS GCP AWS

Profile Summary

• Over 10 years of experience in Big Data Pipeline architecture on premises and cloud; Hadoop administration and performance optimization; Overall 14+ years of IT experience.

• Designed and implemented scalable data pipelines on GCP, utilizing services like BigQuery, Dataflow, and Dataprep to process and analyze large datasets efficiently

• Led requirements gathering efforts, working closely with stakeholders to define project objectives and scope, ensuring alignment with business goals

• Managed projects from inception to completion, employing Agile methodologies, conducting sprint planning, backlog management, and retrospectives to deliver high-quality big data solutions on time and within budget

• Demonstrated expertise in GCP's serverless offerings, including Cloud Functions and App Engine, to build real-time data processing solutions and web applications

• Managed Google Kubernetes Engine (GKE) clusters for containerized applications, achieving high availability and scalability through auto-scaling configurations

• Leveraged Google Cloud Storage and Cloud Pub/Sub for seamless data storage, archiving, and event-driven data processing

• Ensured data security and compliance by implementing IAM roles, VPC peering, and encryption for data at rest and in transit in GCP

• Orchestrated data ingestion, processing, and storage in AWS using services such as S3, EC2, Glue and EMR with PySpark jobs for processing big data workloads

• Employed AWS Lambda and Step Functions to build serverless data workflows and automate ETL processes

• Leveraged AWS Glue for data cataloging, transformation, and seamless integration with various data sources

• Developed CI/CD pipelines with AWS CodePipeline and CodeBuild to automate the deployment of data applications and infrastructure as code

• Ensured high availability and disaster recovery in AWS by configuring multi-region architectures and implementing services like Route 53 and CloudFront

• Utilized Azure services such as Azure Data Factory and Azure Databricks for data integration, ETL, and advanced analytics using PySpark and Spark/Scala

• Managed Azure HDInsight clusters for big data processing, optimizing cluster performance and scalability

• Worked on AWS to form and manage EC2 instances and Hadoop Clusters.

• Demonstrated proficiency in on-premises big data solutions, including Hadoop clusters, ensuring the efficient processing of large-scale data

• Optimized on-premises data storage and retrieval by implementing distributed file systems and data warehousing solutions

• Implemented CI/CD pipelines using tools like Jenkins, Git, and Docker to automate the deployment of data applications and ensure consistent, reliable releases

• Implemented comprehensive testing procedures, including unit testing, integration testing, and end-to-end testing, to validate data processing pipelines and maintain data quality

• Expertise in data compilation from diverse sources, including internal and external data, and the use of advanced data collection techniques, such as geo-location data Technical S k i l l s

The Architecture of Big Data Systems: Amazon AWS - EC2, S3, Kinesis, Azure, Cloudera Hadoop, Hortonworks Hadoop, Spark, Spark Streaming, Hive, Kafka, Snowflake, Databricks Programming Languages: Scala, Python, Java, Bash, PySpark Hadoop Components: Hive, Pig, Zookeeper, Sqoop, Oozie, Yarn, Maven, Flume, HDFS, Airflow Hadoop Administration: Zookeeper, Oozie, Cloudera Manager, Ambari, Yarn Data Management: Snowflake, Apache Cassandra, AWS Redshift, Amazon RDS, Apache HBase, SQL, NoSQL, Elasticsearch, HDFS, Data Lake, Data Warehouse, Database, Teradata, SQL Server The architecture of ETL Data Pipelines: Apache Airflow, Hive, Sqoop, Flume, Scala, Python, Flume, Apache Kafka, Logstash

Scripting: HiveQL, SQL, Shell Script Language

Big Data Frameworks: Spark/PySpark and Kafka

Spark Framework: Spark API, Spark Streaming, Spark SQL, Spark Structured Streaming Visualization: Tableau, QlikView, PowerBI

Software Development IDE: Jupyter Notebooks, PyCharm, IntelliJ Continuous Integration (CI-CD) Versioning: Git, GitHub, Bitbucket

Method: Agile Scrum, Test-Driven Development, Continuous Integration, Unit Testing, Functional Testing, Scenario Testing

Professional Experience

Lead Data/ Cloud Engineer

Nordstrom Inc., Seattle, Washington, Nov’23 - Present

• Pioneered feature engineering techniques to optimize predictive models, delivering exceptional results on Google Cloud Platform

• Fostered close collaboration with cross-functional teams, including infrastructure, network, database, application, and BI, ensuring robust data quality and accessibility within the GCP environment

• Orchestrated Airflow/workflow in hybrid cloud environment from local on-premises server to the cloud.

• Diligently communicated deliverables' status to stakeholders and orchestrated regular review meetings to maintain transparency within GCP projects

• Enhanced data storage efficiency within GCP's Kafka cluster's Kafka Brokers through the strategic partitioning of Kafka Topics with PySpark streaming.

• Developed and maintained ETL pipelines using Apache Spark and Python on Google Cloud Platform (GCP) for large-scale data processing and analysis.

• Worked with new hires for the onboarding process and acted as a point of contact till they are comfortable with project codebase

• Consistently met project timelines on GCP, adhering to GCP project requirements and quality benchmarks

• Leveraged GCP's data processing capabilities, including the use of Flume to seamlessly manage streaming data and efficiently load it into the GCP-hosted Hadoop cluster

• Integrated GCP's Kafka services with Spark streaming to enable high-speed data processing in a GCP environment

• Extracted real-time feeds using GCP's Kafka and Spark Streaming services, transforming them into RDDs and processing data in Data Frame format, and stored the results as Parquet format in GCP's HDFS

• Utilized Python libraries, including NumPy, SciPy, Scikit-Learn, and Pandas, within the GCP ecosystem to enhance the analytics components of the data analysis framework along with PySpark dataframes and RDDs

• Executed Hadoop/Spark jobs on Amazon EMR using programs and data stored in Amazon S3 buckets. Created data models and schema designs for Snowflake data warehouses to support complex analytical queries and reporting.

• Crafted materialized views, partitions, tables, views, and indexes on GCP for optimized data management within the GCP environment

• Harnessed the power of GCP's Spark SQL and DataFrames API to ingest structured and semi- structured data into PySpark Clusters on GCP using DataProc

• Collaborated closely with the data science team, contributing to the development of Spark MLlib applications for diverse predictive modeling initiatives within GCP

• Designed intricate data processing workflows using a range of GCP tools, including cloud-native services such as Google BigQuery, Google Dataflow, and Google Dataprep, ensuring a streamlined data pipeline within the GCP environment

• Loaded data from various sources into GCP's storage solutions, optimizing data accessibility within GCP

• Captured and transformed real-time data from external sources and GCP services into a scalable format suitable for analytics in the GCP environment Sr. Data Engineer

TD Bank, Cherry Hill, New Jersey, Sep’21 – Oct’23

• Applied large-scale data analytics on Microsoft Azure to evaluate market trends, payer behaviours, and healthcare economics, supporting data-driven decisions in market access and pricing strategies.

• Employed advanced analytics methodologies on Azure to analyse massive datasets, generating actionable insights for the pharmaceutical sector.

• Leveraged Azure services, including Azure Synapse Analytics and Azure Machine Learning, for specialized data processing and machine learning tasks, seamlessly integrating with the AWS ecosystem for a comprehensive Big Data approach.

• Worked with cross-functional teams to gather requirements and develop data-driven solutions that enhance pricing and reimbursement strategies.

• Designed and deployed scalable data pipelines to process, clean, and transform both structured and unstructured healthcare data.

• Utilized Azure Data Factory to automate data workflows, including data movement and transformation processes.

• Automated workflows in Databricks using Databricks Jobs and Workflows, improving productivity and consistency.

• Integrated Apache Flink with Kafka for low-latency, high-throughput stream processing tasks, complementing Apache Spark’s batch processing in Big Data projects.

• Leveraged Azure Databricks for scalable data processing with Apache Spark, enabling advanced ETL operations.

• Implemented Azure Logic Apps to automate workflows and integrate multiple Azure services and external systems.

• Employed Azure Functions for event-driven serverless computing, automating data processing tasks.

• Defined data validation rules and checks within Azure Data Factory pipelines to ensure data quality during ETL processes.

• Utilized Azure Data Lake Analytics for ad-hoc queries and data validation across large datasets.

• Applied Azure Synapse Analytics to build and manage enterprise-scale data warehouses.

• Utilized Azure Analysis Services for OLAP, constructing multidimensional data models for reporting and analytics.

• Leveraged Azure Stream Analytics for real-time data ingestion, processing, and analysis from various streaming sources.

• Implemented Azure Event Hubs for scalable ingestion and real-time processing of high-volume data streams.

• Configured Databricks clusters for optimal performance and cost-efficiency, tailoring resources to match workload requirements.

• Utilized Azure Cosmos DB for globally distributed, multi-model NoSQL data storage, supporting document, key-value, graph, and column-family models.

• Implemented Azure Table Storage for fast access to key-value and semi-structured data.

• Monitored and analyzed the performance and health of Azure services and resources using Azure Monitor.

• Deployed Azure Application Insights for application performance monitoring (APM) and diagnostics, facilitating proactive issue detection and resolution.

• Utilized Azure Data Explorer for real-time analysis of log and telemetry data, enabling rapid debugging and troubleshooting of data pipelines and applications.

• Integrated Databricks with cloud storage solutions like AWS S3 and Azure Data Lake Storage to streamline data ingestion and storage.

• Implemented Azure DevOps for end-to-end application lifecycle management (ALM), including code version control, continuous integration, and continuous delivery (CI/CD), streamlining debugging and issue resolution processes.

• Leveraged Docker Compose to define and manage multi-container Docker applications within development and testing environments, simplifying configuration and deployment.

• Utilized Jenkins Pipeline to define and execute CI/CD workflows as code, providing version- controlled, reusable pipelines for Big Data projects.

• Developed Terraform modules to enable modular and reusable infrastructure provisioning in Big Data environments, ensuring scalable and consistent deployments. Sr. Data Engineer

Bank of America, Chicago, Illinois, Jan’19 – Aug’21

• Developed and deployed Big Data analytics solutions on AWS to monitor and analyze customer transactions, detect fraud patterns, and identify financial risk signals, improving overall security and risk management for Bank of America.

• Utilized advanced analytics techniques, including Natural Language Processing (NLP) and machine learning, on AWS to extract insights from unstructured data sources such as customer feedback, transaction data, and market trends to enhance customer experience and support decision-making.

• Implemented automated testing frameworks using Scala, including ScalaTest and Specs2, ensuring code quality, reliability, and regulatory compliance for financial data processing systems at Bank of America.

• Designed and built data pipelines on AWS to ingest, clean, and transform large volumes of transaction and financial data, enabling real-time fraud detection and risk analysis, while adhering to financial regulations.

• Collaborated with cross-functional teams, including data scientists, financial analysts, and compliance officers, to define business requirements and deliver actionable insights for improving Bank of America's financial performance and regulatory compliance.

• Expert in utilizing AWS S3 for scalable and secure object storage, facilitating the storage of large financial datasets such as transaction histories, market data, and compliance reports, ensuring data durability and availability.

• Leveraged Databricks notebooks for data exploration, transformation, and development of machine learning models that predict customer behavior and enhance personalized banking services.

• Utilized Apache Spark's Structured Streaming API to process real-time transaction data from Kafka, enabling fraud detection and prevention by analyzing continuous data streams.

• Experienced with AWS Glue for automating ETL tasks, ensuring efficient data preparation and integration from various financial data sources for reporting and analytics at Bank of America.

• Develop Python scripts to process and analyze large datasets using distributed computing frameworks such as Apache Spark, Hadoop, and Dask.

• Proficient in configuring AWS EMR (Elastic MapReduce) clusters for processing large-scale financial datasets, enabling efficient analysis of market trends and customer transactions using Apache Hadoop and Spark.

• Skilled in querying data with AWS Athena, allowing the Bank of America team to conduct interactive querying of financial data stored in AWS S3 and other data sources, supporting compliance and audit needs.

• Implemented serverless architectures using AWS Lambda, facilitating event-driven processing for fraud detection, regulatory reporting, and real-time financial analytics.

• Deployed Kafka MirrorMaker to replicate financial transaction data across regions, ensuring data durability, redundancy, and disaster recovery for Bank of America's critical data systems.

• Orchestrated complex workflows using AWS Step Functions, coordinating distributed components in Bank of America's data architecture to streamline processing for risk management and compliance reporting.

• Automated data workflows using Apache Airflow, scheduling and managing complex financial data pipelines to ensure timely reporting for Bank of America's regulatory and compliance teams.

• Used Databricks REST API to automate administrative tasks and integrate with other financial tools, improving workflow automation for Bank of America's data management processes.

• Applied reactive programming principles using Scala and Play Framework to develop scalable financial services, such as data visualization for market analysis and customer portfolio management.

• Designed Directed Acyclic Graphs (DAGs) in Apache Airflow to manage and monitor data pipelines for real-time risk management, enhancing visibility and control over workflow dependencies in the bank's data processes.

• Leveraged Airflow's scheduling features to automate the execution of fraud detection and regulatory compliance tasks at Bank of America, ensuring data processing aligns with internal and external timelines.

• Partnered with data scientists to deploy machine learning models in Databricks, optimizing predictive models that enhance fraud detection, customer insights, and risk management for the bank.

• Monitored AWS resources using AWS CloudWatch, providing real-time monitoring, logging, and alerting for key AWS services used in Bank of America's financial data architecture.

• Automated CI/CD pipelines with AWS CodePipeline, enabling seamless deployment of updates to fraud detection systems and other critical financial services across development, testing, and production environments at Bank of America.

• Use Python to develop real-time data streaming applications, leveraging tools like Apache Kafka, Flink, or PySpark for handling continuous data streams.

• Provisioned AWS infrastructure using AWS CloudFormation, ensuring consistent and secure deployment of financial data processing resources via infrastructure as code (IaC).

• Implemented messaging and queuing systems using AWS SQS, enabling asynchronous communication between distributed components of Bank of America's financial systems, supporting real-time transaction processing and risk analysis.

• Built event-driven architectures using AWS SNS, facilitating messaging and notifications for key events such as transaction monitoring and fraud alerts across distributed systems at Bank of America.

• Applied statistical methods and predictive modeling techniques on AWS, identifying potential financial risks, fraud patterns, and customer behavior trends to enhance decision-making at Bank of America.

• Collaborated with IT infrastructure teams to optimize data storage and processing for large financial datasets, improving the efficiency of Bank of America's data-driven initiatives.

• Participated in the selection of Big Data technologies and platforms on AWS to support Bank of America's financial data analysis and regulatory compliance efforts, improving the bank's overall analytical capabilities.

Big Data Engineer

Volkswagen, Herndon, Virginia, Sep’16 – Dec’18

• Automated cloud deployments seamlessly through the creation and management of AWS CloudFormation templates

• Proficiently created and managed AWS CloudFormation templates to enable automated cloud deployments, ensuring consistency and efficiency

• Assumed responsibility for continuous monitoring and efficient management of EMR clusters using the AWS console, optimizing data processing capabilities

• Utilized the Amazon AWS IAM console to create custom users and groups, ensuring precise control over permissions and access to AWS resources

• Design scalable ETL pipelines for data ingestion, transformation, and processing using Big Data technologies like Hadoop, Spark, Kafka, Hive, and AWS Glue.

• Played a pivotal role in designing serverless solutions, leveraging AWS API Gateway, Lambda, S3, and DynamoDB to optimize performance and scalability while reducing costs

• Successfully executed a migration from on-premises RDBMS to AWS RDS, concurrently implementing Lambda functions to process and store data efficiently in S3 using Python

• Introduced support for Amazon AWS S3 and RDS to host static and media files while housing the database within the Amazon Cloud environment for a streamlined architecture

• Proficiently configured and ensured secure connections to the AWS RDS database running on MySQL, facilitating data accessibility and integrity

• Managed policies for AWS S3 buckets and Glacier, enhancing data storage, backup, and retrieval within the AWS ecosystem

• Actively employed AWS CloudFormation to ensure consistent and automated infrastructure provisioning on AWS, enhancing the infrastructure management process

• Demonstrated expertise in installing and utilizing the AWS Command Line Interface (CLI) for efficient control of various AWS services via Shell/Bash scripting, streamlining administrative tasks

• Conducted data processing operations within AWS Glue (Scala and PySpark), efficiently extracting data from MySQL data stores and loading it into RedShift Data Warehousing for in-depth analysis

• Proficiently handled the deployment of code to AWS Code Commit using Git commands such as pull, fetch, push, and commit from the AWS CLI, ensuring version control and collaboration

• Leveraged custom Amazon Machine Images (AMIs) to automate server builds during periods of peak demand, ensuring seamless auto-scaling capabilities for applications

• Deployed applications in AWS using Elastic Beanstalk for simplified management and scalability, enabling efficient application deployment and scaling Data Engineer

DaVita Inc., Denver, Jan’14 – Aug’16

• Conducted performance optimizations for HIVE analytics and SQL queries, which included the creation of tables, views, and the implementation of Hive-based exception processing to enhance query efficiency

• Developed numerous internal and external tables within the HIVE framework, addressing specific data storage and retrieval requirements

• Loaded extensive volumes of log data from diverse sources directly into HDFS using Apache Flume for streamlined data ingestion

• Employed advanced techniques such as partitioning and bucketing within Hive tables, ensuring organized data management and daily data aggregation.

• Executed the migration of ETL jobs to Pig scripts, facilitating complex data transformations, join operations, and pre-aggregations before storing data in HDFS.

• Leveraged multiple Java-based MapReduce programs for efficient data extraction, transformation, and aggregation, accommodating various file formats and processing needs

• Implemented the creation of Hive tables with dynamic partitions and buckets, enabling efficient data sampling and enhancing data manipulation through Hive Query Language (Hive QL)

• Prepared, transformed, and analyzed large volumes of structured, semi-structured, and unstructured data using Hive queries and Pig scripts to extract valuable insights

• Developed UNIX shell scripts and Python scripts for generating reports based on data extracted from Hive, enhancing data reporting capabilities

• Managed the creation of both internal and external tables while utilizing Hive Data Definition Language (DDL) commands to create, alter, and drop tables as needed

• Exploring Spark for improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frame, spark Paired RDD, Spark YARN.

• Used Google Cloud Composer to build and deploy data pipelines as DAGs using Apache Airflow

• Contributed to data cleansing by writing Pig Scripts and implementing Hive tables to store processed data in a tabular format, enabling more accurate analysis

• Spearheaded the setup of the Quality Assurance (QA) environment and updated configurations to effectively implement data processing scripts with Pig

• Played a key role in the development of Pig Scripts for change data capture and delta record processing, allowing the seamless integration of newly arrived data with existing data in HDFS

• Employed Hive for the creation of tables, data loading, and the execution of Hive Queries, which internally triggered MapReduce jobs to process and analyze the data

• Prepared data for in-depth analysis by executing Hive queries (HiveQL), Impala, and Pig Latin scripts, providing valuable insights into customer behavior and data patterns. Incident Response Support Associate

Jackson National Life Insurance, Lansing, MI, Jan’12 – Dec’13

• Partnered with risk management teams to enhance and optimize network infrastructure policies concerning the development of new applications and the onboarding of new users.

• Oversaw the escalation and prioritization of support tickets based on established severity levels, from low to high, while actively monitoring and reducing resolution times for reported issues.

• Worked collaboratively in a two-person team to design and create Power BI reports that effectively summarize and visualize key data, including logging details and incident ticket trends. Education

Bachelor of Science - BS, Data Science

Michigan State University, East Lansing, MI

Certification

Software Engineering Program, Certificate

Springboard

Contact this candidate