Ian Mbaya
Lead Big Data Engineer
(AWS, GCP, Azure – ETL, Hadoop, Spark, PySpark, Snowflake)
Phone: 419-***-****; Email: *********@*****.***
Profile Summary
Big Data Engineer 11+ years of experience in Big Data and overall 14+ years of experience in IT, specializing in customer-facing products, and possessing strong expertise in data transformation and ELT on large datasets. Proven track record in architecting, designing, and developing fault-tolerant data infrastructure and reporting solutions, with extensive experience in distributed platforms and systems. Deep understanding of databases, data storage, and data movement, coupled with hands-on collaboration with DevOps teams to identify business requirements and implement CICD pipelines.
Big Data Engineering & Architecture
Hadoop Architecture: Proficient in Hadoop ecosystem components such as HDFS, YARN, MapReduce, Hive, and Pig for distributed data storage and processing.
Delta Lake: In-depth knowledge of Delta Lake architecture for managing big data workloads, including ACID transactions, schema enforcement, time-travel capabilities, and data versioning.
Data Warehousing: Proficient in data warehousing concepts and technologies, including columnar storage, data partitioning, and compression techniques for efficient query performance.
Snowflake: Hands-on experience with Snowflake data cloud platform for secure, scalable, and high-performance data storage and processing, implementing multi-cloud architecture and automatic scaling.
Data Modeling: Expertise in designing robust, scalable data models for data warehouses and data lakes, ensuring high performance and ease of data retrieval.
Data Storage:
Google Cloud Platform (GCP): Proficient in architecting and implementing fault-tolerant data infrastructure using GCP services such as Google Compute Engine, Google Cloud Storage, and Dataproc. Experienced in developing data pipelines utilizing GCP services like BigQuery, Cloud Dataflow, and Cloud Composer. Skilled in implementing best practices for data security and governance on GCP, including encryption, IAM policies, and compliance controls.
Amazon Web Services (AWS): Demonstrated proficiency with AWS services including EC2, S3, RDS, and EMR. Skilled in implementing CI/CD pipelines on AWS to automate deployment and testing processes. Experienced in using AWS Lambda for developing serverless solutions and event-driven architectures.
Microsoft Azure: Skilled in using Azure data services like Azure SQL Database and Azure Data Factory for managing large datasets. Proficient in implementing Azure Databricks for scalable ETL solutions. Extensive experience in managing and optimizing on-premises data storage systems, including relational databases and distributed file systems.
Data Engineering/ETL:
Open-source Distributed Computing Expertise: Proficiently designed efficient data ingestion processes with low-latency systems and data warehouses. Insightful knowledge of Hadoop Architecture and components such as HDFS, Yarn, and MapReduce. Understanding of Delta Lake architecture, including transactions, schema enforcement, and time-traveling capabilities.
NoSQL Databases: Expertise with NoSQL databases like HBase and Cassandra for low latency. Well-versed in Spark performance tuning using indexes, hints, and partitioning. Experienced in working on data governance and data quality.
SQL and PL/SQL: Successfully used PL/SQL and SQL for query creation and developed Python-based integrations with SQL Databases.
Cloud Platforms:
Google Cloud Platform (GCP): Proficient in designing and implementing fault-tolerant data infrastructure using GCP services like Google Compute Engine, Google Cloud Storage, and Dataproc. Experienced in developing data pipelines using GCP services such as BigQuery, Cloud Dataflow, and Cloud Composer. Skilled in implementing best practices for data security and governance on GCP, including encryption, IAM policies, and compliance controls.
Amazon Web Services (AWS): Demonstrated proficiency with AWS services including EC2, S3, RDS, and EMR. Skilled in implementing CI/CD pipelines on AWS to automate deployment and testing processes. Experienced in using AWS Lambda for developing serverless solutions and event-driven architectures.
Microsoft Azure: Skilled in using Azure data services like Azure SQL Database and Azure Data Factory for managing large datasets. Proficient in implementing Azure Databricks for scalable ETL solutions. Extensive experience in managing and optimizing on-premises data storage systems, including relational databases and distributed file systems.
Data Governance & Quality
Data Governance Frameworks: Extensive experience implementing data governance frameworks to ensure compliance with industry standards (GDPR, HIPAA) and organizational policies.
Data Stewardship: Expertise in data stewardship to oversee data management practices, ensuring data integrity, security, and privacy across the data lifecycle.
Master Data Management (MDM): Proficient in implementing Master Data Management (MDM) solutions for accurate, consistent, and reliable data across multiple systems.
Data Quality Assurance: Proven track record in designing and implementing data quality assurance (DQA) processes, data lineage tools, data cleansing and data transformation, and metadata management.
Open-source Distributed Computing Expertise:
Data Ingestion Processes: Proficiently designed efficient data ingestion processes with low-latency systems and data warehouses.
Spark Performance Tuning: Well-versed in Spark performance tuning using indexes, hints, and partitioning.
Data Governance and Quality: Experienced in working on data governance and data quality.
Technical Skills
Development & Programming
Programming Languages: Python, Scala, Java, PySpark, SQL, R
Scripting & Query Languages: Shell Scripting, HiveQL, Spark SQL, MapReduce, Bash, PowerShell
Programming Paradigms: Object-Oriented Programming (OOP), Functional Programming
Big Data Ecosystem
Frameworks & Tools: Apache Hadoop, Apache Spark, Apache Hive, Apache HBase, Apache Impala, Apache Pig, Apache Kafka, Kafka MirrorMaker, Presto, Flink
Data Ingestion & Processing: Apache Nifi, AWS Glue, Databricks, Azure Data Factory, Google Dataflow, Flume, Sqoop
Workflow Orchestration: Apache Airflow, Apache Oozie, AWS Step Functions, Prefect, Luigi
Cloud Platforms & Services
Amazon Web Services (AWS): S3, EC2, EMR, RDS, Glue, Athena, Lambda, Step Functions, SQS, SNS, Redshift, DynamoDB, CodePipeline, CloudFormation, CloudWatch, IAM, KMS
Google Cloud Platform (GCP): BigQuery, Dataproc, Dataflow, Bigtable, Pub/Sub, Google Cloud Storage, Google Kubernetes Engine (GKE)
Microsoft Azure: Azure Blob Storage, HDInsight, Azure Functions, Azure Data Factory, Azure Monitor, Azure Synapse Analytics, Azure DevOps, Azure Resource Manager (ARM), Cosmos DB
Databases & Storage Solutions
Relational Databases: SQL Server, PostgreSQL, Oracle, AWS RDS, Google Cloud SQL, MySQL
NoSQL Databases: DynamoDB, MongoDB, Cassandra, HBase, Bigtable, Redis, Couchbase
Data Warehousing: Snowflake, Redshift, Synapse Analytics, BigQuery, Teradata
Data Formats & Storage
File Formats: CSV, JSON, Avro, Parquet, ORC, XML, YAML
File Systems: HDFS, S3, Google Cloud Storage, Azure Data Lake
DevOps & CI/CD
Continuous Integration Tools: Jenkins, AWS CodePipeline, Azure DevOps, CircleCI, TravisCI, Bamboo
Containerization & Orchestration: Docker, Kubernetes, OpenShift
Infrastructure as Code (IaC): Terraform, AWS CloudFormation, Azure ARM Templates, Ansible, Chef, Puppet
Performance Optimization: DATA PIPELINE OPTIMIZATION, CACHING MECHANISMS, PARALLEL PROCESSING, DISTRIBUTED COMPUTING
Data Analysis & Machine Learning (ML)
ML Frameworks: Scikit-learn, Apache Mahout, TensorFlow, PyTorch, H2O.ai, MLlib, Databricks Machine Learning
Advanced Analytics: Natural Language Processing (NLP), Predictive Modeling, Sentiment Analysis
Streaming Analytics: Kafka Streaming, Spark Streaming, Flink Streaming
Data Visualization Tools: Tableau, Power BI, AWS QuickSight, Kibana, Looker, D3.js
Security & Governance
Identity & Access Management: AWS IAM, Azure AD, Kerberos
Data Governance: Apache Ranger, Apache Atlas, GDPR Compliance, HIPAA Compliance
Project Management & Methodologies
Methodologies: Agile, Scrum, Kanban, DevOps
Project Practices: Continuous Integration (CI), Continuous Delivery (CD), Test-Driven Development (TDD), Behavior-Driven Development (BDD), Lean b Sigma, Design Thinking
Version Control & Collaboration
Versioning Tools: Git, GitHub, Bitbucket, GitLab
Search Engines: Apache Lucene, Elasticsearch, Solr
Collaboration Tools: JIRA, Confluence, Slack, Microsoft Teams
Professional Experience
Lead Data Engineer
Huntington Bank, Columbus, Ohio Aug 2023 - Present
•Managed the complete lifecycle of data assets on GCP, ensuring seamless flow from ingestion through transformation to delivery for Huntington Bancshares Inc. stakeholders.
•Facilitated the integration of GCP data assets into external-facing data products & services.
•Spearheaded projects within the GCP platform to govern data assets, including ingestion, transformation, and pipelining processes.
•Developed Spark applications using Scala for efficient data processing.
•Develop and optimize Python scripts for data extraction, transformation, and loading (ETL) processes, handling large datasets efficiently using tools like Pandas, Dask, and PySpark.
•Design, develop, and maintain data warehouse solutions on the Snowflake platform, ensuring high performance and scalability.
•Led initiatives to develop data products and services utilizing GCP platform assets for external stakeholders.
•Designed and implemented data pipelines to extract, transform, and load (ETL) data from various sources into the Cloudera platform (CDH) using tools like Sqoop, Flume, or Kafka.
•Automate the build, testing, and deployment processes using CI/CD tools like Jenkins, GitLab CI, Azure DevOps, CircleCI, or Apache Airflow.
•Directed cross-functional teams in designing, developing, and implementing data projects on GCP, fostering collaboration across departments.
•Design, implement, and manage end-to-end data pipelines using Apache Airflow to automate ETL processes, data ingestion, transformation, and loading across Big Data environments.
•Architect, design, and implement scalable Snowflake data warehouses for Big Data environments, optimizing for performance, cost-efficiency, and security.
•Develop data models, schema designs, and data pipelines in Snowflake to support high-performance data analytics and business intelligence.
•Integrate version control with CI/CD pipelines to enable continuous integration, ensuring high code quality and consistency across deployments.
•Identified opportunities to optimize the GCP platform for improved performance, scalability, and reliability, implementing necessary solutions.
•Implemented robust quality assurance measures to validate data accuracy, completeness, and consistency across the GCP data pipeline.
•Utilized Kafka for real-time data streaming and messaging within big data projects.
•Ensured adherence to data governance policies & security standards on GCP, executing measures to protect sensitive data.
•Provided expertise in GCP Big Data technologies and best practices, guiding the efficient design and implementation of data solutions.
•Write efficient PySpark code for handling data processing workflows that scale with data volume and complexity.
•Develop and deploy PySpark applications to run on Hadoop clusters, leveraging Spark for distributed data processing.
•Optimize PySpark jobs by tuning performance parameters, partitioning, and caching to ensure scalability and efficiency.
•Collaborate with data scientists and business analysts to develop PySpark workflows for advanced analytics and machine learning model development.
•Develop ETL processes using tools like Snowflake's Snowpipe, Talend, Informatica, or custom scripts to ingest data from various sources.
•Develop and manage ETL pipelines to extract data from various sources and load it into Snowflake, ensuring clean and structured data for analytics.
•Architect Snowflake data models, creating efficient schema designs to support data analytics and business intelligence operations.
•Communicated with stakeholders to gather requirements, provide project updates, and address concerns related to data projects.
•Documented technical specifications and procedures for GCP data governance and platform usage, delivering training to team members as required.
•Managed and secured data stored in HDFS and other Cloudera storage solutions like HBase.
•Experienced in utilizing Google Cloud Storage for scalable and durable storage of large datasets.
•Proficient in designing and orchestrating data processing pipelines using Google Cloud Dataflow for real-time and batch modes.
•Design, implement, and manage automated Jenkins pipelines to support continuous integration and continuous deployment (CI/CD) for Big Data projects, ensuring seamless delivery of code and data processing tasks.
•Integrated Kafka with Spark Streaming for real-time data processing.
•Leveraged Google Cloud Functions for serverless event-driven computing.
•Configured and managed Kafka clusters for real-time data streaming and messaging.
•Proficient in utilizing Google Cloud Dataproc for managing and scaling Apache Spark and Hadoop clusters.
•Design, develop, and optimize Extract, Transform, Load (ETL) processes using AWS Glue to efficiently manage data pipelines for large-scale Big Data workloads.
•Leveraged Google Cloud Pub/Sub for reliable, asynchronous messaging between independent applications.
•Implemented streaming data pipelines in Databricks using Structured Streaming.
•Skilled in utilizing Google BigQuery for high-performance, serverless data warehousing and analytics.
•Proficient in leveraging Google Bigtable for scalable, NoSQL storage of structured and semi-structured data.
•Utilized Kafka Connect for integrating external data sources and sinks with Kafka clusters.
•Integrated Databricks with Delta Lake to enable ACID transactions and improve data reliability.
•Integrated Scala with Apache Spark for efficient distributed data processing.
•Experienced in implementing data engineering solutions on GCP, ensuring scalability, reliability, and cost-effectiveness while adhering to best practices and security standards.
•Adept at collaborating with cross-functional teams to design, develop, and deploy data-driven applications and solutions on the Google Cloud Platform.
•Passionate about staying updated with the latest advancements in GCP's data engineering tools and technologies.
•Utilized GitHub for version control and collaboration on code repositories within big data projects.
•Leveraged Docker for containerization of big data applications and services.
•Implemented multi-tenancy in Kafka clusters to support isolation and resource management for different user groups or applications.
•Designed and implemented scalable ETL pipelines in Databricks.
•Develop and optimize scalable Databricks notebooks for data exploration, transformation, and machine learning model development.
•Use Databricks to implement real-time data processing pipelines by integrating with Apache Spark and other tools.
•Integrated PySpark with external data sources and storage systems for data ingestion and extraction.
•Implemented Jenkins for continuous integration and continuous deployment (CI/CD) pipelines within big data workflows.
•Utilized Scala's type safety and pattern matching capabilities for robust error handling and data validation in data processing pipelines.
•Utilized Terraform for infrastructure as code (IaC) provisioning within big data environments.
•Implemented data transformation and cleansing operations using PySpark and Scala, ensuring data quality and integrity throughout the pipeline.
•Develop and optimize PySpark applications for distributed data processing in big data environments such as Apache Hadoop and Apache Spark.
•Design and implement ETL pipelines using PySpark for processing large datasets in batch and real-time modes.
•Leverage PySpark to perform data transformation, cleaning, and aggregation for both structured and unstructured data.
•Developed Spark applications using Scala to process large-scale data, significantly improving the performance of data transformation and ETL pipelines.
•Develop and optimize Spark jobs using Scala to process large-scale datasets in distributed computing environments.
•Implement Scala-based data pipelines for efficient batch processing and real-time data ingestion.
•Design and implement scalable Spark/Scala applications for data transformation and data integration tasks.
•Leverage Scala to integrate with external data sources (e.g., HDFS, S3, and databases) for data extraction and loading.
•Leveraged PySpark and Databricks for efficient distributed data processing and transformation, improving data processing times by 25%.
•Designed and maintained scalable Snowflake data models, optimizing complex queries and ensuring high-performance data analytics for business stakeholders.
Technical Stack: GCP, Google BigQuery, Google Cloud Storage, Google Cloud Dataproc, Google Cloud Pub/Sub, Google Cloud Dataflow, Google Bigtable, Google Cloud Functions, Cloudera (CDH), HDFS, HBase, Kafka, Kafka Connect, Kafka Streaming, Jenkins, GitLab CI, Scala, Azure DevOps, CircleCI, Apache Airflow, Spark, Scala, PySpark, Python, Pandas, Dask, Snowflake, Snowpipe, Talend, Informatica, Databricks, Delta Lake, Structured Streaming, AWS Glue, Terraform, Docker, GitHub, ETL, CI/CD, data pipelines, real-time processing, NoSQL, serverless, data warehousing, data governance, data modeling, security standards.
Lead Data Engineer
AbbVie Inc., North Chicago, IL Jan 2021 – Jul 2023
•Developed and deployed Big Data analytics solutions on AWS, monitoring and analyzing adverse event reports, safety signals, and other pharmacovigilance data sources.
•Leveraged advanced analytics techniques, including Natural Language Processing (NLP) and machine learning, on AWS to extract insights from unstructured data sources such as medical records, clinical trial data, and social media.
•Used Airflow to automate Big Data workflows, setting up DAGs (Directed Acyclic Graphs) for task orchestration, dependency management, and scheduling.
•Implemented Scala-based test suites using frameworks like ScalaTest and Specs2 to ensure code quality, facilitating automated testing and continuous integration.
•Designed and implemented data processing pipelines on AWS to ingest, cleanse, and transform large volumes of pharmacovigilance data for analysis and reporting.
•Utilized Python to interact with Big Data technologies such as Apache Hadoop, Spark, Kafka, and Hive, writing scalable code for distributed data processing and real-time analytics.
•Developed and deployed data analysis applications using tools like Spark or Cloudera Impala.
•Collaborated with data scientists, pharmacovigilance experts, and regulatory professionals on AWS to define requirements and deliver actionable insights.
•Implemented Snowflake database objects (tables, views, schemas, and stored procedures) to support data warehousing needs and integrated Snowflake with Big Data tools like Apache Spark, Kafka, and Hadoop.
•Leveraged AWS S3 for scalable, durable object storage, facilitating storage and retrieval of large datasets.
•Utilized Databricks notebooks for data exploration, transformation, and machine learning model development.
•Processed and analyzed continuous data streams from Kafka topics using Spark's Structured Streaming API, enabling complex event processing and pattern recognition.
•Automated the deployment and management of data processing frameworks by integrating Jenkins with Big Data tools like Apache Hadoop, Spark, Kafka, and Hive.
•Leveraged AWS Glue for ETL tasks, enabling automated data preparation and integration for analytics and machine learning.
•Managed AWS EMR (Elastic MapReduce) clusters for large-scale data processing with Apache Hadoop, Spark, and other frameworks.
•Queried data from AWS S3 and other sources using AWS Athena, enabling interactive querying and analysis.
•Designed logical and physical data models in Snowflake to support analytics and reporting.
•Integrate Snowflake with other cloud data services like AWS S3, Google Cloud Storage, and Azure Blob Storage for seamless data ingestion.
•Implement Snowflake’s Snowpipe for continuous data ingestion, allowing for real-time data streaming and processing.
•Design and optimize Snowflake queries using SQL for complex data analysis, aggregation, and reporting purposes.
•Implemented serverless computing solutions with AWS Lambda, enabling event-driven processing without managing servers.
•Utilized Kafka MirrorMaker for data replication across clusters, ensuring data durability and disaster recovery.
•Installed, configured, and maintained the Cloudera CDH environment.
•Orchestrated workflows using AWS Step Functions for serverless applications and coordinated distributed components.
•Employed Databricks REST API to automate administrative tasks and integrate with other platforms for workflow automation.
•Collaborate with cross-functional teams to design and deploy production-ready data solutions using Databricks.
•Design and develop ETL pipelines in Databricks to process large-scale datasets for analytics and reporting.
•Utilize Databricks for implementing Apache Spark applications, enhancing the speed and performance of data processing workflows.
•Created and maintained documentation for Jenkins pipelines, CI/CD best practices, and automated workflows for Big Data teams.
•Developed responsive and scalable web services for data visualization using Scala and Play Framework.
•Collaborated with data scientists to manage and deploy machine learning models in Databricks, enhancing predictive analytics.
•Monitored and managed AWS resources using AWS CloudWatch for real-time logging and alerting.
•Automated software release pipelines using AWS CodePipeline for continuous integration and delivery (CI/CD) across environments.
•Provisioned AWS infrastructure using AWS CloudFormation, enabling Infrastructure as Code (IaC) for consistent and repeatable deployments.
•Implemented messaging solutions using AWS SQS for asynchronous communication between distributed components.
•Built scalable, event-driven architectures with AWS SNS for messaging and push notifications across distributed systems.
•Applied predictive modeling techniques on AWS to identify potential safety risks, drug interactions, and adverse reactions in pharmaceutical products.
•Integrated PySpark into data pipelines for real-time data transformation and processing, reducing latency and improving data accuracy.
•Integrate PySpark with cloud storage solutions like Amazon S3, Google Cloud Storage, and Azure Blob Storage for seamless data ingestion and storage.
•Build and maintain real-time data pipelines using PySpark and Kafka for continuous data processing.
•Write PySpark scripts to process and analyze data in HDFS, Google BigQuery, and other distributed storage systems.
•Implement data validation and data cleansing using PySpark to ensure high-quality data for analytics and reporting.
•Work with PySpark SQL for querying data within Spark, enabling complex transformations and aggregations in large datasets.
•Developed and optimized Scala-based Spark jobs to handle large volumes of pharmacovigilance data, enhancing the throughput of the data processing system.
•Designed and managed Snowflake data warehouses, ensuring efficient schema design and performance tuning for analytics and reporting purposes.
•Collaborated with IT infrastructure teams to optimize data storage and processing for pharmacovigilance datasets.
•Evaluated and selected Big Data technologies on AWS to enhance data analysis capabilities for pharmacovigilance initiatives.
Technical Stack: AWS, AWS S3, AWS Glue, AWS Athena, AWS EMR, AWS Lambda, AWS Step Functions, AWS SQS, AWS SNS, AWS CodePipeline, AWS CloudFormation, AWS CloudWatch, Apache Hadoop, Apache Spark, Apache Kafka, Kafka MirrorMaker, Hive, Cloudera (CDH), Scala, ScalaTest, Specs2, Python, PySpark, Airflow, Databricks, Databricks REST API, Snowflake, ETL, NLP, machine learning, pharmacovigilance, Jenkins, CI/CD, Infrastructure as Code (IaC), Play Framework, Structured Streaming, data warehousing, data modeling, event-driven architectures, predictive analytics, Big Data workflows, DAGs, distributed data processing.
Sr. Data Engineer
Oscar Health, New York City, NY Apr 2018 – Dec 2020
•Developed custom connectors and leveraged Snowflake’s external table features for accessing external data sources, including Azure Blob Storage.
•Utilized PySpark in conjunction with Databricks notebooks for interactive data analysis, enabling faster prototyping and model development.
•Leverage PySpark for handling large-scale joins, window functions, and aggregations in distributed datasets for analytics.
•Design and implement machine learning pipelines using PySpark MLlib to perform tasks like classification, regression, clustering, and recommendation.
•Integrate PySpark with Databricks for optimized big data processing and to take advantage of its collaborative notebook environment.
•Use PySpark for data serialization and deserialization with formats like Parquet, Avro, and ORC to ensure efficient data storage.
•Automate data pipelines and ETL jobs using PySpark, ensuring data flows seamlessly between different systems.
•Implement and maintain PySpark-based jobs within CI/CD pipelines to ensure automated deployment and testing of data applications.
•Designed and implemented Spark/Scala data pipelines to process and aggregate large healthcare datasets, improving operational efficiency.
•Use Scala to implement data validation logic and ensure data integrity during ETL processes.
•Integrate Scala with Apache Kafka to support real-time data processing and streaming.
•Optimize the performance of Spark jobs written in Scala by using advanced techniques like partitioning, shuffling, and caching.
•Deployed data solutions on Snowflake for seamless data integration across multiple cloud platforms, ensuring scalable and optimized data warehousing.
•Integrated Snowflake with Big Data tools to support hybrid data architectures and ensure seamless data processing across platforms, including Azure.
•Manage and maintain Snowflake data security, enforcing data access controls, encryption, and monitoring user activity to ensure compliance.
•Create and manage data marts within Snowflake to provide business teams with optimized, subject-specific data storage solutions.
•Implement data transformation workflows in Snowflake using SQL and Snowflake Streams to support business intelligence and analytics.
•Integrated IaC scripts with CI/CD pipelines to automate infrastructure deployments, including clusters, databases, and data storage systems across cloud providers (Azure, AWS, GCP).
•Leveraged Databricks notebooks on Azure for data exploration, transformation, and machine learning model development.
•Collaborated with data scientists to deploy and manage machine learning models in Databricks on Azure.
•Leverage Databricks for building and automating scalable machine learning pipelines with MLflow for tracking experiments.
•Use Databricks to orchestrate end-to-end data pipelines and automate data workflows using Apache Airflow.
•Implemented serverless computing solutions using Azure Functions for event-driven processing and execution of code without provisioning servers.
•Designed and implemented data processing pipelines on Azure to ingest, cleanse, and transform large volumes of data for analysis and reporting purposes.
•Developed and deployed Big Data analytics solutions on Azure, monitoring and analyzing data sources for insights.
•Configured and managed Azure HDInsight clusters for processing large-scale data using Apache Hadoop, Spark, and other frameworks.
•Utilized Azure Blob Storage for scalable, durable, and secure object storage, facilitating storage and retrieval of large data volumes.
•Monitored and managed Azure resources using Azure Monitor, enabling real-time monitoring, logging, and alerting for Azure services and applications.
•Automated software release pipelines using Azure DevOps, enabling continuous integration and delivery (CI/CD) of code changes across development, testing, and production environments.
•Provisioned and managed Azure infrastructure using Azure Resource Manager (ARM) templates, enabling infrastructure as code (IaC) for consistent and repeatable deployment of resources.
Technical Stack: Azure, Azure Blob Storage, PySpark, Scala, Azure HDInsight, Azure Functions, Azure Monitor, Azure DevOps, Azure Resource Manager (ARM), Snowflake, Databricks, Databricks notebooks, Apache Hadoop, Apache Spark, CI/CD, Infrastructure as Code (IaC), Python, machine learning, external tables, data pipelines, Big Data analytics, hybrid data architectures, scalable storage, event-driven processing, data transformation, data exploration, predictive analytics.
Data Engineer
Constellation Energy Corporation, Baltimore, Maryland Dec 2015 – Mar 2018
•Leveraged Cloudera distribution of Apache Hadoop and HDFS for distributed storage and processing of large-scale data for Constellation Energy Corporation.
•Utilized Apache Hive for data warehousing and querying, enabling SQL-like queries and analytics on Hadoop Distributed File System (HDFS) data.
•Utilized Apache Spark for in-memory data processing and analytics, enabling high-speed processing and