Resume

Senior Big-Data Engineer

Location:

Manhattan, NY, 10016

Posted:

April 15, 2024

Contact this candidate

Resume:

Mahmoud Abusara

Phone: 516-***-**** Email: ad40t3@r.postjobfree.com

Professional Summary

Strategic Professional offering 10+ years of rich and qualitative experience in Big Data Development and ETL Technologies thereby proposing effective solutions through an analytical approach with a track record of building large-scale systems.

•Skilled in Amazon Web Services (AWS) and the ability to adapt cloud knowledge to other platforms like Google Cloud Platform (GCP) and Microsoft Azure, ensuring flexibility in various cloud environments.

•Demonstrated excellence in managing data lakes on GCP, including best practices for data lake architecture.

•Well-versed with Google Cloud Data prep for data preparation and transformation.

•Applied Terraform with Google Cloud to manage and provision GCP resources for the data pipeline.

•Created workflows using Google Cloud Composer to orchestrate and manage complex data processing tasks.

•Proficiency in Google Cloud Pub/Sub for scalable event-driven messaging.

•Ability to use Google Cloud Audit Logging for auditing and compliance monitoring.

•Strong skills in designing scalable data architectures on Azure, utilizing services like Azure Data Lake, Azure Synapse Analytics, and Azure SQL.

•Proven track record in building and managing data pipelines on Azure Data Factory.

•Adept in utilizing AWS EMR and AWS Lambda to process and analyze large volumes of data.

•Successfully designed and implemented a data warehousing solution using AWS Redshift, and AWS Athena to enable efficient querying and analysis of data.

•Possess expertise in AWS CloudWatch for monitoring and managing AWS resources.

•Hands-on experience in implementing data security and access controls using AWS Identity and Access Management (IAM).

•Created workflows using AWS Step Functions to orchestrate and manage complex data processing tasks.

•Formulated comprehensive technical strategies and devised scalable CloudFormation Scripts, ensuring streamlined project execution.

•Utilized AWS Cloud formation to manage and provision AWS resources for the data pipeline.

•Expertise in performance optimization for a spark in multiple platforms like Databricks, Glue, EMR, and on-premises.

•Involved in the migration of on-premises data into the cloud, along with the implementation of CI/CD pipelines like Jenkins, Code Pipeline, Azure DevOps, Kubernetes, Docker, GitHub

•Skilled in data visualization techniques and tools such as Tableau, Power BI, or Matplotlib, to present findings and trends effectively to stakeholders.

•Hands-on experience with major components in the Hadoop Ecosystem, such as Spark, HDFS, HIVE, HBase, Zookeeper, Sqoop, Oozie, and Kafka, enabling the development of robust data solutions.

•Experienced in performing exploratory data analysis (EDA) to identify patterns, anomalies, and relationships within datasets, facilitating informed decision-making.

•Expertise in designing and managing data transformation and filtration patterns using Spark, Hive, Nifi and Python, ensuring efficient data processing.

•Demonstrates strong knowledge and experience in upgrading MapR, CDH (Cloudera Distribution for Hadoop), and HDP (Hortonworks Data Platform) clusters, ensuring seamless transitions and enhanced system performance.

•Strong understanding of data governance principles and compliance requirements to ensure data integrity, privacy, and security.

•Experienced in statistical analysis and hypothesis testing to interpret data trends, patterns, and relationships, and make data-driven recommendations.

•Worked directly with business stakeholders to understand project needs and objectives.

Technical Skills

IDE: Jupyter Notebooks, Eclipse, IntelliJ, PyCharm, Vs Code

Project Methods: Agile, Kanban, Scrum, DevOps, Continuous Integration, Test-Driven Development, Unit Testing, Functional Testing, Design Thinking, Lean Six Sigma

Hadoop Distributions: Hadoop, Cloudera Hadoop, Hortonworks Hadoop

Cloud Platforms: Amazon AWS, Google Cloud (GCP), Microsoft Azure

Database & Tools: Redshift, DynamoDB, Synapse DB, Big Table, Aws RDS, SQL Server, PostgreSQL, Oracle, MongoDB, Cassandra, HBase

Programming Languages: Python, Scala, PySpark, SQL

Scripting: Hive, MapReduce, SQL, Spark SQL, Shell Scripting

Continuous Integration CI/ CD: Jenkins, Code Pipeline, Docker, Kubernetes, Terraform, CloudFormation

Versioning: Git, GitHub

Orchestration Tools: Apache Airflow, Step Functions, Oozie

Programming Methodologies: Object-Oriented Programming, Functional Programming

File Format & Compression: CSV, JSON, Avro, Parquet, ORC

File Systems: HDFS, S3, Google Storage, Azure Data Lake

ELT Tools: Spark, NIFI, Aws Glue, Aws EMR, Data Bricks, Azure Data Factory, Google Data flow

Data Visualization Tools: Tableau, Kibana, Power BI, AWS Quick Sight

Search Tools: Apache Lucene, Elasticsearch

Security: Kerberos, Ranger, IAM

Professional Experience

Since Aug 2023 – Present

Sr. Data Engineer/Analyst Citigroup New York City, NY

(Citigroup Inc. is an American Multinational Investment Bank and financial Services Corporation headquartered in New York City.)

•Worked on Google Cloud Platform (GCP) services like computer engine, cloud load balancing, cloud storage, cloud SQL, stack driver monitoring, and cloud deployment manager.

•Set up GCP firewall rules to allow or deny traffic to and from the VM's instances based on the specified configurations and used GCP cloud CDN (content delivery network) to deliver content from GCP cache locations drastically improving user experience and latency.

•Developed Terraform templates for provisioning infrastructure on the Google Cloud Platform.

•Administered Hadoop clusters on GCP using services like Dataproc or other related GCP tools.

•Processed real-time data using Google Cloud Pub/Sub or Dataflow.

•Created data pipelines using Google Cloud Dataflow or other GCP data processing services.

•Orchestrated data pipeline executions using Apache Airflow on GCP.

•Implemented custom NiFi processors that reacted, processed for the data pipelines.

•Managed data storage in Google Cloud Storage or BigQuery for data warehousing.

•Set up and managed compute instances on Google Cloud Platform.

•Containerized applications using Google Kubernetes Engine (GKE).

•Optimized data processing using Google Cloud Dataprep or Dataflow.

•Implemented data governance and security using Google Cloud IAM.

•Ran Python scripts for custom data processing on Google Cloud instances.

•Developed and optimized data processing workflows using GCP services like Dataprep, and Dataflow.

•Collaborated with data scientists and analysts to build data pipelines and perform data analysis on GCP.

•Conducted exploratory data analysis (EDA) to identify patterns, outliers, and anomalies within datasets, informing further analysis and investigation.

•Conducted data validation and quality assurance processes to ensure accuracy, completeness, and consistency of datasets, maintaining data integrity throughout the analysis process.

•Work closely with stakeholders to define key performance indicators (KPIs) that align with business objectives and strategies.

•Building real-time streaming systems using Amazon Kinesis to process data as it is generated.

•Setting up and maintaining data storage systems, such as Amazon S3 and Amazon Redshift, to ensure data is properly stored and easily accessible for analysis.

•Optimizing data processing systems for performance & scalability using Amazon EMR & Amazon EC2.

•Processed data with a natural language toolkit to count important words and generated word clouds.

•Implementing data governance and security protocols using AWS Identity and Access Management (IAM) and Amazon Macie to ensure that sensitive data is protected.

•Creating and optimizing data processing workflows using AWS services such as Amazon EMR and Amazon Kinesis to process and analyze large amounts of data in a timely and efficient manner.

•Offer timely and insightful ad-hoc analysis and support to various teams and departments, addressing complex data-related queries and facilitating informed decision-making.

•Collaborating with data scientists and analysts to develop data pipelines for tasks such as fraud detection, risk assessment, and customer segmentation. Develop and implement recovery plans and procedures.

Jan 2021 – Aug 2023

Sr. Big Data Engineer/Analyst Liberty Insurance Mutual Group, Boston, Massachusetts, U.S.

(Liberty Mutual Group is an American diversified global insurer and the sixth largest property and casualty insurer in the United States.)

•Launching Amazon EC2 Cloud Instances and S3 storage using Amazon Web Services (Linux/ Ubuntu/RHEL) and configuring launched instances concerning specific applications

•Mastered the administration of critical AWS services, including IAM, VPC, Route S3, EC2, ELB, Code Deploy, RDS, ASG, and CloudWatch, orchestrating a resilient and high-performance cloud infrastructure.

•Provided insightful input on intricate questions related to AWS system architecture, showcasing a deep understanding of cloud principles and best practices.

•Established stringent security protocols within the AWS Cloud environment, ensuring the safeguarding of critical assets and compliance with security standards Involved in file movements between HDFS and AWS S3 and extensively worked with S3 bucket in AWS.

•I employed AWS Glue for data cleaning and preprocessing while performing real-time analysis with Amazon Kinesis Data Analytics to leverage AWS services like EMR, Redshift, DynamoDB, and Lambda for scalable, high-performance data processing.

•Evaluated on-premises data infrastructure, identifying migration opportunities to AWS orchestrated data pipelines with AWS Step Functions and utilized Amazon Kinesis for event messaging.

•Proficient in Apache Airflow, an open-source workflow automation platform.

•Applied statistical methods such as regression, hypothesis testing, and clustering to identify meaningful patterns and relationships within data, supporting informed decision-making.

•Created visually appealing charts, graphs, and dashboards using tools like Tableau and matplotlib to effectively communicate analysis findings to stakeholders.

•Implemented data cleaning and preprocessing techniques to handle missing values, outliers, and inconsistencies, ensuring high data quality for analysis purposes.

•Ensured data governance and compliance by adhering to principles and regulations to maintain data integrity, privacy, and security throughout the analysis process.

•Using Hive Script in Spark for data cleaning and transformation purposes

•Importing data from various data sources; performing transformations using Hive, and MapReduce, loading data into HDFS, and extracting the data from MySQL into HDFS using Sqoop.

•Use Hive to analyze the partitioned and bucketed data and compute various metrics for reporting.

•Developed Pig Latin scripts to extract the data from the web server output files to load into HDFS.

•Monitor and Troubleshoot Hadoop jobs using Yarn Resource Manager and EMR job logs.

•Converting Cassandra/Hive/SQL queries into Spark transformations using Spark RDDs in Scala

•Log data collected from the web servers was channeled into HDFS using Kafka and Spark streaming.

•Developed Spark jobs using Scala in a test environment for faster data processing and used Spark SQL for querying.

•Created Python Flask login and dashboard with Neo4j graph database and executed various cypher queries for data analytics.

•Load and transform Design efficient Spark code using Python and Spark SQL, which can be forward engineered by our code generation developers.

•Utilized large sets of structured, semi-structured, and unstructured data.

•Created big data workflows to ingest the data from various sources to Hadoop using OOZIE and these workflows comprise heterogeneous jobs like Hive, SQOOP, and Python Script

•Proven expertise in data quality management, ensuring accuracy, consistency, and reliability of data assets.

•Implemented data quality frameworks and standards to maintain high data integrity.

•Conducted data profiling, cleansing, and validation processes to enhance data quality.

•Collaborated with cross-functional teams to establish data quality best practices and governance.

Oct 2019 – Dec 2021

Cloud/ Data Engineer Berkshire Hathaway Omaha, Nebraska

(Berkshire Hathaway Inc. is an American multinational conglomerate holding company headquartered in Omaha, Nebraska, United States. Its main business and source of capital is insurance, from which it invests the float (the retained premiums) in a broad portfolio of subsidiaries, equity positions, and other securities.)

•Established and designed Azure resources like Azure Machines, Web Apps, Function Apps, SQL Databases, Azure Kubernetes, Azure Container instances, Azure Container Registry using ARM Templates, BICEP, and Terraform using Azure DevOps CICD pipelines.

•Incorporated several Azure technologies, such as Azure Cosmos DB, Azure App Insights, Azure Blob Storage, Azure API Management, and Azure Functions, to create a modern web application with microservices.

•Prepared capacity and architecture plan to create the Azure Cloud environment to host migrated IaaS VMs and PaaS role instances for refactored applications and databases.

•Successfully migrated on-premises data to AWS, enhancing data accessibility and scalability to reduce data processing time by implementing real-time data analysis with Amazon Kinesis Data Analytics.

•Achieved a unified metadata repository for streamlined data management using AWS Glue utilizing AWS S3 for streamlined data collection and storage, facilitating easy access to large datasets.

•Created modules for Apache airflow to call different services in the cloud including EMR, S3, Athena, Crawlers, Lambda functions, and Glue jobs.

•Implemented AWS Fully Managed Kafka streaming to send data streams from the REST API to the Spark cluster in AWS Glue.

•Utilized AWS Glue for data transformation and ETL processes to prepare data for analysis and visualization.

•Proposed a serverless architecture to process data in AWS on an event-based architecture.

•Implemented data ingestion using Apache Kafka and AWS Kinesis to stream data from various sources into AWS DynamoDb.

•Proficient in Snowflake, a cloud-based data warehousing platform.

•Expertise in designing scalable data solutions and optimizing query performance.

•Skilled in enabling data-driven insights through SQL-based analysis.

•Experience with Snowflake's unique architecture for data integration and security.

•Familiarity with cost-effective cloud computing on the Snowflake platform.

•Used DBT to test the data and ensure data quality.

•Used DBT to debug complex chain of queries. They can be split into multiple models and macros that can be tested separately.

•Architecting and Data Engineering for AWS cloud services including AWS Cloud services planning, designing, and DevOps support like IAM user, group, roles & policy management

•Proposed a serverless architecture to process data in AWS on an event-based architecture.

•Improved data quality and reliability through automated testing in CI/CD pipelines while enabling interactive reporting with visualization tools, enhancing data-driven decision-making.

•Implemented automated ETL pipeline deployment using Jenkins and version-controlled code with Git.

•Designed and optimized Spark SQL queries for data transformation. Extracted real-time financial data of Bitcoin and alt-coin prices every minute using a REST API.

•Consumed data from Kafka topics using Spark Streaming and transformed the data processed.

•Configured a full Kafka cluster with the multi-broker system for high availability.

•Involved in complete Big Data flow of applications starting from data ingestion to the data warehouse.

•Used AWS Glue to automate data processing and migration from on-premises systems to the cloud.

•Set up & implemented Kafka brokers to write data to topics and utilize its fault tolerance mechanism.

•Used Spark where possible to achieve faster results.

Dec 2017 – Sep 2019

Hadoop Developer/Administrator Lucid Motors Newark, CA

(Lucid Group, Inc. is an American manufacturer of electric luxury sports cars and grand tourers headquartered in Newark, California.)

•Evaluated on-premises data infrastructure, identifying migration opportunities to AWS orchestrated data pipelines with AWS Step Functions and utilized Amazon Kinesis for event messaging.

•optimizing data flow for analytics and configuring AWS Lambda functions to automate PySpark ETL jobs, ensuring scheduled and error-handled data processing.

•Installed/Configured/Maintained Apache Hadoop clusters for application development and Hadoop tools like Hive, HBase, Zookeeper, and Sqoop.

•Implemented Neo4j to integrate graph databases with relational databases to efficiently store, handle and query in data model.

•Installing and Upgrading Cloudera CDH & Hortonworks HDP Versions on lower and POC environments.

•Installed and Configured Sqoop to import and export the data into HDFS and Hive from RDBMS

•Close monitoring and analysis of the MapReduce job executions on the cluster at the task level and optimized Hadoop cluster components to achieve high performance.

•Deployed High Availability and automatic failover infrastructure for Name nodes using zookeeper services, ensuring continuous availability and reliability of the Hadoop cluster.

•Managed installation and version upgrades of Hadoop and related software components, ensuring compatibility and performance optimization for big data processing tasks.

•Integrated CDH clusters with Active Directory and enabled Kerberos for Authentication.

•Worked on commissioning & decommissioning of Data Nodes, Name Node recovery, capacity planning, and installed Oozie workflow engine to run multiple Hive Jobs.

•Implemented High Availability and automatic failover infrastructure to overcome single points of failure for Name nodes utilizing zookeeper services.

•Used Hive and created Hive tables and was involved in data loading and writing Hive UDFs and worked with the Linux server admin team in administering the server hardware and operating system.

•Automated data workflows using shell scripts to extract, transform, and load data from various databases into Hadoop, streamlining data ingestion processes.

•Worked closely with data analysts to construct creative solutions for their analysis tasks and managed and reviewed Hadoop and Hive log files.

•Collaborating with application teams to install the operating system and Hadoop updates, and version upgrades when required.

•Automated workflows using shell scripts pull data from various databases into Hadoop.

Dec 2015 – Nov 2017

Cloud Data Engineer Molina Healthcare Long Beach, CA

(Molina Healthcare is a FORTUNE 500 company that is focused exclusively on government-sponsored health care programs for families and individuals who qualify for government sponsored health care.)

•Loaded terabytes of different level raw data into Spark RDD for data computation to generate the output response and imported the data from Azure Blob Storage into Spark RDD

•Used Hive Context which provides a superset of the functionality provided by SQL Context and Preferred to write queries using the HiveQL parser to read data from Azure HDInsight Hive tables.

•Modeled Hive partitions extensively for data separation and faster data processing and followed Hive best practices for tuning in Azure HDInsight.

•Developed highly complex Python and Scala code, which is maintainable, easy to use, and satisfies application requirements, data processing, and analytics using inbuilt libraries in Azure Databricks.

•Successfully loaded files to Azure HDInsight Hive and Azure Blob Storage from Oracle, SQL Server using Azure Data Factory. Environment: Azure HDInsight, Azure Blob Storage, Linux, Shell Scripting, Airflow.

•Migrated legacy MapReduce jobs to PySpark jobs using Azure HDInsight.

•Written UNIX script for ETL process automation and scheduling, and to invoke jobs, error handling and reporting, to handle file operations, to do file transfer using Azure Blob Storage.

•Worked with UNIX Shell scripts for Job control and file management in Azure Linux Virtual Machines.

•Experienced in working at offshore and onshore models for development and support projects in Azure.

•Mentored junior data engineers and provided technical guidance on best practices for ETL data pipelines.

•Developed and maintained ETL pipelines using Apache Spark and Python on Google Cloud Platform (GCP) for large-scale data processing and analysis.

•Assess the existing on-premises data infrastructure, including data sources, databases, and ETL processes.

•Set up a CI/CD pipeline using Jenkins to automate the deployment of ETL code and infrastructure changes.

•Version controls the ETL code and configurations using tools like Git.

•Created automated Python scripts to convert data from different sources and to generate ETL pipelines.

•Optimized ETL and batch processing jobs for performance, scalability, and reliability using Spark, YARN, and GCP Data Proc.

Jan 2013 – Nov 2015

Data Analyst Lutheran Services in America Washington DC

(Lutheran Services in America is the national office of a network of 300 Lutheran health and human services organizations across the United States.)

•Analyzing and decomposing Business products functional requirements to meet the organizational project/product goals.

•Built out the data and reporting infrastructure from the ground up using Tableau and SQL to provide real-time insights.

•Use advanced sql queries for analysis and research for specific requirements and solutions.

•Conduct data analysis and modeling to identify patterns and trends in large datasets.

•Develop and implement data-driven strategies to improve business operations and decision-making.

•Collaborate with cross-functional teams to identify and prioritize data-related projects and initiatives.

•Create and maintain data visualizations and dashboards to communicate insights to stakeholders.

•Conduct A/B testing and experimentation to optimize business outcomes.

•Perform data quality checks and troubleshoot data-related issues as they arise.

•Stay up to date with industry trends and emerging technologies in the field of big data analytics.

•Mentor and train junior analysts on best practices and techniques for data analysis and modeling.

•Participate in the development and implementation of data governance policies and procedures to ensure data security and compliance.

•Built operational reporting in Tableau to find areas of improvement for contractors resulting in ROI in annual incremental revenue.

•Applied models and data to understand and predict repair costs for vehicles on the market and presented findings to stakeholders.

Education

Master of Science in Information Systems -Trine University, Angola, IN

Contact this candidate