Post Job Free

Resume

Sign in

Sr Big Data Developer

Location:
Redwood City, CA, 94061
Posted:
April 16, 2024

Contact this candidate

Resume:

Hocine Makhlouf

ad41tj@r.postjobfree.com

650-***-****

SR BIG DATA ENGINEER DATA ANALYTICS

Accomplished Data Engineer with over 13+ years of rich and qualitative experience in Data Engineering, Data Analytics, Data Modeling and Data Visualization

Profile Summary:

•Result-oriented professional with overall 13+ years of experience including 10+ years in Big Data Engineering as well as 3+ years in Data Analytics.

•Skilled in utilizing AWS services such as Amazon S3, EC2, RDS, Lambda, and DynamoDB for building scalable and resilient cloud-based applications.

•Proficient in leveraging Google Cloud Platform (GCP) services such as Cloud Functions, Cloud Storage, Data Prep, Data Flow, Big Table, Big Query, and Cloud SQL for building scalable and efficient data pipelines.

•Proficient in leveraging Azure services such as Azure SQL Database and Azure Data Factory for data analytics and ETL processes.

•Experienced in deploying and managing containerized applications on AWS using services like Amazon ECS and EKS for container orchestration.

•Experienced in configuring AWS Kinesis Data Streams to collect and process large volumes of data records from various sources, including IoT devices, application logs, clickstreams, and social media feeds, in near real-time for analytics, monitoring, and decision-making purposes.

•Experienced in designing and implementing serverless architectures on AWS using services like AWS Lambda, API Gateway, and DynamoDB

•Proficient in setting up and managing AWS VPCs, subnets, security groups, and IAM policies for secure and isolated cloud environments.

•Proficient in designing, building, and maintaining end-to-end data pipelines on Azure using Azure Data Factory, ensuring efficient and reliable data ingestion, transformation, and loading (ETL) processes.

•Experienced in integrating various data sources and destinations, including Azure Blob Storage, Azure SQL Database, Azure Data Lake Storage, Azure Cosmos DB, and third-party systems, within Azure Data Factory pipelines.

•Skilled in implementing data transformation activities using Azure Data Factory Data Flows, including data cleansing, aggregation, enrichment, and normalization to prepare data for analytics and reporting purposes.

•Skilled in deploying and managing Azure Databricks clusters for big data analytics and machine learning workloads, including data exploration, feature engineering, model training, and inference within integrated data pipelines.

•Proficient in utilizing Azure HDInsight for running Apache Hadoop, Spark, and HBase clusters on Azure infrastructure, enabling distributed processing of large-scale data sets and advanced analytics capabilities.

•Skilled in integrating Azure Event Hubs with other Azure services such as Azure Functions, Azure Stream Analytics, Azure Databricks, and Azure Data Lake Storage to build end-to-end data pipelines for real-time analytics, monitoring, and decision-making.

•Experienced in designing and architecting data solutions within the GCP ecosystem, ensuring high availability, reliability, and performance.

•Skilled in implementing data governance policies and security standards on GCP, ensuring compliance with industry regulations and best practices

•Proficient in utilizing Google Cloud AI and Machine Learning services for predictive analytics and data-driven insights

•Familiar with GCP's managed services for machine learning, including AI Platform, AutoML, and TensorFlow, for developing and deploying machine learning models at scale.

•Experienced in deploying and managing Kubernetes clusters on GCP for containerized applications and microservices architectures

•Proficient in utilizing Google Cloud Pub/Sub for real-time messaging and event-driven architectures.

•Skilled in optimizing cost and resource utilization on GCP, leveraging tools like Cost Explorer and Budgets to monitor and manage cloud spending effectively

•Experienced in managing on-premises data analytics and processing environments, including Hadoop, Spark, and relational databases.

•Skilled in installing, configuring, and maintaining on-premises infrastructure for big data processing and analytics.

•Adept at leveraging a diverse skill set to manipulate, analyze, and visualize large datasets using SQL, Python, R, and Tableau.

•Strong background in implementing Continuous Integration/Continuous Deployment (CI/CD) pipelines using tools like Jenkins, GitLab CI/CD, and AWS CodePipeline

•Proficient in setting up automated testing frameworks and conducting unit tests, integration tests, and end-to-end tests to ensure software quality and reliability

•Experienced in gathering and analyzing business requirements, translating them into technical specifications, and managing project timelines and deliverables

•In-depth knowledge of SQL and NoSQL databases, including incremental imports, partitioning, and bucketing concepts in Hive and Spark SQL

•Skilled in extending HIVE core functionality using custom User Defined Functions (UDF), User Defined Table-Generating Functions (UDTF), and User Defined Aggregating Functions (UDAF)

•Strong background in Agile methodologies, including Scaled Agile Framework (SAFe) practices for project management

•Excellent analytical, logical, communication, and problem-solving skills with a proactive approach to self-learning and adapting to new technologies

Technical Skills:

Big Data

RDDs, UDFs, Data Frames, Datasets, Pipelines, Data Lakes, Data Warehouse, Data Analysis

Hadoop

Hadoop, Cloudera (CDH), Hortonworks Data Platform (HDP), HDFS, Hive, Zookeeper, Sqoop, Oozie, Yarn, MapReduce

Spark

Apache Spark, Spark Streaming, Spark API, Spark SQL

Apache

Apache Kafka, Apache MAVEN, Apache Oozie, Apache Pig, Apache Sqoop, Apache Flume, Apache Hadoop, Apache HBase, Apache Cassandra, Apache Lucene, Apache SOLR, Apache Airflow, Apache Camel, Apache Mesos, Apache Tez, Apache Zookeeper

Programming

PySpark, Python, Scala, Java, SQL

Development

Agile, Scrum, Continuous Integration, Test-Driven Development (TDD), Unit Testing, Functional Testing, Git, GitHub, Jenkins CI (CI/CD for continuous integration)

Query Language

SQL, Spark SQL, Hive QL

Databases /Data warehouses

MongoDB, AWS Redshift, Amazon RDS, Apache HBase, Elasticsearch, Snowflake

File Management

HDFS, Parquet, Avro, Snappy, G zip, Orc, JSON, XML

Cloud Platforms

AWS Amazon Web Services, Microsoft Azure, GCP

Security and Authentication

AWS IAM, Kerberos

AWS Amazon Components

AWS Lambda, AWS S3, AWS RDS, AWS EMR, AWS Redshift, AWS Lambda, AWS Kinesis, AWS ELK, AWS Cloud Formation, AWS IAM

Azure Components

Azure Functions, Azure Blob Storage, Azure SQL Database, Azure HDInsight, Azure Synapse Analytics, Azure Event Hubs, Azure Monitor, Azure Resource Manager, Data Bricks.

GCP Components:

Cloud Functions, Cloud Storage, Cloud SQL, Dataproc, BigQuery, Cloud Pub/Sub,Stackdriver,Deployment Manager,Identity and Access Management (IAM)

Virtualization

VMware, VirtualBox, OSI, Docker

Data Visualization

Tableau, Kibana Crystal Reports 2016, IBM Wats

Cluster Security

Ranger, Kerberos

Query Processing

Spark SQL, HiveQL, Data Frames

Professional Experience:

SR. BIG DATA ENGINEER

JANUARY 2023-PRESENT AMAZON, AUSTIN, TX

Project 1: Orchestrate Redshift ETL using AWS Glue and Step Functions

•Spearheaded the design and implementation of ETL processes on AWS Redshift utilizing AWS Glue for seamless data extraction, transformation, and loading (ETL) operations

•Orchestrated intricate ETL workflows using AWS Step Functions, automating, and refining data processing pipelines to enhance efficiency and reliability.

•Collaborated closely with cross-functional teams including data engineers and data scientists to uphold data accuracy, integrity, and consistency throughout the project lifecycle

•Implemented comprehensive data quality checks and monitoring mechanisms within AWS infrastructure, enabling real-time detection and resolution of anomalies

•Optimized ETL performance and scalability on AWS, adeptly managing and accommodating large volumes of data with efficiency

•Leveraged AWS services such as AWS Lambda for serverless computing to enhance ETL processes and streamline data workflows

•Employed Amazon S3 for cost-effective storage of raw and processed data, ensuring accessibility and durability

•Utilized AWS Glue Data Catalog to maintain metadata repositories and facilitate efficient data discovery and cataloging

•Integrated AWS CloudWatch for real-time monitoring and logging of ETL workflows, enabling proactive identification and mitigation of performance issues

•Implemented AWS IAM policies and security best practices to ensure data confidentiality, integrity, and availability within AWS environments.

•Leveraged AWS Redshift Spectrum for seamless querying and analysis of data stored in Amazon S3, enabling efficient data access and analysis.

•Utilized AWS Data Pipeline to automate the movement and transformation of data between different AWS services, streamlining data workflows and reducing manual intervention

Project 2: Build an ETL Data Pipeline on AWS EMR Cluster

•Created and managed ETL data pipelines on AWS EMR (Elastic MapReduce) cluster using Apache Spark and Hadoop ecosystem tools

•Designed data ingestion processes to extract data from various sources and load it into the EMR cluster for processing

•Implemented data transformations and aggregations using Spark RDDs, DataFrames, and Spark SQL

•Optimized EMR cluster performance and resource utilization to meet SLAs and handle dynamic workloads

•Implemented fault-tolerant and scalable ETL workflows to handle large-scale data processing tasks efficiently

SR. DATA ENGINEER

JULY 2021-JANUARY 2023 VERIZON, NEW YORK, NY

•Utilized Apache Spark distributed framework along with Scala, SparkSQL, CI/CD tools to design, implement and maintain a data pipeline application.

•Built, tested, and maintained a Spark Scala Application, and codes deployments in different environments (dev, qa, ple and prod) using CI/CD - Jenkins and Airflow, IntelliJ, GitLab, SourceTree, DBeaver,

•Created test cases framework for data Validation and Writing complex SQL queries to conduct database testing used to identified, analyzed data discrepancies and data quality issues and worked to ensure data consistency and integrity.

•Worked with Spark Scala, Spark SQL, Spark RDD, Spark DataFrame, HDFS, Hive, and JDBC, and handling big data tasks like ETL and data warehousing.

•Worked on cloud platforms with AWS EMR, AWS S3, Data Pipeline, and RDS (PostgreSQL).

•Collaborated in environments and actively participated in daily stand-ups within a Scaled Agile Framework (SAFe).

•Collaborated with cross-functional teams on product grooming, user stories, and leveraging services like Amazon EMR, S3, Data Pipeline, and Redshift for efficient data management and analysis.

•Developed and implemented User-Defined Functions (UDFs) in Scala to automate business logic within applications deployed on AWS. This expertise streamlines processes and improves overall efficiency.

•Managed, built, deployed, and continuous integration systems within a cloud computing environment using AWS services like CodeBuild and CodeDeploy. This ensures seamless development processes and rapid software delivery.

•Implemented Amazon Kinesis as a highly scalable messaging service for building event-driven architectures and real-time data processing pipelines on AWS. This includes integrating Kinesis with other services like S3, Redshift, and Lambda for efficient data ingestion, processing, and analysis.

•Leveraged Amazon Kinesis and CloudTrail logs for analysis, error identification, and troubleshooting performance issues within data processing workflows on AWS

•Experience in data analysis using libraries like PySpark and designing extensive automated test suites using frameworks like Selenium within the AWS ecosystem.

•Developed Spark Scala code to extract, transform, and aggregate information from various data formats on AWS. Additionally, I can design and orchestrate AWS Glue data pipelines to ingest, process, and store data, integrating seamlessly with various AWS services for comprehensive data management.

SR. BIG DATA ENGINEER

MARCH 2018-JULY 2021 CITIGROUP, NEW YORK CITY, NEW YORK

•Implemented Azure Functions, Azure Blob Storage, and Azure SQL Database on Azure to establish a robust big data infrastructure at Citigroup, leveraging cloud-based solutions.

•Developed multiple data processing functions using Azure Functions in Java to perform data cleaning and preprocessing, ensuring data quality and consistency within the Azure environment.

•Utilized Azure HDInsight for managing and analyzing large-scale datasets, demonstrating expertise in installing, configuring, and utilizing various components of the Hadoop ecosystem on Azure to support diverse data processing requirements.

•Administered Azure Synapse Analytics to facilitate efficient data storage, retrieval, and analysis, leveraging Azure services.

•Leveraged Azure Event Hubs for real-time data ingestion and processing, ensuring seamless integration and consistency across the Azure data ecosystem.

•Utilized Azure Monitor to monitor system health and diagnose potential issues proactively, leveraging Azure monitoring and logging services.

•Stored and transformed large datasets in Azure Blob Storage, ensuring data integrity and accessibility within Azure data storage solutions.

•Managed data from various sources on Azure Blob Storage, ensuring seamless integration and consistency across the Azure data ecosystem.

•Leveraged Azure Databricks for scalable data analytics and machine learning, enabling advanced data processing and predictive analytics capabilities on Azure.

•Supported data processing tasks running on Azure HDInsight, providing technical assistance and troubleshooting within the Azure environment.

•Loaded data into Azure Blob Storage, ensuring efficient data ingestion and storage within Azure data storage solutions.

•Managed and analyzed data using Azure SQL Database, ensuring high availability and reliability of data storage solutions on Azure.

•Exported analyzed data to relational databases using Azure SQL Database for visualization and report generation, enabling data-driven decision-making within the Azure ecosystem.

•Developed scripts for data transformations using Scala on Azure Databricks, automating data processing tasks for enhanced efficiency within Azure services.

•Configured Azure Event Hubs to receive real-time data and process it using Azure Functions, enabling real-time data analytics and processing within the Azure environment.

•Implemented Azure Event Hubs for real-time data ingestion and processing, enabling seamless integration and consistency across the Azure data ecosystem.

•Configured Azure Event Hubs to receive real-time data and process it using Azure Functions, enabling real-time data analytics and processing within the Azure environment.

•Utilized Azure Synapse Analytics for building and running enterprise-scale analytics and data warehousing solutions on Azure.

•Designed and implemented data pipelines in Azure Synapse Analytics to ingest, transform, and analyze large volumes of structured and unstructured data from diverse sources.

•Assisted in setting up Azure Resource Manager templates and updating configurations for implementing data processing tasks, ensuring data accuracy and consistency across environments within the Azure cloud.

•Implemented Azure Data Factory pipelines to orchestrate ETL workflows and data processing tasks, ensuring seamless data integration and management on the Azure cloud platform.

DATA ENGINEER

DECEMBER 2016-MARCH 2018 UNION PACIFIC CORPORATION, OMAHA, NE

•Utilized GCP Cloud Storage to store raw data, while also implementing on-premises storage solutions such as Hadoop Distributed File System (HDFS) for data storage

•Employed GCP Data Prep for pre-processing and cleaning tasks, alongside on-premises data pre-processing tools tailored for specific data requirements

•Leveraged GCP Data Flow to transform and process data in the cloud, while also utilizing on-premises data processing frameworks like Apache Spark for similar tasks

•Loaded data into GCP Big Table for scalable access, in conjunction with on-premises NoSQL databases such as MongoDB for storing and querying processed data

•Utilized GCP Big Query as a highly scalable data warehouse solution, while also leveraging on-premises data warehousing solutions for specific business needs

•Implemented custom business logic and triggered data processing events using GCP Cloud Function, alongside on-premises event-driven architectures for similar functionalities

•Conducted data quality checks to ensure completeness and accuracy across both GCP Cloud Storage, Big Table, Big Query, and on-premises storage and processing systems

•Configured GCP security and access controls to protect sensitive data, ensuring compliance with data privacy regulations, while also implementing similar measures within the on-premises environment

•Used shell scripts to transfer data from MySQL to Hadoop Distributed File System (HDFS) within the Hadoop environment

•Developed and managed data visualization dashboards using tools like Tableau within the on-premises environment, delivering actionable insights to business users and executives

•Implemented data governance and security policies within the on-premises environment to maintain data quality, integrity, and confidentiality.

•Provided training on data resource management to staff members, ensuring effective utilization of on-premises data processing capabilities

•Implemented error handling and logging mechanisms within the on-premises data processing framework to ensure reliability and accuracy of the data pipeline

DATA ENGINEER

FEBRUARY 2014-DECEMBER 2016 RITE AID, PHILADELPHIA, PA

•Leveraged on-premises storage solutions such as Hadoop Distributed File System (HDFS) to store raw data and utilized Apache Spark for data processing and transformation.

•Ingested data from various sources into on-premises data processing frameworks like Apache Kafka and utilized custom scripts for preprocessing and data cleansing before storage.

•Orchestrated data movement and transformation across various on-premises services using custom-built ETL pipelines and workflow management tools.

•Utilized on-premises NoSQL databases such as MongoDB for storing and querying processed data, ensuring scalability and performance.

•Implemented data quality checks to ensure completeness and accuracy of data loaded into on-premises storage solutions and NoSQL databases.

•Optimized performance of the data pipeline through tuning Apache Spark and custom ETL configurations, adjusting hardware resources as necessary

•Configured on-premises security measures and access controls to safeguard sensitive data and ensure compliance with data privacy regulations.

•Collaborated with cross-functional teams to determine Big Data requirements and developed on-premises data processing systems accordingly.

•Finalized system scope and delivered Big Data solutions on-premises in alignment with business objectives.

•Collaborated with software R&D teams to integrate on-premises data processing systems with company applications.

•Implemented disaster recovery and backup solutions on-premises to ensure business continuity and data availability.

•Utilized DevOps practices like infrastructure as code, automated testing, and CI/CD within the on-premises environment to enhance agility and reduce time to market.

•Collected data from various sources within the on-premises environment, establishing secure connections and processing data using Apache Spark

ASSOCIATE-DATA ANALYTICS

OCTOBER 2011-FEBRUARY 2014 ENTERGY, NEW ORLEANS, LOUISIANA

•Enhanced and maintained existing Internet/Intranet applications, employing technologies prevalent during that time.

•Developed efficient workflows using tools like GIT/SSH to facilitate collaborative development among multiple programmers.

•Demonstrated expertise in SQL and PL/SQL, crafting stored procedures to manipulate databases effectively.

•Integrated applications with database architecture and server scripting, leveraging technologies available at the time.

•Configured and managed server clusters using operating systems such as CentOS and Ubuntu for robust application deployment.

•Implemented optimal business logic solutions, adhering to industry best practices and design patterns prevalent during the period.

•Developed automated continuous integration systems utilizing tools like Git, Jenkins, MySQL, Python, and Bash, as per the technological landscape of the time.

•Designed user information solutions to streamline backend operations and enhance data management efficiency.

•Employed Agile Methodologies and Scrum processes for project management and team collaboration, aligning with industry standards of the time.

•Utilized Python for data processing automation, reflecting the prevalent technology trends for automation during the specified period.

•Leveraged integrated development environments like Eclipse, NetBeans, and PyCharm to support application development and debugging processes according to the prevalent technological landscape.

Academic Credentials:

•Bachelor of Science in Applied Statistics

•Associate of Applied Science in Data Technology

•Certificate in Data Management

•Boot Camp Certificate in Data Analytics and Visualization

Certifications:

•AWS Certified Cloud Practitioner

•NDG Linux Unhatched



Contact this candidate