BIG DATA ENGINEER

Location:

Santa Clara, CA

Posted:

May 01, 2023

Contact this candidate

Resume:

Phone: 408-***-**** Email: *******************@*****.***

Daniel Perez Navarro

Senior Big Data Engineer

PROFESSIONAL SUMMARY.

•Experienced big data engineer with 7+ years of expertise in designing, implementing, and managing big data related technologies.

•Proficient in Linux-based big data technologies such as Apache Hadoop, Apache Spark, and Elasticsearch, with a deep understanding of the Linux operating system and its various tools and utilities.

•1.5+ years of hands-on serving as a Linux Systems Administrator.

•Skilled in designing and developing ETL workflows and data pipelines to extract, transform, and load data from various sources into data stores, with a focus on performance optimization, scalability, and data quality.

•Extensive experience in designing and implementing cloud-based big data solutions using AWS, Azure, and Cloudera, with a deep understanding of cloud-based data storage, processing, and analytics technologies.

•Experienced in designing and implementing real-time data processing and analytics solutions using tools such as Apache Kafka, Apache NiFi, AWS Kinesis, and Azure Stream Analytics.

•Experience with multiple terabytes of data stored in AWS S3 using Elastic Map Reduce (EMR) and Redshift for processing.

•Designed and implemented big data solutions using Azure cloud services such as Azure Blob Storage, Azure Data Factory, Azure Databricks, and Azure HDInsight.

•Implemented and optimized data pipelines using Azure Data Factory to extract, transform, and load data from various sources into Azure data storage solutions.

•Create Hive managed and external tables with partition and bucket in Hive and loaded data into Hive.

•Developed data queries using HiveQL and optimized the Hive queries.

•Create structured data from the pool of unstructured data using Spark.

•Utilize Spark to optimize ETL jobs to reduce memory and storage consumption.

•Use Spark SQL and Data Frame API extensively to build Spark applications.

•Experienced working on CQL (Cassandra Query Language) for retrieving data present in the Cassandra cluster by running queries in CQL.

•Clearly document Big Data systems, procedures, governance, and policies.

•Well versed with GCP services such as Big Query, Dataflow, Dataprep and Cloud Storage.

•Participate in the design, development, and system migration of high-performance metadata-driven data pipelines with Kafka and Hive.

•Good knowledge of Cluster coordination services through Kafka.

•Extensive Experience streaming data with Kafka.

•Experienced in Cloudera and Hortonworks Hadoop distributions.

•Extend HIVE core functionality by using custom User Defined Functions (UDF), User Defined Table-Generating Functions (UDTF), and User Defined Aggregating Functions (UDAF) for Hive.

TECHNICAL SKILLS

•Programming/Scripting - Scala, Python, SQL, Hive QL, Shell Scripting, Java, MySQL

•Data Visualization – Kibana, Tableau, PowerBi

•ETL Pipelines – Elasticsearch, Elastic MapReduce, ELK Stack (Elasticsearch, Logstash, Kibana), NiFi, AWS Glue, ADF

•Hadoop and Big Data – Apache Flume, Kerberos, Yarn, Cluster Management, Cluster Security, Zookeeper, Oozie, Airflow, Snowflake

•Databases/Datastores - Hadoop HDFS, NoSQL, Cassandra, HBase, MongoDB, MySql, MSSql, OracLe, Dynamo

•Hadoop Distributions and Cloud – Hadoop, Cloudera Hadoop (CDH), Hortonworks Hadoop (HDP), Databricks

•Cloud Platforms (AWS, Azure, GCP) - AWS RDS, AWS EMR, AWS Redshift, AWS S3, AWS Lambda, AWS Kinesis, BigTable, AWS Glue, AWS IAM, Cloud Formation

WORK EXPERIENCE

Sr. Big Data Engineer

Bytedance Ltd. (Santa Clara, CA) from 8/2022 to Current

TikTok has a monetization platform in the US region that requires the maintenance of a Data Warehouse. For this project, I have been troubleshooting hive, Unix, and spark tasks on Bytedance-owned Data Warehouse using our Cloud tools. Communicating with an international team of developers and experts dedicated to specific tools (HDFS team, FLINK team, Kafka team, spark team, etc.) to solve problems both on ongoing and tasks to be deployed. Automating adding users to Lark groups as required by management teams. In addition, there is an on-call responsibility to answer 24/7 for production support once every couple of weeks as requested by the availability of team members.

•Performing fine-tuning the memory usage of hive jobs that used spark for deployment.

•Troubleshooting data schema mismatch from recent deployments pushed by developers.

•Supporting the teams overseas (China), and locally (US) on the status of running tasks

•Verifying the presence and integrity of tables required by cron jobs scheduled for hourly, daily, and weekly tasks.

•Repairing the lifetime of tables created by the developer to optimize storage capacity.

•Implementing real-time data streaming solutions, such as Apache Kafka, Flink, and Kinesis

•Designing and developing data visualization dashboards and reports to communicate insights to stakeholders.

•Developing and implementing data governance policies to ensure data accuracy and security.

•Overseeing and implementing the release of tasks by developers that included back-filling of tables, schema updates, and new logic added to hive queries.

•Communicating with technical teams to mediate their collaboration and solve issues on released code.

•Managing and scaling big data systems and infrastructure, including Hadoop and Spark clusters

•Working with business users to understand issues, develop root cause analysis, and work with the team for the development of enhancements/fixes.

•Identifying, gathering, analyzing, and automating responses to key performance metrics, logs, and alerts

•Establish best practices within Agile crews for model development workflow, DevOps/mode lops methodology, and productizing the ML models.

•Conduct model review, code refactoring and optimization, containerization, deployment, versioning, and monitoring of its quality.

•Create the tests and integrate them automatically with the workflows.

•Develop an efficient, reliable CI/CD pipeline to inevitably run into compliance requirements.

•Classification, regression, and clustering machine learning modeling techniques

•Agile development experience

•Build native cloud infrastructure using Kubernetes, Knative, and TriggerMesh.

•Work with the product operations team to resolve trouble tickets, develop and run scripts, and troubleshoot services in a hosted environment.

•Working knowledge of virtualized environments; VM management and provisioning

•Provide technical insight on development projects and mentored juniors.

Big Data Engineer

The Intercontinental Exchange (ICE) (Atlanta, GA) from 11/2019 to 7/2022

The Intercontinental Exchange is an American Fortune 500 company that operates global exchanges, and clearing houses and provides mortgage technology, data, and listing services. The company owns exchanges for financial and commodity markets and operates 12 regulated exchanges and marketplaces.

•Used Azure Data Factory to ingest data from various sources such as Azure Blob Storage, Azure SQL Database, and other third-party data sources.

•Created POCs using Azure Databricks notebooks to perform complex data transformations and machine learning tasks to assist DS team.

•Set up Azure Data Lake Analytics to process large amounts of data in parallel and generate insights using U-SQL or Spark along with Databricks to analyze data using Apache Spark, PySpark and deep learning libraries like TensorFlow.

•Worked on e2e ETL pipeline in Azure Data Lake Store and Blob Storage to store large volumes of structured and unstructured data, building complex sql queries with Azure SQL Database for transactional data storage and management.

•Used Azure Cosmos DB for NoSQL data storage and management as key value pairs.

•Configure Azure SQL Database to query data using SQL (joins, window functions and CTEs)

•Defined Spark data schema and set up a development environment inside the cluster.

•Processed data with a natural language toolkit to count important words and generated word clouds.

•Monitor the performance of Azure Data Factory pipelines and detect issues.

•Implemented Azure Log Analytics to collect and analyze log data from various Azure services and Azure Automation to schedule and automate maintenance tasks such as backups and updates.

•Started and configured master and slave nodes for Spark.

•Designed Spark Python job to consume information from S3 Buckets using Boto3.

•Set up cloud compute engine in managed and unmanaged mode and SSH key management.

•Worked in virtual machines to run pipelines on a distributed system.

•Utilized a cluster of multiple Kafka brokers to handle replication needs and allow for fault tolerance.

•Created a pipeline to gather data using PySpark, Kafka, and HBase

•Worked on the Spark Snowflake connector to read and write data from the Snowflake table to Spark.

•Used Spark Streaming to receive real-time data using Kafka.

•Worked with unstructured data and parsed out the information by Python built-in functions.

•Configured a Python API Producer file to ingest data from the Slack API using Kafka for real-time processing with Spark.

•Managed Hive connection with tables, databases, and external tables

•Install Hadoop using Terminal and setting the configurations.

•Formatted the response from Spark jobs to data frames using a schema containing News Type, Article Type, Word Count, and News Snippet to parse JSON.

•Interacted with data residing in HDFS using PySpark to process the data.

•Configured Linux on multiple Hadoop environments setting up Dev, Test, and Prod clusters within the same configuration.

•Handled HDFS Monitoring job status and life of the Data Nodes according to specs.

•Installed spark and PySpark library in terminal using CLI in bootstrapping steps.

•Used DynamoDB to store metadata and logs.

•Programmed Python classes to stack information from Kafka to DynamoDB according to the ideal model.

•Provided connections to different Business Intelligence tools to the tables in the data warehouse such as Tableau and Power BI.

AWS Big Data Engineer

Hyatt Hotels Corporation (Chicago, IL) from 02/2017 to 11/2019

Hyatt Hotels Corporation, commonly known as Hyatt Hotels & Resorts, is an American multinational hospitality company that manages and franchises luxury and business hotels, resorts, and vacation properties.

•Configured, deployed, and automated instances on AWS, and Data centers.

•Applied EC2, Cloud Watch, Cloud Formation, and managed security groups on AWS.

•Created automated Python scripts to convert data from different sources and to generate ETL pipelines.

•Created Hive external tables and designed information models in Hive.

•Processed Terabytes of information in real-time using spark streaming.

•Applied Hive optimization techniques such as partitioning, bucketing, map join, and parallel execution.

•Implemented solutions for ingesting data from various sources and processed the Data-at-Rest utilizing Big Data technologies such as Hadoop, Map Reduce Frameworks, HBase, and Hive.

•Programmed scripts to extract data from different databases and schedule Oozie workflows to execute daily tasks.

•Converted HiveQL/SQL queries into Spark transformations using Spark RDDs, Python, and Scala.

•Produced distributed query agents to perform distributed queries against Hive.

•Loaded data from different sources such as HDFS and HBase into Spark data frames and implemented in-memory data computation to generate the output response.

•Monitored Amazon DB and CPU Memory using Cloud Watch.

•Used Spark SQL to realize quicker results compared to Hive throughout information analysis.

•Implemented usage of Amazon EMR for processing Big Data across the Hadoop Cluster of virtual servers on Amazon Elastic Compute Cloud (EC2) and Amazon Simple Storage Service (S3) AWS Redshift.

•Executed ELK (Elastic Search, Log Stash, Kibana) stack in AWS to gather and investigate the logs created by the website.

•Wrote streaming applications with Spark Streaming/Kafka.

•Developed DBC/ODBC connectors between Hive and Spark for the transfer of the newly populated data frames from MSSQL.

•Executed Hadoop/Spark jobs on AWS EMR using programs and data stored in S3 Buckets.

•Used Datastax Spark Cassandra Connector to extract and load data to/from Cassandra.

Hadoop Data Engineer (GCP)

Genesis HealthCare (Kennett Square, PN) from 10/2015 to 02/2017

Genesis HealthCare is a provider of short-term post-acute, rehabilitation, skilled nursing, and long-term care services.

•Implemented Cloud Storage to ingest data from various sources, such as on-premises systems and other cloud providers.

•Used Dataproc to process large volumes of data using Apache Hadoop, Spark, and additional data processing frameworks.

•Built ETL jobs using Dataprep to clean and prepare data for analysis.

•BigQuery was used on this project to store and analyze large volumes of structured and semi-structured data using SQL as well as Dataproc to process large volumes of data in parallel and generate insights using Apache Spark, Hive.

•Set up Stackdriver to monitor the performance of your GCP resources and identify issues.

•Developed all jobs with Apache Airflow/Cloud Composer to schedule and automate tasks such as backups and updates.

•Installed and configured Hadoop HDFS and developed multiple jobs in Java for data cleaning and preprocessing.

•Developed Map/Reduce jobs using Java for data transformations.

•Developed different components of systems’ Hadoop processes that involved Map Reduce and Hive.

•Developed data pipeline using Sqoop, MR, and Hive to extract the data from weblogs and store the results for downstream consumption.

•Developed Hive queries and UDFS to analyze/transform the data in HDFS.

•Designed and Implemented Partitioning (Static, Dynamic), Buckets in Hive.

•Worked in a team to develop an ETL pipeline that involved extraction of Parquet serialized files from S3 and persisted them in HDFS.

•Used Sqoop to efficiently transfer data between databases and HDFS and used Flume to stream the log data from servers.

•Used Zookeeper and Oozie for coordinating the cluster and scheduling workflows.

•Configured Yarn capacity scheduler to support various business SLAs.

•Implemented and maintained security with LDAP and Kerberos as designed for the cluster.

•Coordinates with monitors cluster upgrade needs and monitors cluster health and build proactive tools to look for anomalous behaviors.

•Worked with cluster users to ensure efficient resource usage in the cluster and alleviate multi-tenancy concerns.

•Migrated ETL processes from Oracle to Hive to test the easy data manipulation.

•Wrote HiveQL scripts to perform trend analysis on Big Data log data.

•Utilized Sqoop to extract data back to relational databases for business reporting.

•Created Hive tables, loaded data, and wrote Hive queries.

•Debugged and identified issues reported by QA with the Hadoop jobs by configuring a local file system.

•Implemented Flume to import streaming data logs and aggregate the data to HDFS.

•Used Cloudera Manager for installation and management of single-node and multi-node Hadoop Cluster.

•

Linux Systems Administrator

CompuCom (Fort Mill, SC) from 05/2014 to 09/2015

CompuCom provides end-to-end managed services, technology, and consulting to enable the digital workplace for enterprise.

•Installed, configured, monitored, and administered Linux servers.

•Installed, deployed, and managed Linux RedHat Enterprise, CentOS, and Ubuntu, and installed patches and packages for Red Hat Linux Servers.

•Configured and installed RedHat and Centos Linux Servers on virtual machines and bare metal installations.

•Worked with the DBA team for database performance issues, network-related issues on LINUX/UNIX servers, and with vendors regarding hardware-related issues.

•Monitored CPU, memory, hardware, and software including raid, physical disk, multipath, filesystems, and networks using the Nagios monitoring tool.

•Hosted servers using Vagrant on Oracle virtual machines.

•Automated daily tasks using bash scripts while documenting the changes in the environment and each server, analyzing the error logs, user logs, and /var/log messages.

•Created and modified users and groups with root permissions.

•Administered local and remote servers using SSH daily.

•Created and maintained Python scripts for automating build and deployment processes.

•Utilized Nagios-based open-source monitoring tools to monitor Linux Cluster nodes.

•Created users, managed user permissions, maintained user and file system quotas, and installed and configured DNS.

•Adhered to industry standards by securing systems, directory and file permissions, groups, and supporting user account management along with the creation of users.

•Performed kernel and database configuration optimization such as I/O resource usage on disks.

•Analyzed and monitored log files to troubleshoot issues.

EDUCATION

CENTRO DE INVESTIGACION Y DE ESTUDIOS AVANZADOS DEL IPN – Ph.D. in Physics

CENTRO DE INVESTIGACION Y DE ESTUDIOS AVANZADOS DEL IPN – Master’s in physics

INSTITUTO TECNOLOGICO Y DE ESTUDIOS SUPERIORES DE MONTERREY - B.Sc. Mechatronics

Contact this candidate