Big Data Engineer

Location:

Posted:

March 28, 2023

Resume:

PROFESSIONAL HIGHLIGHTS

IT Professional with **+ years of experience and more than 7 years building integrated solutions in Big Data environments on premises and migration to cloud environments using Amazon Web Services (AWS) and Azure for hosting Cloud-based data warehouses, and data Based using Redshift, Cassandra, Snowflake And RDBMS sources.

Successfully developed custom pipelines for Real-time or near Realtime data analysis using Spark, PySpark, Spark Streaming and Kafka.

Experienced in Data ingestion, extraction and transformation using ETL processes with the help Hive, Sqoop, Kafka, Firehose, Flume, Kinesis and Map Reduce. Fluent in architecture and engineering of the Hadoop ecosystem.

Ability to conceptualize innovative data models for complex products and create design patterns.

Well versed with implementation and management of data systems using Cloudera Hadoop, Horton works Hadoop, Hadoop, AWS cloud, Azure platforms or on premise.

Adept in designing and building big data architecture for unique projects, ensuring development and delivery of the highest quality, on-time and on budget.

Proficient in dealing with multiple terabytes of mobile ad data stored in AWS using Elastic Map Reduce, Redshift and RDS Postgresql/MySQL. Hands-on experience with AWS EMR, DynamoDB, Athena, Lambda functions, Glue jobs, Crawler and S3.

Experience working on various Cloudera distributions (CDH 5), Knowledge of working on Hortonworks and Amazon EMR Hadoop distributors.

Good experience in working with cloud environment like Amazon Web Services (AWS) EC2 and S3.

Experience working closely with operational data to provide insights and value to inform company strategy.

Capable of building data tools to optimize utilization of data and configure end-to-end systems.

TECHNICAL SKILLS

Apache

Apache Drill, Apache Kafka, Apache MAVEN, Apache Oozie, Apache Hue, Apache Sqoop, Apache Flume, Apache Hadoop, Apache HBase, Apache HCatalog, Apache Cassandra, Apache Lucene, Apache SOLR, Apache Airflow, Apache Mesos, Apache Tez, Apache ZooKeeper

BI Visualization

Kibana, Tableau, Splunk, Grafana

Programming

C++, Java, PHP, Python, Scala, HTML/XHTML/CSS, SQL, Hive, Spark

File Types

XML, Ajax, JSON, Avro, Parquet, ORC

APIs

Spark API, REST API, SOAP API

Development

Agile, Kanban, Scrum, Continuous Integration, TDD, Unit Testing, Functional Testing, Design Thinking, Lean, Six Sigma

Security & Forensics

SQL Injection, Data FTK imager, XXS

Soft Skills

Communication, Collaboration, Customer Service, Help Desk, Mentoring, Reviewing

Big Data

RDDs, UDFs, Dataframes, Datasets, Pipelines, Data Lakes, Data Warehouse, Data Analysis

Hadoop

Hadoop, HDFS, Hadoop YARN, Hortonworks, Cloudera, Impala

Spark & Hive

Apache Spark, Spark Streaming, Spark MLlib, GraphX, Apache Hive, Hive QL

Database

Redshift, DynamoDB, Cassandra, HBase, MongoDB, SQL, NoSQL, MySQL, RDBMS, Teradata, Oracle, Snowflake

File Management

HDFS, Snappy, Gzip, DAS, NAS, SAN

Cloud Services & Distributions

AWS, Azure, Anaconda Cloud, Elasticsearch, Solr, Lucene, Cloudera, Databricks, Hortonworks, Elastic. Cloud Foundry, Elastic Cloud

Software

Microsoft Office Suite

Operating Systems

Linux/UNIX, Windows

Virtualization & Network

SQL Injection, Data FTK imager, WAN/LAN, TCP/IP, Routing, VMware, VirtualBox, OSI Model

PROFESSIONAL EXPERIENCE

Chubb, Corp Sr. Data Engineer

Remote November 2020 – Present

Environment: Linux RHEL 6/7, Hortonworks 2.6/3.1, DELL S3, AWS S3, Windows, Scala, Kubernetes, Docker.

Technologies: Spark, Hortonworks, Knox, Hive, MongoDB, Ambari, MySQL, Linux Access Management (POSIX), PySpark

Configured Linux on multiple Hadoop environments setting up Lab, Dev, and Prod clusters within the same configuration

Created files to add additional layer of security to Hadoop environment

Designed Spark Scala POC to consume information from S3 Buckets

Define Spark data schema and set up of development environment inside the cluster

Management of Spark-submit jobs to all environments

Monitor background operation in Hortonworks Ambari

HDFS Monitoring job status and life of the Data Nodes according to the specs

Managed Zookeeper configurations and Z Nodes to ensure High Availability on the Hadoop Cluster

Collaboration with the security management to Sync Kerberos with Knox

Managed hive beeline connection with tables, databases, and external tables

Implemented Hortonworks medium and low recommendations on all environment

Work using Agile methodology to utilize tasks and delegated between team members

Lead development by managing and coaching not spark developers in the Hadoop team

Work with Kubernetes containers launching spark applications with Scala

Work on Docker containers launching spark applications with Scala

Develop a proof of concept to benchmark Kubernetes and Dockers

Pushing containers and to AWS EMR using Scala

Use Scala to connect to EC2 and push files to AWS S3

Work with offshore team to troubleshoot Airflow jobs, MongoDB, Hive in all environment

Work on setting My SQL enterprise to link with Puppet Enterprise

Ternium Sr. Data Engineer

Houston, TX July 2019 – October 2020

Moved the on-premises RDBMS database to Amazon Web Services (AWS) where the company could use various software solutions specific to the industry.

Recommended a solution to meet needs to accommodate petabytes of data, scale according to infrastructure needs, and stay within budget.

Designed a data warehouse and performed the data analysis queries on Amazon Redshift clusters on AWS.

Designed AWS Glue pipelines and Crawlers to ingest, process, and store data interacting with different services in AWS such Athena and Lambda functions.

Created Spark jobs that run in EMR clusters using interactive Notebooks for data analysis.

Developed Spark code using Python to run in the EMR clusters.

Created Hadoop clusters using HDFS, Amazon Redshift for NoSQL along with Arango DB for multi-modal data warehouse solution.

Implemented Spark using Scala and utilized Data Frames and Spark SQL API for faster processing of data.

Executed Amazon Web Services (AWS) SAP HANA environment to achieve the speed, performance, and agility it required without making a significant investment in physical hardware.

Used Spark Streaming to divide streaming data into batches as an input to Spark engine for batch processing.

This AWS implementation allowed the company to use Accelerated Trade Promotion Planning, a SAP solution powered by SAP HANA, and used this as a source of intelligent data.

Created PySpark scripts to fetch data from disparate sources such as Snowflake to ingest data using AWS Glue jobs along with Crawler.

Used Amazon Elastic Compute Cloud (Amazon EC2) instances to process 16 TB of sales data weekly from promotions in the U.S., modeling dozens of data simulations a day.

Implemented Amazon Virtual Private Cloud (Amazon VPC), connected directly to the data centers to allow access to SAP TPM directly for employees who are on the company network.

Amazon Simple Storage Service (Amazon S3) was used for data backups, including HANA, and Amazon Elastic Block Store (Amazon EBS) provisioned IOPS (P-IOPS) volumes for storage.

The company logs events using AWS Identity and Access Management (AWS IAM).

Used Amazon CloudWatch for monitoring, to allocate costs to each department based on their individual infrastructure use.

For high availability, leverages multiple AWS Availability Zones (AZs) without the additional cost of maintaining a separate data center.

Spotify Data Engineer

San Francisco, CA January 2018 – June 2019

Designed architecture/Hadoop infrastructure layout for development/test and production environments

Provisioned over 20 node Hadoop clusters on Microsoft Azure cloud infrastructure using Cloud break service

Configured over thirty node cluster using Ambari 2.1.2/HDP 2.3.2 on AT&T cloud services

Ingested data from disparate Data Warehouses to Amazon DynamoDB for storage using Glue jobs with PySpark scripts and Lambda functions.

While I was working on this project, the company adopted both cloud platforms, AWS and Azure to support over 1,000 engineers concurrently configuring and deploying over 200 critical services, so I was leading the Azure Digital Data Team.

I ingested all data from disparate sources such as data warehouses, rdbms, csv and json files into our Azure Data Lake Gen2, mainly set up and configured Azure Data Factory as ETL tool and Databricks to process batch data and perform analytics to run and schedule our Machine learning models in QA, Test and Production.

This enabled deployment ideas from QA to production in a matter of minutes, whereas before it used to take hours.

Evaluated Cloudera Hadoop on AWS cloud services to estimate the scalability/cost of the environment

Upgraded Ambari to 2.2.2 and HDP stack to 2.4.3 (hive, spark, yarn, hbase, ranger)

Performance tuned and troubleshot issues related with service components and the cluster

Created a Data Lake for offloading infrequently accessed data from data warehouse and for staging purposes

Ingested data to HDFS from IBM AS400, MSSQL Server, Teradata, Oracle, DB2, Unidata databases using Sqoop

Designed managed/external Hive Tables as per the requirements and stored them in ORC format for efficiency

Configured capacity scheduler to create multiple queues and assigned the users/groups based on the resource requirements

Migrated cluster/data from one datacenter to another

Configured SAP HANA Smart Data Access (SDA), SAP Data Services to integrate with Hadoop environments for reporting purposes

Installed/configured SAP VORA to integrate with HDP cluster to leverage HANA SPS11 functionality

Implemented authentication using Kerberos and integrated with Active Directory (AD)

Setup and configured Ranger for handling the authorization of the service components

Managed five-member offshore team.

Used Spark SQL and Data Frame API extensively to build Spark applications.

Used Spark engine, Spark SQL for data analysis and provided to the data scientists for further analysis.

Performed streaming data ingestion to the Spark distribution environment, using Kafka.

Built a prototype for real-time analysis using Spark streaming and Kafka.

Worked on Spark SQL to check the pirated data.

Used Spark SQL and Data Frames API to load structured and semi structured data into Spark Clusters.

Closely worked with data science team in building Spark M Llib applications to build various predictive models.

Configured Data Ingestion Accelerator tool in-house project to automate the ingestion process

Created Executive Dashboards for ETL Process and reporting

Developed oozie workflow for scheduling and orchestrating the ETL process

Integrated BI tools (Spotfire, Crystal Reports, Lumira, Tableau) with Hadoop

NETGEAR Hadoop Developer

San Jose, CA November 2016 – December 2017

Connected different branch offices and datacenters across the globe with consistent upload and download speeds using AWS.

Implemented the analytics platform using Amazon Virtual Private Cloud (Amazon VPC), allowing it to use security groups to control access to the platform.

Implemented Amazon Elastic Compute Cloud (Amazon EC2) for scalable compute capacity.

Generated around 200 million log entries (equivalent to 10GB of log data) per day.

Utilized 6 to 10 Amazon EC2 instances for data analysis, scaling them according to demand, adding more instances as needed to process extra data or extract more information.

Used Amazon S3 for data storage, and Elastic Load Balancing for performance and stability.

Contributed to serverless architectural design using AWS API, Lambda, S3, Athena, RDS and Dynamo DB with optimized design with Auto scaling performance.

Produced AWS Cloud Formation templates to create a custom infrastructure of our pipeline.

Executed ELK (Elastic Search, Log Stash, Kibana) stack in AWS to gather and investigate the logs created by the website.

Wrote Unit tests for all code using PyTest, PySpark.

Created data loss prevention plan using Amazon S3 storage for backups with Amazon Glacier for archival.

Used Spark Streaming to divide streaming data into batches as an input to Spark engine for batch processing.

Used Amazon Kinesis and Amazon Redshift to provide real-time analytics.

Transformed log data into data model using Pig and wrote UDF functions to format the logs data.

Loaded and transformed large sets of structured and semi structured data from HDFS.

through Sqoop and placed in HDFS for further processing.

Implemented Spark using Scala and utilized Data Frames and Spark SQL API for faster processing of data.

Involved in transforming data from legacy tables to HDFS, and HBase tables using Sqoop.

Extensively used transformations like Router, Aggregator, Normalizer, Filter, Joiner, Expression, Source Qualifier, Unconnected and connected lookup, Update strategy and store procedure, XML transformations along with error handling and performance tuning.

Using Sqoop to extract the data back to relational database for business reporting.

Designed and developed parallel jobs by using different types of stages such as transformers, Aggregator, Merge, Join, Lookup, Sort, remove duplicate, Funnel, Filter, Pivot, Shared container for developing jobs.

Implemented all SCD types using server and parallel jobs. Extensively implemented error handling concepts, testing, debugging skills and performance tuning of targets, source, transformation logics and version control to promote the jobs.

Involved in loading and transforming large sets of structured, semi-structured and unstructured data.

Involved in loading data from UNIX file system to HDFS.

Developed ETLs to pull data from various sources and transform it for reporting applications using PL/SQL

Hands-on experience extracting data from different databases and scheduling Oozie workflows to execute the task daily.

Successfully loaded files to HDFS from Teradata and loaded from HDFS to HIVE.

Involved in loading data from UNIX file system to HDFS.

BB&T Database Developer

Charlotte, NC Dec 2013 – Sep 2016

Connected different branch offices and datacenters across the globe with consistent upload and download speeds using AWS.

Performed data profiling and transformation on the raw data using Python, and oracle.

Experience in Importing and exporting data into HDFS and Hive using Sqoop.

Load the data from different sources such as HDFS or HBase into Spark RDD and implement in-memory data computation to generate the output response.

Involved in loading data from the UNIX file system to HDFS.

Utilized Keras API in Tensorflow to build a model for predicting the loan payback probability of credit cards.

Imported JSON object/dataset from multiple businesses and built multiple models with K-nearest neighbor, Ridge regression, Random Forest regression to retrieve vital information.

Development and Debugging experience on Python, Scala and Java.

Enforced workflows victimization Apache Airflow framework to automatize tasks.

Installed and configured Hive and also written Hive UDFs.

ETL to Hadoop file system (HDFS) and wrote HIVE UDFs.

Transferred data between a Hadoop ecosystem and structured data storage in an RDBMS such as MySQL using Sqoop.

Developed scripts to automate the workflow processes and generate reports.

Developed POC using Scala & deployed on Yarn cluster, compared the performance of Spark, with Hive and SQL.

Implemented different components on the cloud for the Kafka application messaging for data processing and analytics.

Well experienced with hyperparameter tuning techniques with Grid search and Random search engines as well as Feature Union.

Built a Pipeline for analyzing reviews of various companies' data using Natural Language Processing techniques.

EDUCATION

Computer Systems Engineer

Mexico

Contact this candidate