Hadoop Developer/Big Data Engineer

Location:

New York, NY

Posted:

May 08, 2023

Contact this candidate

Resume:

Hagop Jay Tumayan

Big Data/ Cloud/ Hadoop Developer

Email: ************@*****.*** Phone: 646-***-****

PROFESSIONAL PROFILE

9+ years combined experience in database/IT and Big Data, Cloud, Hadoop and overall IT experience of over 25 years

Hands-on experience developing Snowflake Cloud Procedures and Functions and SQL tuning of large databases.

Importing / exporting Terabytes of data between HDFS and Relational Database Systems using Sqoop.

Hands-on with Extract Transform Load (ETL) from databases such as SQL Server and Oracle to Hadoop HDFS in Data Lake.

Very comfortable writing SQL queries, Stored Procedures, Triggers, Cursors, and Packages.

Utilize a full spectrum of Python libraries for analytical processing.

Display analytical insights through data visualization Python libraries and tools such as Matplotlib, Tableau, Power BI, D3.js and several others.

Develop Spark code for Spark-SQL/Streaming in Scala, PySpark & SQL.

Integrate Kafka and Spark using Avro for serializing and deserializing data, and for Kafka Producer and Consumer.

Convert Hive/SQL queries into Spark transformations using Spark RDD and Data Frame.

Use Spark SQL to perform data processing on data residing in Hive.

Configure Spark Streaming to receive real-time data using Kafka.

Use Spark Structured Streaming for high-performance, scalable, fault-tolerant real-time data processing.

Write Hive/Hive QL scripts to extract, transform, and load data into databases.

Configure Kafka cluster with Zookeeper for real-time streaming.

Build highly available, scalable, fault-tolerant systems using Amazon Web Services (AWS).

Hands-on with Amazon Web Services (AWS), and cloud services such as EMR, EC2, S3, EBS, and IAM entities, roles.

Experience using Hadoop clusters, Hadoop HDFS, Hadoop tools, and Spark, Kafka, and Hive in social data, media and financial analytics using the Hadoop ecosystem.

Highly knowledgeable in data concepts and technologies including AWS pipelines, and cloud repositories (Amazon AWS, MapR, Cloudera).

Hands-on experience using Cassandra, HIVE, No-SQL databases (HBase, MongoDB), SQL databases (Oracle, SQL, PostgreSQL, MySQL server, as well as data lakes and cloud repositories to pull data for analytics.

Experience with Microsoft Azure and Google Cloud.

TECHNICAL SKILLS

Programming Languages: Python, Scala. Java, C, C++, JavaScript

Databases: MS SQL Server, Oracle, DB2, MySQL, PostgreSQL, Cassandra, MongoDB, BigQuery

Scripting: HiveQL, SQL, MapReduce, Python, PySpark, Shell.

Distributions: MapR, Databricks, AWS, MS Azure, GCP, Cloudera.

Big Data Primary Skills: Hive, Kafka, Hadoop, HDFS, Spark, Cloudera, Azure Databricks, HBase, Cassandra, MongoDB, Zookeeper, Sqoop, Tableau, Kibana, MS Power BI, QuickSight, Hive Bucketing / Partitioning, Spark performance Tuning, Optimization; Streaming, Data Wrangling

Apache Components: Cassandra, Hadoop, YARN, HBase, Hcatalog, Hive, Kafka, Nifi, Airflow Oozie, Spark, Zookeeper, Cloudera Impala, HDFS, MapR, MapReduce, Spark.

Data Processing: Apache Spark, Spark Streaming, Storm, Pig, MapReduce.

Operating Systems: UNIX/Linux, Windows. (including administration and networking)

Cloud Services: AWS S3, EMR, Lambda Functions, Step Functions, Glue, Athena, Redshift Spectrum, Quicksight, DynamoDB, Redshift, CloudFormation, MS ADF, Azure Databricks, Azure Data Lake, Azure SQL, Azure HDInsight, GCP, Cloudera, Anaconda Cloud, Elastic.

Testing Tools: PyTest, Selenium, ScalaTest, Scalactic. 2 decades of troubleshooting experience.

PROFESSIONAL EXPERIENCE

Assurant, Inc., New York, NY since September 2021

Big Data Engineer

(Assurant, Inc. is a global provider of risk management products and services that include niche-market insurance products in the property, casualty, extended device protection, and preneed insurance sectors)

Worked closely with customers and stakeholders to optimize SQL queries they used so that we could move to a more efficient data structure to optimize performance.

Designed data models to optimize the querying performance in the AWS Redshift data warehouse.

Orchestrated workflows in Apache Airflow to run ETL pipelines using tools in AWS.

Exporting data to AWS RDS SQL from on-premises data sources.

Created Spark jobs to migrate data from on-premises to the AWS cloud.

Developed Spark jobs for data processing and cleaning using python as programming language.

Creating test data and unit testing scripts to test Python scripts and validate data as part of the CI/CD pipeline.

Implemented AWS Secrets Manager into Glue jobs to help encrypt account numbers and other private information for client hashing and protection.

Hands-on application with AWS Cloud (PaaS & IaaS).

Developed large-scale enterprise applications using Spark for data processing in the AWS cloud platform.

Scheduled workflows and ETL processes for constant monitoring and support with AWS Step Functions to orchestrate lambda functions in Python.

Created SNS topics for email notifications to get updates on several Lambda functions, Glue jobs, and Tables using python as programming language.

Applied extensive knowledge on exploring table items and updating on AWS DynamoDB.

Implemented SQL queries in AWS Athena to view table contents and data variables in multiple datasets for data profiling.

Partitioning and bucketing in ranges of log file information to differentiate the information on commonplace and combinations of supported business needs.

Worked with support to familiarize on-call staff with Hadoop by creating hands-on demos and POCs.

Team communication over MS Teams with project tracking on Azure DevOps.

Followed Agile methodologies having daily team stand up and weekly direct report meetings and weekly customer touchpoints throughout the software development life cycle.

Served Advisory role regarding Spark best practices.

Windsor Fixtures Inc, Atlanta, GA March 2014 – August 2021

(Windsor Fixtures, Inc. manufactures millwork products – Office Furniture, Fixtures, and Equipment. The Company provides store fixtures and services for the department, convenience, specialty retailers, grocers, restaurants, drug stores, and institutional cafeterias)

Big Data Engineer Harwil Fixtures, Hastings, FL Apr 2020 – Aug 2021

Ingested data using flat files and APIs using Kafka from Salesforce and other systems.

Developed the code which handles data type conversions, data value mappings, and checking for required fields.

Developed the programs using Spark withPython (Pyspark).

Mapped the ingested data to the Databricks schema and loaded it into its table in data lake and snowflake.

Modified the pipeline for streaming messages to receive the new incoming records for merging, handling missing (NULL) values, and triggering a corresponding merge in salesforce records on receipt of a merge event.

Followed Agile Scrum processes for Software Development Lifecycle (SDLC) with 2-week Sprints and daily 30-minute standups (Scrums).

All relevant documentation was created and recorded on Confluence pages.

Tasks, sprints, stories, and backlog management are tracked using Jira Agile development software.

Major contributions included design, code, configuration, and documentation for components that manage data ingestion, real-time streaming, batch processing, data extraction, and transformation.

Supported the QA and MuleSoft teams that work on testing and troubleshooting spark job run failures.

Created and managed Kafka topics and producers for the streaming data.

Cloud Engineer Client: Southern Store Fixtures, Wyoming, MI Apr 2018 – Mar 2020

Installed Spark and configured the Spark config files, environment path, Spark home, and external libraries.

Wrote Python script using Spark to read and count word frequency from multiple files and generate tables to compare frequency across the files.

Wrote Hive Queries for analyzing data in Hive warehouse using Hive Query Language.

Imported and exported data using Flume and Kafka.

Wrote shell scripts to automate workflows to pull data from various databases into the Hadoop framework for users to access the data through Hive-based views.

Loaded data from the LINUX file system to HDFS.

Developed and maintained Spark/Scala application which calculates interchange between different credit cards.

Created a pipeline to gather new product releases of a country for a given week using PySpark, Kafka, and Hive.

Used Spark SQL to convert the Dstream into RDDs or Data frames.

Worked with Hive on MapReduce, and various configuration options for improving query performance.

Optimized data ingestion in Kafka Brokers within the Kafka cluster by partitioning Kafka Topics.

Sent requests to REST-Based API from a Python script via Kafka Producer.

Performed Data scrubbing and processing with Airflow and Spark.

Configured Flume agent batch size, capacity, transaction capacity, roll size, roll count, and roll intervals.

Provided connections to different Business Intelligence tools to the tables in the data warehouse such as Tableau and Power BI.

Cloud Engineer Client: Ethan Allen Interiors Inc., Danbury, CT Feb 2016 – Mar 2018

Created custom Spark Streaming jobs to process data streaming events.

Orchestrated workflows in Apache Airflow to run ETL pipelines using tools in AWS.

Integrated streams with Spark streaming for prime speed processing.

Configured, deployed, and automated instances on AWS, Azure environments, and Data Centers.

Developed Spark jobs for data processing and Spark-SQL/Streaming for a distributed processing of data.

Created modules for Apache Airflow to call different services in the cloud including EMR, S3, and Redshift.

Ingested data into the data lake using AWS S3.

Created AWS Lambda function for extracting the data from Kinesis Firehose and post the data to AWS S3 bucket on a scheduled basis (every 4 hours) using AWS Cloud Watch event.

Utilized Python libraries such as Spark for analysis.

Configured Spark-submit command to allocate resources to all the jobs across the cluster.

Collected log information using custom-engineered input adapters and Kafka.

Created a custom producer to ingest the data into Kafka topics for consumption by custom Kafka consumers.

Performed maintenance, monitoring, deployments, and upgrades across applications that support all Spark jobs.

Partitioned and bucketed log file information to differentiate the information in a common place and combine supported business needs.

Set up HBase and stored data in HBase.

Deployed the applying jar files into AWS EC2 instances.

Developed a task execution framework on EC2 instances mistreatment using Lambda and Airflow.

Implemented ksqlDB queries to process the data ingested in the Kafka topic.

Created and maintained the data warehouse in AWS Redshift.

Implemented different optimization techniques to improve the performance of the data warehouse in AWS Redshift.

Authored queries in AWS Athena to query from files in S3 for data profiling.

Hadoop Administrator Client: Herman Miller Inc., Zeeland, MI) Mar 2014 – Feb 2016

Configured Fair Scheduler to allocate resources to all the applications across the cluster.

Configured Zookeeper to coordinate the servers in clusters to maintain data consistency and monitor services.

Designed and presented a POC on introducing Impala in project architecture.

Implemented Yarn resource pools to share resources of the cluster for Yarn jobs submitted by users.

Performed code reviews of simple to complex Map/Reduce Jobs using Hive and Pig.

Implemented cluster monitoring using the Big Insights ionosphere tool.

Imported data from various data sources and parsed it into structured data.

Analyzed data by performing Hive queries and ran Pig scripts to study customer behavior.

Automated all the jobs for pulling data from the SFTP server to load data into Hive tables using Oozie workflows.

Optimized data storage in Kafka Brokers within the Kafka cluster by partitioning Kafka Topics.

Performed maintenance, monitoring, deployments, and upgrades across infrastructure that supported all Hadoop clusters.

Wrote Pig Latin script to read various file system sources of data and do the processing of data.

Used Zookeeper and Oozie for coordinating the cluster and scheduling workflows.

Used Log4j for logging the output to the files.

Managed and reviewed Hadoop log files.

Used Impala where possible to achieve faster results compared to Hive during data Analysis.

Edited and configured HDFS and tracker parameters.

PAST EXPERIENCE

CI/CD — CEO Jan 2008 – Feb 2014

Freelance Consultant - Dev

Bright House Networks, Tampa, FL Mar 2000 - Dec 2007

Software Engineer

DMA, Orlando, NYC Sep 1994 – Dec 1999

Developer

EDUCATION

BA in Philosophy

The Florida State University

Contact this candidate