HADOOP BIG DATA ENGINEER
Phone: 408-***-**** Email: firstname.lastname@example.org
Skilled in managing data analytics and data processing, database and data driven projects
Skilled in Architecture of Big Data Systems, ETL Pipelines, and Analytics Systems for diverse end users
Skilled in Database systems and administration
Proficient in writing technical reports and documentation
Adept with various distributions such as Cloudera Hadoop, Hortonworks, MapR and Elastic Cloud, Elasticsearch
Expert in bucketing and partitioning
Expert in Performance Optimization
Apache Ant, Apache Flume, Apache Hadoop, Apache YARN, Apache Hive, Apache Kafka, Apache MAVEN, Apache Oozie, Apache Spark, Apache Tez, Apache Zookeeper, Cloudera Impala, HDFS
Hortonworks, MapR, MapReduce
HiveQL, MapReduce, XML, FTP,
Python, UNIX, Shell scripting, LINUX
Unix/Linux, Windows 10, Ubuntu, Apple OS
Parquet, Avro & JSON, ORC, text, csv
Cloudera, Hortonworks, AWS, Elastic, ELK, Cloudera CDH 4/5, Hortonworks HDP 2.5/2.6, Amazon Web Services (AWS)
DATA PROCESSING (COMPUTE) ENGINES
Apache Spark, Spark Streaming, Flink
DATA VISUALIZATION TOOLS
Pentaho, QlikView, Tableau, PowerBI, Matplot
Apache Spark, Spark Streaming, Storm
Microsoft SQL Server Database (2005, 2008R2, 2012)
Database & Data Structures, Apache Cassandra, Amazon Redshift, DynamoDB, Apache Hbase, Apache Hive, MongoDB,
Microsoft Project, Primavera P6, VMWare, Microsoft Word, Excel, Outlook, Power Point; Technical Documentation Skills
Freddie Mac Remote
DATA ENGINEER December 2020 – Present
Implemented AWS EMR Spark using PySpark and utilized DataFrames and SparkSQL API for faster processing of data
Registered datasets to AWS Glue through Rest API
Used AWS API Gateway to Trigger Lambda functions
Queried with Athena on data residing in AWS S3 bucket
AWS Step function used to run a data pipeline
Used DynamoDB to store metadata and logs
Monitoring and managed services with AWS CloudWatch
Performed transformations using Apache SparkSQL
Wrote Spark applications for data validation, cleansing, transformation, and customed aggregation
Developed Spark code using Python and Spark-SQL for faster testing and data processing.
Tuned Spark to increase job's performance
Monitoring and managed services with AWS CloudWatch
Configured ODBC Driver, Presto Driver with Okera and RapidSQL
Used Dremio as Query engine for faster Joins and complex queries over AWS S3 bucket using Dremio data reflections
Realtor Santa Clara, California
HADOOP DATA ENGINEER December 2018 – December 2020
Installed entire Hadoop ecosystem on new servers including Hadoop, Spark, Pyspark, Kafka, Hortonworks, Hive, Cassandra
Developed a data pipeline used for extracting historic flood information from online sources
Used Python to scrape relevant articles of most recent floods
Used Python to make requests from news sources API’s as well as social media API’s such as Facebook and Twitter
Stored unprocessed JSON and HTML files in HDFS data lake
Retrieved structured and unstructured data from HDFS and MySQL to Spark to preform MapReduce jobs
Implemented advanced procedures like text analytics and processing using in memory computing capability methods via Apache Spark in Scala
Used Spark Context and Spark Session to process text files by flat mapping, mapping to RDD, and reducing RDD’s by key to identify sentences containing valuable information
Worked with analytics team to provide querying insights and helped develop methods to map informative sentences more efficiently
Adjusted tables and schema to provide more informative data to be used in machine learning models
Worked with Apache Spark which provides fast and general engine for large data processing integrated with functional programming language Scala.
Created a Kafka producer to connect to different external sources and bring the data to a Kafka broker.
Handled schema changes in data stream.
Created a Kafka topics for structured streaming to get structured data by schema via CLI.
Hive partitioning, bucketing, performing joins on Hive tables.
Performed transformations and analysis using Hive
Medline Industries Northfield, IL
HADOOP DATA ENGINEER May 2017 – December 2018
Installed and configured Hadoop HDFS developed multiple jobs in java for data cleaning and preprocessing.
Developed Map/Reduce jobs using Scala for data transformations.
Develop different components of system like Hadoop process that involves Map Reduce, and Hive.
Migration of ETL processes from Oracle to Hive to test the easy data manipulation.
Using Sqoop to extract the data back to relational database for business reporting.
Involved in creating Hive tables, Pig tables, and loading data and writing hive queries and pig scripts.
Involved in Hadoop Cluster environment administration that includes adding and removing cluster nodes, cluster capacity planning, performance tuning, cluster Monitoring.
Developed Hive queries and UDFS to analyze/transform the data in HDFS.
Designed and Implemented Partitioning (Static, Dynamic), Buckets in HIVE.
Used Sqoop to efficiently transfer data between databases and HDFS
Debugging and identifying issues reported by QA with the Hadoop jobs by configuring to local file system.
Implemented Flume to import streaming data logs and aggregating the data to HDFS.
Experienced in running Hadoop streaming jobs to process terabytes data.
Involved in evaluation and analysis of Hadoop cluster and different big data analytic tools including HBase database and Sqoop.
Wells Fargo San Francisco, CA
AWS Cloud DATA ENGINEER October 2015 – May 2017
Imported the data from different sources like AWS S3 into Spark RDD.
Imported data from AWS S3 and into Spark RDD and performed transformations and actions on RDD's.
Developed Spark scripts by using Scala Shell commands as per the requirement.
Implemented Spark using Scala and utilizing Data frames and Spark SQL API for faster processing of data.
Used Spark for interactive queries, processing of streaming data and integration with popular NoSQL database for huge volume of data.
Developing Spark programs using PySpark APIs to compare the performance of Spark with Hive and SQL.
Used Scala libraries to process XML data that was stored in HDFS and processed data was stored in HDFS.
Used Spark API over Hortonworks Hadoop YARN to perform analytics on data in Hive.
Load the data into Spark RDD and do in memory data Computation to generate the Output response.
Implemented Spark using Scala and SparkSQL for faster testing and processing of data.
Wrote different pig scripts to clean up the ingested data and created partitions for the daily data.
Designed and created Hive external tables using shared meta-store instead of derby with partitioning, dynamic partitioning and buckets.
Involved in HBASE setup and storing data into HBASE, which will be used for analysis.
Used Impala for querying HDFS data to achieve better performance.
Used Spark-SQL to Load JSON data and create Schema RDD and loaded it into Hive Tables and handled Structured data using SparkSQL.
Implemented HQL scripts to load data from and to store data into Hive.
Develop Spark jobs to parse the JSON or XML data.
Used the JSON and XML for serialization and de-serialization to load JSON and XML data into HIVE tables.
Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, Scala and Python.
Analyzed the SQL scripts and designed the solution to implement using PySpark.
Tested on MongoDB NoSQL data modeling, tuning, disaster recovery and backup.
Used Avro, Parquet and ORC data formats to store in to HDFS.
Used Oozie workflow to co-ordinate pig and hive scripts.
Deployed to Amazon Web Services (AWS) Cloud services like EC2, S3, EBS, RDS and VPC.
Deployed to various HDFS file formats like Avro, Sequence File and various compression formats like Snappy.
Dick's Sporting Goods Oakdale, PA
DATA ENGINEER April 2013 – October 2015
Involved in architectural design cluster infrastructure, Resource mobilization, Risk analysis and reporting.
Commissioning and de-commissioning the data nodes and involve in Name Node maintenance.
Regular backup and clear logs from HDFS space. This is to utilize data nodes optimally. Write shell scripts for time bound commands execution.
Edit and configure HDFS and tracker parameters.
Script the requirements using BigSQL and provide time statistics of running jobs.
Involve code review tasks in simple to complex Map/reduce Jobs using Hive and Pig
Cluster Monitoring using Big Insights ionosphere tool.
Worked with application teams to install operating system, Hadoop updates, patches, version upgrades as required.
Installed Oozie workflow engine to run multiple Hive and Pig jobs.
Bachelor of Science Bioengineering, Bioinformatics
University of Illinois at Chicago
IBM Scala Certificate
IBM Hadoop Certificate
IBM Big Data Certificate