Big data Engineer

Location:

Reston, VA

Posted:

January 30, 2023

Contact this candidate

Resume:

Name: Jeff UBA

Sr Data engineer

SUMMARY:

●Over 8+ years of strong experience in Software Development with strong emphasis on Data Engineering and Data Analytics using large scale datasets.

●Strong experience in end-to-end data engineering including data ingestion, data cleansing, data transformations, data validations/auditing and feature engineering.

●Strong experience in programming languages like Java, Scala, and Python.

●Strong experience working with Hadoop ecosystem components like HDFS, Map Reduce, Spark, HBase, Oozie, Hive, Sqoop, Pig, Flume and Kafka

●Good hands-on experiencing working with various Hadoop distributions mainly Cloudera (CDH), Hortonworks (HDP) and Amazon EMR.

●Good understanding of Distributed Systems architecture and design principles behind Parallel Computing.

●Expertise in developing production ready Spark applications utilizing Spark-Core, DataFrames, Spark-SQL, Spark-ML and Spark-Streaming API's.

●Strong experience troubleshooting failures in spark applications and fine-tuning spark applications and hive queries for better performance.

●Worked extensively on building real time data pipelines using Kafka for streaming data ingestion and Spark Streaming for real time consumption and processing.

●Worked extensively on Hive for building complex data analytical applications.

●Strong experience writing complex map-reduce jobs including development of custom Input Formats and custom Record Readers.

●Experience in managing the Hadoop infrastructure with Cloudera Manager.

●Good exposure on usage of NoSQL databases column oriented HBase, Cassandra and MongoDB (Document Based DB).

●Experience in importing and exporting data using Sqoop from HDFS to Relational Database Systems (RDBMS) and vice-versa.

●Good experience working with Apache Nifi for building data flows from multiple sources like Ftp, Rest api’s etc.,

●Good experience working with AWS Cloud services like S3, EMR, Lambda functions, Redshift, Athena, Glue etc.,

●Solid experience in working with csv, text, Avro, parquet, orc, Json formats of data.

●Working experience on Core java technology, which includes efficient use of Collections framework, Multithreading, I/O & JDBC, Collections, localization, ability to develop new API for different projects.

●Experience in building, deploying, and integrating applications in Application Servers with ANT, Maven and Gradle.

●Experience in using IDE tools such as Visual Studio, NetBeans, and Eclipse and application servers WebSphere, WebLogic, and Tomcat

●Expertise in all phases of System Development Life Cycle Process (SDLC), Agile Software Development, Scrum Methodology and Test-Driven Development.

●Used Tomcat server for the application development and Utilized JIRA for task scheduling.

●Experience in using Version Control tools like Git, SVN.

●Experience in web application design using open source MVC, Spring and Spring Boot Frameworks.

●Adequate knowledge and working experience in Agile and Waterfall Methodologies.

●Defining user stories and driving the agile board in JIRA during project execution, participate in sprint demo and retrospective.

●Have good interpersonal, communication skills, strong problem-solving skills, explore/adopt to new technologies with ease and a good team member.

TECHNICAL SKILLS:

Hadoop Components

HDFS, Hue, MapReduce, Hive, Sqoop, Impala, Zookeeper, Flume

Spark Components

Spark RDD, Data Frames, SparkSQL, PySpark, Spark Streaming, Spark ML

Databases

Oracle, Teradata, Microsoft SQL Server, MySQL.

Programming Languages

Java, Python, Scala.

Web Servers

Windows server 2005/2008/2012 and Apache Tomcat.

Cloud Services

AWS S3, EMR, Redshift, Glue, Lambda, Step Functions, Athena, GCP Dataproc, Big Query

IDE

Eclipse, IntelliJ IDEA, PyCharm

NoSQL Databases

HBase, Cassandra, MongoDB.

Release Management Tools

Jenkins, Maven, SBT, GitHub, Jira

Development methodologies

Agile/Scrum

PROFESSIONAL EXPERIENCE:

Client: T-Mobile, Reston, VA

Hadoop Developer Jan’2021 – Till Date

Responsibilities

●Developed Spark Applications to implement various data cleansing/validation and processing activity of large-scale datasets ingested from traditional data warehouse systems.

●Worked both with batch and real time streaming data sources.

●Developed custom Kafka producers to write the streaming messages from external Rest applications to Kafka topics.

●Developed spark streaming applications to consume the streaming json messages from Kafka topics.

●Developed data transformations job using Spark Data frames to flatten JSON documents to csv.

●Worked with the Spark for improving performance and optimization of the existing transformations.

●Used Spark Streaming APIs to perform transformations and actions on the fly for building common learner data model which gets the data -from Kafka in Near real time and persist it to HBase.

●Worked and learned a great deal from AWS Cloud services like EMR, S3, RDS, Redshift, Athena, Glue.

●Migrated an existing on-premises data pipelines to AWS.

●Worked on automating provisioning of AWS EMR clusters.

●Used Hive QL to analyze the partitioned and bucketed data, executed Hive queries on Parquet tables stored in Hive to perform data analysis to meet the business specification logic.

●Experience in using Avro, Parquet, ORC file and JSON file formats, developed UDFs in Hive.

●Worked with Log4j framework for logging debug, info & error data.

●Used Jenkins for Continuous integration.

●Generated various kinds of reports using Tableau based on client specification.

●Used Jira for bug tracking and Git to check-in and checkout code changes.

●Responsible for generating actionable insights from complex data to drive real business results for various application teams and worked in Agile Methodology projects extensively.

●Worked with Scrum team in delivering agreed user stories on time for every Sprint.

Environment: AWS EMR, Spark, Hive, Kafka, S3, Scala, Athena, Redshift, Glue, Python

Client : The Kroger, Cincinnati, OH

Hadoop Developer Feb 2019 – Dec 2020

Responsibilities

●Ingested gigabytes of click stream data from external servers such as FTP server and S3 buckets on daily basis using customized home-grown Input Adapters.

●Created Sqoop scripts to import/export data from RDBMS to S3 data store.

●Developed various spark applications using Scala to perform cleansing, transformation, and enrichment of these click stream data.

●Involved in data cleansing, event enrichment, data aggregation, de-normalization and data preparation needed for machine learning and reporting.

●Troubleshooting Spark applications for improved error tolerance and reliability.

●Fine-tuning spark applications/jobs to improve the efficiency and overall processing time for the pipelines.

●Created Kafka producer API to send live stream Json data into various Kafka topics.

●Developed Spark-Streaming applications to consume the data from Kafka topics and to insert the processed streams to HBase.

●Utilized Spark in Memory capabilities, to handle large datasets.

●Used Broadcast variables in Spark, effective & efficient Joins, transformations, and other capabilities for data processing.

●Experienced in working with EMR cluster and S3 in AWS cloud.

●Creating Hive tables, loading and analyzing data using hive scripts. Implemented Partitioning, Dynamic Partitions, Buckets in HIVE.

●Involved in continuous Integration of application using Jenkins.

●Interacted with the infrastructure, network, database, application, and BA teams to ensure data quality and availability

●Followed Agile Methodologies while working on the project.

Environment AWS EMR, Spark, Hive, HDFS, Sqoop, Kafka, Oozie, HBase, Scala, Java.

BBVA, Birmingham, AL

Data Engineer Dec 2017– Jan 2019

Involved in developing roadmap for migration of enterprise data from multiple data sources like SQL Server, provider databases into S3 which serves as a centralized datahub across the organization.

Loaded and transformed large sets of structured and semi structured data from various downstream systems.

Developed ETL pipelines using Spark and Hive for performing various business specific transformations.

Building data applications and automating the pipelines in Spark for bulk loads as well as Incremental Loads of various Datasets.

Worked closely with our data scientist team’s and business consumers to shape the datasets as per the requirements.

Automated the data pipeline to ETL all the Datasets along with full loads and incremental loads of data.

Performed bulk load of JSON data from s3 bucket to snowflake.

Used Snowflake functions to perform semi structures data parsing entirely with SQL statements

Utilized AWS services like EMR, S3, Glue Metastore and Athena extensively for building the data applications.

Implemented a 'server less' architecture using API Gateway, Lambda, and Dynamo DB and deployed AWS Lambda code from Amazon S3 buckets. Created a Lambda Deployment function, and configured it to receive events from your S3 bucket

Worked on building input adapters for data dumps from FTP Servers using Apache spark.

Generating Data Models using Erwin9.6 and developed relational database system and involved in Logical modeling using the Dimensional Modeling techniques such as Star Schema and Snowflake Schema

Wrote spark applications to perform operations like data inspection, cleaning, load and transforms the large sets of structured and semi-structured data.

Developed Spark with Scala and Spark-SQL for testing and processing of data.

Reporting the spark job stats, monitoring, and running data quality checks are made available for each Datasets.

Used SQL Programming Skills to work around the Relational SQL Databases.

Environment: AWS Cloud Services, Apache Spark, Spark-SQL, Unix, Kafka, Scala, SQL Server.

Indium Software, Singapore

Hadoop/BigData Developer September 2016–November 2017

Involved in importing and exporting data between Hadoop Data Lake and Relational Systems like Oracle, MySQL using Sqoop.

Involved in developing spark applications to perform ELT kind of operations on the data.

Modified existing MapReduce jobs to Spark transformations and actions by utilizing Spark RDDs, DataFrames and Spark SQL API’s

Utilized Hive partitioning, Bucketing and performed various kinds of joins on Hive tables

Involved in creating Hive external tables to perform ETL on data that is produced on daily basis

Validated the data being ingested into Hive for further filtering and cleansing.

Developed Sqoop jobs for performing incremental loads from RDBMS into HDFS and further applied Spark transformations

Loaded data into hive tables from spark and used Parquet columnar format.

Created Oozie workflows to automate and productionize the data pipelines

Migrating Map Reduce code into Spark transformations using Spark and Scala.

Collecting and aggregating large amounts of log data using Apache Flume and staging data in HDFS for further analysis.

Did a Poc on GCP cloud services and feasibility of migrating onprem setup to GCP cloud and utilizing various services in GCP like Dataproc, Big Query, Cloud Storage etc.,

Designed, documented operational problems by following standards and procedures using JIRA

Environment: CDH, Hadoop, Hive, Impala, Oracle, Spark, Pig, Sqoop, Oozie, Map Reduce, GIT, Confluence, Jenkins.

COSMICVENT Software, India June 2015 – August 2016

Java Developer

Involved in designing Class and Sequence diagrams with UML and Data flow diagrams.

Implemented MVC architecture using Strut’s framework to get the Free Quote.

Designed and developed front end using JSP, Struts (tiles), XML, JavaScript, and HTML.

Used Struts tag libraries to create JSP.

Implemented Spring MVC, dependency Injection (DI) and aspect-oriented programming (AOP) features along with Hibernate.

Experienced with implementing navigation using Spring MVC.

Used Hibernate for object-relational mapping persistence.

Implemented message driven beans to get from queues to send again to support team using MSend commands.

Experienced with hibernate core interfaces like configuration, session factory, transactional and criteria interfaces.

Reviewed the requirements and involved in database design for new requirements

Wrote Complex SQL queries to perform various database operations using TOAD.

Java Mail API was used to notify the Agents about the free quote and for sending Email to the Customer with Promotion Code for validation.

Involved in testing using Junit.

Performed application development using Eclipse and Web Sphere Application Server for deployment.

Used SVN for version control.

Environment: Java, Spring, Hibernate, JM’s, Web Services, Ejb, Sql, Pl/Sql, Html, CSS, Jsp, java script, Ant, Junit, Web sphere.

Contact this candidate