Hadoop Developer Data Engineer

Location:

Raleigh, NC

Posted:

May 08, 2023

Contact this candidate

Resume:

VARUN KUMAR

AWS Data Engineer Hadoop Developer

PROFESSIONAL SUMMARY:

●Around 9 years of experience in the field of information technology with extensive knowledge on CDP platform, Big Data technologies and AWS Cloud.

●Hands-on experience in Hadoop eco-system technologies including Apache Spark, Hive, Sqoop, Impala, Apache HBase and MapReduce/YARN, HDFS.

●Hands-on experience on complete life cycle implementation using CDP and AWS.

●Worked on building specific transformation layer to collect data from multiple sources and implement complex transform logics on large data sets using PySpark, SparkSQL and Kafka for better performance.

●Experienced in creating automation applications to handle multiple file formats such as Parquet, Avro and JSON to extract and apply specific transformations to repurpose data and load into Hive tables that are used for report creation.

●Designed and developed scalable and cost-effective architecture in AWS Big Data services for data life cycle of collection, ingestion, storage, processing, and visualization

●Hands-on experience in building data pipelines to move and transform large datasets efficiently on AWS EMR.

●Worked on AWS services such as Glue, Step functions and Lambda to revamp ETL pipelines and automate them to work on EMR.

●Expertise with Amazon Athena and QuickSight for data query and analysis.

●Java, Python & Other Experience in installing and setting up Hadoop Environment in cloud though Amazon Web services (AWS) like EMR and EC2 which provide efficient processing of data.

●Hands-on experience in building an end-to-end framework to source, ingest, validate, transform and load/write fact data into final tables using Python, SparkSQL, Impala.

●Sound knowledge on partitions and bucketing concepts in hive and designed both external and managed tables in Hive to optimize performance.

●Experience in Designing, Developing, Documentation and integrating applications using Hadoop, Hive, PIG, in all three Bigdata platforms Cloudera, Hortonworks, MapR, Snowflake, Apache Airflow in which 5 years of experience on Data Engineering and 3 years of experience on Data Warehouse.

●Experienced in data ETL from SQL/NoSQL databases and worked on integrating data from various file formats into a combined analytical data model for reporting.

●Collaborated a with a team of data engineers in building a common federated framework that is deployed across entire cluster including all sub accounts/tenants to create a single front end platform for users to upload, query and validate data.

●Strong understanding in creating Sqoop jobs to ingest data from various sources into HDFS.

TECHNICAL SKILLS:

Operating Systems

Windows, Linux.

Big Data Technologies

HDFS, Map Reduce and YARN, Spark, SparkSQL, HIVE, Impala, Kafka, Oozie, Zookeeper, Sqoop and Jupyter Notebook.

Cloud Technologies

AWS Lambda, EMR, S3, SNS, Athena, Glue, Step functions, IAM, Snowflake.

Languages

Scala, SQL, Python.

Databases

Oracle, Hbase, Cassandra and MS SQL Server.

BI Tools

Tableau and Qlik.

Sub Version Control Tools

SVN Tortoise, Git.

IDE

Eclipse, IntelliJ, PyCharm.

PROFESSIONAL EXPERIENCE:

Client: CREDIT SUISSE, NC

Role: Big Data Engineer/Data Engineer October 2017 – Present The ViR Fast Fill project targets to populate Core Finance Data area with Investment Banking trade and balance data sourced from various ledger accounts and balances stored across multiple micro-tenants which define a physical segregation of data held on the Finance Platform and represent a separate application with in Finance platform, supported by core reference data from authoritative sources, to support future consumers of finance data, expected to migrate their reporting and processes to Finance.

Responsibilities:

●Working closely with stakeholders, business analysts and project manager and product architects to discuss and define technical requirements and create project documentation.

●Co-ordinating and closely working with teams from various time zones to make sure deliverables are completed within agreed timelines.

●Responsible for creating a framework using SparkSQL, Python, Kafka to efficiently ingest, transform and process large data sets into intermediate hive tables.

●Worked on creating an intermediate layer to implement complex solutions to process adjusted balances for daily trades using a series of Spark Applications written in SparkSQL and Python.

●Implementing a series of mandatory validation checks using Sql and IMPALA to analyze, detect and log defects in the imported data.

●Creating data validation tools using Spark and Python which is leveraged to create a series of Tableau dashboards to further analyze the integrity of data.

●Involved in building an end-to-end automated framework handle different file formats from multiple sources using Python.

●Built a framework using shell scripts, PySpark and Python to migrate and validate data from HDFS to S3 for various clients.

●Worked on coding different assets where we used Python, AWS and MYSQL database.

●Created AWS API Gateway which invoked the lambda function and returned the status code as per the validations. In this when the user hits the endpoint with a valid request body then invokes lambda function where the actual code is written to fetch data from database. All the validations are done using JSON validations with schema level validations also.

●Worked on migrating existing ETL pipelines from Cloudera environment to EMR.

●Used AWS managed services like Glue, Step Functions, Lambda to revamp ETL pipelines and automate them to work on EMR.

●Configured AWS S3 Lambda triggers to automate pipeline runs.

●Implemented Cloud Formation Templates to automate building Infrastructure and automate pipeline runs.

●Experienced in leveraging Kedro and Flow to successfully orchestrate, schedule and run pipelines.

●Hands-on experience with validating data in internal stage files before loading into fact tables.

●Experienced in handling large datasets using Partitions, Spark In-Memory capabilities, Distributed processing and Broadcasts in Spark, Effective & efficient Joins during ingestion and transformation process.

●Building views on top of transformed data to improve performance while querying data from Tableau or Qlik reporting tools.

●Experienced in loading data from internal staging into snowflake tables using Snowsql.

●Used Import and Export from the internal stage (Snowflake) from the external stage.s

●Working with project stakeholders, business analysts, QA team to perform testing and troubleshooting of the applications to provide the best possible output before productionizing the components.

●Creating and modifying XMLs to automate and execute large number of jobs in batch mode via control-M.

●Performed the duties of a scrum master to ensure the project deliverables are completed within agreed timelines.

●Integration of data storage solutions in spark – especially with AWS S3 object storage

●Worked as a tester for different applications within the same platform under the Carbon program initiated by the organization to develop new skills and effectively perform in all technical roles involved in the complete lifecycle of a project.

Environment: Cloudera data platform (S.8), AWS, Apache Spark, Hive, Impala, HDFS, Map Reduce/YARN, Sqoop, Kafka, HBase, Zookeeper, Control-M, Nolio, Tableau (Desktop and Server), Qlik, Snowflake Oracle PL/SQL, IntelliJ, Jupiter, Python, SQL and Scala

Client: NFC Solutions, IL.

Role:Hadoop Developer. January 2017 - September 2017

Responsibilities:

●Involved in development and design of a POC using RFID based attendance management system.

●Developed scripts in RStudio and Scala to analyze the data received by the RFID receivers.

●Designed sample MCM layout for AWR Microwave Office.

●Assisted design engineers in testing and tuning the RFID circuits

●Worked on AWS glue ETL tool, in which we used AWS Glue Crawler, AWS Glue Data catalog and Connections.

●Performed RF measurements such as S parameters, Power sweep (AM-AM and AM-PM), Noise figure using VNA and documenting the data.

●Developed ETL data pipelines using pyspark on AWS EMR,also Configured EMR clusters on AWS

●Hands-on experience in soldering SMD components on PCB and MCM.

●Developed a python module to read excel sheets, convert them into data frames using pandas and stored them as parquet files.

●Working on creating the EMR Cluster and AWS SNS, and subscribed them to AWS Lambda and alerts when the data reaches the Lake and polled the messages using AWS SQS from the S3.

●Worked in a Hadoop ecosystem to build transformations using PySpark.

●Responsible for developing and maintaining the visualization reports using Tableau.

●Performed sourcing of data from source systems to HDFS and imposing Hive Schema on the data to establish relations.

●Working with multiple individuals from different levels of SDLC to coordinate, prioritize, schedule and track project goals.

Environment: AWR Microwave Office, Vector Network Analyzer,AWS, Cloudera Hadoop distribution, RStudio, Scala, SQL, Python, Hive, Sqoop, MapReduce/YARN, HDFS, IntelliJ, Tableau.

Client: USF Health, FL.

Role: Data Engineer/Hadoop Developer August 2015 - December 2016

Responsibilities:

●Worked on Hadoop Ecosystem components like HDFS, YARN, Impala, Spark, Hive, Sqoop, Zookeeper with Cloudera distribution model.

●Responsible for importing of data from various data sources, performed transformations using Impala and writing data into HDFS.

●Developed automated test framework using Scala and Spark to test large data sets.

●Good experience in Hive partition, bucketing concepts, perform various types of joins effectively for optimal performance and implementing Hive serde’s for JSON, Parquet and Avro formats while writing data into HDFS.

●Responsible for writing Sqoop jobs to read data from Oracle DB and write into HDFS.

●Developed shell scripts to run Spark jobs and Sqoop import/export jobs and leveraged control-M to automate the jobs.

●leveraged Spark over Cloudera Hadoop YARN to perform analytics on data in Hive.

●Developed Scala scripts, UDFs using both Data frames/SQL/Data sets and RDD in Spark for Data Aggregation and queries.

●Worked on developing various health care metrics used by clients to analyze trends of patients.

●Packaging and building final components to deploy the scripts onto gateway nodes through automated processes such as Nolio and SPUDS.

Environment: Cloudera distributed Hadoop (CDH), Apache Spark, Hive, Impala, HDFS, Map Reduce/YARN, Sqoop, Zookeeper, Control-M, Nolio, Tableau (Desktop and Server), IntelliJ, SQL, Scala.

Client: Niantic Inc, Hyderabad.

Role: Jr. SQL Developer. August 2014 - May 2015

Responsibilities:

●Involved in complete software development lifecycle (SDLC).

●Coordinated with business users and analysts to gather requirements and create technical documents.

●Formulated tools and procedures /processes to load data received from upstream source systems via flat files to staging tables.

●Worked on creating database objects such as tables, stored procedures, views, Joins, Triggers and Functions.

●Involved in normalization and de-normalization of existing tables for faster query retrievals.

●Created/modified complex sql joins. sub-queries and procedures to analyze data and implement business rules.

●Created after and instead of triggers to enforce data consistency and integrity.

●Developed SSIS packages to ingest, analyze, transform and load data into final tables from various sources systems.

●Packaging and building final components to deploy the scripts onto gateway nodes through automated processes such as Nolio and SPUDS.

Environment: SQL Server 2005/2008, SQL Query Analyzer, SQL Profiler, Oracle 9i,10g/11g, PL/SQL, Enterprise Manager and SSIS, MS Access MS Excel.

EDUCATION:

●MS from University of South Florida, Tampa, FL.

●Bachelor of Technology from Jawaharlal Nehru Tech. university, Hyderabad, India.

Contact this candidate