Hadoop Developer Data Analyst

Location:

Posted:

September 02, 2022

Resume:

Professional Summary:

*+ years of professional experience as a software professional industry comprising of Data Engineering, Big Data development, design, development, deployment.

Experienced with cloud services like AWS, Azure and GCP.

Good Experience in Microsoft Azure services like Databricks, Azure data factory, ADLS (Gen1 & Gen2), Azure KeyVaults, EventHub, CosmosDb, HdInsights.

Performed AWS Cloud administration managing EC2 instances, S3, SES and SNS services.

Proactively monitor resources and applications using AWS CloudWatch including creating alarms to monitor metrics for services like EC2, S3, Glue, Kinesis.

Experienced in migrating database/legacy applications to AWS cloud ecosystem using Services like VPC, EC2, S3, Glue, Kinesis, Lambda, Athena, StepFunctions, EMR, SageMaker, RDS for Compute, Big Data Analytics and Storage.

Worked on Google cloud platform (GCP), Kubernetes, dataflow, Pub/Sub, Big query, DataFlow.

Good Knowledge in Cloud computing on Google Cloud Platform with various technology like Dataflow, Pub/Sub, Big Query and all related tools

Good knowledge on Design implementation of star schema in Big Query.

Expertise with the tools in Hadoop Ecosystem including Hive, HDFS, MapReduce, Sqoop, Spark, InfoWorks, Yarn, Oozie, and Zookeeper.

Experience in manipulating/analyzing large datasets and finding patterns and insights within structured and unstructured data.

Experience in migrating the data using Sqoop from HDFS to Relational Database System and vice-versa.

Experienced in writing complex Spark programs that work with different file formats like Text, Sequence, Xml, parquet and Avro.

Experience in designing and developing applications in Spark using Scala to compare the performance of Spark with Hive and SQL.

Experience on working with No-SQL database like HBase.

Experience on building the applications using Spark Core, Spark SQL, Data Frames, Spark Streaming

Hands on experience in setting up workflow using Apache Airflow and Oozie workflow engine for managing and scheduling Hadoop jobs.

Education Details

Masters from TAMUK University in 2016

Bachelors from JNTU in 2014

Technical Skills:

Cloud Services

Hadoop Ecosystem

Microsoft Azure, AWS, GCP

HDFS, Map Reduce, HIVE, Sqoop, Oozie, Spark.

Distributed Platforms

Cloudera, Hortonworks and HDInsight

Languages

C, Scala, SQL, shell script, Python

Hadoop/BigData Technologies

HDFS, Map Reduce, Sqoop, Hive, Oozie, Spark, Yarn, Kafka, Zookeeper, Infoworks

Version Control

Github, Azure Repos.

Build & Deployment Tools

Maven, ANT

Database

Oracle, MS SQL Server, MySQL, HBase

Project Experience:

Client: ATT - Plano, TX.

Role: SR Data Engineer

Duration: Feb 2021 – Till

Responsibilities:

Migration of existing applications in onprem to Azure DataBricks and develop pipelines for new applications.

Developed Spark code to stream data from multiple kafka servers and process json data as per business requirement.

Worked on streaming data from Eventhubs

Developed multiple Reusable functions in python and scala. Developed Custom logger functionality which can be used by team to get log in specified format.

Implemented Databricks PostProcess (ATT Specified) for all applications.

Tested code, done peer to peer review, validated data in dev environment and pushed code prod using code cloud (repository).

Implemented optimization techniques in migration on applications to reduce time taking to complete onprem job.

Worked with hive Managed and External tables, partition and bucketing of data when storing data into ADLS Gen2.

Validated counts, data of all applications and sequence number of files processed to ensure data quality of application.

Documented details analysis on existing code, schedule and validation reports before handing off applications to OPS team.

Created Databricks, schedule a spark job to extract data from files in ADLS gen2.

Created CosmosDB with Mongo API.

Worked on spark job to write data from ADLS to Cosmos Db with Mongo API.

Worked on streaming job to extract data from Kafka using spark streaming.

Installed Mongo Drivers in Databricks and used inbound and outbound config to establish connection between Databricks and Cosmos Db.

Environment: Databricks, Apache Spark, Python, Scala, Azure, Kafka, ADLS Gen2, Cosmos Db, MongoDB, ADF, NIFI.

Client: Microsoft, WA.

Role: Big Data / Hadoop Developer

Duration: June 2020 – Feb 2021

Responsibilities:

Evaluated business requirements and prepared detailed specifications that follow project guidelines required to develop written programs.

Created Pipelines in ADF using Linked Services/Datasets/Pipeline/ to Extract, Transform and load data from different sources like Azure SQL, Blob storage, Azure SQL Data warehouse, write-back tool and backwards.

Used Zeppelin, Jupyter notebooks to develop, test and analyze Spark jobs before Scheduling Customized Spark jobs.

Undertake data analysis and collaborated with down-stream, analytics team to shape the data according to their requirement.

Worked on end to end pipeline of Msaas data from EventHub to ADLS.

Developed reusable scripts to manipulate schema of service desk tickets.

Analyzed existing pipelines which are taking time and optimized.

Reusable Scripts for installing EventHub library’s in new automated cluster for our scheduled jobs Responsible for Ingestion of Data from Blob to Kusto and maintaining the PPE and PROD pipelines.

Expertise in creating HDInsight cluster and Storage Account with End-to-End environment for running the jobs.

Developed Json Scripts for deploying the Pipeline in Azure Data Factory (ADF) that process the data using the Cosmos Activity.

Worked on Optimizing our scheduled jobs in both batch and streaming pipelines.

Used Spark SQL to process the huge amount of data and Implemented Spark RDD, DataFrames transformations.

Worked on schema Mapping between historical and Incremental Data.

Worked on merging data from multiple storages in ADLS.

Worked with file formats like Parquet, Orc and Delta.

Working closely with our business and data Science team for requirement gathering.

Used Spark-Scala to transform data from service desk tickets.

Merging of historical data (with-out struct format) and incremental data (in Struct Format) stored in different location.

Supporting of failed jobs in production environment.

Environment: Databricks, Apache Spark, Scala, python, kafka, Adls gen1 &Gen2, Azure, EventHubs.

Client: PepsiCo, Plano, TX.

Role: Big Data / Hadoop Developer

Duration: July 2019 – June 2020

Responsibilities:

Working on POC to test capabilities and limitations of InfoWorks.

Creating source configuration to establish connections to sources like Teradata, SqlServer, Mainframe, Oracle.

Transforming data using InfoWorks transformation which work with spark execution engine.

Performing Batch, Real-time processing of Data using Hadoop Components like Hive, Spark

Used Spark streaming to Process the streaming data and to analyze the continuous datasets using PySpark.

Experience in Apache Airflow to author workflows as directed acyclic graphs (DAGs), to visualize batch and handle API’s real-time data pipelines running in production using Python.

Integrate new big data management technologies with Spark Sala, Hadoop, and software engineering tools in designing the data architecture and data integration layers to ensure that data center networks and scheduling with Airflow scripts written in Python

Worked with varieties of Relational Databases (RDBMS) like PostgreSQL, MSQL, MySQL, Netezza, and NoSQL MongoDB.

Experience in Data modeling for ETL process consisting of data sourcing, data transformation, mapping, conversion, and loading. Had developed complex mappings.

Automatic deployment tools like Docker and Kubernetes for containerization by combining them with the workflows to make them lightweight.

Experience in creating and maintaining reporting infrastructure to facilitate visual representation of manufacturing data by executing Lambda in S3, Redshift, RDS, MongoDB ecosystems.

Constructed product-usage SDK data and Siebel data aggregations by using Python 3, AWS EMR, PYSPARK, Scala, Spark SQL, and Hive context in partitioned Hive external tables maintained in AWS S3 location for reporting, data science dashboarding, and ad-hoc analyses.

Worked on MongoDB database concepts such as locking, transactions, indexes, Shredding, replication, schema design.

Experience in managing MongoDB environment from availability, performance, and scalability perspectives.

Processing membership claims based on EPIC source data to process the claims using Type1&Type2 functionality for the customer Unity Health Insurance.

Performed in agile methodology, interacted directly with the entire team provided/took feedback on design, Suggested/implemented optimal solutions, and tailored application to meet business requirements and followed Standards.

Environment: Python, Apache Airflow, PySpark, Hadoop, MongoDB, MapReduce, PostgreSQL, HDFS, Sqoop, Oozie, WinSCP, UNIX Shell Scripting, HIVE, Impala, Cloudera (Hadoop distribution), AWS, Docker, ETL JIRA, etc.

Client: Liberty Mutual, Dover, NH.

Role: Big Data / Hadoop Developer

Duration: Feb 2017 – July 2019

Responsibilities:

Involved in complete Big Data flow of the application starting from data ingestion from upstream to HDFS, processing and analysing the data in HDFS.

Developed Spark API to import data into HDFS from MySQL, SQL Server, Oracle and created Hive tables.

Developed Sqoop jobs to import data in Avro file format from Oracle database and created hive tables on top of it.

Created Partitioned and Bucketed Hive tables in Parquet File Formats with Snappy compression and then loaded data into hive tables.

Involved in running all the hive scripts through hive, Hive on Spark and through Spark SQL.

Involved in performance tuning of Hive from design, storage and query perspectives.

Used security groups, network ACL’s, internet gateways and route tables to ensure a secure zone for organization in AWS public cloud.

Wrote Python scripts to do file system commands, RestAPI calls.

Installed and configured apache Airflow for workflow management and created workflows in python.

Developed spark programs using Scala API’s & Python API’s to compare the performance of Spark with Hive and SQL.

Developed Flume ETL job for handling data from HTTP Source and Sink as HDFS.

Collected the Json data from HTTP Source and developed Spark APIs that helps to do inserts and updates in Hive tables.

Configure and implement AWS tools such as CloudWatch, CloudTrail and direct system logs for monitoring.

Developed Spark scripts to import large files from Amazon S3 buckets.

Wrote the terraform scripts for AWS services like S3, Glue, Ec2, EMR, IAM, Lambda & Redshift.

Developed Spark core and Spark SQL scripts using Scala for faster data processing.

Developed Kafka consumer’s API in Scala for consuming data from Kafka topics.

Experience using the AWS technologies, such as S3, EC2, CLI, EMR, Glue, Athena, Redshift, Lambda, Kinesis Firehose, CloudWatch, CloudFront, CloudTrail and CloudFormation.

Involved in designing and developing tables in HBase and storing aggregated data from Hive Table.

Integrated Hive tables for COGNOS to generate the reports and publish.

Developed shell scripts for running Hive scripts in Hive and Impala.

Orchestrated number of Sqoop and Hive scripts using Oozie workflow and scheduled using Oozie coordinator.

Used Unix shell scripts in crontab for automating the tasks.

Used Jira for bug tracking and SVN to check-in and checkout code changes.

Continuous monitoring and managing the Hadoop cluster through Cloudera Manager.

Environment:

HDFS, Yarn, MapReduce, Hive, Sqoop, Flume, Oozie, AWS, HBase, Kinesis, Kafka, Spark SQL, Spark Streaming, Eclipse, Oracle, Teradata, PL/SQL UNIX Shell Scripting, Cloudera.

Client: iConcept software Services, India

Role: Data Analyst

Duration: Jun 2013 - July 2015

Responsibilities

Worked with various complex queries with joins, subqueries, and nested queries in SQL queries.

Coded complex SQL queries to retrieve data from the database depending on the need.

Using set operators in PL/SQL like Union, Union all, Intersect and Minus.

Built various graphs for business decision making using Python matplotlib library.

Updated and manipulated content and files by using python scripts.

Released the reports through UNIX shell scripting every month based on requirement.

Creating and executing SQL queries to perform Data Integrity testing on a Teradata Database to validate and test data using TOAD.

Worked with data architects’ team to make appropriate changes to the data models.

Experience in creating UNIX scripts for file transfer and file manipulation.

Generate ad-hoc or management specific reports using Tableau and Excel.

Analyzed the subscriber, provider, members and claim data to continuously scan and create authoritative master data.

Proficient with Tableau Framework to create Filters, Histograms, Parameters, Quick Filters

Prepare the data rules spreadsheet using MS Excel that will be used to update allowed values, findings, and profiling results.

Built complex formulas in Tableau for various business calculations.

Developed Geo/Area Maps to show details on which states have more patients who are hospitalized using Tableau.

Create Bar Charts which is compiled with data sets and added trend lines and forecasting on future trend of the financial performance.

Complied interactive dashboards in Tableau Desktop and published them to Tableau Server with Quick Filters for on demand needed information with just a click of a button

Environment: SQL, UNIX, MS SQL, MS Access, MS SQL Server Access, Excel, Tableau

Contact this candidate