Sign in

Hadoop Developer Data Analyst

New York, NY
January 20, 2022

Contact this candidate


Professional Summary:

Around * years of professional experience as a software professional industry comprising of Big Data/ Hadoop development, design, development, deployment.

Expertise with the tools in Hadoop Ecosystem including Hive, HDFS, MapReduce, Sqoop, Spark, InfoWorks, Yarn, Oozie, and Zookeeper.

Experience in manipulating/analyzing large datasets and finding patterns and insights within structured and unstructured data.

Experience in migrating the data using Sqoop from HDFS to Relational Database System and vice-versa.

Hands-on experience working with Kusto.

Worked on Microsoft’s internal tools like Cosmos, Kusto, iScope etc. which are known for doing ETL operations efficiently.

Performed AWS Cloud administration managing EC2 instances, S3, SES and SNS services.

Proactively monitor resources and applications using AWS CloudWatch including creating alarms to monitor metrics for services like EC2, S3, Glue, Kinesis.

Experienced in migrating database/legacy applications to AWS cloud ecosystem using Services like VPC, EC2, S3, Glue, Kinesis, Lambda, EMR, RDS for Compute, Big Data Analytics and Storage.

Created IAM Roles and defining Policies and applying to AWS services.

Experienced in writing complex Spark programs that work with different file formats like Text, Sequence, Xml, parquet and Avro.

Experience in designing and developing applications in Spark using Scala to compare the performance of Spark with Hive and SQL.

Experience on working with No-SQL database like Hbase.

Experience on building the applications using Spark Core, Spark SQL, Data Frames, Spark Streaming

Hands on experience with Scala language features – language fundamentals, Classes, Objects, Traits, Collections, Case Classes, Higher Order Functions, Pattern Matching, Extractors, etc.

Experience in Oozie and workflow scheduler to manage Hadoop jobs by Direct Acyclic Graph (DAG) of actions with control flows.

Performed Exporting and importing of data into simple storage service (S3).

Good knowledge on Importing volumes, launching EC2, RDS, creating security groups, auto-scaling, load balancers (ELBs) in the defined virtual private connection.

Experience in database design using PL/SQL to write Stored Procedures, Functions, Triggers and strong experience in writing complex queries.

Experience with Data flow diagrams, Data dictionary, Database normalization techniques, Entity relation modeling and design techniques.

Expert in developing T-SQL (DDL, DML) statements, Data Integrity (constraints), Rules and Validation issues including complicated joins and cursor loops, creating indexes for fast retrieval of data, creating views, indexes, triggers, cursors, stored procedures and functions.

Good knowledge on AWS infrastructure services Amazon Simple Storage Service (Amazon S3), EMR, and Amazon Elastic Compute Cloud (Amazon EC2).

Technical Skills:

Hadoop Ecosystem

HDFS, Map Reduce, HIVE, Sqoop, Oozie, Spark.

Distributed Platforms

Cloudera, Hortonworks and HDInsight


C, Scala, SQL, shell script, Python

Hadoop/BigData Technologies

HDFS, Map Reduce, Sqoop, Hive, Oozie, Spark, Yarn, Kafka, Zookeeper and Cloudera Manager, Infoworks

Version Control

Github, Azure.

Build & Deployment Tools

Maven, ANT


Oracle, MS SQL Server, MySQL, HBase

Project Experience:

Client: ATT - Plano, TX.

Role: Big Data / Hadoop Developer

Duration: Feb 2021 – Till


ATT provide multiple services, based on service provided they divided streams to applications. Our team has to Create data flow pipeline of new ATT applications and migration of existing onprem application to Databricks.


Tested NIFI, ADF, AzCopy to move on-prem files to azure.

Scheduled NIFI job to copy files from multiple to directories based on file landing in on-prem directories.

Migration of existing applications in onprem to Azure DataBricks and develop pipelines for new applications.

Developed Spark code to stream data from multiple kafka servers and process json data as per business requirement.

Developed multiple Reusable functions in python and scala. Developed Custom logger functionality which can be used by team to get log in specified format.

Implemented Databricks PostProcess (ATT Specified) for all applications.

Tested code, done peer to peer review, validated data in dev environment and pushed code prod using code cloud (repository).

Implemented optimization techniques in migration on applications to reduce time taking to complete onprem job.

Worked with hive Managed and External tables, partition and bucketing of data when storing data into ADLS Gen2.

Validated counts, data of all applications and sequence number of files processed to ensure data quality of application.

Documented details analysis on existing code, schedule and validation reports before handing off applications to OPS team.

Created Databricks, schedule a spark job to extract data from files in ADLS gen2.

Created CosmosDB with Mongo API.

Worked on spark job to write data from ADLS to Cosmos Db with Mongo API.

Worked on streaming job to extract data from Kafka using spark streaming.

Installed Mongo Drivers in Databricks and used inbound and outbound config to establish connection between Databricks and Cosmos Db.

Environment: Databricks, Apache Spark, Python, Scala, Azure, Kafka, ADLS Gen2, Cosmos Db, MongoDB, ADF, NIFI.

Client: Microsoft, WA.

Role: Big Data / Hadoop Developer

Duration: April 2020 – Feb 2021


We receive data on supplier services through event hub and files which are processed and stored in azure datalake for data science team to do analysis.


Evaluated business requirements and prepared detailed specifications that follow project guidelines required to develop written programs.

Created Pipelines in ADF using Linked Services/Datasets/Pipeline/ to Extract, Transform and load data from different sources like Azure SQL, Blob storage, Azure SQL Data warehouse, write-back tool and backwards.

Used Zeppelin, Jupyter notebooks to develop, test and analyze Spark jobs before Scheduling Customized Spark jobs.

Undertake data analysis and collaborated with down-stream, analytics team to shape the data according to their requirement.

Worked on end to end pipeline of Msaas data from EventHub to ADLS.

Developed reusable scripts to manipulate schema of service desk tickets.

Analyzed existing pipelines which are taking time and optimized.

Reusable Scripts for installing EventHub library’s in new automated cluster for our scheduled jobs Responsible for Ingestion of Data from Blob to Kusto and maintaining the PPE and PROD pipelines.

Expertise in creating HDInsight cluster and Storage Account with End-to-End environment for running the jobs.

Developed Json Scripts for deploying the Pipeline in Azure Data Factory (ADF) that process the data using the Cosmos Activity.

Worked on Optimizing our scheduled jobs in both batch and streaming pipelines.

Used Spark SQL to process the huge amount of data and Implemented Spark RDD, DataFrames transformations.

Worked on schema Mapping between historical and Incremental Data.

Worked on merging data from multiple storages in ADLS.

Worked with file formats like Parquet, Orc and Delta.

Working closely with our business and data Science team for requirement gathering.

Used Spark-Scala to transform data from service desk tickets.

Merging of historical data (with-out struct format) and incremental data (in Struct Format) stored in different location.

Supporting of failed jobs in production environment.

Environment: Databricks, Apache Spark, Scala, python, kafka, Adls gen1 &Gen2, Azure, EventHubs.

Client: PepsiCo, Plano, TX.

Role: Big Data / Hadoop Developer

Duration: July 2019 – June 2020


Data for FritoLay North America data is received into multiple sources. Our team has to pull data from these sources and process as per requirement and store in azure datalake by using tool called Infoworks.


Working on POC to test capabilities and limitations of InfoWorks.

Creating source configuration to establish connections to sources like Teradata, SqlServer, Mainframe, Oracle.

Transforming data using InfoWorks transformation which work with spark execution engine.

Creating Workflows and scheduling workflows to trigger ingestion process and move data to DataLake.

Used Spark SQL to process the huge amount of data and Implemented Spark RDD, DataFrames transformations, actions to migrate Map reduce algorithms.

Developed code using scala and Spark-SQL for faster testing and data processing.

Training Interns and Junior developer on new tools implemented (Infoworks)

Working on Scheduling Dependency jobs using Control-M.

Worked on connection between Databricks and Cosmos for Inbound and Outbound to Migrate data from On Prem Mongo DB to Azure Cosmos DB with Mongo API using databricks

Developing scripts to clean files in pre-ingestion process.

Developing generic scripts to Import data from any source (structured).

Implementing Alert Mechanism in FLNA-1.5 in order to keep track of all jobs by gathering details of workflow variables (start-time, end-time, duration etc).

Importing and exporting data into HDFS and Hive using Sqoop and Infoworks.

Creating Azure Data-factories to copy data from Blob to Datalakes.

Exploring with the Spark improving the performance and optimization of the existing algorithms in Hadoop using Spark-Context & Spark -SQL.

Executed Hive queries on Parquet tables stored in Hive to perform data analysis to meet the business requirements.

Worked on POC to transform data from Teradata and export to S3 bucket.

Involved in stand-ups meetings, status calls, Business owner meetings with stake holders, Risk management teams in an agile environment.

Supported code/design analysis, strategy development and project planning.

Environment: Hadoop, HDInsight, Apache Hive, Sqoop, Apache Spark, Shell Scripting, Azure.

Client: Liberty Mutual, Dover, NH.

Role: Big Data / Hadoop Developer

Duration: Feb 2017 – July 2019


We receive customer insurance policy and claims data, we have to process data based on policy type and give data to RGM team to create dashboards.


Involved in complete Big Data flow of the application starting from data ingestion from upstream to HDFS, processing and analysing the data in HDFS.

Developed Spark API to import data into HDFS from MySQL, SQL Server, Oracle and created Hive tables.

Developed Sqoop jobs to import data in Avro file format from Oracle database and created hive tables on top of it.

Created Partitioned and Bucketed Hive tables in Parquet File Formats with Snappy compression and then loaded data into hive tables.

Involved in running all the hive scripts through hive, Hive on Spark and through Spark SQL.

Involved in performance tuning of Hive from design, storage and query perspectives.

Used security groups, network ACL’s, internet gateways and route tables to ensure a secure zone for organization in AWS public cloud.

Wrote Python scripts to do file system commands, RestAPI calls.

Installed and configured apache Airflow for workflow management and created workflows in python.

Developed spark programs using Scala API’s & Python API’s to compare the performance of Spark with Hive and SQL.

Developed Flume ETL job for handling data from HTTP Source and Sink as HDFS.

Collected the Json data from HTTP Source and developed Spark APIs that helps to do inserts and updates in Hive tables.

Configure and implement AWS tools such as CloudWatch, CloudTrail and direct system logs for monitoring.

Developed Spark scripts to import large files from Amazon S3 buckets.

Wrote the terraform scripts for AWS services like S3, Glue, Ec2, EMR, IAM, Lambda & Redshift.

Developed Spark core and Spark SQL scripts using Scala for faster data processing.

Developed Kafka consumer’s API in Scala for consuming data from Kafka topics.

Experience using the AWS technologies, such as S3, EC2, CLI, EMR, Glue, Athena, Redshift, Lambda, Kinesis Firehose, CloudWatch, CloudFront, CloudTrail and CloudFormation.

Involved in designing and developing tables in HBase and storing aggregated data from Hive Table.

Integrated Hive tables for COGNOS to generate the reports and publish.

Developed shell scripts for running Hive scripts in Hive and Impala.

Orchestrated number of Sqoop and Hive scripts using Oozie workflow and scheduled using Oozie coordinator.

Used Unix shell scripts in crontab for automating the tasks.

Used Jira for bug tracking and SVN to check-in and checkout code changes.

Continuous monitoring and managing the Hadoop cluster through Cloudera Manager.


HDFS, Yarn, MapReduce, Hive, Sqoop, Flume, Oozie, AWS, HBase, Kinesis, Kafka, Spark SQL, Spark Streaming, Eclipse, Oracle, Teradata, PL/SQL UNIX Shell Scripting, Cloudera.

Client: iConcept software Services, India

Role: Data Analyst

Duration: Jan 2014- July 2015


Worked with various complex queries with joins, subqueries, and nested queries in SQL queries.

Coded complex SQL queries to retrieve data from the database depending on the need.

Using set operators in PL/SQL like Union, Union all, Intersect and Minus.

Built various graphs for business decision making using Python matplotlib library.

Updated and manipulated content and files by using python scripts.

Released the reports through UNIX shell scripting every month based on requirement.

Creating and executing SQL queries to perform Data Integrity testing on a Teradata Database to validate and test data using TOAD.

Worked with data architects’ team to make appropriate changes to the data models.

Experience in creating UNIX scripts for file transfer and file manipulation.

Generate ad-hoc or management specific reports using Tableau and Excel.

Analyzed the subscriber, provider, members and claim data to continuously scan and create authoritative master data.

Proficient with Tableau Framework to create Filters, Histograms, Parameters, Quick Filters

Prepare the data rules spreadsheet using MS Excel that will be used to update allowed values, findings, and profiling results.

Built complex formulas in Tableau for various business calculations.

Developed Geo/Area Maps to show details on which states have more patients who are hospitalized using Tableau.

Create Bar Charts which is compiled with data sets and added trend lines and forecasting on future trend of the financial performance.

Complied interactive dashboards in Tableau Desktop and published them to Tableau Server with Quick Filters for on demand needed information with just a click of a button

Environment: SQL, UNIX, MS SQL, MS Access, MS SQL Server Access, Excel, Tableau

Education Details

Masters from TAMUK University in 2016

Bachelors from JNTU in 2014

Contact this candidate