CR CHARAN
Senior Hadoop/Big Data Engineer
**********@*****.***
Professional Summary:
Overall 8+ years of professional experience in Software Development, Analysis, Design, Testing, Documentation, Deployment and Integration using Java, Python, SQL, Big Data technologies and other coding languages as well as AWS/Azure Cloud.
Expertise in using major components of Hadoop ecosystem components like HDFS, Yarn, MapReduce, Hive, Pig, Sqoop, HBase, Spark, Spark SQL, Kafka, Spark Streaming, Oozie, Ambari.
Experience in architecting, designing, installation, configuration and management of Apache Hadoop Clusters, MapReduce, Hortonworks & Cloudera Hadoop Distribution.
Good understanding of Hadoop architecture and Hands on experience with Hadoop components such as Resource Manager, Node Manager, Name Node, Data Node and Map Reduce concepts and HDFS Framework.
Adept at configuring and installing Hadoop/Spark Ecosystem Components.
Familiar with data architecture including data ingestion pipeline design, Hadoop information architecture, data modeling and advanced data processing.
Expertise in writing MapReduce Jobs in Java for processing large sets of structured, semi-structured and unstructured data sets and ingesting to various platforms.
Expertise in Data Migration, Data Profiling, Data Ingestion, Data Cleansing, Transformation, Data Import, and Data Export using multiple ETL tools such as Informatica Power Centre.
Experience in exporting as well as importing the data using Sqoop between HDFS to Relational Database systems and vice-versa and load into Hive tables, which are partitioned.
●Worked on HQL for required data extraction and join operations as required and having good experience in optimizing Hive Queries.
●Experience in Partitions, bucketing concepts in Hive and designed both Managed and External tables in Hive to optimize performance.
●Capable of processing large sets (Gigabytes) of structured, semi-structured or unstructured data.
●Good understanding and knowledge of NoSQL databases like HBase, MongoDB and Cassandra.
●Good knowledge on Oozie workflow Scheduling and implementing Real Time Stream Processing.
●Developed Spark code using Scala, Python and Spark-SQL/Streaming for faster processing of data.
●Implemented Spark Streaming jobs in Scala by developing RDD's (Resilient Distributed Datasets) and used Pyspark and spark-shell accordingly.
●Profound experience in creating real time data streaming solutions using Apache Spark /Spark Streaming, Kafka and Flume.
●Good knowledge of using Apache NiFi to automate the data movement between different Hadoop Systems.
●Good experience in handling messaging services using Apache Kafka.
●Excellent knowledge of job workflow scheduling and locking tools/services like Oozie and Zookeeper.
●Proficient experience in different Databases like Oracle, Microsoft SQL Server, Teradata and IBM DB2.
●Experience writing Advance SQL programs for joining multiple tables, sorting data, creating SQL views and creating indexes.
●Good understanding of Amazon Web Services (AWS) like EC2 for computing and S3 as storage mechanism and EMR, RedShift, DynamoDB.
Experience in developing ETL data pipelines using Pyspark, Experience in creating Reports and Dashboards using Tableau
Java, Python & Other Experience in installing and setting up Hadoop Environment in cloud though Amazon Web services (AWS) like EMR and EC2 which provide efficient processing of data.
●Good understanding and knowledge of Microsoft Azure services like HDInsight Clusters, BLOB, ADLS, Data Factory and Logic Apps.
●Strong in Data Warehousing concepts, Star schema and Snowflake schema methodologies, understanding Business process/requirements.
●Worked with various formats of files like delimited text files, JSON files, XML Files. Mastered in using different columnar file formats like RC, ORC and Parquet formats and has a good understanding of various compression techniques used in Hadoop processing like G-zip, Snappy, LZO etc.
●Used IDEs like Eclipse, IntelliJ IDE, PyCharm IDE, Notepad ++, and Visual Studio for development.
●Experienced in Software Development Lifecycle (SDLC) using SCRUM, Agile methodologies.
●Immense experience in using version control tools like GitHub, BitBucket.
Software Skill Set
BigData Technologies/ Hadoop Components
HDFS, Hue, MapReduce, Yarn, Sqoop, Pig, Hive, HBase, Oozie, Kafka, Impala, Zookeeper, Flume, Cloudera Manager, Airflow
Spark
Spark SQL, Spark Streaming, Data Frames, YARN, Pair RDD’s
Cloud Services
AWS (S3, EC2, EMR, Lambda, RedShift, Glue), Azure (Azure Data Factory / ETL / ELT / SSIS, Azure Data Lake Storage, Azure Databricks)
Programming Languages
SQL, PySpark, Python, Scala, Java
Databases
Oracle, MySQL, DB2, SQL Server, Teradata
NoSQL Databases
HBase, Cassandra, MongoDB,
Web Technologies
HTML, JDBC, Java Script, CSS
Version Control Tools
GitHub, Bitbucket
Server-Side Scripting
UNIX Shell, Power Shell Scripting
IDE
Eclipse, PyCharm, Notepad++, IntelliJ, Visual Studio
Operating Systems
Linux, Unix, Ubuntu, Windows, Cent OS
Professional Experience:
Role: Senior Hadoop Developer / Data Engineer May 2020 – Present
Client: Truist Bank Charlotte NC
Responsibilities:
●Develop and add features to existing data analytic applications built with Spark and Hadoop on a Scala, Python development platform on the top of AWS services.
●Programming using Python, Scala along with Hadoop framework utilizing Cloudera Hadoop Ecosystem projects (HDFS, Spark, Sqoop, Hive, HBase, Oozie, Impala, Zookeeper etc.).
●Involved in developing spark applications using Scala, Python for Data transformations, cleansing as well as validation using Spark API.
●Developed several Accelerators/Tools as Spark/Step Functions for Workflow (UNIX Scripts as well) applications that saved a lot of Manual Efforts
●Developed ETL data pipelines using Pyspark on AWS EMR, also Configured EMR clusters on AWS
●Involved in trouble shooting spark jobs with the help of Spark UI and monitored spark jobs.
●Setting up continuous integration/deployment of spark jobs to EMR clusters (used AWS SDK CLI)
●Integration of data storage solutions in spark – especially with AWS S3 object storage.
●Worked on all the Spark APIs, like RDD, Dataframe, Data source and Dataset, to transform the data.
●Worked on both batch processing and streaming data Sources. Used Spark streaming and Kafka for the streaming data processing.
●Developed Spark Streaming script which consumes topics from distributed messaging source Kafka and periodically pushes batch of data to spark for real time processing.
●Built data pipelines for reporting, alerting, and data mining. Experienced with table design and data management using HDFS, Hive, Impala, Sqoop, MySQL, and Kafka.
●Worked on Apache Nifi to automate the data movement between RDBMS and HDFS.
●Created shell scripts to handle various jobs like Map Reduce, Hive, Pig, Spark etc., based on the requirement.
●Used Hive techniques like Bucketing, Partitioning to create the tables.
●Experience on Spark-SQL for processing the large amount of structured data.
●Experienced working with source formats, which includes - CSV, JSON, AVRO, JSON, Parquet, etc.
●Worked on AWS to aggregate clean files in Amazon S3 and also on Amazon EC2 Clusters to deploy files into Buckets.
●Used AWS EMR clusters for creating Hadoop and Spark clusters. These clusters are used for submitting and executing Scala and Python applications in production.
●Responsible for developing data pipeline with Amazon AWS to extract the data from weblogs and store in HDFS.
●Migrated the data from AWS S3 to HDFS using Kafka.
●Integrating Kubernetes with network, storage of security to provide a comprehensive infrastructure and orchestrating the Kubernetes containers across the multiple hosts.
●Implementing Jenkins and built pipelines to drive all microservice builds out to Docker registry and deploying to Kubernetes.
●Experienced in loading and transforming of large sets of structured, semi structured data using ingestion tool Talend.
●Worked with NoSQL databases like HBase, Cassandra to retrieve and load the data for real time processing using Rest API.
●Worked on creating data models for Cassandra from the existing Oracle data model.
●Responsible for transforming and loading the large sets of structured, semi structured and unstructured data.
●Developed data ingestion modules using AWS Step Functions, AWS Glue and Python modules
●Worked on functions in Lambda that aggregates the data from incoming events, and then stored result data in DynamoDB.
Environment:Hadoop, HDFS, Apache Hive, Apache Kafka, Apache Spark, Spark-SQL, Spark-Streaming, Zookeeper, Pig, Oozie, Java, Python, S3, EMR, EC2, Redshift, Cassandra, Nifi, Talend, HBase,Cloudera (CHD 5.X).
Role: Data Engineer / Hadoop Developer Nov 2018 –April 2020
Client: Parallon., Nashville, TN
Responsibilities:
●Involved in data warehouse implementations using Azure SQL Data warehouse, SQL Database, Azure Data Lake Storage (ADLS), Azure Data Factory v2
●Involved in creating specifications for ETL processes, finalized requirements and prepared specification documents
●Migrated data from on-premises SQL Database to Azure Synapse Analytics using Azure Data Factory, designed optimized database architecture
●Created Azure Data Factory for copying data from Azure BLOB storage to SQL Server
●Implement ad-hoc analysis solutions using Azure Data Lake Analytics/Store, HDInsight/Databricks
●Work with similar Microsoft on-prem data platforms, specifically SQL Server and SSIS, SSRS, and SSAS
●Created Reusable ADF pipelines to call REST APIs and consume Kafka Events.
●Performed Kafka analysis, feature selection, feature extraction using Apache Spark Machine.
●Involved in Pipelining data into AWS Glue then used PySpark to perform complex transformations standardizing the data and stages it into S3 buckets.
●Used AWS Redshift, S3, Spectrum and Athena services to query large amount data stored on S3 to create a Virtual Data Lake without having to go through ETL process. Developed ETL processes to move data between RDBMS and NoSQL data storage.
●Used Control-M for scheduling DataStage jobs and used Logic Apps for scheduling ADF pipelines
●Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL and U-SQL Azure Data Lake Analytics. Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in In Azure Databricks.
●Responsible for writing Hive Queries to analyse the data in Hive warehouse using Hive Query Language (HQL).Involved in developing Hive DDLs to create, drop and alter tables.
●Extracted the data and updated it into HDFS using Sqoop Import from various sources like Oracle, Teradata, SQL server etc.
●Created Hive staging tables and external tables and joined the tables as required.
●Implemented Dynamic Partitioning, Static Partitioning and Bucketing.
●Installed and configured Hadoop Map Reduce, Hive, HDFS, Pig, Sqoop, Flume and Oozie on Hadoop cluster.
●Worked on Microsoft Azure services like HDInsight Clusters, BLOB, ADLS, Data Factory and Logic Apps and done POC on Azure Data Bricks.
●Implement ad-hoc analysis solutions using Azure Data Lake Analytics/Store, HDInsight/Databricks
●Developing and configuring Build and Release (CI/CD) processes using Azure DevOps, along with managing application code using Azure GIT with required security standards for .Net and java applications.
●Implemented Sqoop jobs for data ingestion from the Oracle to Hive.
●Worked with various formats of files like delimited text files, click stream log files, Apache log files, Avro files, JSON files, XML Files. Mastered in using different columnar file formats like RC, ORC and Parquet formats.
●Developed custom the Unix/BASH SHELL scripts for the purpose of pre and post validations of the master and slave nodes, before and after the configuration of the name node and datanodes respectively.
●Developed job workflows in Oozie for automating the tasks of loading the data into HDFS.
●Implemented compact and efficient file storage of big data by using various file formats like Avro, Parquet, JSON and using compression methods like GZip, Snappy on top of the files.
●Exploring with Spark, improving performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frame and Pair RDD's.
●Worked on Spark using Python as well as Scala and Spark SQL for faster testing and processing of data.
●Developed data warehouse model in snowflake for over 100 datasets using whereScape.
●Worked on various data modelling concepts like star schema, snowflake schema in the project.
●Extensively used Stash, Bit-Bucket and GITHUB for the code control purpose.
●Migrated Map reduce jobs to Spark jobs for achieving a better performance.
Environment: Hadoop, HDFS, Microsoft Azure services like HDinsight, BLOB, ADLS, Logic Apps etc, Hive, Sqoop, SnowFlake, Apache Spark, Spark-SQL, ETL, Maven, Oozie, Java, Python, Unix shell scripting.
Role: Data & Analytics Engineer February 2017 – Oct 2018
Client: Merck, Branchburg, NJ
Responsibilities:
●Worked on creating MapReduce programs to analyze the data for claim report generation and running the Jars in Hadoop.
●Extracted, transformed and loaded the data sets using Apache Sqoop.
●Used NiFi and Sqoop for moving data between HDFS and RDBMS.
●Involved in writing Hive queries to analyze ad-hoc data from structured as well as semi structured data.
●Created Hive tables and working on them using HiveQL.
●Also assisted in exporting analyzed data to relational databases using Sqoop
●Imported data from different sources into Spark RDD for processing.
●Imported and exported data into HDFS using Flume.
●Worked on Hadoop components such as pig, oozie, etc.
●Developed Oozie Workflows for daily incremental loads, which gets data from Teradata and then imported into hive tables.
●Continuous monitoring and managing the Hadoop cluster using Cloudera Manager.
●Worked extensively with importing metadata into Hive and migrated existing tables and applications to work on Hive and AWS cloud.
●Created data pipeline for different events of ingestion, aggregation and load consumer response data in AWS S3 bucket into Hive external tables in HDFS location to serve as feed for Tableau dashboards
●Created monitors, alarms and notifications for EC2 hosts using Cloud Watch, Cloud trail and SNS
●Building data pipeline ETLs for data movement to S3, then to Redshift.
●Scheduled different Snowflake jobs using NiFi .
●Worked on Snowflake Multi-Cluster Warehouses
●Written AWS Lambda code in Python for nested Json files, converting, comparing, sorting etc
●Involved in agile development methodology and actively participated in daily scrum meetings.
Environment: Hadoop 2.6.2, HDFS 2.6.2, Map Reduce 2.9.0, Hive 1.1.1, Sqoop 1.4.4, Apache Spark 2.0, Nifi, ETL, Pig, Oozie, Java 7, Python 3, Snowflake Apache airflow,Tableau.
Role: Hadoop Developer March 2014-December2015
Client: Couth Infotech Pvt. Ltd, Hyderabad, India
Responsibilities:
Involved in loading data from UNIX file system to HDFS. Imported and exported data into HDFS and Hive using Sqoop.
Evaluated business requirements and prepared detailed specifications that follow project guidelines required to develop written programs.
Devised procedures that solve complex business problems with due considerations for hardware/software capacity and limitations, operating times and desired results.
Analyzed large amounts of data sets to determine optimal way to aggregate and report on it. Provided quick response to ad hoc internal and external client requests for data and experienced in creating ad hoc reports.
Responsible for building scalable distributed data solutions using Hadoop. Worked hands on with ETL process.
Responsible for cluster maintenance, adding and removing cluster nodes, cluster monitoring and troubleshooting, manage and review data backups, manage and review Hadoop log files.
Handled Imported of data from various data sources, performed transformations using Hive, MapReduce, and loaded data into HDFS.
Extracted the data from Teradata into HDFS using Sqoop. Analyzed the data by performing Hive queries and running Pig scripts to know user behavior like shopping enthusiasts, travelers, music lovers etc
Exported the patterns analyzed back into Teradata using Sqoop. Continuous monitoring and managing the Hadoop cluster through Cloudera Manager.
Installed Oozie workflow engine to run multiple Hive. Developed Hive queries to process the data and generate the data cubes for visualizing.
Environment: Hive, Pig, Apache Hadoop, Cassandra, Sqoop, Big Data, HBase, Zookeeper, Cloudera, Centos, No SQL, sencha extjs, java script, Ajax, Hibernate, Jms, web logic Application server, Eclipse, Web services, azure, Project Server, Unix, Windows.
Role: Data Modeler / ETL Developer/ Data Engineer August 2012 – February 2014
Client: IBing Software Solutions Private Ltd, Hyderabad, India
Responsibilities:
Reviewed/Prepared the technical design documents (TDD) and source to target mappings are created as per Requirements meet.
Design and Implement ETL to data load from Source to target databases and for Fact and Slowly Changing Dimensions (SCD) Type1, Type 2 to capture the changes for Staging, Dimensions, Facts and Data marts.
Involved in Data Extraction for various Databases & Files using Talend Open Studio & Big Data Edition
Worked on Talend with Java as Backend Language.
Extensively Used tMap component which does lookup & Joiner Functionstxml,tlogrow, tlogback components etc. in many of my Jobs Created and worked on over 50+ components to use in my jobs.
Worked on Joblets which is used for reusable code.
Experience in data ingestions techniques for batch and stream processing using AWS Batch, AWS Kinesis, AWS Data Pipeline
Developed Spark scripts by using Python in PySpark shell command in development.
Wrote SQL scripts, stored procedures, kettle transformations.
Environment: Python, Pyspark, AWS Data Pipeline, Talend, SQL, MySQL, Eclipse, Java