Naveen Raju
Data Engineer
*************@*****.***
Professional Summary:
• Accomplished Big Data Engineer with 8 years of hands-on experience in designing, implementing, and optimizing data-intensive applications within the Hadoop Ecosystem and Amazon Web Services (AWS). Specialized in crafting robust solutions for Big Data Analytics, Cloud Data Engineering, Data Warehousing, Data Visualization, Reporting, and Data Quality assurance.
• Demonstrates a comprehensive understanding of Hadoop architecture and AWS cloud services, including YARN, HDFS, MapReduce, Spark, EMR, S3, Redshift, Glue, and Lambda, adeptly integrating them to meet diverse business needs.
• Proven expertise in developing scalable and efficient enterprise solutions, leveraging a wide array of Hadoop components such as Apache Spark, MapReduce, HDFS, Sqoop, Pig, Hive, HBase, Oozie, Flume, NiFi, Kafka, Zookeeper, and YARN.
• Proficient in data ingestion and processing methodologies, adept at performing complex transformations, enrichments, and aggregations while ensuring data integrity and quality. Possesses a strong foundation in distributed systems architecture, parallel processing, and the Spark execution framework.
• Skilled in fine-tuning and optimizing algorithms within the Hadoop ecosystem using Spark Context, Spark-SQL, DataFrames API, Spark Streaming, MLlib, and Pair RDDs, with proficiency in both PySpark and Scala programming languages.
• Experienced in architecting and implementing end-to-end data pipelines within AWS, ensuring seamless compatibility across diverse data sources and destinations. Proficient in managing complex data integration pipelines using Spark and AWS services like EMR and Glue for efficient data ingestion, transformation, and loading.
• Demonstrated proficiency in managing data ingestion processes from various sources into HDFS using tools like Sqoop, Flume, and executing transformations using Hive, MapReduce. Skilled in managing Sqoop jobs for incremental load to populate HIVE external tables.
• Adept in leveraging AWS ecosystem components and Spark for ETL processes using Spark Core, Spark-SQL, and real-time data processing with Spark Streaming. Proficient in integrating Kafka as middleware for real-time data pipelines.
• Skilled in developing custom User Defined Functions (UDFs) and seamlessly integrating them with Hive and Pig using Java. Experienced in creating, debugging, scheduling, and monitoring workflows using Airflow and Oozie in both Hadoop and AWS environments.
• Hands-on experience in managing SQL and NoSQL databases, including MongoDB, HBase, Cassandra, SQL Server, and PostgreSQL. Proficient in database design, creation, migration, and transformation processes, ensuring optimal performance and data integrity.
Category
Hadoop/Big Data Skills
AWS Skills
Big Data Tools
HDFS, MapReduce, Yarn, HBase, Pig, Hive, Sqoop, Flume, Oozie, Zookeeper
EMR, S3, Redshift, Glue, Data Pipeline, Lambda
Splunk, Hortonworks, Cloudera
Programming Languages
SQL, Python, R, Scala
SQL, Python, Scala
Pyspark, Linux Shell Scripts
Databases
RDBMS (MySQL, DB2, MS-SQL Server, Teradata, PostgreSQL)
RDS, DynamoDB, Redshift
NoSQL (MongoDB, HBase, Cassandra)
DocumentDB, Neptune
Snowflake Virtual Warehouse
OLAP & ETL Tools
Tableau, Tableau Server, PowerBi, Spyder, SSIS, Informatica
Glue, Data Pipeline, Lambda
Spark, Pentaho, Talend
Data Modelling Tools
Microsoft Visio, ER Studio, Erwin
Python and R Libraries
R-tidyr, tidyverse, dplyr, reshape, lubridate, Python-beautiful Soup, numpy, scipy, matplotlib, python-twitter, pandas, scikit-learn, keras, boto3
numpy, pandas, scikit-learn, boto3
Machine Learning
Regression, Clustering, MLlib, Linear Regression, Logistic Regression, Decision Tree, SVM, Naive Bayes, KNN, K-Means, Random Forest, Gradient Boost & Adaboost, Neural Networks, Time Series Analysis
SageMaker, Comprehend, Forecast
Data Analysis Tools
Machine Learning, Deep Learning, Data Warehouse, Data Mining, Data Analysis, Big data, Visualizing, Data Munging, Data Modelling
Quicksight, Athena, Glue, Kinesis
Cloud Computing Tools
Snowflake, SnowSQL
AWS Snow Family (Snowflake, Snowball, Snowmobile)
Amazon Web Services (AWS), Microsoft Azure, Google Cloud Platform (GCP)
AWS Services (EMR, EC2, S3, RDS, CloudSearch, Redshift, Glue, Data Pipeline, Lambda)
Reporting Tools
JIRA, MS Excel, Tableau, Power BI, QlikView, Qlik Sense, D3, SSRS, SSIS
QuickSight, Athena, QuickSight
IDE's
Pycharm, Anaconda, Jupyter Notebook, Intellij
AWS Cloud9, Sagemaker Studio, IntelliJ
Development Methodologies
Agile,Waterfall
Professional experience:
CIGNA, Morristown, NJ Sep 2022 – Present
Sr Data Engineer
Responsibilities:
• Proficient in Agile Scrum Development, collaborating across teams to efficiently meet client requirements.
• Leveraged Amazon Web Services (AWS) tools such as Amazon EMR, Amazon S3, Amazon RDS, Amazon Redshift, AWS Glue, and AWS Lambda to tackle complex analytical challenges with large datasets.
• Designed and executed data pipelines in Apache NiFi and Apache Spark, ensuring seamless compatibility across diverse data sources and destinations within the Hadoop ecosystem.
• Orchestrated complex data integration pipelines, utilizing Apache NiFi and Apache Spark for efficient data ingestion, transformation, and loading into target systems.
• Streamlined ETL workflows, integrating with AWS services like Amazon S3 and Amazon Redshift, resulting in significant time and cost savings.
• Automated deployment and monitoring of data pipelines using CI/CD pipelines and AWS CloudWatch, enhancing reliability and reducing manual intervention.
• Demonstrated proficiency in SQL and Python for database programming, ensuring functional code and conducting thorough unit testing.
• Accountable for end-to-end solution development, from analyzing business requirements to deploying solutions and obtaining stakeholder signoff.
• Conducted essential validations on data ingestion for both incremental and full data loads to ensure data accuracy and completeness.
• Expertise in parsing large XML data using Apache Spark operations on Hadoop clusters, optimizing performance through transformations and actions.
• Skilled in developing and fine-tuning complex SQL scripts to optimize performance as needed.
• Utilized Apache Airflow for workflow orchestration, creating modular data transformations, and scheduling Directed Acyclic Graphs (DAGs) for daily activities.
• Conducted proof of concept (POC) projects to evaluate various cloud offerings, including migrating projects from on-prem Hadoop systems to AWS.
• Compared self-hosted Hadoop with AWS's EMR, exploring use cases, and evaluating performance evolution.
• Leveraged AWS CLI to configure services such as EMR, S3, and Redshift.
• Fine-tuned Spark applications to improve overall processing time for pipelines and enhance efficiency.
Vangaurd, Malvern, PA Jul 2021 - Aug 2022
Data Engineer
Responsibilities:
Developed ETL pipelines on S3 Parquet files in a data lake using AWS Glue.
Programmed ETL functions to transfer data between Oracle and Amazon Redshift.
Conducted Data Analytics on the Data Lake utilizing PySpark on the Databricks platform.
Assessed and enhanced the quality of customer data.
Utilized various AWS cloud services including EC2, S3, EMR, RDS, Athena, and Glue.
Analyzed data quality issues through exploratory data analysis (EDA) using SQL, Python, and Pandas.
Created automation scripts using Python libraries to perform accuracy checks from diverse sources to target databases.
Developed Python scripts to generate heatmaps for issue and root cause analysis of data quality report failures.
Performed data analysis and predictive data modeling.
Collaborated with stakeholders to deliver regulatory reports and recommend remediation strategies, building analytical dashboards using Excel and Python plotting libraries.
Designed and implemented a Rest API for accessing the Snowflake DB platform.
Managed data warehouses in Snowflake and implemented star schemas.
Participated in the code migration of a quality monitoring tool from AWS EC2 to AWS Lambda and developed logical datasets.
Handled various data feeds such as JSON, CSV, XML, and DAT, implementing the Data Lake concept.
Environment: Python, Spark SQL, PySpark, Pandas, Numpy, Excel, PowerBI, AWS EC2, AWS S3, AWS Lambda, Athena, Glue, Linux Shell Scripting, SnowflakeDB, Git, DynamoDB, Redshift.
Citi Bank, New York, NY Apr 2020 – Jun 2021
Data Engineer
Responsibilities:
Proficiently extracted and analyzed extensive data sets exceeding 800k records from Hadoop Distributed File System (HDFS) and Amazon S3 using SQL queries.
Demonstrated expertise in conducting exploratory data analysis (EDA) in Python, utilizing Seaborn and Matplotlib to evaluate data quality on both Hadoop clusters and AWS.
Developed efficient Spark applications with PySpark and Spark SQL in Databricks and Amazon EMR, enabling seamless data extraction, transformation, and aggregation from diverse file formats across HDFS and Amazon S3.
Played a key role in migrating data from On-premises SQL servers to Cloud databases like Amazon Redshift and Amazon RDS, ensuring smooth transition and data integrity.
Established robust data ingestion pipelines by connecting with Amazon S3, facilitating end-to-end processing of raw files through Databricks and Amazon Glue.
Implemented advanced data cleaning techniques using Pandas and NumPy in Jupyter Notebook, effectively handling missing values and enhancing data preparation workflows on both Hadoop and AWS platforms.
Designed and implemented custom input adapters leveraging Spark, Hive, and Sqoop for seamless ingestion and analysis of data into HDFS and Amazon S3 from sources like Redshift and MySQL.
Showcased expertise in text analytics and processing using Apache Spark written in Scala, contributing to enhanced data insights and quality assurance on both Hadoop and AWS environments.
Proficiently implemented Spark functionalities using Python and Spark SQL, streamlining data processing tasks and ensuring efficient management of diverse data sources across Hadoop and AWS setups.
Contributed to real-time data processing initiatives by implementing Kafka clusters for data ingestion, along with extracting and exporting data from Teradata into Amazon DynamoDB, enhancing data accessibility and utilization.
Fedility, Boston, Massachusetts Nov 2018 – Mar 2020
Data Engineer
Responsibilities:
Configured and managed Hadoop ecosystem components including Hive, Pig, HBase, and Sqoop on both Hadoop clusters and AWS EMR instances.
Implemented real-time data processing by configuring Spark Streaming to ingest data from Apache Kafka and store it in HDFS using Scala on both Hadoop and AWS environments.
Developed Sqoop scripts to enable seamless data transfer between Hive and Vertical Database systems across Hadoop clusters and AWS infrastructure.
Utilized MapReduce, Pig, and Hive for data analysis and processing, storing processed data in HDFS and Google Cloud Storage (GCS) for downstream applications on both Hadoop and AWS platforms.
Established multi-terabyte Data Warehouse infrastructure on AWS Redshift and Google BigQuery, handling extensive data volumes daily.
Led the migration of an on-premises application to AWS and Google Cloud Platform, configuring virtual data centers with services like EC2, S3, and Google Cloud Storage (GCS) to support Enterprise Data Warehouse hosting.
Unisys Global Services Pvt Ltd, India Jan 2016 – Jul 2017
Hadoop Engineer
Responsibilities:
Developed highly optimized Spark applications for data cleansing, validation, transformation, and
Implemented data pipelines consisting of Spark, Hive, and Sqoop, alongside custom-built Input
Created Spark and Hive jobs for data summarization and transformation purposes.
Utilized Spark for interactive queries, processing streaming data, and integrating with popular
Transformed Hive/SQL queries into Spark transformations using Spark DataFrames and Scala.
Smart Software Tech. Dev. Pvt. Ltd., India Apr 2014 – Dec 2015 Hadoop Engineer
Responsibilities:
Collaborated with the Business Requirements and design teams, preparing Low-Level Design and
high-level design documents.
Provided comprehensive technical and business knowledge to ensure efficient design,
programming, implementation, and ongoing support for applications.
Identified potential avenues to enhance system efficiency.
Executed logical implementations and interacted with HBase effectively.
Efficiently stored and retrieved data to/from HBase by developing MapReduce jobs.
Developed MapReduce jobs to automate data transfer to and from HBase.
Education:
• Master of Computer Science from New York Institute Of Tech, New York, NY -2018
• Bachelors in Computer Science from JNTUH, India,2014