Naveen Raju
Data Engineer
********@*****.***
https://www.linkedin.com/in/naveen05/
Professional Summary:
• Accomplished Big Data Engineer with 9+ years of hands-on experience in designing, implementing, and optimizing data-intensive applications within the Hadoop Ecosystem and Amazon Web Services (AWS). Specialized in crafting robust solutions for Big Data Analytics, Cloud Data Engineering, Data Warehousing, Data Visualization, Reporting, and Data Quality assurance.
• Demonstrates a comprehensive understanding of Hadoop architecture and AWS cloud services, including YARN, HDFS, MapReduce, Spark, EMR, S3, Redshift, Glue, and Lambda, adeptly integrating them to meet diverse business needs.
• Proven expertise in developing scalable and efficient enterprise solutions, leveraging a wide array of Hadoop components such as Apache Spark, MapReduce, HDFS, Sqoop, Pig, Hive, HBase, Oozie, Flume, NiFi, Kafka, Zookeeper, and YARN.
• Proficient in data ingestion and processing methodologies, adept at performing complex transformations, enrichments, and aggregations while ensuring data integrity and quality. Possesses a strong foundation in distributed systems architecture, parallel processing, and the Spark execution framework.
• Skilled in fine-tuning and optimizing algorithms within the Hadoop ecosystem using Spark Context, Spark-SQL, DataFrames API, Spark Streaming, MLlib, and Pair RDDs, with proficiency in both PySpark and Scala programming languages.
• Experienced in architecting and implementing end-to-end data pipelines within AWS, ensuring seamless compatibility across diverse data sources and destinations. Proficient in managing complex data integration pipelines using Spark and AWS services like EMR and Glue for efficient data ingestion, transformation, and loading.
• Demonstrated proficiency in managing data ingestion processes from various sources into HDFS using tools like Sqoop, Flume, and executing transformations using Hive, MapReduce. Skilled in managing Sqoop jobs for incremental load to populate HIVE external tables.
• Adept in leveraging AWS ecosystem components and Spark for ETL processes using Spark Core, Spark-SQL, and real-time data processing with Spark Streaming. Proficient in integrating Kafka as middleware for real-time data pipelines.
• Skilled in developing custom User Defined Functions (UDFs) and seamlessly integrating them with Hive and Pig using Java. Experienced in creating, debugging, scheduling, and monitoring workflows using Airflow and Oozie in both Hadoop and AWS environments.
• Hands-on experience in managing SQL and NoSQL databases, including MongoDB, HBase, Cassandra, SQL Server, and PostgreSQL. Proficient in database design, creation, migration, and transformation processes, ensuring optimal performance and data integrity.
Category
Hadoop/Big Data Skills
AWS Skills
Big Data Tools
HDFS, MapReduce, Yarn, HBase, Pig, Hive, Sqoop, Flume, Oozie, Zookeeper
EMR, S3, Redshift, Glue, Data Pipeline, Lambda
Splunk, Hortonworks, Cloudera
Programming Languages
SQL, Python, R, Scala
SQL, Python, Scala
Pyspark, Linux Shell Scripts
Databases
RDBMS (MySQL, DB2, MS-SQL Server, Teradata, PostgreSQL)
RDS, DynamoDB, Redshift
NoSQL (MongoDB, HBase, Cassandra)
DocumentDB, Neptune
Snowflake Virtual Warehouse
OLAP & ETL Tools
Tableau, Tableau Server, PowerBi, Spyder, SSIS, Informatica
Glue, Data Pipeline, Lambda
Spark, Pentaho, Talend
Data Modelling Tools
Microsoft Visio, ER Studio, Erwin
Python and R Libraries
R-tidyr, tidyverse, dplyr, reshape, lubridate, Python-beautiful Soup, numpy, scipy, matplotlib, python-twitter, pandas, scikit-learn, keras, boto3
numpy, pandas, scikit-learn, boto3
Machine Learning
Regression, Clustering, MLlib, Linear Regression, Logistic Regression, Decision Tree, SVM, Naive Bayes, KNN, K-Means, Random Forest, Gradient Boost & Adaboost, Neural Networks, Time Series Analysis
SageMaker, Comprehend, Forecast
Data Analysis Tools
Machine Learning, Deep Learning, Data Warehouse, Data Mining, Data Analysis, Big data, Visualizing, Data Munging, Data Modelling
Quicksight, Athena, Glue, Kinesis
Cloud Computing Tools
Snowflake, SnowSQL
AWS Snow Family (Snowflake, Snowball, Snowmobile)
Amazon Web Services (AWS), Microsoft Azure, Google Cloud Platform (GCP)
AWS Services (EMR, EC2, S3, RDS, CloudSearch, Redshift, Glue, Data Pipeline, Lambda)
Reporting Tools
JIRA, MS Excel, Tableau, Power BI, QlikView, Qlik Sense, D3, SSRS, SSIS
QuickSight, Athena, QuickSight
IDE's
Pycharm, Anaconda, Jupyter Notebook, Intellij
AWS Cloud9, Sagemaker Studio, IntelliJ
Development Methodologies
Agile,Waterfall
Professional experience:
CIGNA, Morristown, NJ Sep 2022 – Present
Sr Data Engineer
Responsibilities:
• Proficient in Agile Scrum Development, collaborating across teams to efficiently meet client requirements.
• Leveraged Amazon Web Services (AWS) tools such as Amazon EMR, Amazon S3, Amazon RDS, Amazon Redshift, AWS Glue, and AWS Lambda to tackle complex analytical challenges with large datasets.
• Designed and executed data pipelines in Apache NiFi and Apache Spark, ensuring seamless compatibility across diverse data sources and destinations within the Hadoop ecosystem.
• Developed Infrastructure as Code (IaC) solutions using Terraform to provision and manage AWS resources, ensuring consistency and scalability across development, staging, and production environments.
• Orchestrated complex data integration pipelines, utilizing Apache NiFi and Apache Spark for efficient data ingestion, transformation, and loading into target systems.
• Streamlined ETL workflows, integrating with AWS services like Amazon S3 and Amazon Redshift, resulting in significant time and cost savings.
• Automated deployment and monitoring of data pipelines using CI/CD pipelines and AWS CloudWatch, enhancing reliability and reducing manual intervention.
• Engineered and optimized ETL workflows and mappings using Informatica PowerCenter and SnapLogic, managing sessions and workflows to ensure efficient data processing.
• Demonstrated proficiency in SQL and Python for database programming, ensuring functional code and conducting thorough unit testing.
• Accountable for end-to-end solution development, from analyzing business requirements to deploying solutions and obtaining stakeholder signoff.
• Conducted essential validations on data ingestion for both incremental and full data loads to ensure data accuracy and completeness.
• Expertise in parsing large XML data using Apache Spark operations on Hadoop clusters, optimizing performance through transformations and actions.
• Set up and automated data storage systems on AWS, including S3 for object storage, Amazon RDS for relational databases, and Redshift for data warehousing, using Terraform for scalability and efficient data handling.
• Skilled in developing and fine-tuning complex SQL scripts to optimize performance as needed.
• Utilized Terraform to create autoscaling groups for dynamic resource allocation, optimizing compute resources based on pipeline loads and improving cost-efficiency.
• Utilized Apache Airflow for workflow orchestration, creating modular data transformations, and scheduling Directed Acyclic Graphs (DAGs) for daily activities.
• Utilized advanced SQL and PL-SQL to develop and optimize complex queries and stored procedures for Oracle databases, improving data retrieval and manipulation efficiency.
• Conducted proof of concept (POC) projects to evaluate various cloud offerings, including migrating projects from on-prem Hadoop systems to AWS.
• Provided on-demand production support for ETL jobs, diagnosing and resolving issues to maintain operational stability and minimize downtime.
• Compared self-hosted Hadoop with AWS's EMR, exploring use cases, and evaluating performance evolution.
• Leveraged AWS CLI to configure services such as EMR, S3, and Redshift.
• Fine-tuned Spark applications to improve overall processing time for pipelines and enhance efficiency.
Vangaurd, Malvern, PA Jul 2021 - Aug 2022
Data Engineer
Responsibilities:
Developed ETL pipelines on S3 Parquet files in a data lake using AWS Glue.
Programmed ETL functions to transfer data between Oracle and Amazon Redshift.
Conducted Data Analytics on the Data Lake utilizing PySpark on the Databricks platform.
Assessed and enhanced the quality of customer data.
Utilized various AWS cloud services including EC2, S3, EMR, RDS, Athena, and Glue.
Provisioned and scaled AWS EC2 instances, EMR clusters, and Lambda functions through Terraform to run data processing pipelines, supporting large-scale ETL and big data workflows.
Analyzed data quality issues through exploratory data analysis (EDA) using SQL, Python, and Pandas.
Created automation scripts using Python libraries to perform accuracy checks from diverse sources to target databases.
Developed Python scripts to generate heatmaps for issue and root cause analysis of data quality report failures.
Conducted in-depth troubleshooting of ETL jobs, utilizing log analysis and error tracking to prioritize and address issues swiftly.
Performed data analysis and predictive data modeling.
Collaborated with stakeholders to deliver regulatory reports and recommend remediation strategies, building analytical dashboards using Excel and Python plotting libraries.
Configured VPCs, security groups, and IAM roles using Terraform, ensuring secure data handling and access controls across AWS resources.
Designed and implemented a Rest API for accessing the Snowflake DB platform.
Managed data warehouses in Snowflake and implemented star schemas.
Performed performance tuning and optimization of data pipelines and ETL jobs, focusing on reducing processing times and increasing throughput through best practices in indexing and partitioning.
Participated in the code migration of a quality monitoring tool from AWS EC2 to AWS Lambda and developed logical datasets.
Handled various data feeds such as JSON, CSV, XML, and DAT, implementing the Data Lake concept.
Led the design and implementation of comprehensive test plans, test cases, and test scenarios for complex data-driven systems, covering functional, non-functional, and edge case scenarios to ensure thorough test coverage.
Orchestrated ETL pipelines using AWS Glue, Step Functions, and Lambda functions, automating data ingestion, transformation, and loading processes via Terraform.
Analyzed and mapped data discrepancies between source and target systems, implementing fixes and enhancements within strict timelines.
Developed and maintained automated testing frameworks and scripts using industry-standard tools such as Selenium, pytest, and JUnit, resulting in significant reductions in testing time and increased test coverage across multiple platforms.
Conducted performance testing and optimization efforts to identify bottlenecks and scalability issues in data processing workflows, collaborating with development teams to implement performance enhancements and improve system responsiveness.
Established and enforced quality standards and best practices across the development lifecycle, including code reviews, coding standards, and documentation guidelines, to ensure consistent and high-quality deliverables.
Tuned Informatica mappings and transformations for improved
Participated in release management activities, including release planning, regression testing, and post-release validation, to guarantee the stability and reliability of production systems and minimize the impact of defects on end users.
Investigated and triaged reported defects and issues, utilizing debugging tools and techniques to isolate root causes and work with development teams to implement timely resolutions and prevent recurrence.
Implemented continuous integration and continuous deployment (CI/CD) pipelines to automate the build, test, and deployment processes, ensuring rapid and reliable delivery of software updates while maintaining quality standards.
Conducted risk assessments and impact analyses for proposed changes and enhancements to data systems, providing valuable insights to stakeholders and decision-makers to inform prioritization and resource allocation efforts.
Environment: Python, Spark SQL, PySpark, Pandas, Numpy, Excel, PowerBI, AWS EC2, AWS S3, AWS Lambda, Athena, Glue, Linux Shell Scripting, SnowflakeDB, Git, DynamoDB, Redshift.
Citi Bank, New York, NY Apr 2020 – Jun 2021
Data Engineer
Responsibilities:
Proficiently extracted and analyzed extensive data sets exceeding 800k records from Hadoop Distributed File System (HDFS) and Amazon S3 using SQL queries.
Demonstrated expertise in conducting exploratory data analysis (EDA) in Python, utilizing Seaborn and Matplotlib to evaluate data quality on both Hadoop clusters and AWS.
Implemented comprehensive monitoring and logging frameworks (CloudWatch, X-Ray) for AWS-based data pipelines, provisioning resources through Terraform to ensure real-time performance tracking and issue resolution.
performance, including optimizing session parameters and minimizing data bottlenecks.
Developed efficient Spark applications with PySpark and Spark SQL in Databricks and Amazon EMR, enabling seamless data extraction, transformation, and aggregation from diverse file formats across HDFS and Amazon S3.
Played a key role in migrating data from On-premises SQL servers to Cloud databases like Amazon Redshift and Amazon RDS, ensuring smooth transition and data integrity.
Established robust data ingestion pipelines by connecting with Amazon S3, facilitating end-to-end processing of raw files through Databricks and Amazon Glue.
Designed and automated backup and disaster recovery solutions using Terraform to manage S3 data replication, RDS snapshots, and cross-region failover.
Implemented advanced data cleaning techniques using Pandas and NumPy in Jupyter Notebook, effectively handling missing values and enhancing data preparation workflows on both Hadoop and AWS platforms.
Designed and implemented custom input adapters leveraging Spark, Hive, and Sqoop for seamless ingestion and analysis of data into HDFS and Amazon S3 from sources like Redshift and MySQL.
Integrated AWS data services, including Amazon RedShift, Aurora, and S3, into ETL processes to enhance data scalability and processing efficiency.
Showcased expertise in text analytics and processing using Apache Spark written in Scala, contributing to enhanced data insights and quality assurance on both Hadoop and AWS environments.
Automated cost-saving strategies by provisioning spot instances for transient jobs, implementing S3 lifecycle policies, and using Terraform to control resource usage based on project needs.
Proficiently implemented Spark functionalities using Python and Spark SQL, streamlining data processing tasks and ensuring efficient management of diverse data sources across Hadoop and AWS setups.
Contributed to real-time data processing initiatives by implementing Kafka clusters for data ingestion, along with extracting and exporting data from Teradata into Amazon DynamoDB, enhancing data accessibility and utilization.
Fedility, Boston, Massachusetts Nov 2018 – Mar 2020
Data Engineer
Responsibilities:
Configured and managed Hadoop ecosystem components including Hive, Pig, HBase, and Sqoop on both Hadoop clusters and AWS EMR instances.
Implemented real-time data processing by configuring Spark Streaming to ingest data from Apache Kafka and store it in HDFS using Scala on both Hadoop and AWS environments.
Developed Sqoop scripts to enable seamless data transfer between Hive and Vertical Database systems across Hadoop clusters and AWS infrastructure.
Utilized MapReduce, Pig, and Hive for data analysis and processing, storing processed data in HDFS and Google Cloud Storage (GCS) for downstream applications on both Hadoop and AWS platforms.
Established multi-terabyte Data Warehouse infrastructure on AWS Redshift and Google BigQuery, handling extensive data volumes daily.
Led the migration of an on-premises application to AWS and Google Cloud Platform, configuring virtual data centers with services like EC2, S3, and Google Cloud Storage (GCS) to support Enterprise Data Warehouse hosting.
Unisys Global Services Pvt Ltd, India Jan 2016 – Jul 2017
Hadoop Engineer
Responsibilities:
Developed highly optimized Spark applications for data cleansing, validation, transformation, and
Implemented data pipelines consisting of Spark, Hive, and Sqoop, alongside custom-built Input
Created Spark and Hive jobs for data summarization and transformation purposes.
Utilized Spark for interactive queries, processing streaming data, and integrating with popular
Transformed Hive/SQL queries into Spark transformations using Spark DataFrames and Scala.
Smart Software Tech. Dev. Pvt. Ltd., India Apr 2014 – Dec 2015 Hadoop Engineer
Responsibilities:
Collaborated with the Business Requirements and design teams, preparing Low-Level Design and
high-level design documents.
Provided comprehensive technical and business knowledge to ensure efficient design,
programming, implementation, and ongoing support for applications.
Identified potential avenues to enhance system efficiency.
Executed logical implementations and interacted with HBase effectively.
Efficiently stored and retrieved data to/from HBase by developing MapReduce jobs.
Developed MapReduce jobs to automate data transfer to and from HBase.
Education:
• Master of Computer Science from New York Institute Of Tech, New York, NY -2018
• Bachelors in Computer Science from JNTUH, India,2014