Naveen Kumar
****************@*****.*** 937-***-****
PROFESSIONAL SUMMARY:
Data Engineer with Around 5 years of IT experience in analysis, design, development, implementation, maintenance, and support. Expertise in deploying big data technologies to efficiently solve complex data processing challenges.
4+ years as Cloud Professional, proficient in architecting and implementing highly distributed global cloud-based systems.
Extensive experience with GCP services, including BigQuery, Cloud Storage, Dataflow, Dataproc, Pub/Sub, and Cloud SQL, for data processing, storage, and analysis.
Skilled in migrating legacy applications to the GCP platform, managing GCP services such as Compute Engine, Cloud Storage, BigQuery, VPC, Stack driver, and Load Balancing.
Strong background in data warehousing, ETL processes, and building data pipelines using tools like Airflow, DBT, and Informatica.
Experienced in Airflow and Cloud Composer for orchestrating complex data workflows, ensuring efficient and reliable data operations.
Proficient in working with Azure Data Lake, Azure SQL Database, Key Vault, and Integration Runtimes.
Familiar with CI/CD practices using Git and Azure DevOps for pipeline version control and deployment.
Integrated Databricks with cloud platforms such as AWS leveraging their respective storage and compute services for scalable data solutions.
Proficient in designing and developing scalable data pipelines using Snowflake, supporting both batch and near real-time data workloads.
Strong command over Snowflake SQL, including writing complex queries, stored procedures, and UDFs for data transformation.
Managed Snowflake object versioning and deployment automation using GitHub, DBT, and CI/CD pipelines.
Experienced in setting up Role-Based Access Control (RBAC) and masking policies to ensure secure, compliant access to sensitive data.
Integrated Snowpark processes into our existing CI/CD pipeline to streamline testing and deployment of data workflows.
Utilized advanced SQL features in Teradata such as OLAP functions, CASE expressions, CTEs, and recursive queries to solve complex business problems
Evaluated cloud-native tools against Teradata EDW performance to inform architectural decisions
Hands-on experience with Snowpipe, Streams, and Tasks for automated and continuous data ingestion.
Worked in Snowflake advanced concepts like setting up Resource Monitors, Role Based Access Controls, Data Sharing, Virtual Warehouse Sizing, Query Performance Tuning, Snow Pipe, Tasks, Streams, Zero- copy cloning .
Worked on Spark Streaming and Structured Spark streaming including Kafka for real time data processing.
Successfully led multiple Maximo implementation projects, ensuring timely delivery, adherence to project scope.
Used Maximo for asset tracking, preventive maintenance scheduling, and work order management.
Worked with different file formats like Json, AVRO and parquet and compression techniques like snappy.
Proficient in containerization and orchestration technologies, including Docker and Kubernetes, across multiple cloud providers.
Experience in Creating Power BI reports and upgraded power pivot reports to Power BI.
Experienced in developing interactive data visualizations using Tableau and Power BI.
TECHNICAL SKILLS:
Big Data / Hadoop Ecosystem: Hadoop, MapReduce, Pig, Hive, YARN, Kafka, Flume, Sqoop, Impala, Oozie, Zookeeper, Spark, Ambari, Mahout, MongoDB, Cassandra, Avro, Storm, Parquet, Snappy
Languages: Python, Java, Scala, HTML, JavaScript, XML, C/C++
NoSQL Databases: Cassandra, MongoDB, HBase
Cloud Technologies: GCP, Azure, AWS
Development/Build Tools: DBT, Eclipse, Ant, Maven, JUnit
Frameworks: Struts, Spring, Hibernate
Development Methodologies: Agile, Waterfall
DB Languages: MySQL, PL/SQL, PostgreSQL, Oracle
PROFESSIONAL EXPERIENCE:
Geisinger Health, Remote, PA / APEX Systems
Database Engineer / Data Admin
September 2024 – Present
Developed Python scripts for automated data validation, error logging, and metadata tracking across pipelines.
Built reusable Python modules for API interaction and data ingestion workflows.
Developed and maintained complex SQL scripts and stored procedures for data extraction, transformation, and loading (ETL) between Clarity, Caboodle, and enterprise data warehouse platforms.
Implemented custom alerts and dashboards in Foglight to proactively detect slow-running queries, deadlocks, and storage bottlenecks in production systems.
Developed and optimized ETL workflows in Databricks, utilizing Delta Lake for improved data reliability and performance.
Developed reusable Databricks notebooks for standard ETL processes, enhancing the efficiency and consistency of data engineering tasks.
Created custom reports and dashboards using Epic Clarity and Caboodle to support clinical operations, finance, and quality improvement initiatives.
Conducted root cause analysis and resolved data discrepancies between Epic front-end workflows and backend Clarity/Caboodle databases.
Ensured HIPAA compliance and patient data integrity while handling sensitive healthcare information within Epic systems.
Applied encryption, token-based authentication (OAuth2), and secure API gateway configurations for FHIR endpoints to safeguard patient data.
Coordinated with Epic TS and internal IT teams to troubleshoot production issues, apply patches, and conduct root cause analysis for system incidents.
Created multiple workspaces and Clusters in Databricks.
Whitelisted multiple Ip address to the Workspaces.
Utilized ServiceNow to track, manage, and resolve incidents related to Epic Clarity/Caboodle, SQL Server Agent jobs, and enterprise data systems.
Logged and triaged infrastructure, database, and backup-related incidents, ensuring timely resolution and adherence to SLAs.
Strong understanding of data warehouse concepts and testing data loading into data warehouse systems.
Worked in Chronical and other healthcare platforms to ensure seamless integration and reporting from diverse patient care sources
Created stored procedure and databases based on the project demand.
Configured and maintained Qlik Replicate tasks to enable near real-time data replication from Epic Clarity/Caboodle SQL Server databases to downstream analytics platforms.
Expertise in documentation using Confluence for the project related and one note for the internal team learning
Provided on-call support and maintained documentation for all critical Clarity and Caboodle administrative processes.
Designed, scheduled, and maintained SQL Server Agent Jobs to automate ETL processes, database maintenance, backups, and data refreshes for Epic Clarity and Caboodle environments.
Created power shell scripts for the automation
Managed enterprise-level database backups and restores using Commvault for Epic Clarity, Caboodle, and other mission-critical SQL Server environments.
Used Postman to get data from Microsoft Graph API and load in SSMS on Incremental bases
First Citizens Bank, Remote, NC
Data Engineer
January 2023 – May 2024
Installed, configured, and maintained data pipelines, leveraging GCP services such as Dataproc, Cloud Storage, and BigQuery.
Worked on ETL (Extract, Transform, Load) testing methodologies and processes.
Built Python utilities to automate data archival, logging, and performance monitoring.
Monitored Data Engines to define data requirements and data accusations from both relational and non-relational databases.
Experienced in deploying and managing data processing clusters with Google Dataproc, leveraging its scalability and automation features for large-scale data analysis.
Developed serverless event-driven workflows on Google Cloud Functions, streamlining data processing and reducing infrastructure complexity.
Developed and deployed Spark applications using PySpark and Spark SQL for data extraction, transformation, and aggregation.
Integrated BTEQ scripts into enterprise ETL workflows to handle pre-load/post-load validation and data quality checks
Implemented access controls and role-based privileges in Teradata to ensure secure access to sensitive datasets.
Designed and managed ELT pipelines using Airflow, Python, DBT, and Stitch Data, ensuring efficient data processing.
Built real-time messaging systems with Pub/Sub for event-driven architectures, optimizing asynchronous communication.
Designed and developed complex stored procedures in SQL to automate data processing tasks.
Implemented stored procedures to handle ETL processes, ensuring data integrity and consistency across multiple databases.
Monitor system performance and troubleshoot issues through Shell scripting solutions.
Automated data processing workflows using Oozie and developed custom UDFs in Python for complex data transformations.
Environment: GCP (Dataproc, BigQuery, Cloud Storage, Pub/Sub), PySpark, Airflow, Oozie, Python, SQL Server, Tableau
Model N India Pvt Ltd
Data Engineer
October 2021 – July 2022
Migrating on premise ETLs to Google Cloud Platform (GCP) using cloud native tools such as BIG query, Cloud Data Proc, Google Cloud Storage, Composer.
Managed real-time data ingestion from on-premise systems to GCP using Cloud Pub/Sub and Cloud SQL, and extended similar ingestion pipelines into Snowflake.
Designed and optimized Teradata BTEQ scripts for batch processing and data transformation jobs
Wrote parameterized BTEQ scripts to dynamically generate SQL queries for flexible and reusable job executions
Gained hands on in handling python and spark context when writing Pyspark programs for ETL.
Developed Python-based test harnesses for unit testing and data pipeline validation.
Automated data engineering tasks in Databricks using Python and SQL, reducing manual effort and improving operational efficiency.
Conducted performance tuning and troubleshooting of Databricks clusters, ensuring optimal resource utilization and minimal downtime.
Carried out data transformation and cleansing using SQL queries, Python and Pyspark.
Developed data ingestion processes for maintaining the Global Data Lake on GCP and BigQuery.
Led the migration of legacy applications to GCP, deploying artifacts on the GCP platform.
Secured stored procedures with appropriate access controls, preventing unauthorized access to sensitive data and ensuring compliance with security standards.
Implemented modular data transformation workflows using DBT in both GCP and Snowflake environments, reducing pipeline complexity.
Implemented RBAC (role-based access control) and column-level masking in Snowflake to restrict access to customer financial data, adhering to GLBA and internal security protocols.
Orchestrated end-to-end data pipelines using Apache Airflow in GCP Composer and integrated with Snowflake tasks for scheduling
Development and maintenance of Scala applications that are executed on the Google Cloud platform
Design and model datasets with Power BI desktop based on measure and dimension requested by customer and dashboard needs.
Environment: GCP (BigQuery, Cloud Storage, Pub/Sub, Composer), Databricks, PySpark, DBT, SQL Server, Apache Airflow,Snowflake .
HSBC, Hyderabad, India
Intern Member / Junior Data Engineer
Jan 2020 – August 2021
Collaborated with senior data engineers to assist in the development and testing of SQL queries for financial data extraction and reporting.
Designed and implemented end-to-end ETL pipelines using Azure Data Factory (ADF) to extract data from diverse sources including on-prem SQL, Azure Blob Storage, REST APIs, and SaaS platforms.
Migrated legacy ETL workflows from SSIS/Informatica to Azure Data Factory, reducing infrastructure cost and improving scalability.
Integrated ADF with Azure SQL DB, Azure Synapse Analytics, and Azure Data Lake Gen2 to enable modern data warehouse solutions.
Participated in data cleaning and preprocessing activities using Python and Excel to ensure accuracy and consistency across internal reports.
Secured pipeline configurations and connections using Azure Key Vault to store and manage secrets.
Assisted in building and maintaining data pipelines for daily ingestion tasks using Azure Data Factory (ADF), Apache Airflow, and SQL Server in a cloud-native Azure environment.
Worked with data visualization tools like Power BI to create dashboards that supported business unit KPIs.
Learned and applied Snowflake basics, including writing simple queries, exploring data schemas, and understanding access controls.
Designed optimized star/snowflake schema models to support analytics and reporting use cases.
Designed and modeled datasets using Power BI Desktop based on business KPIs and dashboard requirements, sourcing data from GCP and Snowflake.
Conducted root cause analysis on pipeline failures using Snowflake query profiles, logs, and metadata tables.
Documented project learnings, data workflows, and best practices using Confluence for knowledge sharing within the team.
Participated in weekly Agile standups and sprint reviews, contributing to the team's overall planning and progress tracking.
Education Details:
Master’s In Computing and Information Systems (Youngstown State University) Aug 2022 – Dec 2023
Bachelors in Electronics and Communication Engineering (Lovely Professional University ) Aug 2017 – May 2021