Puneeth Cherukuri — Senior Data Engineer
929-***-**** ******************@*****.***
PROFESSIONAL SUMMARY:
Leveraging around 5 years of extensive experience in Data Engineering and Data Warehousing, specializing in robust data solutions and architecture.
Expertly implement, configure, and manage complex Linux-based processes and infrastructure for critical enterprise data warehousing operations.
Proven ability to identify and implement strategic system and architecture improvements, enhancing overall data ecosystem performance and reliability.
Highly skilled in enhancing various Linux-based toolsets, shell scripts, scheduled jobs, and processes for optimized data flows and automation.
Adept at developing and enhancing ETL and database load/extract processes, ensuring efficient data movement and transformation for large datasets.
Deep practical working experience in setting up Linux environments, extensive shell scripting, and managing Unix file systems efficiently.
Proficient in Python for advanced data transformation and automation, with practical knowledge of Perl for legacy system integrations and scripting.
Extensive experience with relational databases, including Oracle Exadata, for designing and optimizing high- performance data warehouses.
Strong understanding and application of Agile methodology for iterative development and continuous delivery of data warehouse solutions.
Passionate about automation and driving continual process improvement to streamline data operations, reduce manual effort, and enhance data quality.
Skilled in orchestration using tools like Apache Airflow and experienced with enterprise ETL tools, including Informatica for complex data integration.
Committed to delivering high-quality, scalable data warehouse solutions with excellent written and oral communication skills for stakeholder collaboration.
WORK EXPERIENCE:
Data Engineer @ Corewell Health Grand Rapids, MI Sep 2023 – Present
Implemented and managed robust Linux-based processes and infrastructure, specifically configuring instances within AWS for data warehousing workloads.
Designed and optimized end-to-end data pipelines leveraging AWS S3, Glue, EMR, and Redshift, focusing on enhanced ETL/database load processes.
Developed complex PySpark jobs on EMR for processing large-scale clinical and claims datasets, ensuring high performance and scalability.
Enhanced various Linux-based toolsets, shell scripts, and jobs to automate data ingestion from diverse sources into the AWS data lake architecture.
Performed advanced data transformations and aggregations using Spark SQL and AWS Glue, optimizing data models for analytical consumption.
Migrated critical healthcare data from on-prem Oracle databases to AWS S3, implementing efficient shell scripts for data extraction and loading.
Implemented stringent data quality checks and validation frameworks using Python, ensuring data integrity across the data warehouse.
Collaborated with cross-functional teams to identify and implement system/architecture improvements, enhancing overall data flow and reporting capabilities.
Technologies Used: AWS S3, AWS Glue, AWS EMR, AWS Redshift, PySpark, Python, Shell Scripting, Oracle, Linux, SQL, CloudWatch, Agile
Data Engineer @ Huntington National Bank Columbus, OH Oct 2021 – Aug 2023
Developed scalable ETL pipelines using Azure Data Factory, integrating Linux-based systems for robust financial transaction data processing.
Processed structured and semi-structured data using Azure Databricks (PySpark), implementing enhanced Linux-based toolsets and scripts for data cleansing.
Designed and implemented data models in Azure Synapse Analytics for analytical workloads, focusing on data warehousing best practices.
Enhanced ETL and database load/extract processes, integrating data from SQL Server, external APIs, and on-prem Oracle sources into Azure Data Lake.
Executed complex transformations using Spark SQL and Python, optimizing shell scripts for efficient data flow and performance within the Azure ecosystem.
Utilized Informatica PowerCenter for critical data integration tasks, ensuring seamless data migration and synchronization across platforms.
Supported stringent data reconciliation and audit requirements for regulatory compliance, leveraging automation through shell scripting and data quality checks.
Participated actively in Agile ceremonies, contributing to system/architecture improvements and documentation using Confluence for data warehousing initiatives.
Technologies Used: Azure Data Factory, Azure Databricks, Azure Synapse Analytics, PySpark, Python, Shell Scripting, SQL Server, Oracle, Informatica, Linux, Power BI, Jenkins, GitHub, Agile Junior Data Engineer @ Caterpillar Irving, TX Jun 2020 – Sep 2021
Designed and implemented batch data pipelines using Google Cloud Storage and BigQuery, focusing on efficient data warehousing principles.
Developed robust data processing jobs using PySpark on Dataproc, incorporating Linux-based processes and scripts for automation.
Ingested manufacturing and sensor data from diverse sources, performing data cleansing and transformations using Spark and Python.
Implemented data migration from on-prem SQL Server and Oracle databases to BigQuery, leveraging shell scripts for data extraction.
Built partitioned and clustered BigQuery tables for performance optimization, enhancing ETL and database load/extract processes effectively.
Utilized Apache Airflow (Composer) with Python scripting for advanced workflow orchestration, ensuring reliable and automated data flows.
Supported analytics use cases for supply chain and operations teams, optimizing SQL queries and data structures within the data warehouse.
Maintained version control for all code using GitHub, following standard SDLC practices for development and deployment in a Linux environment.
Technologies Used: Google Cloud Storage, Google BigQuery, Apache Spark, PySpark, Python, Shell Scripting, Apache Airflow, Oracle, SQL Server, Linux, Dataproc, GitHub, SDLC TECHNICAL SKILLS:
Programming & Scripting: Python, Shell Scripting, Perl, SQL
Data Warehousing & Databases: Oracle Exadata, Oracle, Snowflake, PostgreSQL, MS SQL Server, Azure Synapse Analytics, AWS Redshift, Google BigQuery
ETL & Data Processing: Informatica, Azure Data Factory, Apache Airflow, AWS Glue, Apache Spark, Databricks, AWS EMR, Kafka
Cloud Platforms: Amazon Web Services (AWS), Microsoft Azure, Google Cloud Platform (GCP)
Operating Systems & DevOps: Linux, Unix, Docker, Jenkins, GitHub, CI/CD
Data Visualization & Methodologies: Power BI, Tableau, Agile SDLC EDUCATION:
Master's in Data Science @ Saint Peter’s University