Data Engineer North Carolina

Location:

Charlotte, NC

Salary:

90,000/year

Posted:

March 24, 2025

Contact this candidate

Resume:

RAM SAI NIMMAGADDA

Phone: 704-***-**** Email LinkedIn

PROFESSIONAL SUMMARY

Experienced Data Engineer with over 5 years in ETL development, cloud technologies, and data processing. Proficient in Azure Databricks, Hadoop, Hive, and AWS for managing large datasets, optimizing pipelines, and ensuring ACID compliance. Strong in performance tuning for Spark and SQL with expertise in Medallion Architecture and Delta Lake. Skilled in data visualization using Tableau, Power BI, and D3.js to provide actionable insights. Proven ability to work in Agile environments, automate CI/CD pipelines, and enhance team collaboration. PROFESSIONAL EXPERIENCE

University of North Carolina At Charlotte - Charlotte, North Carolina January 2024 -December2024 Graduate Data Engineer Hadoop, Data Bricks, Spark Structured Streaming, Delta Lake, Unity Catalog, Azure Synapse Analytics

Built an ETL pipeline using Azure Databricks to migrate student enrollment data from on-premises to the cloud, incorporating deduplication and implementing validation checks (e.g., graduation year) to ensure data quality during movement.

Implemented a data migration pipeline using Azure Data Factory, leveraging Data Flows and Databricks for efficient incremental data ingestion from on-premises to the cloud.

Has good experience with performance tuning in spark with AQE, memory management, join optimizations (large data sets), caching, Sort aggregate, Hash aggregate, dynamic resource allocation etc.

Utilized Delta Lake to implement ACID properties and time travel capabilities, enabling seamless integration of new student data with historical records

designed and implemented Medallion Architecture by integrating data from multiple sources, including Hadoop, MySQL, and S3, into Azure Data Lake Storage (ADLS Gen2). Utilized Delta Lake to establish a robust processing layer, efficiently merging incremental updates with existing data to ensure high-quality, ready-to-serve datasets for BI and analytics use cases

Experienced in using Unity Catalog for meta store creation for different notebooks and managing user provisioning through Azure Active Directory.

Utilized Hive as a metadata catalog in Databricks, with hands-on experience configuring cluster policies to set resource limitations for different teams during cluster creation.

Extensively used Azure key vault for storing secrets related storage and utilized it to create scopes in azure data bricks as well as in azure data factory during data set creation

Experienced in Azure Synapse Analytics, using PolyBase to efficiently ingest and query large datasets. Generated interactive graphs and visualizations to analyze student enrollment trends across different courses and streams, improving insights into enrollment patterns by 30%.

Worked in an agile environment, used git hub as a version-controlling tool, and I have experience in using Jenkins as a CICD tool for code deployment

Experienced in implementing star schema for student enrollment data, creating central tables for different departments and performing joins to efficiently retrieve student information by department.

Used Confluent Kafka for real-time ingestion of student course registration data from JSON files, streaming it to Databricks for data segregation and staff allocation for various coursework.

Optimized and built efficient streaming pipelines by ingesting data from Kafka as a source, leveraging triggers and watermarking techniques to enhance checkpoint utilization and ensure reliable data processing.

Work collaboratively with cross-functional teams, Data Scientists, software development, and stakeholders, leveraging Agile, Scrum and Kanban methodologies for creating user-centered designs, code reviews, process improvements Cognizant Technology Solutions (Client: United Health Care) – Hyderabad, India. Jan 2022 – July 2023 Data Engineer Tableau, Snowflake, Hive, PySpark, Hadoop, SQL, Airflow, Git hub, Jira, Jenkins

Managed an on-premises Spark cluster with a capacity of 2 TB, optimized to handle up to 5 TB of data, specifically for processing and analyzing insurance data, ensuring efficient data processing and scalability.

Managed task allocation using YARN as the resource manager and optimized Spark job performance by analyzing DAG creation and fine-tuning configurations through Spark UI, ensuring efficient resource utilization and improved job execution

Skilled in optimizing Spark jobs by addressing partition skew and fine-tuning configurations for improved performance and efficient handling of large-scale data workloads.

Worked in a Hadoop environment for data curation, utilizing Hadoop commands for efficient data retrieval. Used Hadoop as a data lake to collect and store large volumes of insurance data, ensuring streamlined data management and accessibility.

Proficient in using Hive's command-line interface (CLI) for efficient data crawling and management within a Hadoop cluster, ensuring seamless data retrieval and processing.

Utilized Hive as a transactional system for data manipulation operations on insurance data, implementing ACID compliance to ensure data integrity and consistency during processing.

Experienced in optimizing Hive jobs for large data workloads through effective partitioning, bucketing, and implementing join optimization techniques to improve query performance and resource utilization

Has good experience with (BCNF) for efficient data base design and has good experience with partitioning and indexing strategies for data base optimization.

Has good experience with Tableau prep to join data from different sources and make the data ready for tableau visualizations.

Performed data operations on insurance data at United Healthcare using Tableau, generating various visualizations such as bar charts, pie charts, line graphs, and heat maps to derive insights for key business use cases and support data-driven decision- making.

Administered and maintained MySQL and other relational database systems, ensuring high availability and performance.

Utilized PL/SQL and Shell Scripting to automate routine database management tasks

Proficient in SQL window functions including RANK, DENSE_RANK, SUM, AVG, and ROW_NUMBER for efficient analytical computations, ranking, and aggregations over partitioned datasets.

Defined and implemented key performance metrics (KPIs) and success data flows metrics to measure data pipeline efficiency, reducing ETL failures by 30%.

Utilized Power BI to integrate data from various sources, transforming it into actionable insights through custom visualizations, including bar charts, pie charts, and heatmaps.

Designed and implemented dynamic dashboards in Power BI to track key performance metrics (KPIs), providing real-time insights into business operations and improving decision-making processes. Cognizant Technology Solutions (Client: Discover Banking and Financial Services) – Hyderabad. Jan 2020 – December2021 Data Engineer S3, EMR, Red Shift, Athena, Glue, Hive, PySpark, Hadoop, SQL, Confluent Kafka, Airflow, Git hub, Jira

Has good experience with AWS Athenna for serverless querying, used AWS glue to perform ETL operations and meta data crawling

Used AWS Glue as a metadata crawler to optimize querying in AWS Athena. Built scalable ETL pipelines using both Visual ETL and custom PySpark code in Glue notebooks for efficient data transformation.

Used git hub as a version control system and worked in an agile environment with Jira as a Ticketing tool for sprint planning.

Utilized AWS Athena to query data stored in S3, which was ingested through Airflow DAGs. Optimized data by partitioning it for faster querying based on business use cases and downstream team requirements.

Implemented validation and data quality checks in ETL pipelines in AWS EMR, reducing bad data ingestion by 30% and improving data accuracy by 25%.

Designed test cases for ETL pipelines, involving complex aggregations to separate data for rapidly growing verticals like retail, based on the fastest-growing regions of customer spending.

During my tenure at Cards, I designed an ETL pipeline to track newly enrolled customer data, segregating it into partitions based on regions and customer spending. This enabled personalized offers to be rolled out efficiently

Utilized Apache Airflow to schedule and automate ETL pipelines, including AWS Glue jobs for data transformation and metadata crawling. Orchestrated workflows that integrated with AWS Athena for querying data and optimized the data pipeline process across different stages, ensuring seamless data movement and processing.

Optimized data querying in AWS Athena by leveraging partitioning and bucketing to efficiently query data from AWS S3, Hadoop, and Azure Blob Storage, reducing scan time by 75% and costs by 80%.

Experienced in AWS Redshift for efficient data querying, utilizing sort keys and distribution keys to optimize performance and enhance query speed.

Integrated Tableau with Amazon Redshift, reducing data reporting time by 40% and enabling data-driven decisions and delivering actionable insights to technical and non-technical stakeholders.

Developed AWS Lambda functions for serverless data processing and event-driven automation, configuring Amazon Redshift clusters for analysis and reporting on large-scale data sets.

Automated workflows of CI/CD pipelines with Jenkins, GitLab CI, and AWS Code Pipeline, improving deployment efficiency by 60% and ensuring seamless integration of data engineering solutions.

Assisted in the development of data governance best practices, classifications, and metadata management, ensuring compliance with security, data protection policies, and improving data quality across manufacturing data solutions EDUCATION

Master of Science, Information Technology August 2023-December 2024 University of North Carolina At Charlotte - Charlotte, North Carolina GPA: 3.86/4.00 Coursework: Data Base Systems, Applied Data Bases, Knowledge Discovery in Data Bases, Visual Analytics, Big Data Analytics SKILLS

Programming Languages: C, Python, Java, SQL (intermediate)

Big Data Technologies: Spark, Hadoop, Hive, Spark Structured Streaming, Confluent Kafka

Database Engines: MySQL, NoSQL, MSSQL Server (T-SQL), Oracle (PL/SQL), MongoDB, SQLite

Cloud and other tools: ADLS GEN 2, Azure Blob Storage, Azure Data Bricks, Azure Data Factory, Unity Catalog, Auto Loader, Delta Live Tables, Azure SQL, Azure Key Vault, Azure Synapse Analytics, Tableau, Snowflake, Red Shift, Red Shift Spectrum, EMR, Athena, Glue, Lambda, S3, Star Schema, Snowflake Schema etc. PROJECTS: -

Project Title: Analyzing Greenhouse Gas Emissions in the Agri-Food Sector This project explores the environmental impact of the agri-food industry by analyzing CO2 emissions using data from the FAO and IPCC. With agri-food activities responsible for about 62% of global emissions, we focused on two key questions: 1. How do agricultural activities impact emissions? 2. What is the relationship between population demographics and emissions? We collected, cleaned, and analyzed the data, creating visualizations to highlight emission trends. While we faced challenges with data. filtering and building specific charts in Tableau, our team is refining these methods to deliver clear insights. The outcome will include interactive visualizations and a comprehensive report on our findings. Link: - https://public.tableau.com/app/profile/joe.sunshine/viz/shared/2N3HYPPWJ Developed interactive visualizations in D3.js, replicating the analytical insights from Tableau dashboards, ensuring a dynamic and customizable representation of greenhouse gas emissions data. Attaching the GitHub repository for reference. Git hub repository URL: - https://github.com/JoeSunshine02/TeamFourFinal Certifications

Data Bricks Certified Data Engineer Associate

Contact this candidate