Data Engineer Healthcare

Location:

Texas City, TX

Salary:

85000

Posted:

October 15, 2025

Contact this candidate

Resume:

Praveen Kumar Akkapelly

Data Engineer

Texas, USA 640-***-**** *************@*****.*** Linkedin SUMMARY

Data Engineer with 5+ years of experience in healthcare and finance sectors, proficient in building robust data pipelines using Python and SQL to manage and transform complex datasets, ensuring seamless data flow across systems.

Skilled in cloud platforms, including AWS and Azure, with hands-on experience in services such as AWS EMR, S3, Redshift, Azure Data Lake, and Azure Data Factory to design and optimize scalable data solutions.

Experienced with big data frameworks such as Apache Spark and Hadoop, enabling efficient handling of large data volumes and advanced analytics for data.

Strong knowledge of ETL processes and tools like Apache Airflow, Talend, and Informatica, ensuring data quality, reliability, and consistency across multiple sources.

Proficient in data warehousing and database management systems, including Redshift, Snowflake, and MySQL, to support secure storage and retrieval of structured and unstructured data.

Expertise in data governance, data security, and regulatory compliance standards such as HIPAA for healthcare data and SOX for finance, ensuring privacy and compliance.

Hands-on experience with data visualization tools like Tableau and Power BI to create insightful dashboards and reports for stakeholders, supporting data-driven decision-making.

Adept at performance tuning and optimizing query performance for large datasets, ensuring high efficiency and scalability in data processing and retrieval operations.

SKILLS

WORK EXPERIENCE

Data Engineer Cerner Corporation, TX, USA August 2022 - Present

Developed ETL pipelines using Apache Spark and Python to process and transform large volumes of healthcare data (patient records, medical histories, treatment plans, lab results) from multiple sources into a unified format, improving data consistency and accuracy for better patient care and operational efficiency.

Used SQL and Hive to design optimized queries and data models for healthcare data analysis, enabling efficient retrieval and integration of large-scale data from various Electronic Health Record (EHR) systems and medical devices. Streamlined data access across platforms, enhancing reporting efficiency and decision-making for clinical teams, hospitals, and healthcare providers.

Reduced patient care discrepancies by 15% by leveraging data-driven predictions to optimize patient management and resource allocation, leading to improved healthcare outcomes and more effective treatment planning.

Optimized healthcare data processing workflows by implementing automated data pipelines that streamlined patient data flow, ensuring timely delivery of actionable insights to healthcare professionals, risk management teams, and hospital operations.

Developed predictive models using TensorFlow and Scikit-learn to forecast patient health outcomes, hospital admission rates, and disease progression, helping healthcare providers optimize treatment strategies and improve patient care. Methodologies: SDLC, Agile, Waterfall

Programming Language: Python, SQL, R

Packages: Scikit-Learn, ggplot2, Pandas, NumPy, Matplotlib, SciPy, Seaborn Visualization Tools: Tableau, Power BI, Advanced Excel (Pivot Tables, VLOOKUP) IDEs: Visual Studio Code, PyCharm, Jupyter Notebook, IntelliJ Database: MySQL, MongoDB, PostgreSQL, Oracle, Tera Data Cloud Technologies: Google Cloud Platform, AWS, Azure Other Technical Skills: Google Analytics, DAX, SAS, JIRA, SAP, SSIS, SSRS, Machine Learning Algorithms, Mathematics, Probability distributions, Confidence Intervals, Hypothesis Testing, Regression Analysis, Linear Algebra, Advance Analytics, Data Mining, Data Visualization, Data warehousing, Data transformation, Data Storytelling, Association rules, Clustering, Classification, Regression, A/B Testing, Forecasting & Modelling, Data Cleaning, Data Wrangling, Process Mapping, Solution Oriented, Ad Hoc Analysis, Project Management, Data Presentation, Requirement Gathering, Root Cause Analysis, Data Sets, Data Modules, Quantitative Analytics, Apache Spark, Apache Hadoop, Apache Kafka, Apache Beam, ETL/ELT, PySQL, PySpark, Docker, dbt, Big Data & AI Integration, PCI DSS, Data Lineage & Masking, RBAC, CI/CD Pipeline

Version Control Tools: Git, GitHub

Operating Systems: Windows, Linux, Mac iOS

Designed an ER diagram to map out key entities such as Patient, Treatment, Appointment, Medical_Record, and Diagnosis, ensuring clear relationships (one-to-many between Patient and Treatment). This improved data structure efficiency by 30%, enabling faster query processing and smoother integration of healthcare data across systems.

Utilized Azure Data Factory to design and orchestrate ETL (Extract, Transform, Load) pipelines for processing and transforming large volumes of healthcare data. This helped to integrate patient records, medical histories, treatment plans, and lab results from multiple sources, ensuring data consistency and accuracy across systems.

Employed Airflow’s scheduling capabilities to execute periodic healthcare data pipeline runs, ensuring up-to-date data processing, especially for time-sensitive tasks such as patient health monitoring, hospital bed management, and treatment outcome forecasting.

Created interactive Power BI dashboards to visualize healthcare data, patient trends, and predictive model results, enabling medical teams to easily monitor key metrics such as patient performance, hospital utilization, and health outcomes.

Took advantage of Snowflake’s automatic scaling to handle fluctuating data processing workloads, enabling on-demand resource allocation for peak times, such as during health crises or high patient volume periods.

Leveraged Azure Databricks to create scalable data pipelines capable of processing large-scale healthcare data from multiple sources, ensuring timely and accurate delivery of actionable insights to clinical teams and hospital operations.

Executed role-based access control (RBAC) to restrict access to sensitive healthcare data based on user roles, ensuring that only authorized medical and administrative personnel could access patient records, in line with compliance regulations such as HIPAA.

Applied Agile Scrum methodology by organizing the project into 2-week sprints, delivering incremental updates on data pipeline development, predictive model improvements, and integration of healthcare data systems for Cerner clients. Data Engineer Wipro, India Feb 2019 - Aug 2021

Designed the process flow and data flow for the existing data ingestion system.

Analyzed conversion data which includes mapping from source to target database schemas, definition, and building data extraction scripts/programming of data conversion in test and production environments.

Developed production-ready Spark applications employing the Spark-RDD, Data frames, Spark- SQL, Spark-ML, and Spark Streaming APIs.

Utilized Sqoop to import data from MySQL to HDFS on a regular basis.

Designed the functional and technical documents, report templates and reporting standards for created SSIS packages connected between Oracle and SQL Server instances in SSIS 2008.

Performed aggregations on large amounts of data using Apache Spark and Scala and stored the data in the Hive warehouse for further analysis.

Leveraged AWS Glue to build serverless ETL pipelines for automating the extraction, transformation, and loading of data from various sources into Amazon S3, Redshift, or RDS. This service enabled the creation of ETL jobs that transformed the data to be used in analytics, reporting, and machine learning applications.

Relevant experience in developing real-time analytics, data pipelining (data ingestion, cleaning, enrichment) and prediction engine using scala, Apache Kafka, spark streaming.

Prepared an ETL framework with the help of Sqoop, pig and hive to frequently bring in data from the source and make it available for consumption.

Processed HDFS data and created external tables using Hive and developed scripts to ingest and repair tables that can be reused.

Analyzed the source data and handled efficiently by modifying the data types. Used excel sheet, flat files, CSV files to generate Power BI ad-hoc reports.

Deployed SSIS packages and Reports to Production Servers.

Hands-on experience in Apache Hive, Apache Pig, HBase, Apache Spark, Zookeeper, Flume, Kafka, Impala, Drill, and Sqoop.

Extensively worked on creating combiners, Partitioning, distributed cache to improve the performance of MapReduce jobs.

Utilized JIRA to manage project issues and workflow.

Utilized AWS Data Pipeline to schedule and automate data movement between AWS storage, compute, and analytic services. Data pipeline tasks included running Spark jobs for batch data processing, managing dependencies, and ensuring timely data flow from source systems to target destinations.

Responsible for logical and physical data modeling, database design, star schema, data analysis, programming, documentation, and implementation.

EDUCATION

Master of Science in Computer and Information System - Southern Arkansas University, Arkansas, USA Bachelor of Technology in Electrical and Electronics Engineering - Jawaharlal Nehru Technological University, Telangana, India CERTIFICATIONS

Databricks- Lakehouse Fundamentals

Databricks Certified Data Engineer Professional

Microsoft Certified- Power BI Data Analyst Associate

Databricks Accredited Generative AI Fundamentals

SnowPro Core Certification

Contact this candidate