Jan 23, 2025
Abstract
Healthcare systems continuously generate vast amounts of electronic health records (EHRs), commonly stored in the Fast Healthcare Interoperability Resources (FHIR) standard. Despite the wealth of information in these records, their complexity and volume make it difficult for users to retrieve and interpret crucial health insights. Recent advances in Large Language Models (LLMs) offer a solution, enabling semantic question answering (QA) over medical data, allowing users to interact with their health records more effectively.
However, ensuring privacy and compliance requires edge and private deployments of LLMs.This paper proposes a novel approach to semantic QA over EHRs by first identifying the most relevant FHIR resources for a user query (Task1) and subsequently answering the query based on these resources (Task2). We explore the performance of privately hosted, fine-tuned LLMs, evaluating them against benchmark models such as GPT-4 and GPT-4o. Our results demonstrate that fine-tuned LLMs, while 250x smaller in size, outperform GPT-4 family models by 0.55% in F1 score on Task1 and 42% on Meteor Task in Task2. Additionally, we examine advanced aspects of LLM usage, including sequential fine-tuning, model self-evaluation (narcissistic evaluation), and the impact of training data size on performance. The models and datasets are available here: https://huggingface.co/genloop
Keywords LLMs ⋅ Private ⋅ Edge ⋅ Fine-Tuning ⋅ FHIR ⋅ Question Answering ⋅ Medical
Introduction
The Fast Healthcare Interoperability Resources (FHIR) standard, developed by HL7, provides a consistent framework for exchanging electronic health records (EHRs) across healthcare systems, enabling improved interoperability. FHIR’s structured, machine-readable format supports the seamless transfer of complex healthcare data, making it central to modern health information exchange.
Recent mandates requiring the use of HL7 FHIR Release 4 to support API-enabled services reflect a shift toward greater data transparency and patient autonomy. Regulatory changes, particularly the Anti-Information Blocking provisions of the 21st Century Cures Act, also emphasize the need for patient-centric access to health data. Platforms like Apple Health now integrate FHIR data, giving patients unprecedented access to their medical records and creating new opportunities to transform complex data into actionable health insights.
Despite these advances in interoperability, patients often face significant challenges when interacting with their own health data. Complex medical terminology, language barriers, and limited health literacy make it difficult for many individuals to understand their conditions, diagnoses, or treatment plans. These challenges are not just inconveniences—they can lead to delays in care, misunderstandings of critical information, and ultimately, poorer health outcomes.
Large Language Models (LLMs) provide a paradigm for human readability of their health records. LLMs, characterized by their ability to process vast amounts of unstructured data and generate human-like text, are transforming the landscape of healthcare informatics. Their inherent capacity to perform natural language understanding (NLU) and generation enables them to extract, summarize, and interpret complex medical information.
In the context of this paper, LLMs serve as a bridge between raw FHIR (Fast Healthcare Interoperability Resources) data and end-users, translating intricate clinical terminology into plain language that patients and healthcare providers can easily comprehend. By leveraging Large Language Models (LLMs), healthcare providers have the opportunity to simplify and personalize health data. Such systems can empower patients, making their medical information more accessible and helping them better understand their health, which in turn may enhance patient engagement and adherence to treatment plans. This approach also has the potential to reduce healthcare inefficiencies by offering quicker, more intuitive access to relevant medical records.
However, the application of LLMs for such tasks faces significant barriers, particularly with respect to data privacy and security. Sharing personal health information (PHI) with cloud-hosted, general-purpose LLMs risks violating key privacy regulations, such as HIPAA, the California Consumer Privacy Act (CCPA), and the Biometric Information Privacy Act (BIPA, Illinois). In the healthcare domain, ensuring the confidentiality of sensitive patient data is paramount. For this reason, edge-deployed, privately hosted LLMs present a more viable solution, allowing healthcare systems to maintain control over patient data while still benefiting from the powerful capabilities of LLMs.
Thanks to open-source models like Meta’s LLaMA series and Mistral’s open source model series, it is now feasible to self-host LLMs for specific applications. However, these smaller models often fall short in the accuracy required for sophisticated healthcare tasks such as semantic question answering. Fine-tuning with techniques like LoRA emerges as a critical step to address this gap with computational efficiency. By customizing smaller, domain-specific models to the nuances of the target task, fine-tuning enhances both the accuracy and performance of these models. This adaptation not only makes them suitable for demanding healthcare applications but also ensures they remain efficient and privacy-compliant, aligning with the requirements of edge deployment in sensitive domains like healthcare.
In this paper, we break down the task of semantic question answering over medical records into two stages: (1) retrieving the most relevant FHIR resources given a user’s medical query, and (2) answering the query based on the retrieved resources. We fine-tune smaller, open-source models for each stage, ensuring that they are well-suited to the unique challenges of medical data processing. To facilitate this process, we generate synthetic patient data and utilize larger general-purpose models to create task-specific synthetic datasets (data collection). These datasets are then refined (data refinement), followed by training multiple models to identify the best-performing configurations (training), and finally, evaluating their performance against established benchmarks (evaluation). The resulting models are deployed in a privacy-compliant setup.
Our experiments reveal that fine-tuned smaller models significantly outperform larger general models like GPT-4, particularly in terms of accuracy, efficiency, and privacy compliance. This work demonstrates that edge-deployed, fine-tuned LLMs can offer a practical solution for healthcare providers seeking to implement patient-centric semantic question answering without sacrificing data privacy.
Beyond the core tasks, this study explores several important aspects of LLM behavior in healthcare contexts:
The impact of sequential fine-tuning on multi-task performance.
The tendency of LLMs to exhibit narcissistic behavior by favoring their own outputs.
The influence of dataset size on fine-tuning effectiveness, with implications for resource-constrained environments.
We discuss these results in Section 5 (Results and Discussion) and highlight key insights for future research in Section 6 (Conclusion, Limitations, and Future Work).
Related Work
The application of Large Language Models (LLMs) to patient medical data processing has garnered significant attention in recent research. Notably, demonstrated the efficacy of leveraging LLMs to convert clinical texts into FHIR resources with an impressive accuracy rate of over 90 percent, thereby streamlining the processing and calibration of healthcare data and enhancing interoperability. Introduced a multi-agent workflow utilizing the Reflexion framework, which employed ICD-10 codes from medical reports to generate patient-friendly letters with an accuracy rate of 94.94%. These studies did not involve direct querying of Electronic Medical Records (EMRs).
Recently, introduced "LLM on FHIR", an open-source mobile application that leverages GPT-4 to allow users to interact with their health records, demonstrating impressive accuracy and relevance in delivering comprehensible health information to patients. Developed on top of to replace GPT-4 with fine-tuned LLMs. They divided the FHIR querying process into three tasks: filtering, summarization, and response. They compared against Meditron, a family of medical-adapted Llama-2 models, and Llama 2 Base models. Their approach was shown performing better than Llama 2 Base models but underperforming Meditron.
Approach
To approach this task, we break query processing in 2 stages, similar to how retrieval augmented generation, is performed.
Task 1: Identifying the FHIR resources from the patient’s medical record that are relevant to a given natural language patient query. Each patient will have numerous FHIR resources in their patient record, only some of which are relevant to the patient query. We formulate the problem as a binary classification problem i.e. given a query q, and a FHIR resource r, the problem is setup as F(q, r) = I{0, 1}
Task 2: Answering the medical query of the patient using the FHIR resources that were identified as relevant to the query i.e. generating the response based on (query, list of relevant resources) pairs.
Figure 1 outlines the approach. The main inputs are the user query and the EHR records in the FHIR format. The EHR records can be both relevant or irrelevant when received. The Task 1 classifies relevant FHIRs and help us pick the necessary records to generate the answer for the user’s medical question in Task 2.
LLMs are the intelligence modules for Task 1 and Task 2. In our approach, we fine-tune the top-performing models available at the time of experimentation (Llama-3.1-8B, and Mistral-NeMo) for each of these tasks. We also compare the results switching them with GPT-4 (SOTA model at the time), and Meditron-7b SOTA medical domain model at the time). More details on the experiments, including dataset generation, are covered in Section 4 (Experiment Details). The Results are discussed in Section 5 (Results and Discussion)
