Radiology Report Generation with Large Language Models
Radiologists write free-text reports that are ambiguous, inconsistent, and hard to evaluate. Structured reports fix this but require massive annotation effort. This project built an LLM-powered pipeline to automatically convert thousands of free-text radiology reports into structured format using a 718-question hierarchical template, creating a large-scale ground truth dataset for the Rad-ReStruct benchmark.
The Problem
When a radiologist reads a chest X-ray, they write a free-text report describing what they see. These narrative reports are the standard in clinical practice, but they come with serious limitations: ambiguous phrasing, inconsistent terminology, variable structure, and no standardized way to evaluate whether a report is clinically correct or complete.
Structured reports solve these problems. Instead of free-form narrative, they use standardized templates with predefined categories: anatomical sections, specific findings, and measurable attributes. Every field is explicit, every answer comes from a controlled vocabulary. But creating large-scale structured report datasets is prohibitively expensive. Manual annotation by trained radiologists does not scale.
The goal: use Large Language Models to automatically convert free-text radiology reports into structured format, generating a dataset large enough to train future automated structured report generation models.
The Approach
The foundation is a hierarchical question-answering template covering nine anatomical sections: Respiratory System (Lung, Pleura, Trachea), Cardiovascular System, Skeletal System, Breast, Abdomen, Thorax, Mediastinum, Lymph Nodes, and Foreign Objects. The template encodes clinical knowledge as a decision tree with three levels of progressively specific questions.
Level 1 asks about topic existence: "Are there any signs or diseases in the respiratory system?" These 25 binary questions act as gates. Level 2 asks about element existence: "Is there pneumonia in the lung?", "Are there stents?" These 216 binary questions drill into specific findings. Level 3 captures attributes: body region, laterality, degree, severity. These 477 questions offer single-choice or multi-choice answers from up to 94 options drawn from MeSH and RadLex medical ontologies.
The gating structure is key. If Level 1 finds no respiratory issues, the 80+ Level 2 and Level 3 questions for that section are skipped entirely. For a typical report where most sections have no findings, this dramatically reduces the effective question count from 718 to a manageable subset.
The pipeline feeds each free-text radiology report to an LLM alongside questions from the structured template. For every report, the model receives the full text, a question from the hierarchy, and the available answer choices. The LLM extracts the answer, which is parsed into the structured format. This repeats for all applicable questions, producing a complete structured report.
Multiple LLMs were evaluated: Vicuna-13B, Alpaca, gpt4all, PubMedGPT, OpenAssistant, and ChatGPT. Vicuna-13B was selected as the primary model for its combination of strong medical text understanding, local hosting (no API costs), and data privacy compliance, critical when working with clinical text from MIMIC-III/IV.
Prompting strategies were designed to maximize extraction accuracy: zero-shot prompting with medical context, few-shot with example report-answer pairs, and chain-of-thought prompting for multi-step reasoning about complex findings. The hierarchical template itself served as implicit chain-of-thought guidance, breaking the structuring task into a sequence of progressively specific questions.
The pipeline was applied to thousands of chest X-ray reports from the MIMIC-III/IV clinical database, one of the largest publicly available collections of de-identified clinical data. Each free-text report was processed through the full question hierarchy, generating a structured JSON output with nested section, finding, and attribute data.
An interpolation system handled LLM output inconsistencies: malformed answers, ambiguous responses, and cases where the model hedged rather than committing to a controlled vocabulary term. Both raw and interpolated versions were generated. The resulting dataset was contributed to the Rad-ReStruct benchmark, providing the first large-scale structured radiology report dataset paired with chest X-ray images.
Results
Template coverage by anatomical section
| Section | Topics | Example Findings |
|---|---|---|
| Respiratory System | Lung, Pleura, Trachea | Pneumonia, effusion, atelectasis |
| Cardiovascular System | Heart, vessels | Cardiomegaly, congestion |
| Skeletal System | Bones, joints | Fractures, degenerative changes |
| Foreign Objects | Devices, implants | Catheters, stents, tubes |
| Mediastinum | Mediastinal structures | Widening, masses, shift |
| Thorax | Thoracic wall, diaphragm | Hernia, elevation |
| Breast | Breast tissue | Calcifications, masses |
| Abdomen | Abdominal structures | Free air, distension |
| Lymph Nodes | Lymphatic system | Enlargement, calcification |
9 sections, 178 controlled vocabulary terms from MeSH and RadLex ontologies.
Question distribution across hierarchy levels
LLM comparison
| Model | Type | Key Characteristic |
|---|---|---|
| Vicuna-13B | Open-source, 13B | Primary model, best medical understanding |
| ChatGPT | API-based, closed | High quality but API cost and privacy |
| PubMedGPT | Domain-specific | Medical vocabulary, limited generation |
| Alpaca | Open-source, 7B | Fast but lower accuracy |
| gpt4all | Open-source | Broad capability, variable quality |
| OpenAssistant | Open-source | Conversational focus |
Vicuna-13B (highlighted) selected for medical text understanding, local hosting, and data privacy compliance.
The three-level hierarchy proved essential for managing scale. A chest X-ray with findings only in the respiratory and cardiovascular sections might answer 25 Level 1 questions, then branch into roughly 60 Level 2 and Level 3 questions for those two sections, rather than all 718. For the MIMIC dataset, where many reports describe only one or two abnormalities, this gating reduced average processing time per report substantially.
Key Findings
- A hierarchical question-answering template with 718 questions across three levels can capture the full clinical detail of a chest X-ray report, from high-level section findings down to specific attributes like body region, laterality, and severity.
- Vicuna-13B proved the most effective open-source model for medical text structuring, offering strong medical vocabulary understanding without API costs or data privacy concerns inherent to cloud-based models.
- Structured prompting with few-shot medical examples significantly outperformed zero-shot approaches. The question hierarchy itself served as implicit chain-of-thought guidance, breaking complex extraction into manageable steps.
- The three-level gating structure dramatically reduces the effective question count per report. Since most anatomical sections have no findings in a typical X-ray, Level 1 gates eliminate the majority of downstream questions.
- The generated dataset contributed to the Rad-ReStruct benchmark, providing the first large-scale structured radiology report dataset paired with chest X-ray images for training automated structured report generation models.