Radiology Report Generation with Large Language Models

Research Project · Technical University of Munich · 2023

Radiologists write free-text reports that are ambiguous, inconsistent, and hard to evaluate. Structured reports fix this but require massive annotation effort. This project built an LLM-powered pipeline to automatically convert thousands of free-text radiology reports into structured format using a 718-question hierarchical template, creating a large-scale ground truth dataset for the Rad-ReStruct benchmark.

The Problem

When a radiologist reads a chest X-ray, they write a free-text report describing what they see. These narrative reports are the standard in clinical practice, but they come with serious limitations: ambiguous phrasing, inconsistent terminology, variable structure, and no standardized way to evaluate whether a report is clinically correct or complete.

Structured reports solve these problems. Instead of free-form narrative, they use standardized templates with predefined categories: anatomical sections, specific findings, and measurable attributes. Every field is explicit, every answer comes from a controlled vocabulary. But creating large-scale structured report datasets is prohibitively expensive. Manual annotation by trained radiologists does not scale.

The goal: use Large Language Models to automatically convert free-text radiology reports into structured format, generating a dataset large enough to train future automated structured report generation models.

The Approach

Structured Report Template The Schema

The foundation is a hierarchical question-answering template covering nine anatomical sections: Respiratory System (Lung, Pleura, Trachea), Cardiovascular System, Skeletal System, Breast, Abdomen, Thorax, Mediastinum, Lymph Nodes, and Foreign Objects. The template encodes clinical knowledge as a decision tree with three levels of progressively specific questions.

Level 1 asks about topic existence: "Are there any signs or diseases in the respiratory system?" These 25 binary questions act as gates. Level 2 asks about element existence: "Is there pneumonia in the lung?", "Are there stents?" These 216 binary questions drill into specific findings. Level 3 captures attributes: body region, laterality, degree, severity. These 477 questions offer single-choice or multi-choice answers from up to 94 options drawn from MeSH and RadLex medical ontologies.

The gating structure is key. If Level 1 finds no respiratory issues, the 80+ Level 2 and Level 3 questions for that section are skipped entirely. For a typical report where most sections have no findings, this dramatically reduces the effective question count from 718 to a manageable subset.

LLM-based QA Pipeline The Engine

The pipeline feeds each free-text radiology report to an LLM alongside questions from the structured template. For every report, the model receives the full text, a question from the hierarchy, and the available answer choices. The LLM extracts the answer, which is parsed into the structured format. This repeats for all applicable questions, producing a complete structured report.

Multiple LLMs were evaluated: Vicuna-13B, Alpaca, gpt4all, PubMedGPT, OpenAssistant, and ChatGPT. Vicuna-13B was selected as the primary model for its combination of strong medical text understanding, local hosting (no API costs), and data privacy compliance, critical when working with clinical text from MIMIC-III/IV.

Prompting strategies were designed to maximize extraction accuracy: zero-shot prompting with medical context, few-shot with example report-answer pairs, and chain-of-thought prompting for multi-step reasoning about complex findings. The hierarchical template itself served as implicit chain-of-thought guidance, breaking the structuring task into a sequence of progressively specific questions.

Dataset Generation The Output

The pipeline was applied to thousands of chest X-ray reports from the MIMIC-III/IV clinical database, one of the largest publicly available collections of de-identified clinical data. Each free-text report was processed through the full question hierarchy, generating a structured JSON output with nested section, finding, and attribute data.

An interpolation system handled LLM output inconsistencies: malformed answers, ambiguous responses, and cases where the model hedged rather than committing to a controlled vocabulary term. Both raw and interpolated versions were generated. The resulting dataset was contributed to the Rad-ReStruct benchmark, providing the first large-scale structured radiology report dataset paired with chest X-ray images.

Results

Template coverage by anatomical section

Section	Topics	Example Findings
Respiratory System	Lung, Pleura, Trachea	Pneumonia, effusion, atelectasis
Cardiovascular System	Heart, vessels	Cardiomegaly, congestion
Skeletal System	Bones, joints	Fractures, degenerative changes
Foreign Objects	Devices, implants	Catheters, stents, tubes
Mediastinum	Mediastinal structures	Widening, masses, shift
Thorax	Thoracic wall, diaphragm	Hernia, elevation
Breast	Breast tissue	Calcifications, masses
Abdomen	Abdominal structures	Free air, distension
Lymph Nodes	Lymphatic system	Enlargement, calcification

9 sections, 178 controlled vocabulary terms from MeSH and RadLex ontologies.

Question distribution across hierarchy levels

Level 3: Attributes

477 questions

Level 2: Elements

216 questions

Level 1: Topics

25 questions

718 total questions per report. Level 1 binary gates reduce effective count: most sections have no findings.

LLM comparison

Model	Type	Key Characteristic
Vicuna-13B	Open-source, 13B	Primary model, best medical understanding
ChatGPT	API-based, closed	High quality but API cost and privacy
PubMedGPT	Domain-specific	Medical vocabulary, limited generation
Alpaca	Open-source, 7B	Fast but lower accuracy
gpt4all	Open-source	Broad capability, variable quality
OpenAssistant	Open-source	Conversational focus

Vicuna-13B (highlighted) selected for medical text understanding, local hosting, and data privacy compliance.

The three-level hierarchy proved essential for managing scale. A chest X-ray with findings only in the respiratory and cardiovascular sections might answer 25 Level 1 questions, then branch into roughly 60 Level 2 and Level 3 questions for those two sections, rather than all 718. For the MIMIC dataset, where many reports describe only one or two abnormalities, this gating reduced average processing time per report substantially.

Key Findings

A hierarchical question-answering template with 718 questions across three levels can capture the full clinical detail of a chest X-ray report, from high-level section findings down to specific attributes like body region, laterality, and severity.
Vicuna-13B proved the most effective open-source model for medical text structuring, offering strong medical vocabulary understanding without API costs or data privacy concerns inherent to cloud-based models.
Structured prompting with few-shot medical examples significantly outperformed zero-shot approaches. The question hierarchy itself served as implicit chain-of-thought guidance, breaking complex extraction into manageable steps.
The three-level gating structure dramatically reduces the effective question count per report. Since most anatomical sections have no findings in a typical X-ray, Level 1 gates eliminate the majority of downstream questions.
The generated dataset contributed to the Rad-ReStruct benchmark, providing the first large-scale structured radiology report dataset paired with chest X-ray images for training automated structured report generation models.