← bsozudogru.com

Radiology Report Generation with Large Language Models

Research Project · Technical University of Munich · 2023

Radiologists write free-text reports that are ambiguous, inconsistent, and hard to evaluate. Structured reports fix this but require massive annotation effort. This project built an LLM-powered pipeline to automatically convert thousands of free-text radiology reports into structured format using a 718-question hierarchical template, creating a large-scale ground truth dataset for the Rad-ReStruct benchmark.


The Problem

When a radiologist reads a chest X-ray, they write a free-text report describing what they see. These narrative reports are the standard in clinical practice, but they come with serious limitations: ambiguous phrasing, inconsistent terminology, variable structure, and no standardized way to evaluate whether a report is clinically correct or complete.

Structured reports solve these problems. Instead of free-form narrative, they use standardized templates with predefined categories: anatomical sections, specific findings, and measurable attributes. Every field is explicit, every answer comes from a controlled vocabulary. But creating large-scale structured report datasets is prohibitively expensive. Manual annotation by trained radiologists does not scale.

The goal: use Large Language Models to automatically convert free-text radiology reports into structured format, generating a dataset large enough to train future automated structured report generation models.


The Approach

Structured Report Template The Schema

The foundation is a hierarchical question-answering template covering nine anatomical sections: Respiratory System (Lung, Pleura, Trachea), Cardiovascular System, Skeletal System, Breast, Abdomen, Thorax, Mediastinum, Lymph Nodes, and Foreign Objects. The template encodes clinical knowledge as a decision tree with three levels of progressively specific questions.

Level 1 asks about topic existence: "Are there any signs or diseases in the respiratory system?" These 25 binary questions act as gates. Level 2 asks about element existence: "Is there pneumonia in the lung?", "Are there stents?" These 216 binary questions drill into specific findings. Level 3 captures attributes: body region, laterality, degree, severity. These 477 questions offer single-choice or multi-choice answers from up to 94 options drawn from MeSH and RadLex medical ontologies.

The gating structure is key. If Level 1 finds no respiratory issues, the 80+ Level 2 and Level 3 questions for that section are skipped entirely. For a typical report where most sections have no findings, this dramatically reduces the effective question count from 718 to a manageable subset.

Level 1: Topic Existence 25 binary questions gates Level 2: Element Existence 216 binary questions gates Level 3: Attributes 477 multi-choice questions Structured Report 9 anatomical sections · 718 total questions · 178 controlled vocabulary terms
LLM-based QA Pipeline The Engine

The pipeline feeds each free-text radiology report to an LLM alongside questions from the structured template. For every report, the model receives the full text, a question from the hierarchy, and the available answer choices. The LLM extracts the answer, which is parsed into the structured format. This repeats for all applicable questions, producing a complete structured report.

Multiple LLMs were evaluated: Vicuna-13B, Alpaca, gpt4all, PubMedGPT, OpenAssistant, and ChatGPT. Vicuna-13B was selected as the primary model for its combination of strong medical text understanding, local hosting (no API costs), and data privacy compliance, critical when working with clinical text from MIMIC-III/IV.

Prompting strategies were designed to maximize extraction accuracy: zero-shot prompting with medical context, few-shot with example report-answer pairs, and chain-of-thought prompting for multi-step reasoning about complex findings. The hierarchical template itself served as implicit chain-of-thought guidance, breaking the structuring task into a sequence of progressively specific questions.

Free-text Report Vicuna-13B + question template + few-shot prompts Answer Parsing + interpolation Structured Report JSON hierarchy repeat for all 718 questions per report
Dataset Generation The Output

The pipeline was applied to thousands of chest X-ray reports from the MIMIC-III/IV clinical database, one of the largest publicly available collections of de-identified clinical data. Each free-text report was processed through the full question hierarchy, generating a structured JSON output with nested section, finding, and attribute data.

An interpolation system handled LLM output inconsistencies: malformed answers, ambiguous responses, and cases where the model hedged rather than committing to a controlled vocabulary term. Both raw and interpolated versions were generated. The resulting dataset was contributed to the Rad-ReStruct benchmark, providing the first large-scale structured radiology report dataset paired with chest X-ray images.


Results

Template coverage by anatomical section

Section Topics Example Findings
Respiratory System Lung, Pleura, Trachea Pneumonia, effusion, atelectasis
Cardiovascular System Heart, vessels Cardiomegaly, congestion
Skeletal System Bones, joints Fractures, degenerative changes
Foreign Objects Devices, implants Catheters, stents, tubes
Mediastinum Mediastinal structures Widening, masses, shift
Thorax Thoracic wall, diaphragm Hernia, elevation
Breast Breast tissue Calcifications, masses
Abdomen Abdominal structures Free air, distension
Lymph Nodes Lymphatic system Enlargement, calcification

9 sections, 178 controlled vocabulary terms from MeSH and RadLex ontologies.

Question distribution across hierarchy levels

Level 3: Attributes
477 questions
Level 2: Elements
216 questions
Level 1: Topics
25 questions

718 total questions per report. Level 1 binary gates reduce effective count: most sections have no findings.

LLM comparison

Model Type Key Characteristic
Vicuna-13B Open-source, 13B Primary model, best medical understanding
ChatGPT API-based, closed High quality but API cost and privacy
PubMedGPT Domain-specific Medical vocabulary, limited generation
Alpaca Open-source, 7B Fast but lower accuracy
gpt4all Open-source Broad capability, variable quality
OpenAssistant Open-source Conversational focus

Vicuna-13B (highlighted) selected for medical text understanding, local hosting, and data privacy compliance.

The three-level hierarchy proved essential for managing scale. A chest X-ray with findings only in the respiratory and cardiovascular sections might answer 25 Level 1 questions, then branch into roughly 60 Level 2 and Level 3 questions for those two sections, rather than all 718. For the MIMIC dataset, where many reports describe only one or two abnormalities, this gating reduced average processing time per report substantially.


Key Findings

  1. A hierarchical question-answering template with 718 questions across three levels can capture the full clinical detail of a chest X-ray report, from high-level section findings down to specific attributes like body region, laterality, and severity.
  2. Vicuna-13B proved the most effective open-source model for medical text structuring, offering strong medical vocabulary understanding without API costs or data privacy concerns inherent to cloud-based models.
  3. Structured prompting with few-shot medical examples significantly outperformed zero-shot approaches. The question hierarchy itself served as implicit chain-of-thought guidance, breaking complex extraction into manageable steps.
  4. The three-level gating structure dramatically reduces the effective question count per report. Since most anatomical sections have no findings in a typical X-ray, Level 1 gates eliminate the majority of downstream questions.
  5. The generated dataset contributed to the Rad-ReStruct benchmark, providing the first large-scale structured radiology report dataset paired with chest X-ray images for training automated structured report generation models.