Introduction
Expertise Style Transfer (EST) is a new task of text style transfer between expert language and layman language, which is to tackle the problem of discrepancies between an expert's advice and a layman's understanding of it. This problem is usually caused by the pervasive cognitive bias exhibited across all domains. We contribute a manually annotated dataset, namely MSD, in the medical domain to promote research into this task.
Here we show random examples from the test set. Click the Result button to see the style!
Which sentence do you prefer?
This is an example.
This is another example.
Characteristics
Our dataset is challenging from two aspects:
Knowledge Gap: Domain knowledge is the key factor that influences the expertise level of text, which is also a key difference from conventional styles. We identify two major types of knowledge gaps in MSD: terminology, e.g., dyspnea; and empirical evidence, that is, doctors prefer to use statistics (About 1/1000), while laymen do not (quite small)
Lexical & Structural Modification: Most ST models only perform lexical modification, while leaving structures unchanged. However, syntactic structures play a significant role in language styles, especially regarding complexity or simplicity. The statistics part will show that the sentences in MSD have more lexical and structural modifications.
Construction
The dataset is derived from the world's most trusted health references, The Merck Manuals, also known as the MSD Manuals. Each article is written through a collaboration between hundreds of medical experts, and includes two versions: one tailored for consumers and the other for professionals.
Our MSD dataset is constructed in three steps:
Data Preprocessing We collect 2601 professional and 2487 consumer documents with 1185 internal links from the MSD website. Then, we find 2551 possible parallel groups of sentences by matching their document titles and subsection titles, which denote medical PICO elements, such as the Diagnosis and Symptoms. For example, all sentences for Atherosclerosis.Symptoms in the professional MSD are aligned with those for Atherosclerosis.Signs in the consumer MSD
Expert Annotation Given the aligned groups of sentences in professional and consumer MSD, hired doctors are to select sentences from each version that have the same meaning but are written in different styles. The annotated sentence pairs can be used for evaluation.
Knowledge Incorporation To facilitate knowledge-aware analysis, we use QuickUMLS to automatically link entity mentions, i.e., medical concepts, to Unified Medical Language System (UMLS). Each mention may refer to multiple concepts, e.g., dyspnea is linked to concept C0013404.
After the three steps, we obtain a large set of (non-parallel) training sentences in each style, and a small set of parallel sentences for evaluation.
Statistics
This part gives some statistics of the MSD dataset and compares it with other Style Transfer (ST) and Text Simplification (TS) datasets. Figure 2 shows the distribution of medical topics of MSD Manuals, and Figure 3 displays the PICO elements' distribution in the test set. Table 1 compares the BLEU and edit distance (ED for short) scores of parallel sentences of existing TS and ST datasets. We see that:
Table 1. BLEU (4-gram) and edit distance (ED for short) scores between parallel sentences.
|
Figure 2: Distribution of dataset based on topics.
Figure 3: Distribution of testing set based on PICO.
|
Disclaimer
According to the Data Management Policy of National University of Singapore which takes effect on 1 May 2020, the dataset cannot be published before getting the approval of NUS. We will release the dataset to the public as soon as possible. Before that, you can contact shuirh2019@gmail.com to get the dataset.
Citation
@inproceddings{cao2020expertise,
title = {Expertise Style Transfer: A New Task Towards Better Communication between Experts and Laymen}
author = {Cao, Yixin and Shui, Ruihao and Pan, Liangming and Kan, Min-Yen and Liu, Zhiyuan and Chua, Tat-Seng},
booktitle={Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics},
year={2020}
}