Introduction

Expertise Style Transfer (EST) is a new task of text style transfer between expert language and layman language, which is to tackle the problem of discrepancies between an expert's advice and a layman's understanding of it. This problem is usually caused by the pervasive cognitive bias exhibited across all domains. We contribute a manually annotated dataset, namely MSD, in the medical domain to promote research into this task.

Here we show random examples from the test set. Click the Result button to see the style!

Which sentence do you prefer?

This is an example.

This is another example.

You choose an expert sentence!

Characteristics

Our dataset is challenging from two aspects:

Knowledge Gap: Domain knowledge is the key factor that influences the expertise level of text, which is also a key difference from conventional styles. We identify two major types of knowledge gaps in MSD: terminology, e.g., dyspnea; and empirical evidence, that is, doctors prefer to use statistics (About 1/1000), while laymen do not (quite small)
Lexical & Structural Modification: Most ST models only perform lexical modification, while leaving structures unchanged. However, syntactic structures play a significant role in language styles, especially regarding complexity or simplicity. The statistics part will show that the sentences in MSD have more lexical and structural modifications.

Construction

The dataset is derived from the world's most trusted health references, The Merck Manuals, also known as the MSD Manuals. Each article is written through a collaboration between hundreds of medical experts, and includes two versions: one tailored for consumers and the other for professionals.

Our MSD dataset is constructed in three steps:

Data Preprocessing We collect 2601 professional and 2487 consumer documents with 1185 internal links from the MSD website. Then, we find 2551 possible parallel groups of sentences by matching their document titles and subsection titles, which denote medical PICO elements, such as the Diagnosis and Symptoms. For example, all sentences for Atherosclerosis.Symptoms in the professional MSD are aligned with those for Atherosclerosis.Signs in the consumer MSD
Expert Annotation Given the aligned groups of sentences in professional and consumer MSD, hired doctors are to select sentences from each version that have the same meaning but are written in different styles. The annotated sentence pairs can be used for evaluation.
Knowledge Incorporation To facilitate knowledge-aware analysis, we use QuickUMLS to automatically link entity mentions, i.e., medical concepts, to Unified Medical Language System (UMLS). Each mention may refer to multiple concepts, e.g., dyspnea is linked to concept C0013404.

After the three steps, we obtain a large set of (non-parallel) training sentences in each style, and a small set of parallel sentences for evaluation.

Statistics

This part gives some statistics of the MSD dataset and compares it with other Style Transfer (ST) and Text Simplification (TS) datasets.

Figure 2 shows the distribution of medical topics of MSD Manuals, and Figure 3 displays the PICO elements' distribution in the test set. Table 1 compares the BLEU and edit distance (ED for short) scores of parallel sentences of existing TS and ST datasets. We see that:

1. MSD has the lowest BLEU and highest ED. This implies that MSD is very challenging that requires both lexical and structural modifications.
2. TS datasets reflect more structural differences (with higher ED values) as compared to ST datasets, meaning that TS datasets are more complex to transfer.

Datasets	BLEU	ED	Task
GYAFC_E&M	16.22	28.53	ST
GYAFC_F&R	16.95	29.53	ST
Yelp	24.76	22.20	ST
Amazon	44.52	19.75	ST
Authorship	19.43	36.70	ST
SimpWiki	49.98	64.16	TS
MSD	14.01	139.73	ST&TS

Table 1. BLEU (4-gram) and edit distance (ED for short) scores between parallel sentences.

Figure 2: Distribution of dataset based on topics.

Figure 3: Distribution of testing set based on PICO.

Disclaimer

According to the Data Management Policy of National University of Singapore which takes effect on 1 May 2020, the dataset cannot be published before getting the approval of NUS. We will release the dataset to the public as soon as possible. Before that, you can contact shuirh2019@gmail.com to get the dataset.

Citation

@inproceddings{cao2020expertise,
title = {Expertise Style Transfer: A New Task Towards Better Communication between Experts and Laymen}
author = {Cao, Yixin and Shui, Ruihao and Pan, Liangming and Kan, Min-Yen and Liu, Zhiyuan and Chua, Tat-Seng},
booktitle={Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics},
year={2020}
}