Expertise Style Transfer: A New Task Towards Better Communication between Experts and Laymen


Expertise Style Transfer (EST) is a new task of text style transfer between expert language and layman language, which is to tackle the problem of discrepancies between an expert's advice and a layman's understanding of it. This problem is usually caused by the pervasive cognitive bias exhibited across all domains. We contribute a manually annotated dataset, namely MSD, in the medical domain to promote research into this task.

Here we show random examples from the test set. Click the Result button to see the style!

Which sentence do you prefer?

This is an example.

This is another example.

You choose an expert sentence!


Our dataset is challenging from two aspects:


The dataset is derived from the world's most trusted health references, The Merck Manuals, also known as the MSD Manuals. Each article is written through a collaboration between hundreds of medical experts, and includes two versions: one tailored for consumers and the other for professionals.

Our MSD dataset is constructed in three steps:

After the three steps, we obtain a large set of (non-parallel) training sentences in each style, and a small set of parallel sentences for evaluation.


This part gives some statistics of the MSD dataset and compares it with other Style Transfer (ST) and Text Simplification (TS) datasets.

Figure 2 shows the distribution of medical topics of MSD Manuals, and Figure 3 displays the PICO elements' distribution in the test set. Table 1 compares the BLEU and edit distance (ED for short) scores of parallel sentences of existing TS and ST datasets. We see that:

  1. 1. MSD has the lowest BLEU and highest ED. This implies that MSD is very challenging that requires both lexical and structural modifications.

  2. 2. TS datasets reflect more structural differences (with higher ED values) as compared to ST datasets, meaning that TS datasets are more complex to transfer.

Table 1. BLEU (4-gram) and edit distance (ED for short) scores between parallel sentences.
Figure 2: Distribution of dataset based on topics.
Figure 3: Distribution of testing set based on PICO.


According to the Data Management Policy of National University of Singapore which takes effect on 1 May 2020, the dataset cannot be published before getting the approval of NUS. We will release the dataset to the public as soon as possible. Before that, you can contact to get the dataset.


title = {Expertise Style Transfer: A New Task Towards Better Communication between Experts and Laymen}
author = {Cao, Yixin and Shui, Ruihao and Pan, Liangming and Kan, Min-Yen and Liu, Zhiyuan and Chua, Tat-Seng},
booktitle={Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics},

Copyright © 2018 NExT++   /   Terms & Conditions   /   Privacy Policy