Building a Dataset for Evidence Extraction to Support the Evaluation of Health News Quality | 2022 UWM Undergraduate Research Symposium

Amna Ataullah, “Building a Dataset for Evidence Extraction to Support the Evaluation of Health News Quality”
Mentors: Susan McRoy and Xiaoyu Liu, Computer Science
Poster #3

As new information is becoming more accessible to people, misinformation is also increasing. Cases regarding misinformation are increasing and causing harmful impacts on families and society. Unlike other types of misinformation, health-related misleading information, especially the one that describes health intervention, can cause actual harm to real people. To facilitate the work on automatic evaluation of health news quality, we aim to build a new dataset consisting of sentence-level evidence based on ten established criteria. The annotation guidelines were a full adoption of criteria definitions from HealthNewsReview.org. Two people manually annotated 100 health news when extracting evidence for each criterion. An independent reviewer was invited to resolve annotation disagreements. We adopted an inter-annotator agreement to measure the quality of the annotation work. It is assessed using both simple counts and the percentage of the final quantity of the evidence in total extracted items. The annotation for four of the ten criteria is complete. For the health news that contains claims of efficacy about health interventions, the four criteria assess whether the health news explains or quantifies benefits of the intervention(the Benefit criterion); explains or quantifies harms of the intervention(the Harm criterion); mentions the Availability of the intervention(the Availability criterion); and mentions costs of the intervention (the Cost criterion). The inter-annotator agreement rates on the evidence extraction for the four criteria are 72% (the Harm Criterion), 58% ( the Availability criterion), 72% (the Cost criterion), and 87% (the Conflict Criterion). Following the same name order, the extraction yields 318 sentences, 274 sentences, 201 sentences, and 777 sentences, respectively. We also observed different groups of specific keywords repeatedly appear when annotating for different criteria. This study presents a dataset for building the automatic evaluation of health news quality, along with the annotation interface, annotator guidelines, and agreement studies.