Towards Massively Multi-domain Multilingual Readability Assessment

Tarek Naous, Michael J. Ryan, Anton Lavrouk, Mohit Chandra, Wei Xu

May, 2023

Abstract

We present ReadMe++, a massively multi-domain multilingual dataset for automatic readability assessment. Prior work on readability assessment has been mostly restricted to the English language and one or two text domains. Additionally, the readability levels of sentences used in many previous datasets are assumed on the document-level other than sentence-level, which raises doubt about the quality of previous evaluations. We address those gaps in the literature by providing an annotated dataset of 9,757 sentences in Arabic, English, Hindi, French, and Russian collected from 112 different data sources. Unlike previous datasets, ReadMe++ offers more domain and language diversity and is manually annotated at a sentence level using the Common European Framework of Reference for Languages (CEFR) and through a Rank-and-Rate annotation framework that reduces subjectivity in annotation. Our experiments demonstrate that models fine-tuned using ReadMe++ achieve strong cross-lingual transfer capabilities and generalization to unseen domains. ReadMe++ will be made publicly available to the research community.

Type

Conference paper

Publication

2024 Conference on Empirical Methods in Natural Language Processing (Main Conference)

Towards Massively Multi-domain Multilingual Readability Assessment

Abstract

Michael J. Ryan

Masters student in NLP