Bibliographic Details
Title: |
SMOL: Professionally translated parallel data for 115 under-represented languages |
Authors: |
Caswell, Isaac, Nielsen, Elizabeth, Luo, Jiaming, Cherry, Colin, Kovacs, Geza, Shemtov, Hadar, Talukdar, Partha, Tewari, Dinesh, Diane, Baba Mamadi, Doumbouya, Koulako Moussa, Diane, Djibrila, Cissé, Solo Farabado |
Publication Year: |
2025 |
Collection: |
Computer Science |
Subject Terms: |
Computer Science - Computation and Language |
More Details: |
We open-source SMOL (Set of Maximal Overall Leverage), a suite of training data to unlock translation for low-resource languages (LRLs). SMOL has been translated into 115 under-resourced languages, including many for which there exist no previous public resources, for a total of 6.1M translated tokens. SMOL comprises two sub-datasets, each carefully chosen for maximum impact given its size: SMOL-Sent, a set of sentences chosen for broad unique token coverage, and SMOL-Doc, a document-level source focusing on a broad topic coverage. They join the already released GATITOS for a trifecta of paragraph, sentence, and token-level content. We demonstrate that using SMOL to prompt or fine-tune Large Language Models yields robust ChrF improvements. In addition to translation, we provide factuality ratings and rationales for all documents in SMOL-Doc, yielding the first factuality datasets for most of these languages. Comment: ~10 pages with appendices |
Document Type: |
Working Paper |
Access URL: |
http://arxiv.org/abs/2502.12301 |
Accession Number: |
edsarx.2502.12301 |
Database: |
arXiv |