CLIRudit: Cross-Lingual Information Retrieval of Scientific Documents

Bibliographic Details
Title: CLIRudit: Cross-Lingual Information Retrieval of Scientific Documents
Authors: Valentini, Francisco, Kozlowski, Diego, Larivière, Vincent
Publication Year: 2025
Collection: Computer Science
Subject Terms: Computer Science - Information Retrieval, Computer Science - Computation and Language
More Details: Cross-lingual information retrieval (CLIR) consists in finding relevant documents in a language that differs from the language of the queries. This paper presents CLIRudit, a new dataset created to evaluate cross-lingual academic search, focusing on English queries and French documents. The dataset is built using bilingual article metadata from \'Erudit, a Canadian publishing platform, and is designed to represent scenarios in which researchers search for scholarly content in languages other than English. We perform a comprehensive benchmarking of different zero-shot first-stage retrieval methods on the dataset, including dense and sparse retrievers, query and document machine translation, and state-of-the-art multilingual retrievers. Our results show that large dense retrievers, not necessarily trained for the cross-lingual retrieval task, can achieve zero-shot performance comparable to using ground truth human translations, without the need for machine translation. Sparse retrievers, such as BM25 or SPLADE, combined with document translation, show competitive results, providing an efficient alternative to large dense models. This research advances the understanding of cross-lingual academic information retrieval and provides a framework that others can use to build comparable datasets across different languages and disciplines. By making the dataset and code publicly available, we aim to facilitate further research that will help make scientific knowledge more accessible across language barriers.
Document Type: Working Paper
Access URL: http://arxiv.org/abs/2504.16264
Accession Number: edsarx.2504.16264
Database: arXiv
More Details
Description not available.