Data leakage inflates prediction performance in connectome-based machine learning models.

Bibliographic Details
Title: Data leakage inflates prediction performance in connectome-based machine learning models.
Authors: Rosenblatt, Matthew, Tejavibulya, Link, Jiang, Rongtao, Noble, Stephanie, Scheinost, Dustin
Source: Nature Communications; 2/28/2024, Vol. 15 Issue 1, p1-15, 15p
Subject Terms: MACHINE learning, MACHINE performance, PREDICTIVE validity, FEATURE selection, LEAKAGE, PREDICTION models
Abstract: Predictive modeling is a central technique in neuroimaging to identify brain-behavior relationships and test their generalizability to unseen data. However, data leakage undermines the validity of predictive models by breaching the separation between training and test data. Leakage is always an incorrect practice but still pervasive in machine learning. Understanding its effects on neuroimaging predictive models can inform how leakage affects existing literature. Here, we investigate the effects of five forms of leakage–involving feature selection, covariate correction, and dependence between subjects–on functional and structural connectome-based machine learning models across four datasets and three phenotypes. Leakage via feature selection and repeated subjects drastically inflates prediction performance, whereas other forms of leakage have minor effects. Furthermore, small datasets exacerbate the effects of leakage. Overall, our results illustrate the variable effects of leakage and underscore the importance of avoiding data leakage to improve the validity and reproducibility of predictive modeling. The effects of data leakage on predictive models in neuroimaging studies are not well understood. Here, the authors show that data leakage via feature selection and repeated subjects drastically inflates prediction performance, whereas other forms of leakage have more minor effects. [ABSTRACT FROM AUTHOR]
Copyright of Nature Communications is the property of Springer Nature and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
Database: Complementary Index
More Details
ISSN:20411723
DOI:10.1038/s41467-024-46150-w
Published in:Nature Communications
Language:English