Interpretable machine learning for allergic rhinitis prediction among preschool children in Urumqi, China

Bibliographic Details
Title: Interpretable machine learning for allergic rhinitis prediction among preschool children in Urumqi, China
Authors: Jinyang Wang, Ye Yang, Xueli Gong
Source: Scientific Reports, Vol 14, Iss 1, Pp 1-11 (2024)
Publisher Information: Nature Portfolio, 2024.
Publication Year: 2024
Collection: LCC:Medicine
LCC:Science
Subject Terms: Allergic rhinitis, Preschool children, Machine learning, Model interpretability, Prediction model, Optimal cut-off value, Medicine, Science
More Details: Abstract This study aimed to investigate the advantages and applications of machine learning models in predicting the risk of allergic rhinitis (AR) in children aged 2–8, compared to traditional logistic regression. The study analyzed questionnaire data from 7131 children aged 2–8, which was randomly divided into training, validation, and testing sets in a ratio of 55:15:30, repeated 100 times. Predictor variables included parental allergy, medical history during the child’s first year (cfy), and early life environmental factors. The time of first onset of AR was restricted to after the age of 1 year to establish a clear temporal relationship between the predictor variables and the outcome. Feature engineering utilized the chi-square test and the Boruta algorithm, refining the dataset for analysis. The construction utilized Logistic Regression (LR), Support Vector Machine (SVM), Random Forest (RF), and Extreme Gradient Boosting Tree (XGBoost) as the models. Model performance was evaluated using the area under the receiver operating characteristic curve (AUROC), and the optimal decision threshold was determined by weighing multiple metrics on the validation sets and reporting results on the testing set. Additionally, the strengths and limitations of the different models were comprehensively analyzed by stratifying gender, mode of birth, and age subgroups, as well as by varying the number of predictor variables. Furthermore, methods such as Shapley additive explanations (SHAP) and purity of node partition in Random Forest were employed to assess feature importance, along with exploring model stability through alterations in the number of features. In this study, 7131 children aged 2–8 were analyzed, with 524 (7.35%) diagnosed with AR, with an onset age ranging from 2 to 8 years. Optimal parameters were refined using the validation set, and a rigorous process of 100 random divisions and repeated training ensured robust evaluation of the models on the testing set. The model construction involved incorporating fourteen variables, including the history of allergy-related diseases during the child’s first year, familial genetic factors, and early-life indoor environmental factors. The performance of LR, SVM, RF, and XGBoost on the unstratified data test set was 0.715 (standard deviation = 0.023), 0.723 (0.022), 0.747 (0.015), and 0.733 (0.019), respectively; the performance of each model was stable on the stratified data, and the RF performance was significantly better than that of LR (paired samples t-test: p
Document Type: article
File Description: electronic resource
Language: English
ISSN: 2045-2322
Relation: https://doaj.org/toc/2045-2322
DOI: 10.1038/s41598-024-73733-w
Access URL: https://doaj.org/article/94e2c295ef3b44499a4fec3f2bdf7c0b
Accession Number: edsdoj.94e2c295ef3b44499a4fec3f2bdf7c0b
Database: Directory of Open Access Journals
Full text is not displayed to guests.
More Details
ISSN:20452322
DOI:10.1038/s41598-024-73733-w
Published in:Scientific Reports
Language:English