Academic Journal
Interpretable machine learning for allergic rhinitis prediction among preschool children in Urumqi, China
Title: | Interpretable machine learning for allergic rhinitis prediction among preschool children in Urumqi, China |
---|---|
Authors: | Jinyang Wang, Ye Yang, Xueli Gong |
Source: | Scientific Reports, Vol 14, Iss 1, Pp 1-11 (2024) |
Publisher Information: | Nature Portfolio, 2024. |
Publication Year: | 2024 |
Collection: | LCC:Medicine LCC:Science |
Subject Terms: | Allergic rhinitis, Preschool children, Machine learning, Model interpretability, Prediction model, Optimal cut-off value, Medicine, Science |
More Details: | Abstract This study aimed to investigate the advantages and applications of machine learning models in predicting the risk of allergic rhinitis (AR) in children aged 2–8, compared to traditional logistic regression. The study analyzed questionnaire data from 7131 children aged 2–8, which was randomly divided into training, validation, and testing sets in a ratio of 55:15:30, repeated 100 times. Predictor variables included parental allergy, medical history during the child’s first year (cfy), and early life environmental factors. The time of first onset of AR was restricted to after the age of 1 year to establish a clear temporal relationship between the predictor variables and the outcome. Feature engineering utilized the chi-square test and the Boruta algorithm, refining the dataset for analysis. The construction utilized Logistic Regression (LR), Support Vector Machine (SVM), Random Forest (RF), and Extreme Gradient Boosting Tree (XGBoost) as the models. Model performance was evaluated using the area under the receiver operating characteristic curve (AUROC), and the optimal decision threshold was determined by weighing multiple metrics on the validation sets and reporting results on the testing set. Additionally, the strengths and limitations of the different models were comprehensively analyzed by stratifying gender, mode of birth, and age subgroups, as well as by varying the number of predictor variables. Furthermore, methods such as Shapley additive explanations (SHAP) and purity of node partition in Random Forest were employed to assess feature importance, along with exploring model stability through alterations in the number of features. In this study, 7131 children aged 2–8 were analyzed, with 524 (7.35%) diagnosed with AR, with an onset age ranging from 2 to 8 years. Optimal parameters were refined using the validation set, and a rigorous process of 100 random divisions and repeated training ensured robust evaluation of the models on the testing set. The model construction involved incorporating fourteen variables, including the history of allergy-related diseases during the child’s first year, familial genetic factors, and early-life indoor environmental factors. The performance of LR, SVM, RF, and XGBoost on the unstratified data test set was 0.715 (standard deviation = 0.023), 0.723 (0.022), 0.747 (0.015), and 0.733 (0.019), respectively; the performance of each model was stable on the stratified data, and the RF performance was significantly better than that of LR (paired samples t-test: p |
Document Type: | article |
File Description: | electronic resource |
Language: | English |
ISSN: | 2045-2322 |
Relation: | https://doaj.org/toc/2045-2322 |
DOI: | 10.1038/s41598-024-73733-w |
Access URL: | https://doaj.org/article/94e2c295ef3b44499a4fec3f2bdf7c0b |
Accession Number: | edsdoj.94e2c295ef3b44499a4fec3f2bdf7c0b |
Database: | Directory of Open Access Journals |
Full text is not displayed to guests. | Login for full access. |
ISSN: | 20452322 |
---|---|
DOI: | 10.1038/s41598-024-73733-w |
Published in: | Scientific Reports |
Language: | English |