Artificial intelligence versus human touch: can artificial intelligence accurately generate a literature review on laser technologies?

Bibliographic Details
Title: Artificial intelligence versus human touch: can artificial intelligence accurately generate a literature review on laser technologies?
Authors: Panthier, Frédéric, Crawford-Smith, Hugh, Alvarez, Eduarda, Melchionna, Alberto, Velinova, Daniela, Mohamed, Ikran, Price, Siobhan, Choong, Simon, Arumuham, Vimoshan, Allen, Sian, Traxer, Olivier, Smith, Daron
Source: World Journal of Urology; 10/28/2024, Vol. 42 Issue 1, p1-10, 10p
Subject Terms: LANGUAGE models, NURSE practitioners, ARTIFICIAL intelligence, LASER lithotripsy, OPEN source intelligence
Abstract: Purpose: To compare the accuracy of open-source Artificial Intelligence (AI) Large Language Models (LLM) against human authors to generate a systematic review (SR) on the new pulsed-Thulium:YAG (p-Tm:YAG) laser. Methods: Five manuscripts were compared. The Human-SR on p-Tm:YAG (considered to be the "ground truth") was written by independent certified endourologists with expertise in lasers, accepted in a peer-review pubmed-indexed journal (but not yet available online, and therefore not accessible to the LLMs). The query to the AI LLMs was: "write a systematic review on pulsed-Thulium:YAG laser for lithotripsy" which was submitted to four LLMs (ChatGPT3.5/Vercel/Claude/Mistral-7b). The LLM-SR were uniformed and Human-SR reformatted to fit the general output appearance, to ensure blindness. Nine participants with various levels of endourological expertise (three Clinical Nurse Specialist nurses, Urology Trainees and Consultants) objectively assessed the accuracy of the five SRs using a bespoke 10 "checkpoint" proforma. A subjective assessment was recorded using a composite score including quality (0–10), clarity (0–10) and overall manuscript rank (1–5). Results: The Human-SR was objectively and subjectively more accurate than LLM-SRs (96 ± 7% and 86.8 ± 8.2% respectively; p < 0.001). The LLM-SRs did not significantly differ but ChatGPT3.5 presented greater subjective and objective accuracy scores (62.4 ± 15% and 29 ± 28% respectively; p > 0.05). Quality and clarity assessments were significantly impacted by SR type but not the expertise level (p < 0.001 and > 0.05, respectively). Conclusions: LLM generated data on highly technical topics present a lower accuracy than Key Opinion Leaders. LLMs, especially ChatGPT3.5, with human supervision could improve our practice. [ABSTRACT FROM AUTHOR]
Copyright of World Journal of Urology is the property of Springer Nature and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
Database: Complementary Index
More Details
ISSN:07244983
DOI:10.1007/s00345-024-05311-8
Published in:World Journal of Urology
Language:English