Improving Text Classification with Large Language Model-Based Data Augmentation.

Bibliographic Details
Title:	Improving Text Classification with Large Language Model-Based Data Augmentation.
Authors:	Zhao, Huanhuan, Chen, Haihua, Ruggles, Thomas A., Feng, Yunhe, Singh, Debjani, Yoon, Hong-Jun
Source:	Electronics (2079-9292); Jul2024, Vol. 13 Issue 13, p2535, 14p
Subject Terms:	DATA augmentation, LANGUAGE models, CHATGPT, NATURAL language processing, CLASSIFICATION
Company/Entity:	REUTERS Ltd.
Abstract:	Large Language Models (LLMs) such as ChatGPT possess advanced capabilities in understanding and generating text. These capabilities enable ChatGPT to create text based on specific instructions, which can serve as augmented data for text classification tasks. Previous studies have approached data augmentation (DA) by either rewriting the existing dataset with ChatGPT or generating entirely new data from scratch. However, it is unclear which method is better without comparing their effectiveness. This study investigates the application of both methods to two datasets: a general-topic dataset (Reuters news data) and a domain-specific dataset (Mitigation dataset). Our findings indicate that: 1. ChatGPT generated new data consistently enhanced model's classification results for both datasets. 2. Generating new data generally outperforms rewriting existing data, though crafting the prompts carefully is crucial to extract the most valuable information from ChatGPT, particularly for domain-specific data. 3. The augmentation data size affects the effectiveness of DA; however, we observed a plateau after incorporating 10 samples. 4. Combining the rewritten sample with new generated sample can potentially further improve the model's performance. [ABSTRACT FROM AUTHOR]
	Copyright of Electronics (2079-9292) is the property of MDPI and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
Database:	Complementary Index

More Details
ISSN:	20799292
DOI:	10.3390/electronics13132535
Published in:	Electronics (2079-9292)
Language:	English