On the improvement of reinforcement active learning with the involvement of cross entropy to address one-shot learning problem.

Bibliographic Details
Title:	On the improvement of reinforcement active learning with the involvement of cross entropy to address one-shot learning problem.
Authors:	Honglan Huang, Jincai Huang, Yanghe Feng, Jiarui Zhang, Zhong Liu, Qi Wang, Li Chen
Source:	PLoS ONE, Vol 14, Iss 6, p e0217408 (2019)
Publisher Information:	Public Library of Science (PLoS), 2019.
Publication Year:	2019
Collection:	LCC:Medicine LCC:Science
Subject Terms:	Medicine, Science
More Details:	As a promising research direction in recent decades, active learning allows an oracle to assign labels to typical examples for performance improvement in learning systems. Existing works mainly focus on designing criteria for screening examples of high value to be labeled in a handcrafted manner. Instead of manually developing strategies of querying the user to access labels for the desired examples, we utilized the reinforcement learning algorithm parameterized with the neural network to automatically explore query strategies in active learning when addressing stream-based one-shot classification problems. With the involvement of cross-entropy in the loss function of Q-learning, an efficient policy to decide when and where to predict or query an instance is learned through the developed framework. Compared with a former influential work, the advantages of our method are demonstrated experimentally with two image classification tasks, and it exhibited better performance, quick convergence, relatively good stability and fewer requests for labels.
Document Type:	article
File Description:	electronic resource
Language:	English
ISSN:	1932-6203
Relation:	https://doaj.org/toc/1932-6203
DOI:	10.1371/journal.pone.0217408
Access URL:	https://doaj.org/article/38c11e8c6f284674b3e4c329e9955dee
Accession Number:	edsdoj.38c11e8c6f284674b3e4c329e9955dee
Database:	Directory of Open Access Journals
Full text is not displayed to guests.	Login for full access.

FullText	Links: – Type: pdflink Url: https://content.ebscohost.com/cds/retrieve?content=AQICAHjPtM4BHU3ZchRwgzYmadcigk49r9CVlbU7V5F6lgH7WwETz5usiav8K25KDSWE7aXHAAAA4jCB3wYJKoZIhvcNAQcGoIHRMIHOAgEAMIHIBgkqhkiG9w0BBwEwHgYJYIZIAWUDBAEuMBEEDBvEYJTLsrkdnEHHvQIBEICBmrutyP1hY5LzHPlfbNFa6YFa130yZFvJgCyxgW_IHekefMVmeoySYA5wyK4H3j9FBDYja8X2IyZ_u4wz8sE9JZBcClN3qlLvkScnpC3o4yWAlswryMRhnqMPnERPmD85UXJp9_XDvFcoQ8Us1-f2-8uPPcdtXFpUqzz5M6MR3UIb05Erb9X-_1U9apulT8cRbzsQDDzOwwsqRPw= Text: Availability: 1 Value: <anid>AN0137066134;[78be]19jun.19;2019Jun20.14:20;v2.2.500</anid> <title id="AN0137066134-1">On the improvement of reinforcement active learning with the involvement of cross entropy to address one-shot learning problem </title> <p>As a promising research direction in recent decades, active learning allows an oracle to assign labels to typical examples for performance improvement in learning systems. Existing works mainly focus on designing criteria for screening examples of high value to be labeled in a handcrafted manner. Instead of manually developing strategies of querying the user to access labels for the desired examples, we utilized the reinforcement learning algorithm parameterized with the neural network to automatically explore query strategies in active learning when addressing stream-based one-shot classification problems. With the involvement of cross-entropy in the loss function of Q-learning, an efficient policy to decide when and where to predict or query an instance is learned through the developed framework. Compared with a former influential work, the advantages of our method are demonstrated experimentally with two image classification tasks, and it exhibited better performance, quick convergence, relatively good stability and fewer requests for labels.</p> <p>Keywords: Research Article; Biology and life sciences; Neuroscience; Cognitive science; Cognitive psychology; Learning; Human learning; Psychology; Social sciences; Learning and memory; Computer and information sciences; Artificial intelligence; Machine learning; Physical sciences; Mathematics; Applied mathematics; Algorithms; Research and analysis methods; Simulation and modeling; Physics; Thermodynamics; Entropy; Machine learning algorithms; Probability theory; Probability distribution; Decision making; Cognition</p> <hd id="AN0137066134-2">Introduction</hd> <p>In recent decades, machine learning has attracted increasing attention from both industry and academia and shown its great power in universal applications, such as pattern analysis [[<reflink idref="bib1" id="ref1">1</reflink>]], knowledge discovery and discipline prediction. As acknowledged in this domain, data resources are crucial in learning tasks. A direct strategy to process data and incorporate human experience is to formulate labels for examples. In small-scale datasets, precise annotation based on expert knowledge is acceptable. However, when large-scale datasets are used for complicated tasks, complete and perfect annotations are no longer viable, due to the reality that labeling process for these datasets is labor-intensive, costly in terms of time and money, and dependent on domain experience. With the increase of dataset volume, the learning system tends to generalize better, but the cost of annotation dramatically increases [[<reflink idref="bib2" id="ref2">2</reflink>]]. Meanwhile, former studies have revealed that obtaining the ground truth label of a dataset not only requires the participation of a large number of experts in the field, but also takes more than 10 times longer to label the instance as to collect it [[<reflink idref="bib3" id="ref3">3</reflink>]]. In contrast, accessing a massive number of unlabeled instances is relatively easy. The availability of a massive number of unlabeled examples as well as the potential task-beneficial information buried in them has led to enlightenment through some effective paradigms employed in the learning domain, including semi-supervised learning and active learning. The goals of these emerging paradigms are to take advantage of the unlabeled datasets for performance promotion and to reduce workloads of human experts. Semi-supervised learning has developed quickly in recent years, exploiting statistical or geometrical information in unlabeled examples to enhance the generalization. Notably, however, the involvement of the unlabeled examples in a semi-supervised framework may be inappropriate and degrade the original accuracy in certain scenarios. Another powerful learning paradigm-active learning is significantly distinct from semi-supervised learning in theory and practice. The difference is that the active learning algorithm simulates the human learning process to some extent: selects part of instances to label and join the training set, and iteratively improves the generalization performance of the classifier. Therefore, this algorithm has been widely used in information retrieval [[<reflink idref="bib4" id="ref4">4</reflink>]], image and speech recognition [[<reflink idref="bib5" id="ref5">5</reflink>]–[<reflink idref="bib11" id="ref6">11</reflink>]], and text analysis [[<reflink idref="bib12" id="ref7">12</reflink>]–[<reflink idref="bib14" id="ref8">14</reflink>]] in recent years.</p> <p>The core of traditional active learning methods is to formulate criteria for selecting samples, and commonly-used methods include uncertainty sampling [[<reflink idref="bib15" id="ref9">15</reflink>]], query-by-committee [[<reflink idref="bib16" id="ref10">16</reflink>]], margin [[<reflink idref="bib17" id="ref11">17</reflink>]], and representative and diversity-based sampling [[<reflink idref="bib18" id="ref12">18</reflink>]]. However, determining which approach is better is difficult since each approach starts from a reasonable, meaningful, and completely different motivation. To the best of our knowledge, no universal method that performs best on all datasets currently exists. These limitations drive us to explore new frameworks to address the sample-selecting problem. Observing that human beings can learn new concepts from a single example [[<reflink idref="bib19" id="ref13">19</reflink>]], we sought to design an artificial intelligence agent that can inherit a similar capability and pose fewer requests for labeling new examples during the training process [[<reflink idref="bib20" id="ref14">20</reflink>]]. An ideal case in active learning is one in which labeling of critical examples is still required, but the frequency can be minimized. We preferred a model that learns active learning algorithms via reinforcement learning [[<reflink idref="bib21" id="ref15">21</reflink>], [<reflink idref="bib22" id="ref16">22</reflink>]], rather than a hand-design criterion. More specifically, the selection or design of a new example labeling strategy can be performed automatically.</p> <p>Therefore, we propose a novel learning method, that can not only learn to classify instances with little supervision but also capture a relatively optimal label query strategy as well. Our method is mostly inspired by the work of Mark Woodward et al. [[<reflink idref="bib23" id="ref17">23</reflink>]] and can be viewed as a practical extension of that work. Our model falls into the class of stream-based active learners, which is based on the online setting of active learning. The use of reinforcement learning by an active learner to solve a continuous decision problem is a natural fit since each query action affects the next decision (when and which instance to query based on the state of the basic learner). Accordingly, the active query system trained by the reinforcement learning can learn a cogent, non-myopic strategy [[<reflink idref="bib24" id="ref18">24</reflink>]], and make effective decisions with little supervision.</p> <p>Our primary contribution in this work is improvement of the influential active one-shot learning (AOL) model introduced by Mark Woodward et al. [[<reflink idref="bib23" id="ref19">23</reflink>]]. Woodward's work is known to be the first practice of reinforcement learning with deep recurrent models in the task of active learning. With additional metric cross entropy involved in the loss function of Q-learning, we significantly accelerate the convergence speed, avoid the gradient vanishing problem, improved the stability, reduce the number of requested labels, and improve the level of accuracy in comparison with the former work of Mark Woodward et al. [[<reflink idref="bib23" id="ref20">23</reflink>]]. Meanwhile, we evaluate the model on Omniglot [[<reflink idref="bib19" id="ref21">19</reflink>], [<reflink idref="bib25" id="ref22">25</reflink>], [<reflink idref="bib26" id="ref23">26</reflink>]]("active" variants of existing one-shot learning tasks [[<reflink idref="bib27" id="ref24">27</reflink>]]), and the experimental results show the efficiency of our model in exploring label querying strategies. We empirically demonstrate that our model can achieve better performance with fewer iterations and learn a query strategy based on uncertainty [[<reflink idref="bib28" id="ref25">28</reflink>]] of instances in an end-to-end fashion. Accordingly, the workload of human experts can be partially reduced during the learning process.</p> <hd id="AN0137066134-3">Related work</hd> <p>The setting of active learning is mainly based on three scenarios: (i) membership query synthesis, (ii) pool-based sampling, and (iii) stream-based selective sampling [[<reflink idref="bib29" id="ref26">29</reflink>]]. In the membership query synthesis scenario, the learner can select a new instance to label from the input space, or it can generate a new instance. In the pool-based scenario, the learner can request labels for any instance from a large amount of historical data. Finally, in the stream-based active learning scenario, instances can be continually obtained from the data stream and presented in an exogenously-determined order. The learner must instantly decide whether to request a label for the new instance [[<reflink idref="bib30" id="ref27">30</reflink>]]. Various practical scenarios have benefited from the idea of active learning, including movie recommendation [[<reflink idref="bib31" id="ref28">31</reflink>]–[<reflink idref="bib33" id="ref29">33</reflink>]], medical image classification [[<reflink idref="bib34" id="ref30">34</reflink>]], natural language processing.</p> <p>In recent years, reinforcement learning has gained considerable attention. Due to its capability of interacting with the environment and providing a good approximation of the objective value based on relevant feedback, this method is theoretically suitable for online, real-time forecasting and decision-making. Particularly for specific complex tasks, in the unknown environment, reinforcement learning can learn the optimal strategy by exploration and exploitation. This learning framework has also been successfully applied to solve complex predictive and control problems in virtual environments [[<reflink idref="bib21" id="ref31">21</reflink>]].</p> <p>In this article, we mainly consider the setting of the third scenario, single pass stream-based online active learning. Many studies have focused on active learning based on data streams [[<reflink idref="bib35" id="ref32">35</reflink>]–[<reflink idref="bib37" id="ref33">37</reflink>]], and a common opinion is that the choice of a proper instance to label should be based on maximizing the expected informativeness of the labeled instances [[<reflink idref="bib30" id="ref34">30</reflink>]]. In general, most of these methods rely strongly on heuristics, such as similarity measures between former instances and current instances [[<reflink idref="bib38" id="ref35">38</reflink>]] or the extent of uncertainty in label prediction [[<reflink idref="bib36" id="ref36">36</reflink>], [<reflink idref="bib38" id="ref37">38</reflink>], [<reflink idref="bib39" id="ref38">39</reflink>]]. To move away from engineered selection heuristics, we introduce a model learning active learning algorithm end-to-end via reinforcement learning. The premise of active learning is that costs associated with requesting labels and making false predictions exist [[<reflink idref="bib23" id="ref39">23</reflink>]]. Reinforcement learning can optimize these costs by explicitly setting them and directly identifying an action strategy. Therefore, we believe that combining reinforcement learning with active learning is a reasonable and appealing approach. Some recent studies have been based on a similar inspiration. Woodward and Finn [[<reflink idref="bib23" id="ref40">23</reflink>]] first applied reinforcement learning with deep recurrent models to the task of active learning. Bachman et al. [[<reflink idref="bib27" id="ref41">27</reflink>]] and Pang et al. [[<reflink idref="bib24" id="ref42">24</reflink>]] investigated a pool-based active learning algorithm via meta-learning. The same idea emerged in the artificial intelligence classification systems developed by Puzanov and Cohen [[<reflink idref="bib20" id="ref43">20</reflink>]]. Recent approaches, such as meta-learning and one-shot learning, are closely related to our model. Santoro et al. [[<reflink idref="bib25" id="ref44">25</reflink>]] proposed a supervised learning model using meta-learning with memory-augmented neural networks, which approached the same task as ours. The practical applications of these methods show that they are good solutions to the cold start problem [[<reflink idref="bib31" id="ref45">31</reflink>], [<reflink idref="bib40" id="ref46">40</reflink>]–[<reflink idref="bib42" id="ref47">42</reflink>]]. In our work, a deep recurrent neural network [[<reflink idref="bib43" id="ref48">43</reflink>]] function approximator is used to represent the action-value function and the cross entropy [[<reflink idref="bib44" id="ref49">44</reflink>]] term is introduced to the loss function to improve the performance of the algorithm.</p> <hd id="AN0137066134-4">Model description</hd> <p>In this section, we present a novel model based on the reinforcement one-shot active learning (ROAL) framework, which can monitor a stream of instances and select an appropriate action (classify or query the label) for each arrival instance. Our model metalearns a query strategy, which intelligently captures the time and population of instances that are worth to query. In present study, a long short-term memory (LSTM), which is connected to a linear output layer, is used to approximate the action-value function.</p> <hd id="AN0137066134-5">Task description</hd> <p>In the stream-based online active learning scenario, obtaining the ground truth label of a data instance is costly; therefore, an algorithm is required to judiciously determine the population of instances to label [[<reflink idref="bib29" id="ref50">29</reflink>], [<reflink idref="bib45" id="ref51">45</reflink>]]. In this setting [[<reflink idref="bib29" id="ref52">29</reflink>], [<reflink idref="bib46" id="ref53">46</reflink>]], the algorithm takes an action and chooses whether or not to request the ground truth at the time that the instance arrives. The classification task that we focus on is a stream of images, in which a decision must be made to either query or predict the label. Similar to works on one-shot learning [[<reflink idref="bib25" id="ref54">25</reflink>], [<reflink idref="bib26" id="ref55">26</reflink>]], the behavior of our model is refined over short training episodes and a small amount of examples per class to maximize the performance of the test episodes for instances that are not encountered in training. The structure of our active learning task is shown in Fig 1. At each time step of the episode, the model receives an instance <emph>x</emph><subs><emph>t</emph></subs>, and need to decide to execute an action. Assume that in each episode, up to <emph>M</emph> possible classes exist. Let <emph>a</emph><subs><emph>t</emph></subs> be the action at time step <emph>t</emph>; then, the action space is defined as follows: A≜{c1,...,cM,areq} (<reflink idref="bib1" id="ref56">1</reflink>)</p> <p>Action <emph>a</emph><subs><emph>t</emph></subs> = <emph>c</emph><subs><emph>i</emph></subs> is taken when the model classifies the instances under category <emph>i</emph> without requiring the true label at time <emph>t</emph>. Action <emph>a</emph><subs><emph>t</emph></subs> = <emph>a</emph><subs><emph>req</emph></subs> is taken when the model requests the true label <emph>y</emph>. Here we set the action <emph>a</emph><subs><emph>t</emph></subs> as a one-hot vector consisting of the optionally predicted label y^ that is followed by a bit for requesting the label. The model can either make a label prediction or request the label since only one bit can be 1. If the model requests the label of instance <emph>x</emph><subs><emph>t</emph></subs>, then no prediction will be made, and the true label <emph>y</emph><subs><emph>t</emph></subs> of the instance will be sent into the model at the next observation <emph>o</emph><subs><emph>t</emph>+1</subs> along with a new instance <emph>x</emph><subs><emph>t</emph>+1</subs>. If the model decides to predict, then no request will be made and a 0→ will be included in the next observation instead of the true label.</p> <p> <emph>r</emph> <subs> <emph>t</emph> </subs> is the reward received after action <emph>a</emph><subs><emph>t</emph></subs> in state <emph>s</emph><subs><emph>t</emph></subs>, and <emph>γ</emph> represents the discount factor for future rewards. At each time step, one of three rewards is given depending on the chosen action: <emph>R</emph><subs><emph>cor</emph></subs> for correctly predicting the label, <emph>R</emph><subs><emph>inc</emph></subs> for incorrectly predicting the label, or <emph>R</emph><subs><emph>req</emph></subs> for requesting the label. The aim is to maximize the sum of the rewards received in this episode.</p> <p>rt={Rcor,ifpredictingandy^t=ytRinc,ifpredictingandy^tytRreq,ifalabelisrequested (<reflink idref="bib2" id="ref57">2</reflink>)</p> <hd id="AN0137066134-6">Methodology</hd> <p>Reinforcement learning aims at seeking practical and superior strategies in complicated control and prediction tasks by interacting with environment. Through explorations as well as exploitations, it can estimate the goodness of a policy and perform improvements based on experience information. The basic structure of reinforcement learning can be seen in Fig 2. And an efficient model-free reinforcement learning method Q-learning is employed in this paper to learn an optimal strategy that can maximize the expected sum of discounted future rewards. Q-learning has been widely used in a variety of decision-making problems [[<reflink idref="bib47" id="ref58">47</reflink>]], mainly because it can estimate the expected utility from the available operations and adapt to stochastic transitions without prior knowledge of the system model [[<reflink idref="bib48" id="ref59">48</reflink>]].</p> <p>Reinforcement learning requires a definition of an objective function to show the benefit of an action in the long run. The idea of Q-learning is not to estimate the environmental model, but to optimize a Q function that can be directly calculated. The Q function reflects the gain obtained after performing action <emph>a</emph><subs><emph>t</emph></subs> under state <emph>s</emph><subs><emph>t</emph></subs>, and then accumulates the reinforcement value according to the discount of the best action sequence performed later: Q(st,at)=rt+γmaxat+1∈AQ(st+1,at+1) (<reflink idref="bib3" id="ref60">3</reflink>) Here let <emph>π</emph>(<emph>s</emph><subs><emph>t</emph></subs>) be a policy which is taken at <emph>s</emph><subs><emph>t</emph></subs>, and outputs an action <emph>a</emph><subs><emph>t</emph></subs> at time <emph>t</emph>. A policy that is better than or equal to other policies always exists, and this policy is called the optimal policy <emph>π</emph>(<emph>s</emph><subs><emph>t</emph></subs>). The optimal policy is the strategy that maximizes the optimal action-value function <emph>Q</emph>(<emph>s</emph><subs><emph>t</emph></subs>, <emph>a</emph><subs><emph>t</emph></subs>). In other words, the action that the model selects is given by the optimal policy <emph>π</emph>* which is calculated by maximizing the optimal action-value function <emph>Q</emph>(<emph>s<subs>t</subs></emph>, <emph>a<subs>t</subs></emph>) at=π(st)=argmaxat+1∈AQ(st,at) (<reflink idref="bib4" id="ref61">4</reflink>) According to Bellman equation, the optimal action-value function can be derived as follows: Q(st,at)=Est+1[rt+γmaxat+1∈AQ(st+1,at+1)] (<reflink idref="bib5" id="ref62">5</reflink>)</p> <p>Normally, a function approximator is used to represent <emph>Q</emph>(<emph>s</emph><subs><emph>t</emph></subs>, <emph>a</emph><subs><emph>t</emph></subs>), and its parameters are optimized by minimizing the Bellman error. Woodward et al. [[<reflink idref="bib23" id="ref63">23</reflink>]] derived the loss function as follows: L(θ)≔∑t[Q(ot,at)-(rt+γmaxat+1∈AQ(st+1,at+1))]2 (<reflink idref="bib6" id="ref64">6</reflink>) Where <emph>θ</emph> represents the parameters of the function approximator, and <emph>o</emph><subs><emph>t</emph></subs> represents the observations, such as images, that the agent receives.</p> <p>However, in the early stages of training, this loss function tends to be inefficient and prone to encounter the gradient vanishing phenomenon, because the loss function here only considers the maximum value of <emph>Q</emph>. In order to avoid these shortcomings and accelerate the training to advance the efficiency of our model, we introduce the cross-entropy of <emph>Q</emph> values and labels in the loss function. Cross-entropy is an important concept in Shannon's information theory that is mainly used to measure the difference information between two probability distributions. The intuition is that we want to increase the similarity of the label prediction probability distribution output by the model to the probability distribution of the real label. This method has been applied in many fields of machine learning. Inspired by this idea, we design our loss function as follows: L(θ)≔{∑t([Qθ(ot,at)-(rt+γmaxat+1∈AQ(st+1,at+1))]2-p(Q(ot,at))log(q(label(t)))ifpredicting∑t[Q(ot,at)-(rt+γmaxat+1∈AQ(st+1,at+1))]2ifalabelisrequired (<reflink idref="bib7" id="ref65">7</reflink>) Where <emph>p</emph>(<emph>Q</emph>(<emph>o</emph><subs><emph>t</emph></subs>, <emph>a</emph><subs><emph>t</emph></subs>)) represents the probability distribution of <emph>Q</emph>(<emph>o</emph><subs><emph>t</emph></subs>, <emph>a</emph><subs><emph>t</emph></subs>) and <emph>q</emph>(<emph>label</emph>(<emph>t</emph>)) represents the probability distribution of the true label at time step <emph>t</emph>.</p> <p>We use an LSTM network [[<reflink idref="bib43" id="ref66">43</reflink>]] connected to a linear output layer to implement the action-value function <emph>Q</emph>(<emph>o</emph><subs><emph>t</emph></subs>, <emph>a</emph><subs><emph>t</emph></subs>) in Q-learning, as shown in Fig 3. <emph>Q</emph>(<emph>o</emph><subs><emph>t</emph></subs>) outputs a vector, in which each element corresponds to an action: Q(ot,at)=Q(ot)∙at (<reflink idref="bib8" id="ref67">8</reflink>) Q(ot)=Whqht+bq (<reflink idref="bib9" id="ref68">9</reflink>) Where <emph>b</emph><sups><emph>q</emph></sups> is the action-value bias, <emph>h</emph><subs><emph>t</emph></subs> is the output of the LSTM, <emph>W</emph><subs><emph>hq</emph></subs> represents the weights mapping from the LSTM output to the action-values. A basic LSTM is used in our model, and the equations are as follows: g^f,g^i,g^o,c^t=Woot+Whht-1+b (<reflink idref="bib10" id="ref69">10</reflink>) gf=σ(g^f) (<reflink idref="bib11" id="ref70">11</reflink>) gi=σ(g^i) (<reflink idref="bib12" id="ref71">12</reflink>) go=σ(g^o) (<reflink idref="bib13" id="ref72">13</reflink>) ct=gf⨀ct-1+gi⨀tanh(c^t) (<reflink idref="bib14" id="ref73">14</reflink>) ht=go⨀tanh(ct) (<reflink idref="bib15" id="ref74">15</reflink>) Here, g^f,g^i,g^o respectively represent the forget gates, input gates and output gates. Where c^t is the candidate cell state and <emph>c</emph><subs><emph>t</emph></subs> represents the new LSTM cell state. <emph>W</emph><sups><emph>o</emph></sups> and <emph>W</emph><sups><emph>h</emph></sups> respectively represent the weights mapping from the observation to the gates and candidate cell state and the weights mapping from the hidden state to the gates and candidate cell state. <emph>b</emph> is the bias vector. σ(·) is a sigmoid function. ⨀ represents element-wise multiplication, and tanh(·) represents the hyperbolic tangent function.</p> <hd id="AN0137066134-7">Experiments</hd> <p>We examined our proposed ROAL model under an AOL set-up for two image classification tasks and compared the experimental results of present study with the results from previous study. Our goal is to further study the following points through experiments: 1) whether the model we proposed can learn a practical strategy that knows how to label instances and when to instead request a label, and 2) whether the model effectively uses its uncertainty of instances to make decisions.</p> <hd id="AN0137066134-8">Omniglot</hd> <p></p> <hd id="AN0137066134-9">Setup</hd> <p>We performed our first experiments on the Omniglot dataset [[<reflink idref="bib19" id="ref75">19</reflink>]], consisting of 1623 classes of characters from 50 different alphabets each hand-written by 20 different persons, for a total of 32460 instances. Following Woodward et al [[<reflink idref="bib23" id="ref76">23</reflink>]], we randomly divided the dataset into 1200 characters for training and kept the remaining 423 characters for testing. Our model interacted with classes of characters that it did not encounter during training to measure its test performance. To reduce the computational time of our experiments, images were downscaled to 28×28 pixels, and the pixel values were normalized between 0.0 and 1.0.</p> <p>In each episode, 30 Omniglot images were randomly selected from 3 randomly sampled classes, without replacement. Here, the number of samples from each class may not have been balanced. Each selected class in the episode was assigned to a random label which was represented by a slot in a one-hot vector of length 3, giving <emph>y</emph><subs><emph>t</emph></subs>. In order to reduce the risk of overfitting, we performed data augmentation for each class in the episode by randomly rotating in all samples from that class in {0°, 90°, 180°, 270°}. An LSTM with 200 hidden units was used here. We optimized the parameters of our model using Adam with the default parameters [[<reflink idref="bib49" id="ref77">49</reflink>]]. A grid search was performed over the following parameters, and the parameters of the results reported in this article are listed as follows. During training process, epsilon greedy exploration with <emph>ϵ</emph> = 0.23 was used for action selection. The discount factor <emph>γ</emph> was set to 0.5. Unless otherwise stated, each training and testing step consisted of a batch of 50 episodes, and the reward values were set as: <emph>R</emph><subs><emph>cor</emph></subs> = +1, <emph>R</emph><subs><emph>inc</emph></subs> = −1, and <emph>R</emph><subs><emph>req</emph></subs> = −0.05. For every 1000 episodes, we calculated the average accuracy, request, and precision rate. Notably, in order to achieve a better convergence effect, the learning rate of the model needs to be adjusted according to the change of the reward values, and the initial learning rate was set to 0.001. The training was carried out on 100,000 episodes. After that, 200 testing steps were conducted for evaluation.</p> <hd id="AN0137066134-10">Results and discussion</hd> <p>This section presents the results of the two experiments with our model. In the first experiment, we implemented both active one-shot learning (AOL) model with the default parameters from Ref. [[<reflink idref="bib23" id="ref78">23</reflink>]], and our ROAL model on the task in Fig 1. During training, the 1<sups>st</sups>, 2<sups>nd</sups>, 5<sups>th</sups>, and 10<sups>th</sups> instances of all classes in each episode were identified. Notably, in this analysis, label requests were treated as incorrect label prediction when calculating the accuracy. After training on 100,000 episodes, the training is ceased. Then the model was given 10,000 more test episodes. In these episodes, no further update occurred, and the model was to run on previously unencountered classes pulled from a disjoint test set. We report the results in Figs 4 and 5.</p> <p>As shown in Fig 4, first instance accuracy is poor, since the ROAL model that we propose learns to query the label for early instances of a class. We can also conclude that ROAL results in more predictions for later instances, since the label request rates of later instances decrease sharply. At the same time, the accuracy of the model is improved on later instances of a class, which approaches 90%. Fig 5 shows the average results of 10 repeated experiments. As shown in Fig 5, compared with AOL, ROAL has higher convergence speed, higher and more stable classification accuracy, and lower request rate. To evaluate the statistical significance of the comparison results on ROAL and AOL, Student's paired two-tailed <emph>t</emph>-test was conducted. When the p-value in the hypothesis test was less than 0.05, the result was considered as significant. The statistical significance levels that accuracy and prediction are better in the case of ROAL than for AOL were substantially less than 0.05, suggesting that the results of ROAL are significantly superior to the results of AOL. These data indicate that ROAL greatly accelerates the training speed, and effectively avoids the phenomenon of low efficiency and the gradient vanishing problem in the early training stage, thus saving considerable time and computing resources by introducing cross entropy into the loss function.</p> <p>To further compare the performance of the proposed ROAL method with the AOL method, Fig 6 shows the results of the receiver operating characteristic (ROC) curve analyses in our multiclassification task. The ROC curve, which is a plot of the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings, can clearly illustrate the diagnostic ability of a classifier system. In a ROC plane, the axes range from 0 to 1, where FPR is plotted on the X-axis and the TPR is plotted on the Y-axis. The diagonal dotted straight line connecting (0,0) to (<reflink idref="bib1" id="ref79">1</reflink>,<reflink idref="bib1" id="ref80">1</reflink>) represents a random performance of the classifier. Any classifier that appears in the upper left triangle performs better than random guessing, while curves in the lower right of the ROC plot have worse classification performances. Since we are faced with the problem of multiclassification, we present not only the ROC curves of two algorithms for each class but also the macro-average ROC curves, reflecting the overall classification effect for both algorithms. As shown in Fig 6, the ROAL method had better upper-left ROC curve results than the AOL method.</p> <p>The areas under the curve (AUC) of the ROC plot were also computed to quantitatively evaluate classification performance. The AUC can be calculated by using the trapezoidal areas created between each ROC point. The AUC value lies between 0 and 1, with a higher AUC value indicating better classification performance. As shown in Fig 6, the ROAL method had a higher macro-average AUC of 0.90 and higher AUC values for each class, while the AOL method had a macro-average AUC of 0.88. As a result, the ROC-AUC analyses show that the ROAL algorithm effectively improves the classification performance compared with the AOL algorithm.</p> <p>In reinforcement learning, the setting of the reward function has a great influence on the convergence speed and the performance of the algorithm. To explore this, we further trained models using different reward values. Notably, when training the models of <emph>R</emph><subs><emph>inc</emph></subs> = −10 and <emph>R</emph><subs><emph>inc</emph></subs> = −20, for consistency of convergence, we used a batch size of 100. Our experimental setup was the same as Woodward's. We used the default parameters from Woodward's work to reproduce the results of AOL. At the same time, we show the best results we reproduced with the default parameters of the AOL model in Ref [[<reflink idref="bib23" id="ref81">23</reflink>]] presented on the same problem. Importantly, based on previous work, we further explored the impact of different <emph>R</emph><subs><emph>req</emph></subs> settings on the accuracy and request rate of the model. As shown in Table 1, our model obtains higher accuracy and a lower request rate with the same reward values setting. The experimental results also verified that the ROAL model can make trade-offs between high prediction accuracy with many label requests and few label requests but lower prediction accuracy. Higher prediction accuracy can be achieved by increasing the penalty value for wrongly predicting labels. Similarly, the request label rate can be reduced by increasing the penalty for the request label action, at the cost of accuracy. The results also indicate that if the reward value is set improperly, no label may be requested with random prediction or all the labels may be requested without any prediction. Therefore, proper setting of the reward value function has an important influence on the learning effect of the model.</p> <p>Table 1: Test set classification accuracies and percentage of label requests per episode.</p> <p> <ephtml> &lt;table&gt;&lt;tr&gt;&lt;th align="center" rowspan="4"&gt;%&lt;/th&gt;&lt;th align="center" colspan="4"&gt;AOL&lt;/th&gt;&lt;th align="center" colspan="10"&gt;ROAL&lt;/th&gt;&lt;/tr&gt;&lt;tr&gt;&lt;th align="center" colspan="2" rowspan="2"&gt;Results in Ref [23]&lt;/th&gt;&lt;th align="center" colspan="2"&gt;&lt;i&gt;R&lt;/i&gt;&lt;sub&gt;cor&lt;/sub&gt; = 1&lt;/th&gt;&lt;th align="center" colspan="10"&gt;&lt;i&gt;R&lt;/i&gt;&lt;sub&gt;cor&lt;/sub&gt; = 1&lt;/th&gt;&lt;/tr&gt;&lt;tr&gt;&lt;th align="center" colspan="2"&gt;&lt;i&gt;R&lt;/i&gt;&lt;sub&gt;req&lt;/sub&gt; = &amp;#8722;0.05&lt;/th&gt;&lt;th align="center" colspan="2"&gt;&lt;i&gt;R&lt;/i&gt;&lt;sub&gt;req&lt;/sub&gt; = &amp;#8722;0.05&lt;/th&gt;&lt;th align="center" colspan="2"&gt;&lt;i&gt;R&lt;/i&gt;&lt;sub&gt;req&lt;/sub&gt; = &amp;#8722;0.1&lt;/th&gt;&lt;th align="center" colspan="2"&gt;&lt;i&gt;R&lt;/i&gt;&lt;sub&gt;req&lt;/sub&gt; = &amp;#8722;1&lt;/th&gt;&lt;th align="center" colspan="2"&gt;&lt;i&gt;R&lt;/i&gt;&lt;sub&gt;req&lt;/sub&gt; = &amp;#8722;3&lt;/th&gt;&lt;th align="center" colspan="2"&gt;&lt;i&gt;R&lt;/i&gt;&lt;sub&gt;req&lt;/sub&gt; = &amp;#8722;4&lt;/th&gt;&lt;/tr&gt;&lt;tr&gt;&lt;th align="center"&gt;Accuracy&lt;/th&gt;&lt;th align="center"&gt;Requests&lt;/th&gt;&lt;th align="center"&gt;Accuracy&lt;/th&gt;&lt;th align="center"&gt;Requests&lt;/th&gt;&lt;th align="center"&gt;Accuracy&lt;/th&gt;&lt;th align="center"&gt;Requests&lt;/th&gt;&lt;th align="center"&gt;Accuracy&lt;/th&gt;&lt;th align="center"&gt;Requests&lt;/th&gt;&lt;th align="center"&gt;Accuracy&lt;/th&gt;&lt;th align="center"&gt;Requests&lt;/th&gt;&lt;th align="center"&gt;Accuracy&lt;/th&gt;&lt;th align="center"&gt;Requests&lt;/th&gt;&lt;th align="center"&gt;Accuracy&lt;/th&gt;&lt;th align="center"&gt;Requests&lt;/th&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="center"&gt;Accuracy&lt;/td&gt;&lt;td align="char" char="."&gt;75.9&lt;/td&gt;&lt;td align="char" char="."&gt;7.2&lt;/td&gt;&lt;td align="center"&gt;75.3&lt;/td&gt;&lt;td align="center"&gt;7.9&lt;/td&gt;&lt;td align="center"&gt;78.8&lt;/td&gt;&lt;td align="char" char="."&gt;7.9&lt;/td&gt;&lt;td align="char" char="."&gt;76.6&lt;/td&gt;&lt;td align="char" char="."&gt;7.1&lt;/td&gt;&lt;td align="char" char="."&gt;33.7&lt;/td&gt;&lt;td align="center"&gt;0&lt;/td&gt;&lt;td align="center"&gt;33.2&lt;/td&gt;&lt;td align="center"&gt;0&lt;/td&gt;&lt;td align="char" char="."&gt;33.8&lt;/td&gt;&lt;td align="center"&gt;0&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="center"&gt;Prediction&lt;/td&gt;&lt;td align="char" char="."&gt;81.8&lt;/td&gt;&lt;td align="char" char="."&gt;7.2&lt;/td&gt;&lt;td align="center"&gt;81.3&lt;/td&gt;&lt;td align="center"&gt;7.9&lt;/td&gt;&lt;td align="center"&gt;85.9&lt;/td&gt;&lt;td align="char" char="."&gt;7.9&lt;/td&gt;&lt;td align="char" char="."&gt;82.5&lt;/td&gt;&lt;td align="char" char="."&gt;7.1&lt;/td&gt;&lt;td align="char" char="."&gt;33.7&lt;/td&gt;&lt;td align="center"&gt;0&lt;/td&gt;&lt;td align="center"&gt;33.2&lt;/td&gt;&lt;td align="center"&gt;0&lt;/td&gt;&lt;td align="char" char="."&gt;33.8&lt;/td&gt;&lt;td align="center"&gt;0&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="center"&gt;R&lt;sub&gt;inc&lt;/sub&gt; = &amp;#8722;5prediction&lt;/td&gt;&lt;td align="char" char="."&gt;86.4&lt;/td&gt;&lt;td align="char" char="."&gt;31.8&lt;/td&gt;&lt;td align="center"&gt;92.9&lt;/td&gt;&lt;td align="center"&gt;67.9&lt;/td&gt;&lt;td align="center"&gt;97&lt;/td&gt;&lt;td align="char" char="."&gt;48.2&lt;/td&gt;&lt;td align="char" char="."&gt;96.4&lt;/td&gt;&lt;td align="char" char="."&gt;44.9&lt;/td&gt;&lt;td align="char" char="."&gt;91.5&lt;/td&gt;&lt;td align="center"&gt;20.6&lt;/td&gt;&lt;td align="center"&gt;57&lt;/td&gt;&lt;td align="center"&gt;4.2&lt;/td&gt;&lt;td align="char" char="."&gt;33.5&lt;/td&gt;&lt;td align="center"&gt;0&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="center"&gt;R&lt;sub&gt;inc&lt;/sub&gt; = &amp;#8722;10prediction&lt;/td&gt;&lt;td align="char" char="."&gt;89.3&lt;/td&gt;&lt;td align="char" char="."&gt;45.6&lt;/td&gt;&lt;td align="center"&gt;97.1&lt;/td&gt;&lt;td align="center"&gt;81.7&lt;/td&gt;&lt;td align="center"&gt;99.2&lt;/td&gt;&lt;td align="char" char="."&gt;71.5&lt;/td&gt;&lt;td align="char" char="."&gt;99.1&lt;/td&gt;&lt;td align="char" char="."&gt;65.5&lt;/td&gt;&lt;td align="char" char="."&gt;97.8&lt;/td&gt;&lt;td align="center"&gt;42.9&lt;/td&gt;&lt;td align="center"&gt;91.1&lt;/td&gt;&lt;td align="center"&gt;16.2&lt;/td&gt;&lt;td align="char" char="."&gt;86.4&lt;/td&gt;&lt;td align="center"&gt;9.9&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="center"&gt;R&lt;sub&gt;inc&lt;/sub&gt; = &amp;#8722;20prediction&lt;/td&gt;&lt;td align="char" char="."&gt;92.8&lt;/td&gt;&lt;td align="char" char="."&gt;60.6&lt;/td&gt;&lt;td align="center"&gt;0&lt;/td&gt;&lt;td align="center"&gt;100&lt;/td&gt;&lt;td align="center"&gt;99.2&lt;/td&gt;&lt;td align="char" char="."&gt;89.0&lt;/td&gt;&lt;td align="char" char="."&gt;99.1&lt;/td&gt;&lt;td align="char" char="."&gt;93.7&lt;/td&gt;&lt;td align="char" char="."&gt;97.4&lt;/td&gt;&lt;td align="center"&gt;81.9&lt;/td&gt;&lt;td align="center"&gt;93.8&lt;/td&gt;&lt;td align="center"&gt;69.3&lt;/td&gt;&lt;td align="char" char="."&gt;92.8&lt;/td&gt;&lt;td align="center"&gt;52&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt; </ephtml> </p> <p>Finally, we performed another experiment to explore whether the model was effectively reasoning about its own uncertainty. In previous experiments, samples were randomly arranged in each episode. In this experiment, we artificially provided the order of the sample arrangement to explore the action strategy of model. In this task, experiments were carried out on the trained model, and two random test classes were selected for each episode. Our experiment was divided into two groups. In both groups, we ran 1000 episodes without learning and recorded the request percentage of episodes for each time step. In the first group, we assigned two instances that came from different classes to the mode at the beginning of each episode. Then, two instances from each class was given. As shown in Fig 7(a), the request rate for later instances of the same class was greatly reduced after the model saw an instance of that class. This result is consistent with the original intention of active learning. If representative samples can be effectively selected for labeling, then the cost of manual labeling can be greatly reduced. However, existing experiments still cannot prove whether the model selects actions based on uncertainty of instances or not, because it is likely to learn only a naive strategy that always requires labels in the first few steps. For further verification, we set the second group of experiments as: 4 instances from the first class were presented, followed by 2 instances from the second class. The results are shown in Fig 7(b). The label request rate at time step 2 was greatly reduced, and the label request rate at time step 5 was greatly increased. The difference in the request rate of these two time steps, and the similarity between the percentages of label requests of the both classes can finally show that the model selects an action based on the uncertainty of instances, because the model can increase the label request rate when a new class appears.</p> <p>In Woodward's paper [[<reflink idref="bib23" id="ref82">23</reflink>]], a supervised method in the same task was carried out. Compared with supervised learning with a label request rate is 100%, our model can achieve higher accuracy while using fewer labels at the same time.</p> <hd id="AN0137066134-11">Handwritten alphanumeric characters</hd> <p></p> <hd id="AN0137066134-12">Setup</hd> <p>The second dataset included handwritten alphanumeric characters and consisted of 36 classes of characters, corresponding to digits from 0 to 9 and the letters from A to Z, with each class consisting of 39 instances. The input corresponds to 20×20 pixels image in binary format. We randomly divided the dataset into 28 characters for training, and kept the remaining 8 characters for testing.</p> <p>Similar to the set-up of Omniglot, 30 images were randomly selected from several randomly sampled classes in each episode, without replacement. Data augmentation for each class in the episode was also performed. An LSTM with 200 hidden units was used here. Adam with the default parameters [[<reflink idref="bib49" id="ref83">49</reflink>]] was used here to optimize our model. A grid search was performed over the following parameters, and the parameters of the results reported in this article are listed as follows. Epsilon greedy exploration with <emph>ϵ</emph> = 0.4 was used. The discount factor <emph>γ</emph> was 0.6. The initial batch size was set to 50 and the reward values were set as: <emph>R</emph><subs><emph>cor</emph></subs> = +1, <emph>R</emph><subs><emph>inc</emph></subs> = −1, and <emph>R</emph><subs><emph>req</emph></subs> = −0.3. The initial learning rate was set to 0.002. The method of training and evaluation is the same as that for the Omnigolt data set.</p> <hd id="AN0137066134-13">Results and discussion</hd> <p>In this section, we compare our ROAL model to AOL and a supervised learning model on handwritten alphanumeric characters recognition task. As introduced in Santoro et al. [[<reflink idref="bib25" id="ref84">25</reflink>]], the loss in the supervised learning model is the cross entropy between the true and predicted label, and the true label is always presented on the following time step. The same LSTM model was used in this supervised task for consistency, and the softmax modification is performed on the output without extra bits for the "request label" action. We expand the experiments by increasing the number of classes per episode. We report the results of prediction accuracy and request rate on the test sets in Table 2. For consistency of convergence, when training the models of 8 classes, a batch size of 100 was used, and the number of instances in each episode was changed to 80 in all three models.</p> <p>Table 2: Results for ROAL and baselines for the handwritten alphanumeric characters classification.</p> <p> <ephtml> &lt;table&gt;&lt;tr&gt;&lt;th align="center" rowspan="2"&gt;%&lt;/th&gt;&lt;th align="center" colspan="2"&gt;3 classes&lt;/th&gt;&lt;th align="center" colspan="2"&gt;5 classes&lt;/th&gt;&lt;th align="center" colspan="2"&gt;8 classes&lt;/th&gt;&lt;/tr&gt;&lt;tr&gt;&lt;th align="center"&gt;Accuracy&lt;/th&gt;&lt;th align="center"&gt;Requests&lt;/th&gt;&lt;th align="center"&gt;Accuracy&lt;/th&gt;&lt;th align="center"&gt;Requests&lt;/th&gt;&lt;th align="center"&gt;Accuracy&lt;/th&gt;&lt;th align="center"&gt;Requests&lt;/th&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="center"&gt;Supervised&lt;/td&gt;&lt;td align="char" char="."&gt;89.1&lt;/td&gt;&lt;td align="center"&gt;100&lt;/td&gt;&lt;td align="char" char="."&gt;78.9&lt;/td&gt;&lt;td align="center"&gt;100&lt;/td&gt;&lt;td align="char" char="."&gt;76.2&lt;/td&gt;&lt;td align="center"&gt;100&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="center"&gt;AOLprediction&lt;/td&gt;&lt;td align="char" char="."&gt;86.78&lt;/td&gt;&lt;td align="center"&gt;8.02&lt;/td&gt;&lt;td align="char" char="."&gt;78.05&lt;/td&gt;&lt;td align="center"&gt;14.35&lt;/td&gt;&lt;td align="char" char="."&gt;72.23&lt;/td&gt;&lt;td align="center"&gt;11.06&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="center"&gt;ROALprediction&lt;/td&gt;&lt;td align="char" char="."&gt;89.58&lt;/td&gt;&lt;td align="center"&gt;6.8&lt;/td&gt;&lt;td align="char" char="."&gt;79.17&lt;/td&gt;&lt;td align="center"&gt;15.15&lt;/td&gt;&lt;td align="char" char="."&gt;79.05&lt;/td&gt;&lt;td align="center"&gt;15.36&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt; </ephtml> </p> <p>1 (Results with statistical significance at the 0.05 level with respect to the Student's paired <emph>t</emph>-test are marked with *.)</p> <p>According to Table 2, the ROAL model also exhibits better performance than the AOL model on the handwritten alphanumeric characters dataset. At the same time, compared to the supervised learning model, the ROAL model significantly reduces the number of requests for tags while achieving the same or even higher accuracy. By increasing the number of classes per episode, we further demonstrate the ability of the ROAL algorithm to handle more complex tasks. We may conclude that the ROAL model has broad application prospects.</p> <hd id="AN0137066134-14">Conclusions</hd> <p>We introduced a model that learns active learning via reinforcement learning. We evaluated the model on one-shot learning tasks. The results show that our model can transform from an engineering heuristic selection of samples to learning strategies from data. Compared to previous works [[<reflink idref="bib23" id="ref85">23</reflink>]], we substantially accelerated the convergence speed, avoided the gradient vanishing problem, improved the stability, reduced the number of request labels, and improved the accuracy of the model. The proposed model may be a good solution to practical problems such as movie recommendation [[<reflink idref="bib50" id="ref86">50</reflink>]] and network traffic analysis [[<reflink idref="bib20" id="ref87">20</reflink>]] due to its ability to learn and generalize new concepts in a short time.</p> <p>In future work, we plan to evaluate our model on practical problems. For this, we may need a more sophisticated learning approach. Due to time and resources limitations, the parameters of our experiment may not be optimal; they can be optimized further to improve the performance of the algorithm.</p> <hd id="AN0137066134-15">Supporting information</hd> <p>S1 Table. Statistical test results of test episodes on Omniglot dataset (3 classes with R<emph>cor</emph> = +1, R<emph>inc</emph> = −1, and R<emph>req</emph> = −0.05). (DOCX)</p> <p>S2 Table. Statistical test results of test episodes on handwritten alphanumeric characters dataset. (DOCX)</p> <p>DIAGRAM: Fig 1: Task structure diagram. For images in the dataset, the classes and their labels and the specific samples are shuffled and randomly presented at each episode. At each time step, the input of the model is an image along with a vector which depends on the output of the previous instance. The output of the model is a one-hot vector of length k + 1, where k is the number of classes per episode. If the model requests the label of xt, it sets the final bit of the output vector to 1. Thus, the reward for this label request action is Rreq. The true label yt of image xt is then provided at the next time step along with the next image xt +1. Alternatively, if the model makes a prediction of xt, it sets one of the first k bits of the output vector, representing . The reward for this action is Rcor if the prediction is correct or Rinc if not. If a prediction is made at time step t, then no information regarding its true label yt is supplied at the next time step t +1.</p> <p>DIAGRAM: Fig 2: Basic reinforcement learning model. When the Agent performs an action, the state of the environment is changed, and a reward signal is feedback to the Agent. The Agent selects the next action according to the reward signal and the current state of the environment, and the selection principle is to increase the probability of receiving positive reinforcement (maximizing rewards). The actions selected affect not only the immediate rewards, but also the state of the environment at one point and the final values.</p> <p>DIAGRAM: Fig 3: Model structure. A basic LSTM connected to a linear output layer is used here to implement the reinforcement one-shot active learning (ROAL) model that we proposed.</p> <p>DIAGRAM: Fig 4: (a) ROAL Accuracies and (b) ROAL label requests per episode for the 1 st , 2 nd , 5 th , and 10 th instances of all classes. The ROAL gains a higher accuracy while requests fewer labels on later instances of each class, indicating that the ROAL is performing "educated guesses" for new instances based on the instances it has already seen. At the 100,000 episode, the training stops and the data switches to test classes withheld from the training set.</p> <p>DIAGRAM: Fig 5: Comparison of overall (a) Accuracy and (b) Label request results between ROAL and AOL. Compared to AOL, ROAL is able to achieve higher accuracy and lower request rate in fewer iterations. After 100,000 episodes, the data switches to test set without further learning.</p> <p>DIAGRAM: Fig 6: ROC plot with AUC values for AOL and ROAL.</p> <p>DIAGRAM: Fig 7: Second experiment results of the trained model. In this task, two random test classes were chosen for each episode. (a) At the beginning of each episode, we assigned two instances which came from different classes to the model. After that, two instances from each class was given, respectively. It shows that the request rate for later instances of the same class has been greatly reduced after the model saw an instance of that class. (b) 4 instances from the first class were presented, followed by 2 instances from the second class. The label request rate at time step 2 is greatly reduced, and the label request rate at time step 5 is greatly increased.</p> <p>The authors would like to thank Mark Woodward for excellent technical support.</p> <ref id="AN0137066134-16"> <title> References </title> <blist> <bibl id="bib1" idref="ref1" type="bt">1</bibl> <bibtext> Huang H , Huang J , Feng Y , Liu Z , Wang T , Chen L , et al . Aircraft Type Recognition Based on Target Track . 2018 .</bibtext> </blist> <blist> <bibl id="bib2" idref="ref2" type="bt">2</bibl> <bibtext> Wang Q , Zhao X , Huang J , Feng Y , Liu Z , Su J , et al . Addressing Complexities of Machine Learning in Big Data: Principles, Trends and Challenges from Systematical Perspectives . 2017 .</bibtext> </blist> <blist> <bibl id="bib3" idref="ref3" type="bt">3</bibl> <bibtext> Xiaojin Z . Semi-Supervised Learning Literature Survey . 2005 ; 37 ( 1 ): 63 – 77 .</bibtext> </blist> <blist> <bibl id="bib4" idref="ref4" type="bt">4</bibl> <bibtext> Tian A, Lease M. Active learning to maximize accuracy vs. effort in interactive information retrieval. International Acm Sigir Conference on Research &amp; Development in Information Retrieval. 2011.</bibtext> </blist> <blist> <bibl id="bib5" idref="ref5" type="bt">5</bibl> <bibtext> Yu D , Varadarajan B , Deng L , Acero A . Active learning and semi-supervised learning for speech recognition: A unified framework using the global entropy reduction maximization criterion . Computer Speech &amp; Language . 2010 ; 24 ( 3 ): 433 – 44 .</bibtext> </blist> <blist> <bibl id="bib6" idref="ref64" type="bt">6</bibl> <bibtext> Vijayanarasimhan S, Jain P, Grauman K, editors. Far-sighted active learning on a budget for image and video recognition. Computer Vision &amp; Pattern Recognition; 2010.</bibtext> </blist> <blist> <bibl id="bib7" idref="ref65" type="bt">7</bibl> <bibtext> Riccardi G , Hakkani-Tur D . Active learning: theory and applications to automatic speech recognition . IEEE Transactions on Speech &amp; Audio Processing . 2005 ; 13 ( 4 ): 504 – 11 .</bibtext> </blist> <blist> <bibl id="bib8" idref="ref67" type="bt">8</bibl> <bibtext> Nallasamy U, Metze F, Schultz T, editors. Active learning for accent adaptation in Automatic Speech Recognition. Spoken Language Technology Workshop; 2013.</bibtext> </blist> <blist> <bibl id="bib9" idref="ref68" type="bt">9</bibl> <bibtext> Minakawa M, Raytchev B, Tamaki T, Kaneda K, editors. Image Sequence Recognition with Active Learning Using Uncertainty Sampling. IEEE International Joint Conference on Neural Networks; 2013.</bibtext> </blist> <blist> <bibtext> Joshi AJ, Porikli F, Papanikolopoulos N, editors. Multi-class active learning for image classification. IEEE Conference on Computer Vision &amp; Pattern Recognition; 2009.</bibtext> </blist> <blist> <bibtext> Hakkani-Tür D, Riccardi G, Gorin A, editors. Active learning for automatic speech recognition. International Conference on Acoustics; 2002.</bibtext> </blist> <blist> <bibtext> Rong H , Namee BM , Delany SJ . Active learning for text classification with reusability . Expert Systems with Applications . 2016 ; 45 ( C ): 438 – 49 .</bibtext> </blist> <blist> <bibtext> Davy M, Luz S, editors. Dimensionality reduction for active learning with nearest neighbour classifier in text categorisation problems. International Conference on Machine Learning &amp; Applications; 2007.</bibtext> </blist> <blist> <bibtext> Cormack GV, Grossman MR. Scalability of Continuous Active Learning for Reliable High-Recall Text Classification. 2016.</bibtext> </blist> <blist> <bibtext> Kapoor A , Grauman K , Urtasun R , Darrell T . Active Learning with Gaussian Processes for Object Categorization . 2015 ; 88 ( 2 ): 1 – 8 .</bibtext> </blist> <blist> <bibtext> Seung, H. S, Opper, Sompolinsky. Query by committee. Proc of the Fith Workshop on Computational Learning Theory. 1992;284:287–94.</bibtext> </blist> <blist> <bibtext> Tong S , Koller D . Support vector machine active learning with applications to text classification : JMLR.org ; 2002 . 45 – 66 p.</bibtext> </blist> <blist> <bibtext> Chattopadhyay R , Wang Z , Fan W , Davidson I , Panchanathan S , Ye J . Batch Mode Active Sampling based on Marginal Probability Distribution Matching . Acm Transactions on Knowledge Discovery from Data . 2013 ; 7 ( 3 ): 1 – 25 .</bibtext> </blist> <blist> <bibtext> Lake BM , Salakhutdinov R , Tenenbaum JB . Human-level concept learning through probabilistic program induction . Science . 2015 ; 350 ( 6266 ): 1332 – 8 . doi: 10.1126/science.aab3050 26659050</bibtext> </blist> <blist> <bibtext> Puzanov A , Cohen K . Deep Reinforcement One-Shot Learning for Artificially Intelligent Classification Systems . 2018 .</bibtext> </blist> <blist> <bibtext> Mnih V , Kavukcuoglu K , Silver D , Rusu AA , Veness J , Bellemare MG , et al . Human-level control through deep reinforcement learning . Nature . 2015 ; 518 ( 7540 ): 529 . doi: 10.1038/nature14236 25719670</bibtext> </blist> <blist> <bibtext> Sutton RS , Barto AG . Reinforcement Learning: An Introduction . IEEE Transactions on Neural Networks . 1998 ; 9 ( 5 ): 1054 .</bibtext> </blist> <blist> <bibtext> Woodward M, Finn C. Active One-shot Learning. 2017.</bibtext> </blist> <blist> <bibtext> Pang K, Dong M, Wu Y, Hospedales T. Meta-Learning Transferable Active Learning Policies by Deep Reinforcement Learning. 2018.</bibtext> </blist> <blist> <bibtext> Santoro A, Bartunov S, Botvinick M, Wierstra D, Lillicrap T. One-shot Learning with Memory-Augmented Neural Networks. 2016.</bibtext> </blist> <blist> <bibtext> Vinyals O, Blundell C, Lillicrap T, Kavukcuoglu K, Wierstra D. Matching Networks for One Shot Learning. 2016.</bibtext> </blist> <blist> <bibtext> Bachman P, Sordoni A, Trischler A. Learning Algorithms for Active Learning. 2017.</bibtext> </blist> <blist> <bibtext> Feng Y , Yang X , Cheng G . Stability in mean for multi-dimensional uncertain differential equation . Soft Computing . 2017 ;( 2 ): 1 – 7 .</bibtext> </blist> <blist> <bibtext> Settles B . Active Learning Literature Survey . University of Wisconsinmadison . 2009 ; 39 ( 2 ): 127 – 31 .</bibtext> </blist> <blist> <bibtext> Smailović J , Grčar M , Lavrač N , Žnidaršič M . Stream-based active learning for sentiment analysis in the financial domain . Information Sciences . 2014 ; 285 ( C ): 181 – 203 .</bibtext> </blist> <blist> <bibtext> Sun M , Li F , Lee J , Zhou K , Lebanon G , Zha H . Learning multiple-question decision trees for cold-start recommendation . 2013 : 445 – 54 .</bibtext> </blist> <blist> <bibtext> Resnick P , Varian HR . Recommender systems . Communications of The ACM . 1997 ; 40 ( 3 ): 56 – 8 .</bibtext> </blist> <blist> <bibtext> Houlsby N, Hernandezlobato JM, Ghahramani Z, editors. Cold-start Active Learning with Robust Ordinal Matrix Factorization. international conference on machine learning; 2014.</bibtext> </blist> <blist> <bibtext> Hoi SCH, Jin R, Zhu J, Lyu MR, editors. Batch mode active learning and its application to medical image classification. International Conference; 2006.</bibtext> </blist> <blist> <bibtext> Zhu X, Zhang P, Lin X, Shi Y, editors. Active Learning from Data Streams. IEEE International Conference on Data Mining; 2007.</bibtext> </blist> <blist> <bibtext> Chu W, Zinkevich M, Li L, Thomas A, Tseng B, editors. Unbiased online active learning in data streams. ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego, Ca, Usa, August; 2011.</bibtext> </blist> <blist> <bibtext> Huang S, An Active Learning Method for Mining Time-Changing Data Streams. intelligent information technology application; 2008.</bibtext> </blist> <blist> <bibtext> Lughofer E . Single-pass active learning with conflict and ignorance . Evolving Systems . 2012 ; 3 ( 4 ): 251 – 71 .</bibtext> </blist> <blist> <bibtext> Feng Y , Dai L , Gao J , Cheng G . Uncertain pursuit-evasion game . soft computing . 2018 : 1 – 5 .</bibtext> </blist> <blist> <bibtext> Lika B , Kolomvatsos K , Hadjiefthymiades S . Facing the cold start problem in recommender systems . Expert Systems with Applications . 2014 ; 41 ( 4 ): 2065 – 73 .</bibtext> </blist> <blist> <bibtext> Elahi M , Ricci F , Rubens N . A survey of active learning in collaborative filtering recommender systems . Computer Science Review . 2016 ; 20 ( C ): 29 – 50 .</bibtext> </blist> <blist> <bibtext> Harpale AS, Yang Y, editors. Personalized active learning for collaborative filtering. International ACM SIGIR Conference on Research and Development in Information Retrieval; 2008.</bibtext> </blist> <blist> <bibtext> Hochreiter S , Schmidhuber J . Long Short-Term Memory . Neural Computation . 1997 ; 9 ( 8 ): 1735 – 80 . 9377276</bibtext> </blist> <blist> <bibtext> Shore JE , Johnson RW . Axiomatic derivation of the principle of maximum entropy and the principle of minimum cross-entropy . Information Theory IEEE Transactions on . 1980 ; 26 ( 1 ): 26 – 37 .</bibtext> </blist> <blist> <bibtext> Cohn DA , Ghahramani Z , Jordan MI . Active Learning with Statistical Models . Journal of Artificial Intelligence Research . 1996 ; 4 ( 1 ): 705 – 12 .</bibtext> </blist> <blist> <bibtext> Settles B, Craven M, editors. An analysis of active learning strategies for sequence labeling tasks. Conference on Empirical Methods in Natural Language Processing; 2008.</bibtext> </blist> <blist> <bibtext> Liu W, Tan Y, Qiu Q, editors. Enhanced Q-learning algorithm for dynamic power management with performance constraint. Design, Automation &amp; Test in Europe Conference &amp; Exhibition; 2010.</bibtext> </blist> <blist> <bibtext> Watkins CJCH , Dayan P . Technical Note: Q-Learning . Machine Learning . 1992 ; 8 ( 3–4 ): 279 – 92 .</bibtext> </blist> <blist> <bibtext> Kingma DP, Ba J. Adam: A Method for Stochastic Optimization. Computer Science. 2014.</bibtext> </blist> <blist> <bibtext> Bachman P, Sordoni A, Trischler A. Learning Algorithms for Active Learning. international conference on machine learning. 2017:301–10.</bibtext> </blist> </ref> <aug> <p>By Honglan Huang, Writing – review &amp; editing; Jincai Huang, Validation; Yanghe Feng, Validation; Jiarui Zhang, Writing – review &amp; editing; Zhong Liu, Resources; Qi Wang, Writing – review &amp; editing and Li Chen, Software</p> </aug> <nolink nlid="nl1" bibid="bib11" firstref="ref6"></nolink> <nolink nlid="nl2" bibid="bib12" firstref="ref7"></nolink> <nolink nlid="nl3" bibid="bib14" firstref="ref8"></nolink> <nolink nlid="nl4" bibid="bib15" firstref="ref9"></nolink> <nolink nlid="nl5" bibid="bib16" firstref="ref10"></nolink> <nolink nlid="nl6" bibid="bib17" firstref="ref11"></nolink> <nolink nlid="nl7" bibid="bib18" firstref="ref12"></nolink> <nolink nlid="nl8" bibid="bib19" firstref="ref13"></nolink> <nolink nlid="nl9" bibid="bib20" firstref="ref14"></nolink> <nolink nlid="nl10" bibid="bib21" firstref="ref15"></nolink> <nolink nlid="nl11" bibid="bib22" firstref="ref16"></nolink> <nolink nlid="nl12" bibid="bib23" firstref="ref17"></nolink> <nolink nlid="nl13" bibid="bib24" firstref="ref18"></nolink> <nolink nlid="nl14" bibid="bib25" firstref="ref22"></nolink> <nolink nlid="nl15" bibid="bib26" firstref="ref23"></nolink> <nolink nlid="nl16" bibid="bib27" firstref="ref24"></nolink> <nolink nlid="nl17" bibid="bib28" firstref="ref25"></nolink> <nolink nlid="nl18" bibid="bib29" firstref="ref26"></nolink> <nolink nlid="nl19" bibid="bib30" firstref="ref27"></nolink> <nolink nlid="nl20" bibid="bib31" firstref="ref28"></nolink> <nolink nlid="nl21" bibid="bib33" firstref="ref29"></nolink> <nolink nlid="nl22" bibid="bib34" firstref="ref30"></nolink> <nolink nlid="nl23" bibid="bib35" firstref="ref32"></nolink> <nolink nlid="nl24" bibid="bib37" firstref="ref33"></nolink> <nolink nlid="nl25" bibid="bib38" firstref="ref35"></nolink> <nolink nlid="nl26" bibid="bib36" firstref="ref36"></nolink> <nolink nlid="nl27" bibid="bib39" firstref="ref38"></nolink> <nolink nlid="nl28" bibid="bib40" firstref="ref46"></nolink> <nolink nlid="nl29" bibid="bib42" firstref="ref47"></nolink> <nolink nlid="nl30" bibid="bib43" firstref="ref48"></nolink> <nolink nlid="nl31" bibid="bib44" firstref="ref49"></nolink> <nolink nlid="nl32" bibid="bib45" firstref="ref51"></nolink> <nolink nlid="nl33" bibid="bib46" firstref="ref53"></nolink> <nolink nlid="nl34" bibid="bib47" firstref="ref58"></nolink> <nolink nlid="nl35" bibid="bib48" firstref="ref59"></nolink> <nolink nlid="nl36" bibid="bib10" firstref="ref69"></nolink> <nolink nlid="nl37" bibid="bib13" firstref="ref72"></nolink> <nolink nlid="nl38" bibid="bib49" firstref="ref77"></nolink> <nolink nlid="nl39" bibid="bib50" firstref="ref86"></nolink> CustomLinks: – Url: https://resolver.ebsco.com/c/xy5jbn/result?sid=EBSCO:edsdoj&genre=article&issn=19326203&ISBN=&volume=14&issue=6&date=20190101&spage=e0217408&pages=&title=PLoS ONE&atitle=On%20the%20improvement%20of%20reinforcement%20active%20learning%20with%20the%20involvement%20of%20cross%20entropy%20to%20address%20one-shot%20learning%20problem.&aulast=Honglan%20Huang&id=DOI:10.1371/journal.pone.0217408 Name: Full Text Finder (for New FTF UI) (s8985755) Category: fullText Text: Find It @ SCU Libraries MouseOverText: Find It @ SCU Libraries – Url: https://doaj.org/article/38c11e8c6f284674b3e4c329e9955dee Name: EDS - DOAJ (s8985755) Category: fullText Text: View record from DOAJ MouseOverText: View record from DOAJ
Header	DbId: edsdoj DbLabel: Directory of Open Access Journals An: edsdoj.38c11e8c6f284674b3e4c329e9955dee RelevancyScore: 893 AccessLevel: 3 PubType: Academic Journal PubTypeId: academicJournal PreciseRelevancyScore: 892.664306640625
IllustrationInfo
Items	– Name: Title Label: Title Group: Ti Data: On the improvement of reinforcement active learning with the involvement of cross entropy to address one-shot learning problem. – Name: Author Label: Authors Group: Au Data: <searchLink fieldCode="AR" term="%22Honglan+Huang%22">Honglan Huang</searchLink><br /><searchLink fieldCode="AR" term="%22Jincai+Huang%22">Jincai Huang</searchLink><br /><searchLink fieldCode="AR" term="%22Yanghe+Feng%22">Yanghe Feng</searchLink><br /><searchLink fieldCode="AR" term="%22Jiarui+Zhang%22">Jiarui Zhang</searchLink><br /><searchLink fieldCode="AR" term="%22Zhong+Liu%22">Zhong Liu</searchLink><br /><searchLink fieldCode="AR" term="%22Qi+Wang%22">Qi Wang</searchLink><br /><searchLink fieldCode="AR" term="%22Li+Chen%22">Li Chen</searchLink> – Name: TitleSource Label: Source Group: Src Data: PLoS ONE, Vol 14, Iss 6, p e0217408 (2019) – Name: Publisher Label: Publisher Information Group: PubInfo Data: Public Library of Science (PLoS), 2019. – Name: DatePubCY Label: Publication Year Group: Date Data: 2019 – Name: Subset Label: Collection Group: HoldingsInfo Data: LCC:Medicine<br />LCC:Science – Name: Subject Label: Subject Terms Group: Su Data: <searchLink fieldCode="DE" term="%22Medicine%22">Medicine</searchLink><br /><searchLink fieldCode="DE" term="%22Science%22">Science</searchLink> – Name: Abstract Label: Description Group: Ab Data: As a promising research direction in recent decades, active learning allows an oracle to assign labels to typical examples for performance improvement in learning systems. Existing works mainly focus on designing criteria for screening examples of high value to be labeled in a handcrafted manner. Instead of manually developing strategies of querying the user to access labels for the desired examples, we utilized the reinforcement learning algorithm parameterized with the neural network to automatically explore query strategies in active learning when addressing stream-based one-shot classification problems. With the involvement of cross-entropy in the loss function of Q-learning, an efficient policy to decide when and where to predict or query an instance is learned through the developed framework. Compared with a former influential work, the advantages of our method are demonstrated experimentally with two image classification tasks, and it exhibited better performance, quick convergence, relatively good stability and fewer requests for labels. – Name: TypeDocument Label: Document Type Group: TypDoc Data: article – Name: Format Label: File Description Group: SrcInfo Data: electronic resource – Name: Language Label: Language Group: Lang Data: English – Name: ISSN Label: ISSN Group: ISSN Data: 1932-6203 – Name: NoteTitleSource Label: Relation Group: SrcInfo Data: https://doaj.org/toc/1932-6203 – Name: DOI Label: DOI Group: ID Data: 10.1371/journal.pone.0217408 – Name: URL Label: Access URL Group: URL Data: <link linkTarget="URL" linkTerm="https://doaj.org/article/38c11e8c6f284674b3e4c329e9955dee" linkWindow="_blank">https://doaj.org/article/38c11e8c6f284674b3e4c329e9955dee</link> – Name: AN Label: Accession Number Group: ID Data: edsdoj.38c11e8c6f284674b3e4c329e9955dee
PLink	https://login.libproxy.scu.edu/login?url=https://search.ebscohost.com/login.aspx?direct=true&site=eds-live&scope=site&db=edsdoj&AN=edsdoj.38c11e8c6f284674b3e4c329e9955dee
RecordInfo	BibRecord: BibEntity: Identifiers: – Type: doi Value: 10.1371/journal.pone.0217408 Languages: – Text: English PhysicalDescription: Pagination: StartPage: e0217408 Subjects: – SubjectFull: Medicine Type: general – SubjectFull: Science Type: general Titles: – TitleFull: On the improvement of reinforcement active learning with the involvement of cross entropy to address one-shot learning problem. Type: main BibRelationships: HasContributorRelationships: – PersonEntity: Name: NameFull: Honglan Huang – PersonEntity: Name: NameFull: Jincai Huang – PersonEntity: Name: NameFull: Yanghe Feng – PersonEntity: Name: NameFull: Jiarui Zhang – PersonEntity: Name: NameFull: Zhong Liu – PersonEntity: Name: NameFull: Qi Wang – PersonEntity: Name: NameFull: Li Chen IsPartOfRelationships: – BibEntity: Dates: – D: 01 M: 01 Type: published Y: 2019 Identifiers: – Type: issn-print Value: 19326203 Numbering: – Type: volume Value: 14 – Type: issue Value: 6 Titles: – TitleFull: PLoS ONE Type: main
ResultId	1