A Robust and Efficient Method for Effective Facial Keypoint Detection

Bibliographic Details
Title:	A Robust and Efficient Method for Effective Facial Keypoint Detection
Authors:	Yonghui Huang, Yu Chen, Junhao Wang, Pengcheng Zhou, Jiaming Lai, Quanhai Wang
Source:	Applied Sciences, Vol 14, Iss 16, p 7153 (2024)
Publisher Information:	MDPI AG, 2024.
Publication Year:	2024
Collection:	LCC:Technology LCC:Engineering (General). Civil engineering (General) LCC:Biology (General) LCC:Physics LCC:Chemistry
Subject Terms:	facial recognition, landmark detection, model optimization, Technology, Engineering (General). Civil engineering (General), TA1-2040, Biology (General), QH301-705.5, Physics, QC1-999, Chemistry, QD1-999
More Details:	Facial keypoint detection technology faces significant challenges under conditions such as occlusion, extreme angles, and other demanding environments. Previous research has largely relied on deep learning regression methods using the face’s overall global template. However, these methods lack robustness in difficult conditions, leading to instability in detecting facial keypoints. To address this challenge, we propose a joint optimization approach that combines regression with heatmaps, emphasizing the importance of local apparent features. Furthermore, to mitigate the reduced learning capacity resulting from model pruning, we integrate external supervision signals through knowledge distillation into our method. This strategy fosters the development of efficient, effective, and lightweight facial keypoint detection technology. Experimental results on the CelebA, 300W, and AFLW datasets demonstrate that our proposed method significantly improves the robustness of facial keypoint detection.
Document Type:	article
File Description:	electronic resource
Language:	English
ISSN:	2076-3417
Relation:	https://www.mdpi.com/2076-3417/14/16/7153; https://doaj.org/toc/2076-3417
DOI:	10.3390/app14167153
Access URL:	https://doaj.org/article/dcd69ed481b74717810b1f91ca3086e7
Accession Number:	edsdoj.69ed481b74717810b1f91ca3086e7
Database:	Directory of Open Access Journals
Full text is not displayed to guests.	Login for full access.

FullText	Links: – Type: pdflink Url: https://content.ebscohost.com/cds/retrieve?content=AQICAHjPtM4BHU3ZchRwgzYmadcigk49r9CVlbU7V5F6lgH7WwEbo4-mkg3Fyo5oFXgp3HuMAAAA4jCB3wYJKoZIhvcNAQcGoIHRMIHOAgEAMIHIBgkqhkiG9w0BBwEwHgYJYIZIAWUDBAEuMBEEDNuME3vEcJJGuEtnLQIBEICBmg8HXMNCWT7FpwurBw5jb8YGneH7YQ2tXNJ5YkmrO9zrNJq3QF_s4HmTYAhhoaZfgDIgNAVMNyjUTVgjx7DzLuvIW7xlS4m3QuoID2TxKV5BiuYuIs-PZE1mAqblCXXENuysiYxjlvPzqTh_BIcbTtRsLpIdion1iSWQCtkz-cStiovuILnl7fl-sFYt31wAx_m2wKpwIz9Mbho= Text: Availability: 1 Value: <anid>AN0179351181;[fdtu]15aug.24;2024Sep03.01:08;v2.2.500</anid> <title id="AN0179351181-1">A Robust and Efficient Method for Effective Facial Keypoint Detection </title> <p>Facial keypoint detection technology faces significant challenges under conditions such as occlusion, extreme angles, and other demanding environments. Previous research has largely relied on deep learning regression methods using the face's overall global template. However, these methods lack robustness in difficult conditions, leading to instability in detecting facial keypoints. To address this challenge, we propose a joint optimization approach that combines regression with heatmaps, emphasizing the importance of local apparent features. Furthermore, to mitigate the reduced learning capacity resulting from model pruning, we integrate external supervision signals through knowledge distillation into our method. This strategy fosters the development of efficient, effective, and lightweight facial keypoint detection technology. Experimental results on the CelebA, 300W, and AFLW datasets demonstrate that our proposed method significantly improves the robustness of facial keypoint detection.</p> <p>Keywords: facial recognition; landmark detection; model optimization</p> <hd id="AN0179351181-2">1. Introduction</hd> <p>Face keypoint detection is a vital task in computer vision, aimed at locating specific keypoints on the face both effectively and efficiently. It is essential for various applications, including face recognition, expression analysis, and 3D face reconstruction, and holds significant potential for broader applications. Current methods for face keypoint detection fall into two main categories: traditional manual feature extraction techniques and more recent deep learning-based approaches. The prevailing research trend works on the end-to-end regression-based deep learning methods. To enhance efficiency, some network compression algorithms such as model pruning are commonly used. However, existing methods typically apply task-independent pruning techniques to improve detection efficiency.</p> <p>Despite their effectiveness in general conditions, current face keypoint detection methods struggle with challenges such as large angles, occlusion, side views, partial occlusion, uneven lighting, and drastic expression changes. In such scenarios, the detection results become unstable. Efficiency is also a concern, as the precision of pruning-based model compression typically decreases with higher pruning ratios, making it difficult to balance efficiency and accuracy. Regression-based deep learning methods rely on the face's inherent global template and lack robustness in extreme cases. Inspired by the heatmap approach used in human keypoint detection, we propose to enhance robustness by focusing on local apparent features. To address the diminishing learning ability due to pruning, external supervision signals, such as those provided by a teacher model through knowledge distillation, can be introduced. This approach aims to maintain accuracy while benefiting from combining pruning and distillation.</p> <p>This paper proposes an effective, efficient, and lightweight method for face keypoint detection to address issues of unstable detection and accuracy loss in extreme conditions. Figure 1 illustrates the model's approach. The approach combines regression with a heatmap for joint training, leveraging both global and local features for joint optimization. Additionally, we introduce a method that combines pruning and distillation. Specifically, after each pruning step, a distillation model is used for fine-tuning to maximize retention of pre-pruning detection performance and minimize accuracy loss. Extensive experimental results demonstrate that integrating a heatmap and regression improves accuracy, while the combination of distillation and pruning maintains performance with minimal accuracy loss. In Section 2, we review effective and efficient methods related to face keypoint detection. Section 3 presents our proposed method for combining distillation with pruning and regression with a heatmap. Section 4 details the experimental results, and Section 5 provides a summary and conclusions.</p> <hd id="AN0179351181-3">2. Related Work</hd> <p>The study of face keypoint detection has focused on both effectiveness and efficiency. In 2019, Guo et al. [[<reflink idref="bib1" id="ref1">1</reflink>]] proposed a lightweight and high-precision PFLD model, which significantly reduced computational costs while maintaining high accuracy. Wang et al. [[<reflink idref="bib2" id="ref2">2</reflink>]] introduced a new adaptive loss function, AWing, which better handles difficult samples and significantly enhances the performance of heatmap regression methods. In 2020, Wang et al. [[<reflink idref="bib3" id="ref3">3</reflink>]] proposed the HRNet network, which performed exceptionally well in multiple keypoint detection tasks, demonstrating the potential of high-resolution representation learning. Xu et al. [[<reflink idref="bib4" id="ref4">4</reflink>]] proposed a method for joint detection and the alignment of facial points, significantly improving both the accuracy and efficiency of detection and alignment. Browatzki et al. [[<reflink idref="bib5" id="ref5">5</reflink>]] addressed small sample learning scenarios by proposing a method for fast face alignment through reconstruction, effectively solving the keypoint detection problem in small sample cases. In 2021, Liu et al. [[<reflink idref="bib6" id="ref6">6</reflink>]] enhanced keypoint positioning accuracy with the attention-guided deformable convolutional network ADNet, improving the model's performance in complex scenarios. In 2022, Li et al. [[<reflink idref="bib7" id="ref7">7</reflink>]] tackled the keypoint positioning problem from a coordinate classification perspective, achieving outstanding performance in facial keypoint detection. In 2023, Bai et al. [[<reflink idref="bib8" id="ref8">8</reflink>]] introduced Coke, a robust keypoint detection method based on contrastive learning, which improved the model's adaptability to various deformations and occlusions. In the same year, Wan et al. [[<reflink idref="bib9" id="ref9">9</reflink>]] proposed a method for accurately locating facial keypoints using a heatmap transformer. In 2024, Yu et al. [[<reflink idref="bib10" id="ref10">10</reflink>]] presented Yolo-facev2, focusing on addressing scale variations and occlusion issues. Rangayya et al. [[<reflink idref="bib11" id="ref11">11</reflink>]] proposed the SVM-MRF method, which combines the KCM segmentation strategy of KTBD, significantly improving recognition accuracy. Inspired by MTCNN, Khan et al. [[<reflink idref="bib12" id="ref12">12</reflink>]] introduced an improved CNN-based face detection algorithm, MTCNN++, optimized for both detection accuracy and speed.</p> <p>Regarding pruning, In 2020, Blalock et al. [[<reflink idref="bib13" id="ref13">13</reflink>]] outlined the current state of neural network pruning, evaluating the effectiveness of various methods in reducing model parameters and computational costs, while also analyzing the retention and stability of model performance post-pruning. In 2022, Vadera et al. [[<reflink idref="bib14" id="ref14">14</reflink>]] provided a detailed examination of several pruning techniques, including weight pruning, structural pruning, and layer pruning. The study compared the strengths and weaknesses of each method and offered practical recommendations for selecting appropriate techniques based on specific application needs. In 2023, Fang et al. [[<reflink idref="bib15" id="ref15">15</reflink>]] introduced DepGraph, a structural pruning method that utilizes graph theory to enhance the flexibility of the pruning process. This approach significantly improves parameter compression while maintaining the model's performance. Sun et al. [[<reflink idref="bib16" id="ref16">16</reflink>]] presented a straightforward and effective pruning method tailored for large language models. This technique effectively reduces computational resource requirements while preserving the model's reasoning capabilities, demonstrating promising pruning results.</p> <p>In the area of distillation, In 2021, Ji et al. [[<reflink idref="bib17" id="ref17">17</reflink>]] introduced a distillation method that uses attention mechanisms to enhance the student model's performance by aligning its feature maps with those of the teacher model. Yao et al. [[<reflink idref="bib18" id="ref18">18</reflink>]] introduced the Adapt-and-Distill method, where a pre-trained model is first adapted to a specific domain and then distilled into a compact, efficient version, balancing domain adaptability with efficiency. In 2022, Beyer et al. [[<reflink idref="bib19" id="ref19">19</reflink>]] emphasized the importance of the teacher model's behavior during distillation, advocating for a patient and consistent teacher while highlighting that managing the complexity and stability of the teacher's output is crucial for successful distillation. Park et al. [[<reflink idref="bib20" id="ref20">20</reflink>]] proposed pruning the teacher model before distillation, effectively reducing its complexity while maintaining accuracy, leading to a smaller and faster student model. In 2024, Waheed et al. [[<reflink idref="bib21" id="ref21">21</reflink>]] questioned the robustness of knowledge distillation, particularly when student models are exposed to out-of-distribution data, and explored the challenges that might arise under such conditions.</p> <hd id="AN0179351181-4">3. Methods</hd> <p></p> <hd id="AN0179351181-5">3.1. Basic Concepts</hd> <p>In this part, we delineate the fundamental definitions and symbols pertaining to Convolutional Neural Networks (CNNs) that are pertinent to our proposed methodology. The CNN architecture comprises two primary components: data representation layers and parameter layers. Specifically, feature maps and latent vectors encapsulate the representational aspect of the data, whereas convolutional layers and fully-connected layers constitute the parameter layers, responsible for learning and optimizing the model parameters.</p> <p>In the context of a CNN, a feature map encapsulates data structured as a tensor <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;mi mathvariant="bold"&gt;M&lt;/mi&gt;&lt;mo&gt;&amp;#8712;&lt;/mo&gt;&lt;msup&gt;&lt;mi mathvariant="double-struck"&gt;R&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;C&lt;/mi&gt;&lt;mo&gt;&amp;#215;&lt;/mo&gt;&lt;mi&gt;H&lt;/mi&gt;&lt;mo&gt;&amp;#215;&lt;/mo&gt;&lt;mi&gt;W&lt;/mi&gt;&lt;/mrow&gt;&lt;/msup&gt;&lt;/mrow&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> , where <emph>C</emph>, <emph>H</emph>, and <emph>W</emph>, respectively, signify the dimensionality of channels, height, and width. For illustration, an RGB image input possesses a dimensionality of <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;mn&gt;3&lt;/mn&gt;&lt;mo&gt;&amp;#215;&lt;/mo&gt;&lt;mi&gt;H&lt;/mi&gt;&lt;mo&gt;&amp;#215;&lt;/mo&gt;&lt;mi&gt;W&lt;/mi&gt;&lt;/mrow&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> . In this paper, we adopt the notation <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mi mathvariant="bold"&gt;M&lt;/mi&gt;&lt;mi&gt;t&lt;/mi&gt;&lt;/msub&gt;&lt;mo&gt;&amp;#8712;&lt;/mo&gt;&lt;msup&gt;&lt;mi mathvariant="double-struck"&gt;R&lt;/mi&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mi&gt;C&lt;/mi&gt;&lt;mi&gt;t&lt;/mi&gt;&lt;/msub&gt;&lt;mo&gt;&amp;#215;&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;H&lt;/mi&gt;&lt;mi&gt;t&lt;/mi&gt;&lt;/msub&gt;&lt;mo&gt;&amp;#215;&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;W&lt;/mi&gt;&lt;mi&gt;t&lt;/mi&gt;&lt;/msub&gt;&lt;/mrow&gt;&lt;/msup&gt;&lt;/mrow&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> to represent the <emph>t</emph>-th layer's feature map.</p> <p> <emph>Convolution operation.</emph> This fundamental mechanism within CNNs serves to abstract local patterns from the input data. For layer <emph>t</emph>, convolution is achieved through the employment of filters denoted as <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mi mathvariant="bold"&gt;Z&lt;/mi&gt;&lt;mi&gt;t&lt;/mi&gt;&lt;/msub&gt;&lt;mo&gt;&amp;#8712;&lt;/mo&gt;&lt;msup&gt;&lt;mi mathvariant="double-struck"&gt;R&lt;/mi&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mi&gt;C&lt;/mi&gt;&lt;mi&gt;t&lt;/mi&gt;&lt;/msub&gt;&lt;mo&gt;&amp;#215;&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;C&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;t&lt;/mi&gt;&lt;mo&gt;&amp;#8722;&lt;/mo&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;mo&gt;&amp;#215;&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;s&lt;/mi&gt;&lt;mi&gt;t&lt;/mi&gt;&lt;/msub&gt;&lt;mo&gt;&amp;#215;&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;s&lt;/mi&gt;&lt;mi&gt;t&lt;/mi&gt;&lt;/msub&gt;&lt;/mrow&gt;&lt;/msup&gt;&lt;/mrow&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> , where <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mi&gt;C&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;t&lt;/mi&gt;&lt;mo&gt;&amp;#8722;&lt;/mo&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;mo&gt;&amp;#215;&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;s&lt;/mi&gt;&lt;mi&gt;t&lt;/mi&gt;&lt;/msub&gt;&lt;mo&gt;&amp;#215;&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;s&lt;/mi&gt;&lt;mi&gt;t&lt;/mi&gt;&lt;/msub&gt;&lt;/mrow&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> encapsulates the dimensionality of each individual filter's parameters, and <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;msub&gt;&lt;mi&gt;C&lt;/mi&gt;&lt;mi&gt;t&lt;/mi&gt;&lt;/msub&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> denotes the count of such filters. Critically, the value of <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;msub&gt;&lt;mi&gt;C&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;t&lt;/mi&gt;&lt;mo&gt;&amp;#8722;&lt;/mo&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> must align with the channel count of the preceding feature map <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;msub&gt;&lt;mi mathvariant="bold"&gt;M&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;t&lt;/mi&gt;&lt;mo&gt;&amp;#8722;&lt;/mo&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> , ensuring compatibility during the convolution process. Each filter systematically traverses <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;msub&gt;&lt;mi mathvariant="bold"&gt;M&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;t&lt;/mi&gt;&lt;mo&gt;&amp;#8722;&lt;/mo&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> in a dense fashion, applying itself to local neighborhoods, thereby generating a corresponding output map. These individual output maps, one per filter, are then concatenated along the channel dimension to produce the output feature map <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mi mathvariant="bold"&gt;M&lt;/mi&gt;&lt;mi&gt;t&lt;/mi&gt;&lt;/msub&gt;&lt;mo&gt;&amp;#8712;&lt;/mo&gt;&lt;msup&gt;&lt;mi mathvariant="double-struck"&gt;R&lt;/mi&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mi&gt;C&lt;/mi&gt;&lt;mi&gt;t&lt;/mi&gt;&lt;/msub&gt;&lt;mo&gt;&amp;#215;&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;H&lt;/mi&gt;&lt;mi&gt;t&lt;/mi&gt;&lt;/msub&gt;&lt;mo&gt;&amp;#215;&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;W&lt;/mi&gt;&lt;mi&gt;t&lt;/mi&gt;&lt;/msub&gt;&lt;/mrow&gt;&lt;/msup&gt;&lt;/mrow&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> .</p> <p> <emph>Fully connected layer.</emph> This layer constitutes a particular instantiation of the convolution operation, tailored for the global integration of information. We consider the feature map <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mi mathvariant="bold"&gt;M&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;t&lt;/mi&gt;&lt;mo&gt;+&lt;/mo&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;mo&gt;&amp;#8712;&lt;/mo&gt;&lt;msup&gt;&lt;mi mathvariant="double-struck"&gt;R&lt;/mi&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mi&gt;C&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;t&lt;/mi&gt;&lt;mo&gt;+&lt;/mo&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;mo&gt;&amp;#215;&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;H&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;t&lt;/mi&gt;&lt;mo&gt;+&lt;/mo&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;mo&gt;&amp;#215;&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;W&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;t&lt;/mi&gt;&lt;mo&gt;+&lt;/mo&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;/mrow&gt;&lt;/msup&gt;&lt;/mrow&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> as the input to this layer. Instead of conventional convolutional filters, it employs a set of filters denoted as <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;mi mathvariant="bold"&gt;Z&lt;/mi&gt;&lt;mo&gt;&amp;#8712;&lt;/mo&gt;&lt;msup&gt;&lt;mi mathvariant="double-struck"&gt;R&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;D&lt;/mi&gt;&lt;mo&gt;&amp;#215;&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;C&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;t&lt;/mi&gt;&lt;mo&gt;+&lt;/mo&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;mo&gt;&amp;#215;&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;H&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;t&lt;/mi&gt;&lt;mo&gt;+&lt;/mo&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;mo&gt;&amp;#215;&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;W&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;t&lt;/mi&gt;&lt;mo&gt;+&lt;/mo&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;/mrow&gt;&lt;/msup&gt;&lt;/mrow&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> , where each filter spans the entire spatial dimensions of the input feature map. In contrast to standard convolution, the layer outputs a condensed latent embedding, where the dimensions of height and width are reduced to unity, effectively collapsing the spatial information into a dense representation of the input data.</p> <hd id="AN0179351181-6">3.2. Keypoint Detection</hd> <p>Given a dataset <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mi mathvariant="bold"&gt;S&lt;/mi&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> , which includes <emph>L</emph> images <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;mi mathvariant="bold"&gt;S&lt;/mi&gt;&lt;mo&gt;=&lt;/mo&gt;&lt;msubsup&gt;&lt;mrow&gt;&lt;mo&gt;{&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;s&lt;/mi&gt;&lt;mi&gt;l&lt;/mi&gt;&lt;/msub&gt;&lt;mo&gt;}&lt;/mo&gt;&lt;/mrow&gt;&lt;mrow&gt;&lt;mi&gt;l&lt;/mi&gt;&lt;mo&gt;=&lt;/mo&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;/mrow&gt;&lt;mi&gt;L&lt;/mi&gt;&lt;/msubsup&gt;&lt;/mrow&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> , and a corresponding label set <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;mi mathvariant="bold"&gt;E&lt;/mi&gt;&lt;mo&gt;=&lt;/mo&gt;&lt;msubsup&gt;&lt;mrow&gt;&lt;mo&gt;{&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;e&lt;/mi&gt;&lt;mi&gt;l&lt;/mi&gt;&lt;/msub&gt;&lt;mo&gt;}&lt;/mo&gt;&lt;/mrow&gt;&lt;mrow&gt;&lt;mi&gt;l&lt;/mi&gt;&lt;mo&gt;=&lt;/mo&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;/mrow&gt;&lt;mi&gt;L&lt;/mi&gt;&lt;/msubsup&gt;&lt;/mrow&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> , we define the input to the network as <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mi mathvariant="bold"&gt;s&lt;/mi&gt;&lt;mi&gt;l&lt;/mi&gt;&lt;/msub&gt;&lt;mo&gt;&amp;#8712;&lt;/mo&gt;&lt;msup&gt;&lt;mi mathvariant="double-struck"&gt;R&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;B&lt;/mi&gt;&lt;mo&gt;&amp;#215;&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;C&lt;/mi&gt;&lt;mn&gt;0&lt;/mn&gt;&lt;/msub&gt;&lt;mo&gt;&amp;#215;&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;H&lt;/mi&gt;&lt;mn&gt;0&lt;/mn&gt;&lt;/msub&gt;&lt;mo&gt;&amp;#215;&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;W&lt;/mi&gt;&lt;mn&gt;0&lt;/mn&gt;&lt;/msub&gt;&lt;/mrow&gt;&lt;/msup&gt;&lt;/mrow&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> , where <emph>B</emph> is the batch size, <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;msub&gt;&lt;mi&gt;C&lt;/mi&gt;&lt;mn&gt;0&lt;/mn&gt;&lt;/msub&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> is the number of channels, and <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;msub&gt;&lt;mi&gt;H&lt;/mi&gt;&lt;mn&gt;0&lt;/mn&gt;&lt;/msub&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> and <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;msub&gt;&lt;mi&gt;W&lt;/mi&gt;&lt;mn&gt;0&lt;/mn&gt;&lt;/msub&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> are the height and width of the image, respectively. We set <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mi&gt;H&lt;/mi&gt;&lt;mn&gt;0&lt;/mn&gt;&lt;/msub&gt;&lt;mo&gt;=&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;W&lt;/mi&gt;&lt;mn&gt;0&lt;/mn&gt;&lt;/msub&gt;&lt;mo&gt;=&lt;/mo&gt;&lt;mn&gt;128&lt;/mn&gt;&lt;/mrow&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> . This batch of images is used as input into the backbone network, such as ResNet-50 [[<reflink idref="bib22" id="ref22">22</reflink>]] or MobileNet [[<reflink idref="bib24" id="ref23">24</reflink>], [<reflink idref="bib26" id="ref24">26</reflink>]]. For the layer <emph>L</emph>, the output is the feature map <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mi mathvariant="bold"&gt;M&lt;/mi&gt;&lt;mi&gt;L&lt;/mi&gt;&lt;/msub&gt;&lt;mo&gt;&amp;#8712;&lt;/mo&gt;&lt;msup&gt;&lt;mi mathvariant="double-struck"&gt;R&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;B&lt;/mi&gt;&lt;mo&gt;&amp;#215;&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;C&lt;/mi&gt;&lt;mi&gt;L&lt;/mi&gt;&lt;/msub&gt;&lt;mo&gt;&amp;#215;&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;H&lt;/mi&gt;&lt;mi&gt;L&lt;/mi&gt;&lt;/msub&gt;&lt;mo&gt;&amp;#215;&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;W&lt;/mi&gt;&lt;mi&gt;L&lt;/mi&gt;&lt;/msub&gt;&lt;/mrow&gt;&lt;/msup&gt;&lt;/mrow&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> , where <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;msub&gt;&lt;mi&gt;C&lt;/mi&gt;&lt;mi&gt;L&lt;/mi&gt;&lt;/msub&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> is the number of channels, and <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;msub&gt;&lt;mi&gt;H&lt;/mi&gt;&lt;mi&gt;L&lt;/mi&gt;&lt;/msub&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> and <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;msub&gt;&lt;mi&gt;W&lt;/mi&gt;&lt;mi&gt;L&lt;/mi&gt;&lt;/msub&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> are the height and width of the feature map.</p> <hd id="AN0179351181-7">3.3. Regression</hd> <p>In the regression task, the label <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;msub&gt;&lt;mi&gt;e&lt;/mi&gt;&lt;mi&gt;l&lt;/mi&gt;&lt;/msub&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> is typically composed of coordinate vectors <bold>x</bold> and <bold>y</bold>. We define the label set <bold>E</bold>, containing <emph>L</emph> images, as the set <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;mrow&gt;&lt;mo&gt;(&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;x&lt;/mi&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;/msub&gt;&lt;mo&gt;,&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;y&lt;/mi&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;/msub&gt;&lt;mo&gt;)&lt;/mo&gt;&lt;/mrow&gt;&lt;mo&gt;,&lt;/mo&gt;&lt;mrow&gt;&lt;mo&gt;(&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;x&lt;/mi&gt;&lt;mn&gt;2&lt;/mn&gt;&lt;/msub&gt;&lt;mo&gt;,&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;y&lt;/mi&gt;&lt;mn&gt;2&lt;/mn&gt;&lt;/msub&gt;&lt;mo&gt;)&lt;/mo&gt;&lt;/mrow&gt;&lt;mo&gt;,&lt;/mo&gt;&lt;mo&gt;...&lt;/mo&gt;&lt;mo&gt;,&lt;/mo&gt;&lt;mrow&gt;&lt;mo&gt;(&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;x&lt;/mi&gt;&lt;mi&gt;L&lt;/mi&gt;&lt;/msub&gt;&lt;mo&gt;,&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;y&lt;/mi&gt;&lt;mi&gt;L&lt;/mi&gt;&lt;/msub&gt;&lt;mo&gt;)&lt;/mo&gt;&lt;/mrow&gt;&lt;/mrow&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> . To complete the regression task from the feature map to the label set, it is necessary to convert the feature map into a vector form. The operation of global average pooling is adopted to transform the feature map <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mi mathvariant="bold"&gt;M&lt;/mi&gt;&lt;mi&gt;L&lt;/mi&gt;&lt;/msub&gt;&lt;mo&gt;&amp;#8712;&lt;/mo&gt;&lt;msup&gt;&lt;mi mathvariant="double-struck"&gt;R&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;B&lt;/mi&gt;&lt;mo&gt;&amp;#215;&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;C&lt;/mi&gt;&lt;mi&gt;L&lt;/mi&gt;&lt;/msub&gt;&lt;mo&gt;&amp;#215;&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;H&lt;/mi&gt;&lt;mi&gt;L&lt;/mi&gt;&lt;/msub&gt;&lt;mo&gt;&amp;#215;&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;W&lt;/mi&gt;&lt;mi&gt;L&lt;/mi&gt;&lt;/msub&gt;&lt;/mrow&gt;&lt;/msup&gt;&lt;/mrow&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> into <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mi mathvariant="bold"&gt;G&lt;/mi&gt;&lt;mi&gt;L&lt;/mi&gt;&lt;/msub&gt;&lt;mo&gt;&amp;#8712;&lt;/mo&gt;&lt;msup&gt;&lt;mi mathvariant="double-struck"&gt;R&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;B&lt;/mi&gt;&lt;mo&gt;&amp;#215;&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;C&lt;/mi&gt;&lt;mi&gt;L&lt;/mi&gt;&lt;/msub&gt;&lt;/mrow&gt;&lt;/msup&gt;&lt;/mrow&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> , retaining the important feature information of the image.</p> <p>Assuming that the label has <emph>N</emph> coordinate points, the dimension of <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;msub&gt;&lt;mi&gt;y&lt;/mi&gt;&lt;mi&gt;l&lt;/mi&gt;&lt;/msub&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> is <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;mi&gt;r&lt;/mi&gt;&lt;mi&gt;e&lt;/mi&gt;&lt;mi&gt;g&lt;/mi&gt;&lt;mo&gt;&amp;#95;&lt;/mo&gt;&lt;mi&gt;D&lt;/mi&gt;&lt;mo&gt;=&lt;/mo&gt;&lt;mi&gt;N&lt;/mi&gt;&lt;mo&gt;&amp;#215;&lt;/mo&gt;&lt;mn&gt;2&lt;/mn&gt;&lt;/mrow&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> . To match the <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;msub&gt;&lt;mi&gt;C&lt;/mi&gt;&lt;mi&gt;L&lt;/mi&gt;&lt;/msub&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> value of the vector <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mi mathvariant="bold"&gt;G&lt;/mi&gt;&lt;mi&gt;L&lt;/mi&gt;&lt;/msub&gt;&lt;mo&gt;&amp;#8712;&lt;/mo&gt;&lt;msup&gt;&lt;mi mathvariant="double-struck"&gt;R&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;B&lt;/mi&gt;&lt;mo&gt;&amp;#215;&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;C&lt;/mi&gt;&lt;mi&gt;L&lt;/mi&gt;&lt;/msub&gt;&lt;/mrow&gt;&lt;/msup&gt;&lt;/mrow&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> with <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;mi&gt;r&lt;/mi&gt;&lt;mi&gt;e&lt;/mi&gt;&lt;mi&gt;g&lt;/mi&gt;&lt;mo&gt;&amp;#95;&lt;/mo&gt;&lt;mi&gt;D&lt;/mi&gt;&lt;/mrow&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> , i.e., <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mi&gt;C&lt;/mi&gt;&lt;mi&gt;L&lt;/mi&gt;&lt;/msub&gt;&lt;mo&gt;=&lt;/mo&gt;&lt;mi&gt;r&lt;/mi&gt;&lt;mi&gt;e&lt;/mi&gt;&lt;mi&gt;g&lt;/mi&gt;&lt;mo&gt;&amp;#95;&lt;/mo&gt;&lt;mi&gt;D&lt;/mi&gt;&lt;/mrow&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> , we build a regression head model based on the output feature map <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mi mathvariant="bold"&gt;F&lt;/mi&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> . We implement the transformation from <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mi mathvariant="bold"&gt;G&lt;/mi&gt;&lt;mi&gt;L&lt;/mi&gt;&lt;/msub&gt;&lt;mo&gt;&amp;#8712;&lt;/mo&gt;&lt;msup&gt;&lt;mi mathvariant="double-struck"&gt;R&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;B&lt;/mi&gt;&lt;mo&gt;&amp;#215;&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;C&lt;/mi&gt;&lt;mi&gt;L&lt;/mi&gt;&lt;/msub&gt;&lt;/mrow&gt;&lt;/msup&gt;&lt;/mrow&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> to <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;msubsup&gt;&lt;mi mathvariant="bold"&gt;G&lt;/mi&gt;&lt;mi&gt;L&lt;/mi&gt;&lt;mo&gt;&amp;#8242;&lt;/mo&gt;&lt;/msubsup&gt;&lt;mo&gt;&amp;#8712;&lt;/mo&gt;&lt;msup&gt;&lt;mi mathvariant="double-struck"&gt;R&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;B&lt;/mi&gt;&lt;mo&gt;&amp;#215;&lt;/mo&gt;&lt;mi&gt;r&lt;/mi&gt;&lt;mi&gt;e&lt;/mi&gt;&lt;mi&gt;g&lt;/mi&gt;&lt;mo&gt;&amp;#95;&lt;/mo&gt;&lt;mi&gt;D&lt;/mi&gt;&lt;/mrow&gt;&lt;/msup&gt;&lt;/mrow&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> using a multi-layer perceptron composed of FC-BN-ReLU layers. The prediction vector for the target position is obtained. Since the label value of the keypoints is usually normalized between 0 and 1, to ensure the stability of the training gradient and that the output value approximates the real keypoint position, we apply a sigmoid operation to the output <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;msubsup&gt;&lt;mi mathvariant="bold"&gt;G&lt;/mi&gt;&lt;mi&gt;L&lt;/mi&gt;&lt;mo&gt;&amp;#8242;&lt;/mo&gt;&lt;/msubsup&gt;&lt;mo&gt;&amp;#8712;&lt;/mo&gt;&lt;msup&gt;&lt;mi mathvariant="double-struck"&gt;R&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;B&lt;/mi&gt;&lt;mo&gt;&amp;#215;&lt;/mo&gt;&lt;mi&gt;r&lt;/mi&gt;&lt;mi&gt;e&lt;/mi&gt;&lt;mi&gt;g&lt;/mi&gt;&lt;mo&gt;&amp;#95;&lt;/mo&gt;&lt;mi&gt;D&lt;/mi&gt;&lt;/mrow&gt;&lt;/msup&gt;&lt;/mrow&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> of the multi-layer perceptron, resulting in <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mi mathvariant="bold"&gt;G&lt;/mi&gt;&lt;mi&gt;s&lt;/mi&gt;&lt;/msub&gt;&lt;mo&gt;&amp;#8712;&lt;/mo&gt;&lt;msup&gt;&lt;mi mathvariant="double-struck"&gt;R&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;B&lt;/mi&gt;&lt;mo&gt;&amp;#215;&lt;/mo&gt;&lt;mi&gt;r&lt;/mi&gt;&lt;mi&gt;e&lt;/mi&gt;&lt;mi&gt;g&lt;/mi&gt;&lt;mo&gt;&amp;#95;&lt;/mo&gt;&lt;mi&gt;D&lt;/mi&gt;&lt;/mrow&gt;&lt;/msup&gt;&lt;/mrow&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> . Finally, in the regression head, using MSE as the loss function, the squared difference of the corresponding elements between <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;msub&gt;&lt;mi mathvariant="bold"&gt;G&lt;/mi&gt;&lt;mi&gt;s&lt;/mi&gt;&lt;/msub&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> and any two vectors <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;msup&gt;&lt;mi mathvariant="bold"&gt;g&lt;/mi&gt;&lt;mo&gt;&amp;#8242;&lt;/mo&gt;&lt;/msup&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> and <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;msup&gt;&lt;mi mathvariant="bold"&gt;e&lt;/mi&gt;&lt;mo&gt;&amp;#8242;&lt;/mo&gt;&lt;/msup&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> in the label set <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mi mathvariant="bold"&gt;E&lt;/mi&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> is computed:</p> <p>(<reflink idref="bib1" id="ref25">1</reflink>) <ephtml> &lt;math display="block" xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mi mathvariant="italic"&gt;loss&lt;/mi&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;/msub&gt;&lt;mo&gt;=&lt;/mo&gt;&lt;mo movablelimits="true" form="prefix"&gt;min&lt;/mo&gt;&lt;mstyle scriptlevel="0" displaystyle="true"&gt;&lt;mfrac&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;mrow&gt;&lt;mi&gt;B&lt;/mi&gt;&lt;mo&gt;&amp;#215;&lt;/mo&gt;&lt;mi&gt;r&lt;/mi&gt;&lt;mi&gt;e&lt;/mi&gt;&lt;mi&gt;g&lt;/mi&gt;&lt;mo&gt;&amp;#95;&lt;/mo&gt;&lt;mi&gt;D&lt;/mi&gt;&lt;/mrow&gt;&lt;/mfrac&gt;&lt;/mstyle&gt;&lt;munderover&gt;&lt;mo&gt;&amp;#8721;&lt;/mo&gt;&lt;mrow&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;mo&gt;=&lt;/mo&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;/mrow&gt;&lt;mi&gt;B&lt;/mi&gt;&lt;/munderover&gt;&lt;munderover&gt;&lt;mo&gt;&amp;#8721;&lt;/mo&gt;&lt;mrow&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;mo&gt;=&lt;/mo&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;/mrow&gt;&lt;mrow&gt;&lt;mi&gt;r&lt;/mi&gt;&lt;mi&gt;e&lt;/mi&gt;&lt;mi&gt;g&lt;/mi&gt;&lt;mo&gt;&amp;#95;&lt;/mo&gt;&lt;mi&gt;D&lt;/mi&gt;&lt;/mrow&gt;&lt;/munderover&gt;&lt;msup&gt;&lt;mrow&gt;&lt;mo&gt;(&lt;/mo&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mi&gt;g&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;mo&gt;&amp;#8722;&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;e&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;/mrow&gt;&lt;mo&gt;)&lt;/mo&gt;&lt;/mrow&gt;&lt;mn&gt;2&lt;/mn&gt;&lt;/msup&gt;&lt;/mrow&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml></p> <hd id="AN0179351181-8">3.4. Heatmap</hd> <p>To ensure the robustness of face keypoint detection in extreme cases, we propose to add a heatmap head network in addition to the backbone. The heatmap's prediction output has the same dimensions as the backbone's output, eliminating the need for global pooling. Since <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mi mathvariant="bold"&gt;M&lt;/mi&gt;&lt;mi&gt;L&lt;/mi&gt;&lt;/msub&gt;&lt;mo&gt;&amp;#8712;&lt;/mo&gt;&lt;msup&gt;&lt;mi mathvariant="double-struck"&gt;R&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;B&lt;/mi&gt;&lt;mo&gt;&amp;#215;&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;C&lt;/mi&gt;&lt;mi&gt;L&lt;/mi&gt;&lt;/msub&gt;&lt;mo&gt;&amp;#215;&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;H&lt;/mi&gt;&lt;mi&gt;L&lt;/mi&gt;&lt;/msub&gt;&lt;mo&gt;&amp;#215;&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;W&lt;/mi&gt;&lt;mi&gt;L&lt;/mi&gt;&lt;/msub&gt;&lt;/mrow&gt;&lt;/msup&gt;&lt;/mrow&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> lacks detailed image features necessary for fine keypoint detection, we perform an upsampling operation on <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;msub&gt;&lt;mi mathvariant="bold"&gt;M&lt;/mi&gt;&lt;mi&gt;L&lt;/mi&gt;&lt;/msub&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> to recover and enrich these details. The new output <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mi mathvariant="bold"&gt;M&lt;/mi&gt;&lt;mi&gt;s&lt;/mi&gt;&lt;/msub&gt;&lt;mo&gt;&amp;#8712;&lt;/mo&gt;&lt;msup&gt;&lt;mi mathvariant="double-struck"&gt;R&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;B&lt;/mi&gt;&lt;mo&gt;&amp;#215;&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;C&lt;/mi&gt;&lt;mi&gt;s&lt;/mi&gt;&lt;/msub&gt;&lt;mo&gt;&amp;#215;&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;H&lt;/mi&gt;&lt;mi&gt;s&lt;/mi&gt;&lt;/msub&gt;&lt;mo&gt;&amp;#215;&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;W&lt;/mi&gt;&lt;mi&gt;s&lt;/mi&gt;&lt;/msub&gt;&lt;/mrow&gt;&lt;/msup&gt;&lt;/mrow&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> is obtained using bilinear interpolation.</p> <p>To align the channel <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;msub&gt;&lt;mi&gt;C&lt;/mi&gt;&lt;mi&gt;L&lt;/mi&gt;&lt;/msub&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> of the upsampled output <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;msub&gt;&lt;mi mathvariant="bold"&gt;M&lt;/mi&gt;&lt;mi&gt;s&lt;/mi&gt;&lt;/msub&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> with the number of keypoints <emph>N</emph> in the label set <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mi mathvariant="bold"&gt;Y&lt;/mi&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> , we apply a nonlinear multi-layer convolutional network composed of Conv-BN-ReLU operations to <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;msub&gt;&lt;mi mathvariant="bold"&gt;M&lt;/mi&gt;&lt;mi&gt;s&lt;/mi&gt;&lt;/msub&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> , resulting in <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;msubsup&gt;&lt;mi mathvariant="bold"&gt;M&lt;/mi&gt;&lt;mi&gt;s&lt;/mi&gt;&lt;mo&gt;&amp;#8242;&lt;/mo&gt;&lt;/msubsup&gt;&lt;mo&gt;&amp;#8712;&lt;/mo&gt;&lt;msup&gt;&lt;mi mathvariant="double-struck"&gt;R&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;B&lt;/mi&gt;&lt;mo&gt;&amp;#215;&lt;/mo&gt;&lt;mi&gt;D&lt;/mi&gt;&lt;mo&gt;&amp;#215;&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;H&lt;/mi&gt;&lt;mi&gt;s&lt;/mi&gt;&lt;/msub&gt;&lt;mo&gt;&amp;#215;&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;W&lt;/mi&gt;&lt;mi&gt;s&lt;/mi&gt;&lt;/msub&gt;&lt;/mrow&gt;&lt;/msup&gt;&lt;/mrow&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> , where <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;mi&gt;D&lt;/mi&gt;&lt;mo&gt;=&lt;/mo&gt;&lt;mi&gt;N&lt;/mi&gt;&lt;/mrow&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> . To ensure that the output values approximate the real keypoint positions as probabilities, we apply a sigmoid operation to <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;msubsup&gt;&lt;mi mathvariant="bold"&gt;M&lt;/mi&gt;&lt;mi&gt;s&lt;/mi&gt;&lt;mo&gt;&amp;#8242;&lt;/mo&gt;&lt;/msubsup&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> , limiting <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mi mathvariant="bold"&gt;M&lt;/mi&gt;&lt;mi&gt;mod&lt;/mi&gt;&lt;/msub&gt;&lt;mo&gt;&amp;#8712;&lt;/mo&gt;&lt;msup&gt;&lt;mi mathvariant="double-struck"&gt;R&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;B&lt;/mi&gt;&lt;mo&gt;&amp;#215;&lt;/mo&gt;&lt;mi&gt;D&lt;/mi&gt;&lt;mo&gt;&amp;#215;&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;H&lt;/mi&gt;&lt;mi&gt;s&lt;/mi&gt;&lt;/msub&gt;&lt;mo&gt;&amp;#215;&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;W&lt;/mi&gt;&lt;mi&gt;s&lt;/mi&gt;&lt;/msub&gt;&lt;/mrow&gt;&lt;/msup&gt;&lt;/mrow&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> to the range <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;mo&gt;[&lt;/mo&gt;&lt;mn&gt;0&lt;/mn&gt;&lt;mo&gt;,&lt;/mo&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;mo&gt;]&lt;/mo&gt;&lt;/mrow&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> .</p> <p>A Gaussian distribution is adopted with the standard deviation of <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mi&gt;&amp;#963;&lt;/mi&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> to transform the label set <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mi mathvariant="bold"&gt;E&lt;/mi&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> into the heatmap <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mi mathvariant="bold"&gt;K&lt;/mi&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> . For each pixel value <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;mi&gt;M&lt;/mi&gt;&lt;mo&gt;(&lt;/mo&gt;&lt;mi&gt;x&lt;/mi&gt;&lt;mo&gt;,&lt;/mo&gt;&lt;mi&gt;y&lt;/mi&gt;&lt;mo&gt;)&lt;/mo&gt;&lt;/mrow&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> in the image, <emph>x</emph> and <emph>y</emph> are the pixel coordinates, <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;mo&gt;(&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;x&lt;/mi&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;/msub&gt;&lt;mo&gt;,&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;y&lt;/mi&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;/msub&gt;&lt;mo&gt;)&lt;/mo&gt;&lt;/mrow&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> are the coordinates of the points in the set, and <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mi&gt;&amp;#963;&lt;/mi&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> controls the range of influence. The output is a matrix <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;mi mathvariant="bold"&gt;K&lt;/mi&gt;&lt;mo&gt;&amp;#8712;&lt;/mo&gt;&lt;msup&gt;&lt;mi mathvariant="double-struck"&gt;R&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;B&lt;/mi&gt;&lt;mo&gt;&amp;#215;&lt;/mo&gt;&lt;mi&gt;D&lt;/mi&gt;&lt;mo&gt;&amp;#215;&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;H&lt;/mi&gt;&lt;mi&gt;K&lt;/mi&gt;&lt;/msub&gt;&lt;mo&gt;&amp;#215;&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;W&lt;/mi&gt;&lt;mi&gt;K&lt;/mi&gt;&lt;/msub&gt;&lt;/mrow&gt;&lt;/msup&gt;&lt;/mrow&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> , i.e., <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;mi&gt;B&lt;/mi&gt;&lt;mo&gt;&amp;#215;&lt;/mo&gt;&lt;mi&gt;D&lt;/mi&gt;&lt;/mrow&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> , representing <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;mi&gt;B&lt;/mi&gt;&lt;mo&gt;&amp;#215;&lt;/mo&gt;&lt;mi&gt;D&lt;/mi&gt;&lt;/mrow&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> heatmap images. Each pixel value is calculated as the sum of the Gaussian effects from all points on that location.</p> <p>(<reflink idref="bib2" id="ref26">2</reflink>) <ephtml> &lt;math display="block" xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;mi&gt;M&lt;/mi&gt;&lt;mrow&gt;&lt;mo&gt;(&lt;/mo&gt;&lt;mi&gt;x&lt;/mi&gt;&lt;mo&gt;,&lt;/mo&gt;&lt;mi&gt;y&lt;/mi&gt;&lt;mo&gt;)&lt;/mo&gt;&lt;/mrow&gt;&lt;mo&gt;=&lt;/mo&gt;&lt;munderover&gt;&lt;mo&gt;&amp;#8721;&lt;/mo&gt;&lt;mrow&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;mo&gt;=&lt;/mo&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;/mrow&gt;&lt;mi&gt;L&lt;/mi&gt;&lt;/munderover&gt;&lt;mo form="prefix"&gt;exp&lt;/mo&gt;&lt;mo&gt;&amp;#8722;&lt;/mo&gt;&lt;mstyle scriptlevel="0" displaystyle="true"&gt;&lt;mfrac&gt;&lt;mrow&gt;&lt;msup&gt;&lt;mrow&gt;&lt;mo&gt;(&lt;/mo&gt;&lt;mi&gt;x&lt;/mi&gt;&lt;mo&gt;&amp;#8722;&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;x&lt;/mi&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;/msub&gt;&lt;mo&gt;)&lt;/mo&gt;&lt;/mrow&gt;&lt;mn&gt;2&lt;/mn&gt;&lt;/msup&gt;&lt;mo&gt;+&lt;/mo&gt;&lt;msup&gt;&lt;mrow&gt;&lt;mo&gt;(&lt;/mo&gt;&lt;mi&gt;y&lt;/mi&gt;&lt;mo&gt;&amp;#8722;&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;y&lt;/mi&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;/msub&gt;&lt;mo&gt;)&lt;/mo&gt;&lt;/mrow&gt;&lt;mn&gt;2&lt;/mn&gt;&lt;/msup&gt;&lt;/mrow&gt;&lt;mrow&gt;&lt;mn&gt;2&lt;/mn&gt;&lt;msup&gt;&lt;mi&gt;&amp;#963;&lt;/mi&gt;&lt;mn&gt;2&lt;/mn&gt;&lt;/msup&gt;&lt;/mrow&gt;&lt;/mfrac&gt;&lt;/mstyle&gt;&lt;/mrow&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml></p> <p>Finally, we compute the mean squared error (MSE) between the heatmap label <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mi mathvariant="bold"&gt;K&lt;/mi&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> and the output <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;msub&gt;&lt;mi mathvariant="bold"&gt;M&lt;/mi&gt;&lt;mi&gt;mod&lt;/mi&gt;&lt;/msub&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> to obtain the scalar <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;msub&gt;&lt;mi&gt;loss&lt;/mi&gt;&lt;mn&gt;2&lt;/mn&gt;&lt;/msub&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> . This MSE measures the variance between the corresponding elements of the two tensors at each feature map location. A smaller <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;msub&gt;&lt;mi&gt;loss&lt;/mi&gt;&lt;mn&gt;2&lt;/mn&gt;&lt;/msub&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> value indicates that the predicted keypoints are closer to their true positions.</p> <p>(<reflink idref="bib3" id="ref27">3</reflink>) <ephtml> &lt;math display="block" xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mi mathvariant="italic"&gt;loss&lt;/mi&gt;&lt;mn&gt;2&lt;/mn&gt;&lt;/msub&gt;&lt;mo&gt;=&lt;/mo&gt;&lt;mstyle scriptlevel="0" displaystyle="true"&gt;&lt;mfrac&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;mrow&gt;&lt;mi&gt;B&lt;/mi&gt;&lt;mo&gt;&amp;#215;&lt;/mo&gt;&lt;mi&gt;D&lt;/mi&gt;&lt;mo&gt;&amp;#215;&lt;/mo&gt;&lt;msup&gt;&lt;mi&gt;H&lt;/mi&gt;&lt;mo&gt;&amp;#8242;&lt;/mo&gt;&lt;/msup&gt;&lt;mo&gt;&amp;#215;&lt;/mo&gt;&lt;msup&gt;&lt;mi&gt;W&lt;/mi&gt;&lt;mo&gt;&amp;#8242;&lt;/mo&gt;&lt;/msup&gt;&lt;/mrow&gt;&lt;/mfrac&gt;&lt;/mstyle&gt;&lt;munderover&gt;&lt;mo&gt;&amp;#8721;&lt;/mo&gt;&lt;mrow&gt;&lt;mi&gt;b&lt;/mi&gt;&lt;mo&gt;=&lt;/mo&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;/mrow&gt;&lt;mi&gt;B&lt;/mi&gt;&lt;/munderover&gt;&lt;munderover&gt;&lt;mo&gt;&amp;#8721;&lt;/mo&gt;&lt;mrow&gt;&lt;mi&gt;d&lt;/mi&gt;&lt;mo&gt;=&lt;/mo&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;/mrow&gt;&lt;mi&gt;D&lt;/mi&gt;&lt;/munderover&gt;&lt;munderover&gt;&lt;mo&gt;&amp;#8721;&lt;/mo&gt;&lt;mrow&gt;&lt;mi&gt;h&lt;/mi&gt;&lt;mo&gt;=&lt;/mo&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;/mrow&gt;&lt;mi&gt;H&lt;/mi&gt;&lt;/munderover&gt;&lt;munderover&gt;&lt;mo&gt;&amp;#8721;&lt;/mo&gt;&lt;mrow&gt;&lt;mi&gt;w&lt;/mi&gt;&lt;mo&gt;=&lt;/mo&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;/mrow&gt;&lt;mi&gt;W&lt;/mi&gt;&lt;/munderover&gt;&lt;msup&gt;&lt;mrow&gt;&lt;mo&gt;(&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;F&lt;/mi&gt;&lt;mi&gt;mod&lt;/mi&gt;&lt;/msub&gt;&lt;mo&gt;&amp;#8722;&lt;/mo&gt;&lt;mi&gt;K&lt;/mi&gt;&lt;mo&gt;)&lt;/mo&gt;&lt;/mrow&gt;&lt;mn&gt;2&lt;/mn&gt;&lt;/msup&gt;&lt;/mrow&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml></p> <p>Our proposed face keypoint detection method combines regression and heatmap approaches by integrating <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;msub&gt;&lt;mi&gt;loss&lt;/mi&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;/msub&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> (for the regression) and <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;msub&gt;&lt;mi&gt;loss&lt;/mi&gt;&lt;mn&gt;2&lt;/mn&gt;&lt;/msub&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> (for the heatmap) into a unified loss function, guiding model learning with <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;mi&gt;LOSS&lt;/mi&gt;&lt;mo&gt;=&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;loss&lt;/mi&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;/msub&gt;&lt;mo&gt;+&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;loss&lt;/mi&gt;&lt;mn&gt;2&lt;/mn&gt;&lt;/msub&gt;&lt;/mrow&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> .</p> <hd id="AN0179351181-9">3.5. Model Compression</hd> <p>We introduce a model compression approach designed to distill a sizable Convolutional Neural Network (CNN) into a compressed yet performant counterpart, ensuring efficient execution while maintaining substantial accuracy levels. As depicted in Figure 2, our methodology follows a sequential paradigm, encompassing three crucial stages: an initial pre-training phase of the original network, a subsequent transfer of knowledge to an intermediary representation, and, finally, a model cutting step aimed at achieving the targeted compression level.</p> <p>Pre-training of the network. Prior to the refinement and compression stages, the CNN undergoes an initial training phase on a comprehensive dataset, which serves as a foundation for subsequent learning on the targeted dataset. This pre-training strategy yields two pivotal benefits. Firstly, it functions as a potent regularization mechanism, bolstering the model's ability to generalize beyond its training examples. By exposing the CNN to a vast corpus of data during pre-training, its parameters are optimized to capture the underlying data distribution more broadly, thereby mitigating the risk of overfitting to a limited subset of samples. Secondly, pre-training contributes to enhanced training stability and accelerated convergence. Convolutional networks, particularly those with intricate architectures and abundant parameters, can be susceptible to instabilities arising from random initialization of their parameters. By providing a supervised initialization through pre-training, the model is equipped with a robust starting point that fosters stable and efficient training processes.</p> <p>We harness the extensive ImageNet dataset [[<reflink idref="bib27" id="ref28">27</reflink>]] as the cornerstone for pre-training. Prior to the targeted training phase on the specific image dataset, the CNN is initially pre-trained on ImageNet, a renowned benchmark in image recognition. Comprising 1.2 M images spanning 1000 prevalent categories, ImageNet offers a rich and diverse set of visual patterns to inform the CNN's initial learning. This pre-training paradigm is universally applicable, benefiting both large-scale and more compact CNN architectures alike, by imparting a robust foundational knowledge base upon which further refinement can be built.</p> <p>We introduce the notation Re for the pre-trained ResNet model and Mo for the MobileNet model, both of which constitute the CNN architectures pertinent to our methodology. These two models are further fine-tuned on the keypoint detection dataset with the LOSS function in a mini-batch manner. The LOSS function combines the output of the teacher model (Re) and the student model Mo with the joint modeling of regression and the heatmap. During this step, the two models are trained separately, with no interaction between them. With sufficient training, both models learn to detect face keypoints effectively.</p> <p>In our approach, we designate the pre-trained ResNet and MobileNet models as Re and Mo, respectively, to distinguish their capacities. Subsequently, these models undergo additional training on the targeted image dataset, leveraging a mini-batch optimization strategy guided by a tailored LOSS function. This LOSS function integrates the predictions from both the teacher model (Re) and the student model Mo, incorporating joint modeling techniques for regression and heatmap generation. Notably, the training of Re and Mo proceeds independently. Through rigorous training, both models acquire the proficiency to accurately detect facial keypoints, demonstrating their effectiveness in learning the intricacies of the target task.</p> <p>(<reflink idref="bib4" id="ref29">4</reflink>) <ephtml> &lt;math display="block" xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;mi&gt;L&lt;/mi&gt;&lt;mi&gt;o&lt;/mi&gt;&lt;mo&gt;(&lt;/mo&gt;&lt;mi&gt;t&lt;/mi&gt;&lt;mi&gt;e&lt;/mi&gt;&lt;mi&gt;a&lt;/mi&gt;&lt;mi&gt;c&lt;/mi&gt;&lt;mi&gt;h&lt;/mi&gt;&lt;mi&gt;e&lt;/mi&gt;&lt;mi&gt;r&lt;/mi&gt;&lt;mo&gt;)&lt;/mo&gt;&lt;mo&gt;=&lt;/mo&gt;&lt;mi&gt;L&lt;/mi&gt;&lt;mi&gt;O&lt;/mi&gt;&lt;mi&gt;S&lt;/mi&gt;&lt;mi&gt;S&lt;/mi&gt;&lt;/mrow&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml></p> <p>(<reflink idref="bib5" id="ref30">5</reflink>) <ephtml> &lt;math display="block" xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;mi&gt;L&lt;/mi&gt;&lt;mi&gt;o&lt;/mi&gt;&lt;mo&gt;(&lt;/mo&gt;&lt;mi&gt;s&lt;/mi&gt;&lt;mi&gt;t&lt;/mi&gt;&lt;mi&gt;u&lt;/mi&gt;&lt;mi&gt;d&lt;/mi&gt;&lt;mi&gt;e&lt;/mi&gt;&lt;mi&gt;n&lt;/mi&gt;&lt;mi&gt;t&lt;/mi&gt;&lt;mo&gt;)&lt;/mo&gt;&lt;mo&gt;=&lt;/mo&gt;&lt;mi&gt;L&lt;/mi&gt;&lt;mi&gt;O&lt;/mi&gt;&lt;mi&gt;S&lt;/mi&gt;&lt;mi&gt;S&lt;/mi&gt;&lt;/mrow&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml></p> <hd id="AN0179351181-10">3.5.1. Knowledge Transfer</hd> <p>Following the above step, the teacher CNN (Re) is rendered static, and all the following optimization efforts are exclusively focused on the small CNN Mo. Given its inherently reduced parameter space, Mo inherently possesses a more constrained capacity for keypoint detection, translating to a comparatively diminished accuracy. To address this performance disparity, we introduce a mechanism aimed at transferring the keypoint detection to the smaller model. This approach endeavors to bridge the accuracy gap, empowering Mo to achieve a heightened level of precision that more closely aligns with that of Re.</p> <p>Research has underscored the notion that the similarity of feature maps between a small and a large CNN, given identical input imagery, correlates strongly with the similarity of their predictions, thereby implying comparable capabilities. Leveraging this insight, we devise a transfer methodology aimed at aligning the feature maps of both networks at a designated layer. Specifically, the teacher model's backbone produces a feature map <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mi mathvariant="bold"&gt;M&lt;/mi&gt;&lt;mi&gt;r&lt;/mi&gt;&lt;/msub&gt;&lt;mo&gt;&amp;#8712;&lt;/mo&gt;&lt;msup&gt;&lt;mi mathvariant="double-struck"&gt;R&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;B&lt;/mi&gt;&lt;mo&gt;&amp;#215;&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;C&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;L&lt;/mi&gt;&lt;mi&gt;r&lt;/mi&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;mo&gt;&amp;#215;&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;H&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;L&lt;/mi&gt;&lt;mi&gt;r&lt;/mi&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;mo&gt;&amp;#215;&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;W&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;L&lt;/mi&gt;&lt;mi&gt;r&lt;/mi&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;/mrow&gt;&lt;/msup&gt;&lt;/mrow&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> , whereas the student model's backbone yields <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mi mathvariant="bold"&gt;M&lt;/mi&gt;&lt;mi&gt;m&lt;/mi&gt;&lt;/msub&gt;&lt;mo&gt;&amp;#8712;&lt;/mo&gt;&lt;msup&gt;&lt;mi mathvariant="double-struck"&gt;R&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;B&lt;/mi&gt;&lt;mo&gt;&amp;#215;&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;C&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;L&lt;/mi&gt;&lt;mi&gt;m&lt;/mi&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;mo&gt;&amp;#215;&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;H&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;L&lt;/mi&gt;&lt;mi&gt;m&lt;/mi&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;mo&gt;&amp;#215;&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;W&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;L&lt;/mi&gt;&lt;mi&gt;m&lt;/mi&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;/mrow&gt;&lt;/msup&gt;&lt;/mrow&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> . The disparity in these dimensions poses a challenge for direct knowledge transfer.</p> <p>To overcome this obstacle, we employed average pooling to homogenize the dimensionality of both <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;msub&gt;&lt;mi mathvariant="bold"&gt;M&lt;/mi&gt;&lt;mi&gt;r&lt;/mi&gt;&lt;/msub&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> and <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;msub&gt;&lt;mi mathvariant="bold"&gt;M&lt;/mi&gt;&lt;mi&gt;m&lt;/mi&gt;&lt;/msub&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> , yielding reduced representations <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;msubsup&gt;&lt;mi mathvariant="bold"&gt;M&lt;/mi&gt;&lt;mi&gt;r&lt;/mi&gt;&lt;mo&gt;&amp;#8242;&lt;/mo&gt;&lt;/msubsup&gt;&lt;mo&gt;&amp;#8712;&lt;/mo&gt;&lt;msup&gt;&lt;mi mathvariant="double-struck"&gt;R&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;B&lt;/mi&gt;&lt;mo&gt;&amp;#215;&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;C&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;L&lt;/mi&gt;&lt;mi&gt;r&lt;/mi&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;/mrow&gt;&lt;/msup&gt;&lt;/mrow&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> and <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;msubsup&gt;&lt;mi mathvariant="bold"&gt;M&lt;/mi&gt;&lt;mi&gt;m&lt;/mi&gt;&lt;mo&gt;&amp;#8242;&lt;/mo&gt;&lt;/msubsup&gt;&lt;mo&gt;&amp;#8712;&lt;/mo&gt;&lt;msup&gt;&lt;mi mathvariant="double-struck"&gt;R&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;B&lt;/mi&gt;&lt;mo&gt;&amp;#215;&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;C&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;L&lt;/mi&gt;&lt;mi&gt;m&lt;/mi&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;/mrow&gt;&lt;/msup&gt;&lt;/mrow&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> , respectively. Furthermore, we utilized 1 × 1 convolutional layers to ensure that the vectors <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mi mathvariant="bold"&gt;M&lt;/mi&gt;&lt;mi&gt;r&lt;/mi&gt;&lt;/msub&gt;&lt;mo&gt;&amp;#8712;&lt;/mo&gt;&lt;msup&gt;&lt;mi mathvariant="double-struck"&gt;R&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;B&lt;/mi&gt;&lt;mo&gt;&amp;#215;&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;C&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;L&lt;/mi&gt;&lt;mi&gt;a&lt;/mi&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;/mrow&gt;&lt;/msup&gt;&lt;/mrow&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> and <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mi mathvariant="bold"&gt;A&lt;/mi&gt;&lt;mi&gt;m&lt;/mi&gt;&lt;/msub&gt;&lt;mo&gt;&amp;#8712;&lt;/mo&gt;&lt;msup&gt;&lt;mi mathvariant="double-struck"&gt;R&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;B&lt;/mi&gt;&lt;mo&gt;&amp;#215;&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;C&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;L&lt;/mi&gt;&lt;mi&gt;m&lt;/mi&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;/mrow&gt;&lt;/msup&gt;&lt;/mrow&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> , derived from these reduced feature maps, share a common vector space dimensionality.</p> <p>Finally, we adopted mean squared error (MSE) as the loss function, denoted as <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;msub&gt;&lt;mi&gt;loss&lt;/mi&gt;&lt;mn&gt;3&lt;/mn&gt;&lt;/msub&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> , to quantify the discrepancy between corresponding elements in vectors <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;msub&gt;&lt;mi mathvariant="bold"&gt;a&lt;/mi&gt;&lt;mi&gt;r&lt;/mi&gt;&lt;/msub&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> and <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;msub&gt;&lt;mi mathvariant="bold"&gt;a&lt;/mi&gt;&lt;mi&gt;m&lt;/mi&gt;&lt;/msub&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> belonging to <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;msub&gt;&lt;mi mathvariant="bold"&gt;A&lt;/mi&gt;&lt;mi&gt;r&lt;/mi&gt;&lt;/msub&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> and <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;msub&gt;&lt;mi mathvariant="bold"&gt;A&lt;/mi&gt;&lt;mi&gt;m&lt;/mi&gt;&lt;/msub&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> . This loss function computes the squared differences, facilitating the optimization process aimed at minimizing the divergence between the feature representations of the teacher and student models.</p> <p>(<reflink idref="bib6" id="ref31">6</reflink>) <ephtml> &lt;math display="block" xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mi mathvariant="italic"&gt;loss&lt;/mi&gt;&lt;mn&gt;3&lt;/mn&gt;&lt;/msub&gt;&lt;mo&gt;=&lt;/mo&gt;&lt;mo movablelimits="true" form="prefix"&gt;min&lt;/mo&gt;&lt;mstyle scriptlevel="0" displaystyle="true"&gt;&lt;mfrac&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;mrow&gt;&lt;mi&gt;B&lt;/mi&gt;&lt;mo&gt;&amp;#215;&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;C&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;L&lt;/mi&gt;&lt;mi&gt;m&lt;/mi&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;/mrow&gt;&lt;/mfrac&gt;&lt;/mstyle&gt;&lt;munderover&gt;&lt;mo&gt;&amp;#8721;&lt;/mo&gt;&lt;mrow&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;mo&gt;=&lt;/mo&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;/mrow&gt;&lt;mi&gt;B&lt;/mi&gt;&lt;/munderover&gt;&lt;munderover&gt;&lt;mo&gt;&amp;#8721;&lt;/mo&gt;&lt;mrow&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;mo&gt;=&lt;/mo&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;/mrow&gt;&lt;msub&gt;&lt;mi&gt;C&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;L&lt;/mi&gt;&lt;mi&gt;m&lt;/mi&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;/munderover&gt;&lt;msup&gt;&lt;mrow&gt;&lt;mo&gt;(&lt;/mo&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mi&gt;a&lt;/mi&gt;&lt;msub&gt;&lt;mi&gt;r&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;/msub&gt;&lt;mo&gt;&amp;#8722;&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;a&lt;/mi&gt;&lt;msub&gt;&lt;mi&gt;m&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;/msub&gt;&lt;/mrow&gt;&lt;mo&gt;)&lt;/mo&gt;&lt;/mrow&gt;&lt;mn&gt;2&lt;/mn&gt;&lt;/msup&gt;&lt;/mrow&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml></p> <hd id="AN0179351181-11">3.5.2. Model Cutting</hd> <p>The overarching goal of this endeavor is to enhance the network's efficiency by diminishing its parameter count, subsequently leading to a reduction in computational complexity. In the context of knowledge distillation, it is imperative that the parameter disparity between the teacher and student CNNs remains within a manageable threshold. This is due to the inherent learning constraints of the smaller CNN, where an excessively wide gap can hinder the effective transfer of knowledge. Consequently, the size of the student CNN cannot be overly diminutive, presenting a paradoxical challenge: maximizing efficiency while ensuring that the model does not sacrifice too much in terms of accuracy. To reconcile this dilemma, we introduce a model pruning or "cutting" strategy, tailored to strike an optimal balance between accuracy preservation and efficiency gains. This approach aims to meticulously identify and eliminate redundant or less impactful parameters, enabling the small CNN to achieve a highly efficient footprint while still retaining sufficient representational power to deliver acceptable levels of prediction accuracy.</p> <p>We take the layer <emph>t</emph> in the network architecture as an example. The process of parameter reduction can be conceptualized as the elimination of insignificant filters within the tensor <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mi mathvariant="bold"&gt;Z&lt;/mi&gt;&lt;mi&gt;t&lt;/mi&gt;&lt;/msub&gt;&lt;mo&gt;&amp;#8712;&lt;/mo&gt;&lt;msup&gt;&lt;mi mathvariant="double-struck"&gt;R&lt;/mi&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mi&gt;C&lt;/mi&gt;&lt;mi&gt;t&lt;/mi&gt;&lt;/msub&gt;&lt;mo&gt;&amp;#215;&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;C&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;t&lt;/mi&gt;&lt;mo&gt;&amp;#8722;&lt;/mo&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;mo&gt;&amp;#215;&lt;/mo&gt;&lt;mi&gt;s&lt;/mi&gt;&lt;mo&gt;&amp;#215;&lt;/mo&gt;&lt;mi&gt;s&lt;/mi&gt;&lt;/mrow&gt;&lt;/msup&gt;&lt;/mrow&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> , essentially diminishing the count of filters in this layer. To quantify the significance of each filter, a logical approach involves assessing its impact on the subsequent feature maps, where less influential filters exhibit reduced effects. Specifically, upon hypothetically removing a filter, the dimensionality of <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;msub&gt;&lt;mi mathvariant="bold"&gt;Z&lt;/mi&gt;&lt;mi&gt;t&lt;/mi&gt;&lt;/msub&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> would adjust to <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;msup&gt;&lt;mi mathvariant="double-struck"&gt;R&lt;/mi&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mi&gt;C&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;t&lt;/mi&gt;&lt;mo&gt;&amp;#8722;&lt;/mo&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;mo&gt;&amp;#215;&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;C&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;t&lt;/mi&gt;&lt;mo&gt;&amp;#8722;&lt;/mo&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;mo&gt;&amp;#215;&lt;/mo&gt;&lt;mi&gt;s&lt;/mi&gt;&lt;mo&gt;&amp;#215;&lt;/mo&gt;&lt;mi&gt;s&lt;/mi&gt;&lt;/mrow&gt;&lt;/msup&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> , with corresponding alterations in the feature maps <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mi mathvariant="bold"&gt;M&lt;/mi&gt;&lt;mi&gt;t&lt;/mi&gt;&lt;/msub&gt;&lt;mo&gt;&amp;#8712;&lt;/mo&gt;&lt;msup&gt;&lt;mi mathvariant="double-struck"&gt;R&lt;/mi&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mi&gt;C&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;t&lt;/mi&gt;&lt;mo&gt;&amp;#8722;&lt;/mo&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;mo&gt;&amp;#215;&lt;/mo&gt;&lt;mi&gt;H&lt;/mi&gt;&lt;mo&gt;&amp;#215;&lt;/mo&gt;&lt;mi&gt;W&lt;/mi&gt;&lt;/mrow&gt;&lt;/msup&gt;&lt;/mrow&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> and the subsequent convolutional kernel <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mi mathvariant="bold"&gt;Z&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;t&lt;/mi&gt;&lt;mo&gt;+&lt;/mo&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;mo&gt;&amp;#8712;&lt;/mo&gt;&lt;msup&gt;&lt;mi mathvariant="double-struck"&gt;R&lt;/mi&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mi&gt;C&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;t&lt;/mi&gt;&lt;mo&gt;+&lt;/mo&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;mo&gt;&amp;#215;&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;C&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;t&lt;/mi&gt;&lt;mo&gt;&amp;#8722;&lt;/mo&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;mo&gt;&amp;#215;&lt;/mo&gt;&lt;mi&gt;s&lt;/mi&gt;&lt;mo&gt;&amp;#215;&lt;/mo&gt;&lt;mi&gt;s&lt;/mi&gt;&lt;/mrow&gt;&lt;/msup&gt;&lt;/mrow&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> , leading to an altered feature map <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mi mathvariant="bold"&gt;M&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;t&lt;/mi&gt;&lt;mo&gt;+&lt;/mo&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;mo&gt;&amp;#8712;&lt;/mo&gt;&lt;msup&gt;&lt;mi mathvariant="double-struck"&gt;R&lt;/mi&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mi&gt;C&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;t&lt;/mi&gt;&lt;mo&gt;+&lt;/mo&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;mo&gt;&amp;#215;&lt;/mo&gt;&lt;mi&gt;H&lt;/mi&gt;&lt;mo&gt;&amp;#215;&lt;/mo&gt;&lt;mi&gt;W&lt;/mi&gt;&lt;/mrow&gt;&lt;/msup&gt;&lt;/mrow&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> .</p> <p>To objectively evaluate the importance of each filter, we leverage a subset of images and employ the adjacent, unmodified feature map <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;msub&gt;&lt;mi mathvariant="bold"&gt;M&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;t&lt;/mi&gt;&lt;mo&gt;+&lt;/mo&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> as a reference. By sequentially removing each filter and computing the resultant reconstruction error for the reference map, we can infer the significance of each filter. This methodology enables us to identify and prune those filters that contribute minimally to the overall representation, thereby refining the model's efficiency without unduly compromising its accuracy.</p> <p>(<reflink idref="bib7" id="ref32">7</reflink>) <ephtml> &lt;math display="block" xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;msup&gt;&lt;mi mathvariant="italic"&gt;Score&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;C&lt;/mi&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;/mrow&gt;&lt;/msup&gt;&lt;mo&gt;=&lt;/mo&gt;&lt;mstyle scriptlevel="0" displaystyle="true"&gt;&lt;mfrac&gt;&lt;mrow&gt;&lt;msubsup&gt;&lt;mo&gt;&amp;#8721;&lt;/mo&gt;&lt;mrow&gt;&lt;mi&gt;l&lt;/mi&gt;&lt;mo&gt;=&lt;/mo&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;/mrow&gt;&lt;mi&gt;L&lt;/mi&gt;&lt;/msubsup&gt;&lt;msubsup&gt;&lt;msubsup&gt;&lt;mi mathvariant="bold"&gt;M&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;t&lt;/mi&gt;&lt;mo&gt;+&lt;/mo&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;/mrow&gt;&lt;mrow&gt;&lt;mi&gt;C&lt;/mi&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;/mrow&gt;&lt;/msubsup&gt;&lt;mrow&gt;&lt;mo&gt;(&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;s&lt;/mi&gt;&lt;mi&gt;l&lt;/mi&gt;&lt;/msub&gt;&lt;mo&gt;)&lt;/mo&gt;&lt;/mrow&gt;&lt;mo&gt;&amp;#8722;&lt;/mo&gt;&lt;msub&gt;&lt;mi mathvariant="bold"&gt;M&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;t&lt;/mi&gt;&lt;mo&gt;+&lt;/mo&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;mrow&gt;&lt;mo&gt;(&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;s&lt;/mi&gt;&lt;mi&gt;l&lt;/mi&gt;&lt;/msub&gt;&lt;mo&gt;)&lt;/mo&gt;&lt;/mrow&gt;&lt;mn&gt;2&lt;/mn&gt;&lt;mn&gt;2&lt;/mn&gt;&lt;/msubsup&gt;&lt;/mrow&gt;&lt;mi&gt;L&lt;/mi&gt;&lt;/mfrac&gt;&lt;/mstyle&gt;&lt;/mrow&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml></p> <p>Within the sampled subset <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;mo&gt;{&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;s&lt;/mi&gt;&lt;mi&gt;l&lt;/mi&gt;&lt;/msub&gt;&lt;mo&gt;\|&lt;/mo&gt;&lt;mi&gt;l&lt;/mi&gt;&lt;mo&gt;=&lt;/mo&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;mo&gt;,&lt;/mo&gt;&lt;mo&gt;...&lt;/mo&gt;&lt;mo&gt;,&lt;/mo&gt;&lt;mi&gt;L&lt;/mi&gt;&lt;mo&gt;,&lt;/mo&gt;&lt;mi&gt;L&lt;/mi&gt;&lt;mo&gt;&amp;#8810;&lt;/mo&gt;&lt;mi&gt;N&lt;/mi&gt;&lt;mo&gt;}&lt;/mo&gt;&lt;/mrow&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> , where <emph>L</emph> is significantly smaller than the number of training images <emph>N</emph>, <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;msubsup&gt;&lt;mi mathvariant="bold"&gt;M&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;t&lt;/mi&gt;&lt;mo&gt;+&lt;/mo&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;/mrow&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;/msubsup&gt;&lt;mrow&gt;&lt;mo&gt;(&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;s&lt;/mi&gt;&lt;mi&gt;l&lt;/mi&gt;&lt;/msub&gt;&lt;mo&gt;)&lt;/mo&gt;&lt;/mrow&gt;&lt;/mrow&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> denotes the transformed representation subsequent to the removal of the <emph>i</emph>-th filter from the set of filters at layer <emph>t</emph>. The significance metric is derived by assessing the impact of eliminating the <emph>i</emph>-th filter from <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;msub&gt;&lt;mi mathvariant="bold"&gt;Z&lt;/mi&gt;&lt;mi&gt;t&lt;/mi&gt;&lt;/msub&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> . By arranging these significance metrics in descending order, we can systematically prune filters with lower scores, guided by a predetermined pruning ratio <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mi&gt;&amp;#945;&lt;/mi&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> . Specifically, we discard <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;msub&gt;&lt;mi mathvariant="bold"&gt;Z&lt;/mi&gt;&lt;mi&gt;t&lt;/mi&gt;&lt;/msub&gt;&lt;mo&gt;&amp;#215;&lt;/mo&gt;&lt;mi&gt;&amp;#945;&lt;/mi&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> filters while retaining <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;msub&gt;&lt;mi mathvariant="bold"&gt;Z&lt;/mi&gt;&lt;mi&gt;t&lt;/mi&gt;&lt;/msub&gt;&lt;mo&gt;&amp;#215;&lt;/mo&gt;&lt;mrow&gt;&lt;mo&gt;(&lt;/mo&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;mo&gt;&amp;#8722;&lt;/mo&gt;&lt;mi&gt;&amp;#945;&lt;/mi&gt;&lt;mo&gt;)&lt;/mo&gt;&lt;/mrow&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> filters, thereby generating a refined set of filters for the <emph>t</emph>-th layer, denoted as <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mi mathvariant="bold"&gt;Z&lt;/mi&gt;&lt;mi&gt;t&lt;/mi&gt;&lt;/msub&gt;&lt;mo&gt;&amp;#8712;&lt;/mo&gt;&lt;msup&gt;&lt;mi mathvariant="double-struck"&gt;R&lt;/mi&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mi&gt;C&lt;/mi&gt;&lt;mi&gt;t&lt;/mi&gt;&lt;/msub&gt;&lt;mo&gt;&amp;#215;&lt;/mo&gt;&lt;mrow&gt;&lt;mo&gt;(&lt;/mo&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;mo&gt;&amp;#8722;&lt;/mo&gt;&lt;mi&gt;&amp;#945;&lt;/mi&gt;&lt;mo&gt;)&lt;/mo&gt;&lt;/mrow&gt;&lt;mo&gt;&amp;#215;&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;C&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;t&lt;/mi&gt;&lt;mo&gt;&amp;#8722;&lt;/mo&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;mo&gt;&amp;#215;&lt;/mo&gt;&lt;mi&gt;s&lt;/mi&gt;&lt;mo&gt;&amp;#215;&lt;/mo&gt;&lt;mi&gt;s&lt;/mi&gt;&lt;/mrow&gt;&lt;/msup&gt;&lt;/mrow&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> . This process ensures that the new filter configuration at each layer is optimized for both efficiency and performance considerations.</p> <p>While the contribution of less significant filters may be modest, they nevertheless encode image representations that could potentially contribute positively to the model's performance. Consequently, their straightforward elimination can result in a decrement in the network's detection capabilities. To counteract this potential loss, a widely adopted strategy involves subjecting the pruned network to a retraining process, which enables the remaining filters to adapt and learn the requisite image representations more comprehensively. In this regard, we employed the entire dataset to retrain the downsized CNN, adhering to the same training paradigm outlined in Equation (<reflink idref="bib8" id="ref33">8</reflink>), thereby fostering the network's ability to maintain or even enhance its detection accuracy.</p> <p>(<reflink idref="bib8" id="ref34">8</reflink>) <ephtml> &lt;math display="block" xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mi mathvariant="italic"&gt;loss&lt;/mi&gt;&lt;mi&gt;t&lt;/mi&gt;&lt;/msub&gt;&lt;mo&gt;=&lt;/mo&gt;&lt;mi&gt;L&lt;/mi&gt;&lt;mi&gt;O&lt;/mi&gt;&lt;mi&gt;S&lt;/mi&gt;&lt;mi&gt;S&lt;/mi&gt;&lt;mo&gt;+&lt;/mo&gt;&lt;msub&gt;&lt;mi mathvariant="italic"&gt;loss&lt;/mi&gt;&lt;mn&gt;3&lt;/mn&gt;&lt;/msub&gt;&lt;/mrow&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml></p> <p>Ultimately, we used an layer-wise pruning methodology where layers are successively reduced, commencing from layer T and progressing downwards to layer 1 within the transferred student model <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;mi&gt;S&lt;/mi&gt;&lt;mi&gt;t&lt;/mi&gt;&lt;msup&gt;&lt;mi&gt;u&lt;/mi&gt;&lt;mi&gt;K&lt;/mi&gt;&lt;/msup&gt;&lt;mspace width="0.277778em" /&gt;&lt;/mrow&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> . For each intermediate layer t, the pruning process closely mirrors the above procedure. Notably, the pruning ratio <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mi&gt;&amp;#945;&lt;/mi&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> serves as a tunable parameter, allowing us to tailor the model's efficiency to meet diverse requirements. However, it is imperative to strike a delicate balance between optimizing efficiency and preserving accuracy, as adjustments to <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mi&gt;&amp;#945;&lt;/mi&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> can have consequential impacts on both aspects. Following the completion of this top–down iterative pruning process across all layers, the resulting <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;msub&gt;&lt;mi&gt;S&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;t&lt;/mi&gt;&lt;mi&gt;u&lt;/mi&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> represents the final compressed student model, optimized for both efficiency and performance.</p> <hd id="AN0179351181-12">4. Experimental Results</hd> <p>In this section, we evaluate the proposed methods. We first introduce the experimental setup in detail and then present the main results of regression combined with a heatmap network and pruning combined with distillation supervision.</p> <hd id="AN0179351181-13">4.1. Detailed Setup</hd> <p>Database: for this experiment, we used three datasets: 300W [[<reflink idref="bib28" id="ref35">28</reflink>]], AFLW [[<reflink idref="bib29" id="ref36">29</reflink>]], and CelebA. The 300W and AFLW datasets were utilized for training and testing, while the CelebA dataset was employed for ablation studies.</p> <p>300W: this dataset includes annotations for five face datasets—LFPW, AFW, HELEN, XM2VTS, and IBUG—with 68 landmarks each. It includes [[<reflink idref="bib30" id="ref37">30</reflink>], [<reflink idref="bib32" id="ref38">32</reflink>]] 3148 images for training and 689 images for testing. The test images are divided into a common subset and a challenge subset: the common subset consists of 554 images from LFPW and HELEN, and the challenge subset comprises 135 images from IBUG.</p> <p>AFLW: this dataset contains 24,386 wild faces, sourced from Flickr, featuring extreme poses, expressions, and occlusions. The facial head poses range from 0° to 120° (yaw angle) and 90° (pitch and roll angles). AFLW provides up to 21 landmarks per face. We used 20,000 images for training and 4386 images for testing.</p> <p>CelebA: for ablation and hyperparameter experiments, we used the CelebA dataset, which contains 10,177 identities and 202,599 face images. Each image is annotated with features, including face bounding boxes, five face keypoint coordinates, and 40 attribute labels. We split the dataset into training and validation sets, using 80% for training and 20% for validation.</p> <p>Backbone network: for the teacher model, we used the popular ResNet-50 network, which is sufficiently deep to handle our tasks. The core idea is that, by introducing residual blocks, the input can bypass one or more convolutional layers and directly add to the output. This approach solves the problems of gradient vanishing in deep network training by using identity mapping. For the student model, we used the MobileNet network. Its core feature is depthwise convolution, which splits standard convolution into two independent operations: depthwise convolution and <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;mo&gt;&amp;#215;&lt;/mo&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;/mrow&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> convolution. This greatly reduces parameters and computation, thereby improving training efficiency.</p> <p>Data processing: the face images in the original CelebA dataset were cropped to allow the model to focus more on learning face features and keypoints. To increase the generalization of training, we performed data augmentation on the CelebA dataset by randomly flipping some images horizontally to simulate different shooting angles and cropping face images in different regions to increase data diversity and model robustness.</p> <p>Training strategy: we set the epoch, batch size, and learning rate as follows. For the batch size, whether it is the combined training of regression and heatmap or the combined optimization of distillation and pruning, we fixed the value at 32. For the learning rate, the pre-trained model based on ImageNet was adjusted from 0.01 to 0.0001 every 20 epochs, for a total of 60 epochs.</p> <p>Evaluation criteria: for the 300W, ALFW, and CelebA datasets, the evaluation metrics differed slightly. In the case of the 300W dataset, results were reported using two different normalizing factors. One used the eye-center distance as the inter-pupil normalizing factor, while the other was based on the inter-ocular distance, measured as the distance between the outer corners of the eyes. For the ALFW dataset, due to the presence of various facial profiles, the error was normalized according to the ground-truth bounding box size across all visible landmarks. For the CelebA dataset, the mean square error (MSE) was used as the evaluation metric. We compared the mean squared error (MSE) results for regression combined with the heatmap under different weight ratios, as well as the MSE results for pruning combined with distillation under various weight ratios. Additionally, we compared the MSE results across different pruning rates.</p> <hd id="AN0179351181-14">4.2. Main Results</hd> <p>This section begins with a detailed comparative analysis of the proposed method against existing facial keypoint detection methods. We then introduce the evaluation method that combines the heatmap and regression, as well as the evaluation method that combines distillation and pruning. Additionally, we discuss the impact of different loss weights on regression combined with the heatmap, distillation combined with pruning, and the effects of varying pruning rates.</p> <hd id="AN0179351181-15">4.2.1. Comparison with Existing Methods</hd> <p>We first compared the mobile–distill–prune model and ResNet-50 with state-of-the-art face landmark detection methods on a 300W dataset using a regression combined with a heatmap approach. The results are shown in Table 1. To comprehensively evaluate model performance, we report two versions of the mobile–distill–prune model: mobile–distill–prune-0.25X and mobile–distill–prune-1X. Mobile–distill–prune-0.25X compresses the model by setting the pruning rate to 75%, while mobile–distill–prune-1X is the full model without pruning. Both models were trained using only the 300W training data. Additionally, we include the performance of ResNet-50 for comparison.</p> <p>Although the performance of mobile–distill–prune-0.25X is slightly lower than that of mobile–distill–prune-1X, it still outperforms other competitors in many aspects, demonstrating excellent detection capabilities. This comparison shows that mobile–distill–prune-0.25X finds a good balance between model compression and detection accuracy. Despite significant reduction in model size through pruning, the detection accuracy does not drop substantially, making it an ideal compromise for practical applications.</p> <p>ResNet-50 consistently shows the best overall performance among all the models tested, especially in handling challenging facial data. This indicates that ResNet-50 has strong feature extraction capabilities and robustness. However, this also highlights areas for improvement in the mobile–distill–prune model. By further optimizing the balance between pruning and distillation, the performance of the mobile–distill–prune model may be enhanced, bringing it closer to ResNet-50's level of detection accuracy.</p> <p>We further evaluated the performance differences of various methods on the AFLW dataset. The AFLW value results obtained by these methods are reported in Table 2. As can be seen from the table, methods including TSR, CPM, SAN, PFLD, and our mobile–distill–prune series, as well as ResNet-50, significantly outperform other competing methods. Among these outstanding methods, our ResNet-50 achieved the best accuracy (NME 1.84), followed by our mobile–distill–prune-1X (NME 1.86). PFLD 1X ranked third with an NME value of 1.88.</p> <p>In summary, our experiments demonstrate the advantages and disadvantages of different models on the 300W dataset through the combination of regression with a heatmap and the comprehensive application of distillation and pruning techniques. Although mobile–distill–prune-0.25X is slightly inferior to mobile–distill–prune-1X in performance, it achieves a satisfactory balance between model size and detection accuracy. ResNet-50 leads in overall performance, particularly when processing complex facial data. The experimental results verify the effectiveness of our proposed method and provide new insights and methodologies in the field of facial keypoint detection. By combining advanced deep learning techniques such as model distillation, pruning, and efficient neural network architecture design, we can significantly optimize the computational efficiency and applicability of the model while maintaining high accuracy. This provides an important reference for the future deployment of efficient and accurate facial feature point detection models in practical applications.</p> <hd id="AN0179351181-16">4.2.2. Regression Combined with Heatmap Results</hd> <p>To demonstrate the effectiveness of heatmaps in face keypoint detection, we conducted a series of comparative experiments to evaluate the detection performance when using only the regression model and when adding heatmaps. Figure 3 presents the experimental results of facial keypoint detection across various environments.</p> <p>In the base regression model, we directly predicted the coordinates of the keypoints and obtained a mean square error (MSE LOSS) of 0.0556. Next, we added heatmaps as an intermediate representation for keypoint detection. By generating a heatmap on the label image, we further captured and utilized the heat value at the location of each keypoint in the image. The experimental results are presented in Table 3, which show that the MSE LOSS significantly decreased to 0.0462 after adding the heatmap, representing an improvement of 16.90%. To evaluate the generalization ability of our regression algorithm and the regression combined with the heatmap algorithm, we utilized a three-fold cross-validation method, involving pairwise combinations of three different test sets, thus yielding three validation values and corresponding <emph>p</emph>-values for each set of experiments. Performance metrics were collected for each fold to ensure robust evaluation. We applied the <emph>t</emph>-values to examine the mean differences in performance metrics across the folds. Additionally, the Levene test was employed to check for homogeneity of variances. Large <emph>p</emph>-values from both the <emph>t</emph>-values and Levene test were observed, indicating that the mean performance differences were not statistically significant, and the variances were homogeneous.</p> <p>The reason for this improvement is that heatmaps provide more detailed local information, making the model more robust and accurate when processing keypoints in specific areas of the image. Heatmaps are particularly effective in capturing the location of keypoints in challenging scenarios involving large angles, occlusion, or complex lighting conditions, thereby improving the overall performance of the model.</p> <hd id="AN0179351181-17">4.2.3. Distillation Combined with Pruning Results</hd> <p>To show how effective it is to combine distillation with pruning in face keypoint detection, we carried out multiple experiments and compared the detection results of various model combinations. These combinations included regression + heatmap, regression + heatmap + distillation, and regression + heatmap + distillation + pruning.</p> <p>As shown in Table 4, a teacher model was developed and evaluated using the regression + heatmap approach, achieving an MSE LOSS of 0.0457. In parallel, a student model was created and assessed with the same regression + heatmap approach, resulting in an MSE LOSS of 0.0462.</p> <p>To enhance the student model's accuracy, we incorporated a distillation step into the regression + heatmap method, transferring knowledge from the teacher model to the student model. This process lowered the MSE LOSS of the student model to 0.0434. The process of knowledge distillation enabled the lightweight student model to better learn the deep features from the teacher model, thereby improving its detection accuracy. Building on this, we further incorporated a pruning step, pruning 50% of the student model to remove redundant parameters. The experimental results showed that the MSE LOSS of the pruned student model was 0.0445. Although there is a slight increase, compared to the unpruned student model, the pruned model significantly reduced computation and model size while maintaining high precision, thus improving the model efficiency. The improvement is attributed to the combination of distillation and pruning, which allows the network to be trained and optimized with minimal accuracy loss. Distillation helps the lightweight models learn more powerful representations, while pruning optimizes the model structure, making it more efficient and lightweight.</p> <p>To assess the generalization capability of various algorithms, including regression + heatmap on ResNet-50, regression + heatmap on MobileNet, regression + heatmap + distillation on MobileNet, and regression + heatmap + distillation + pruning (50%) on MobileNet, we conducted a three-fold cross-validation. This method involved pairwise combinations of three different test sets, thus yielding three validation values and corresponding <emph>p</emph>-values for each set of experiments. Performance metrics were collected for each fold to ensure a thorough evaluation. We applied the <emph>t</emph>-values to examine the mean differences in performance metrics across the folds and employed the Levene test to check for homogeneity of variances. The large <emph>p</emph>-values from both the <emph>t</emph>-test and Levene test indicated that the mean performance differences were not statistically significant and the variances were homogeneous.</p> <hd id="AN0179351181-18">4.2.4. Regression and Heatmap Loss Weight Ratio Results</hd> <p>To evaluate the influence of different regression and heatmap loss weight combinations on model performance, we adjusted the weight ratio of these two losses to find the optimal balance between regression accuracy and heatmap quality. The weight ratio of regression to heatmap is denoted as <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;msub&gt;&lt;mi&gt;&amp;#955;&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;r&lt;/mi&gt;&lt;mi&gt;h&lt;/mi&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> .</p> <p>(<reflink idref="bib9" id="ref39">9</reflink>) <ephtml> &lt;math display="block" xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mi&gt;&amp;#955;&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;r&lt;/mi&gt;&lt;mi&gt;h&lt;/mi&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;mo&gt;&amp;#8712;&lt;/mo&gt;&lt;mrow&gt;&lt;mn&gt;0.2&lt;/mn&gt;&lt;mo&gt;,&lt;/mo&gt;&lt;mn&gt;0.5&lt;/mn&gt;&lt;mo&gt;,&lt;/mo&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;mo&gt;,&lt;/mo&gt;&lt;mn&gt;2&lt;/mn&gt;&lt;mo&gt;,&lt;/mo&gt;&lt;mn&gt;5&lt;/mn&gt;&lt;/mrow&gt;&lt;/mrow&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml></p> <p>The results of the comparative experiments are shown in Figure 4, where we found that, when the weight ratio is 1, the minimum MSE value is 0.0462. When the weight ratio deviates from the optimal range, the mean squared error (MSE) value increases, leading to worse performance. This is because, when the weight ratio is close to 1, the model can effectively balance between predicting global keypoint coordinates and expressing local heatmap details. This equilibrium enables accurate keypoint prediction and detailed feature refinement, while also preventing significant differences during gradient updates, thus avoiding issues like gradient explosion or vanishing. If the weight ratio exceeds 1, the model prioritizes global coordinate prediction but lacks in capturing local details, which hampers its ability to refine keypoint positions, especially under conditions of severe occlusion. On the other hand, if the weight ratio is below 1, the model overly focuses on local heatmap features at the expense of global keypoint accuracy. This results in the poor capture of global structures and overfitting to local details.</p> <hd id="AN0179351181-19">4.2.5. The Result of Distillation and Pruning Loss Weight Ratio</hd> <p>Regarding experiments investigating the effect of loss weights on distillation combined with pruning, Figure 5 presents the results. We studied the impacts of different distillation and pruning loss weights on model performance. By adjusting the weight ratio <ephtml> &lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;msub&gt;&lt;mi&gt;&amp;#955;&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;d&lt;/mi&gt;&lt;mi&gt;p&lt;/mi&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> of these two losses, we aimed to minimize model size and computational cost while maintaining accuracy.</p> <p>(<reflink idref="bib10" id="ref40">10</reflink>) <ephtml> &lt;math display="block" xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mi&gt;&amp;#955;&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;d&lt;/mi&gt;&lt;mi&gt;p&lt;/mi&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;mo&gt;&amp;#8712;&lt;/mo&gt;&lt;mrow&gt;&lt;mn&gt;0.2&lt;/mn&gt;&lt;mo&gt;,&lt;/mo&gt;&lt;mn&gt;0.5&lt;/mn&gt;&lt;mo&gt;,&lt;/mo&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;mo&gt;,&lt;/mo&gt;&lt;mn&gt;2&lt;/mn&gt;&lt;mo&gt;,&lt;/mo&gt;&lt;mn&gt;5&lt;/mn&gt;&lt;/mrow&gt;&lt;/mrow&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml></p> <p>The experimental results show that, under the conditions of a weight ratio of 1 and a regression pruning rate of 50%, the best performance is achieved with an MSE of 0.0445 when the distillation and pruning loss weights are equal. This balance allows the model to achieve compression while maintaining accuracy. When the distillation weight is too small, the MSE is 0.0486, indicating that the student model cannot effectively learn from the teacher model, resulting in deteriorated performance. Conversely, when the distillation weight is too large, the MSE is 0.0472, as the overemphasis on distillation loss neglects the role of pruning, failing to effectively reduce the model size and computational cost.</p> <hd id="AN0179351181-20">4.2.6. Experimental Results of Different Pruning Rates</hd> <p>Pruning rate refers to the proportion of parameters that are removed from the model. Pruning can reduce the computational load and the number of parameters, thereby improving the operational efficiency of the model. To further study the effect of pruning techniques on model performance, we set different pruning rates r for comparative experiments. By comparing the model's performance on the verification set under various pruning rates, we evaluated the impact on model accuracy and computational efficiency to identify the optimal pruning rate.</p> <p>(<reflink idref="bib11" id="ref41">11</reflink>) <ephtml> &lt;math display="block" xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;mi&gt;r&lt;/mi&gt;&lt;mo&gt;&amp;#8712;&lt;/mo&gt;&lt;mrow&gt;&lt;mn&gt;10&lt;/mn&gt;&lt;mo&gt;%&lt;/mo&gt;&lt;mo&gt;,&lt;/mo&gt;&lt;mn&gt;30&lt;/mn&gt;&lt;mo&gt;%&lt;/mo&gt;&lt;mo&gt;,&lt;/mo&gt;&lt;mn&gt;50&lt;/mn&gt;&lt;mo&gt;%&lt;/mo&gt;&lt;mo&gt;,&lt;/mo&gt;&lt;mn&gt;70&lt;/mn&gt;&lt;mo&gt;%&lt;/mo&gt;&lt;mo&gt;,&lt;/mo&gt;&lt;mn&gt;90&lt;/mn&gt;&lt;mo&gt;%&lt;/mo&gt;&lt;/mrow&gt;&lt;/mrow&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml></p> <p>With the pruning rate <emph>r</emph> = 10%, the MSE of the model on the verification set was 0.0438. As the pruning rate increased, computational efficiency improved, but model accuracy decreased. When the pruning rate <emph>r</emph> = 90%, the MSE increased to 0.0478, resulting in a precision loss of 9.13% compared to the model without pruning. The experimental results presented in Figure 6 indicate that a low pruning rate (e.g., 10%) has minimal impact on model accuracy, but the improvement in computational efficiency is limited. Conversely, a high pruning rate (e.g., 90%) significantly reduces the computational load but also leads to a substantial loss in accuracy.</p> <hd id="AN0179351181-21">4.2.7. Performance Analysis of Model Size, Computational Speed, and Running on Mobile Devices</hd> <p>Model size: as shown in Table 5, our model, mobile–distill–prune-0.25x, is significantly smaller at 2.5 Mb, saving over 5 Mb compared to the mobile–distill–prune-1x. This reduction makes mobile–distill–prune-0.25x much more compact than most models, such as SDM (10.1 Mb), LAB (50.7 Mb), and SAN, which is around 800 Mb with its two VGG-based subnets measuring 270.5 Mb and 528 Mb, respectively.</p> <p>Processing speed: we measure the efficiency of each algorithm using an i7-6700K CPU(Intel Corporation, Santa Clara, CA, USA) (denoted as C) and an Nvidia GTX 1080Ti GPU, Nvidia, Santa Clara, CA, USA (denoted as G), unless stated otherwise. Since only the CPU version of SDM [[<reflink idref="bib36" id="ref42">36</reflink>]] and the GPU version of SAN [[<reflink idref="bib30" id="ref43">30</reflink>]] are publicly available, their times are reported accordingly. For LAB [[<reflink idref="bib31" id="ref44">31</reflink>]], only the CPU version is available for download, but the authors mention in their paper [[<reflink idref="bib31" id="ref45">31</reflink>]] that their algorithm runs in approximately 60 ms on a TITAN X GPU (denoted as G). Our results show that mobile–distill–prune-0.25x and mobile–distill–prune-1x outperform most algorithms in terms of speed on both CPU and GPU. Notably, LAB's CPU time is reported in seconds rather than milliseconds. Furthermore, mobile–distill–prune-1x required 2.25 times the CPU time and 1.89 times the GPU time compared to mobile–distill–prune-0.25x. Despite this, PFLD 1X remained much faster than the other models. Additionally, for PFLD 0.25X and PFLD 1X, we conducted tests on a Qualcomm ARM 845 processor (denoted as A). Here, mobile–distill–prune-0.25x processed a face in 12 ms (over 83 fps), while mobile–distill–prune-1x processed a face in 18.7 ms (over 53 fps).</p> <p>Mobile terminal evaluation: Table 6 shows the performance comparison of different models on three mobile devices (HUAWEI Mate9 Pro, HUAWEI Technologies Co., Ltd., Shenzhen, China; Xiaomi 9, Xiaomi Corporation, Beijing, China; and HUAWEI P30 Pro, HUAWEI Technologies Co., Ltd., Shenzhen, China). The comparison indicators include CPU specifications, RAM, and processing speed (in milliseconds).</p> <p>Among all devices, the mobile–distill–prune-1x and mobile–distill–prune-0.25x models are significantly faster, with the mobile–distill–prune-0.25x model being the fastest, reaching a processing speed of 11ms on the HUAWEI P30 Pro. While maintaining high accuracy, it greatly reduces the computational complexity and storage requirements of the model, allowing for inference in a very short time. This has important practical significance for real-time application scenarios such as facial recognition, augmented reality, and intelligent monitoring.</p> <hd id="AN0179351181-22">5. Conclusions</hd> <p>This paper addressed the robustness of face keypoint detection, focusing on effective and efficient detection in challenging environments such as occlusion and inversion. Our key contributions are as follows. Effectiveness enhancement: we proposed a novel approach that combines regression and heatmap training, inspired by the concept of human keypoints. This method significantly improves the accuracy and reliability of face keypoint detection, even under extreme conditions. Efficiency improvement: to reduce the model's computational load, we introduced a compression method that integrates distillation and pruning. This approach effectively minimizes the model size while maintaining high performance, making it suitable for deployment on resource-constrained devices. Integrated technique: we successfully combined regression, heatmap, pruning, and distillation techniques into a cohesive framework. Our comprehensive experiments demonstrated the robustness and superiority of our method compared to existing approaches, especially in terms of speed and accuracy. Experimental validation: through detailed experiments, we compared our method with existing deep learning approaches and used significant statistical analyses to demonstrate the robustness and generalization ability of our model's performance. In ablation experiments and hyperparameter studies, we examined the impact of combining regression with heatmap training and pruning with distillation on the algorithm's accuracy. Finally, we conducted a comparative study on model size, computational speed, and mobile device detection. The research data validated the efficiency, effectiveness, and lightweight nature of the method proposed in this paper.</p> <p>Our findings have important practical implications for real-time applications such as facial recognition, augmented reality, and intelligent monitoring. The ability to detect face keypoints accurately and efficiently, even in extreme conditions, enhances the usability and reliability of these applications. Our model's reduced computational complexity and storage requirements make it ideal for deployment on mobile devices and other platforms with limited resources. While our approach demonstrates robustness in occlusion and inversion scenarios, there may still be edge cases where performance could be improved. Future research should focus on enhancing the model's ability to handle more complex and varied occlusions. Additionally, further exploration into the synergistic effects of combining different techniques, such as advanced pruning methods and more sophisticated distillation strategies, could result in even more efficient models.</p> <hd id="AN0179351181-23">Figures and Tables</hd> <p>Graph: Figure 1 An illustration of the joint training of regression and heatmap with the joint model compression of pruning and distillation.The purple box (Teacher Backbone) represents the original, larger, and more complex neural network model. The blue box, comprising the Regression Head and Heatmap Head, represents the different output heads of the neural network. The Regression Head is responsible for predicting specific values, indicating precise locations of features such as eyes, nose, and mouth, while the Heatmap Head generates heatmaps representing the probability distribution of where these features may appear. The green box (Student Backbone) signifies a smaller and more efficient neural network model after compression.</p> <p>Graph: Figure 2 Thispicture shows the three stages of the model pruning network architecture combined with distillation: (a) in the model initialization stage, the input image is processed through the three convolutional layers of the initial network structure (large, medium, and small purple blocks) to extract features and output the prediction results (thin green bars); (b) in the model distillation stage, the input image is processed through the three convolutional layers after distillation (large, medium, and small blue blocks), the prediction results are output, and the outputs of the initial network and the distillation network are connected by a red dotted line to calculate the distillation loss; (c) in the model pruning stage combined with distillation, the input image is processed through the three convolutional layers pruned layer by layer (large, medium, and small yellow blocks), the prediction results are output, and the outputs of the distillation network and the pruned network are connected by a red dotted line to calculate the pruning loss and guide the pruning process.</p> <p>Graph: Figure 3 Results of facial keypoint detection for face images in various environments. The red dots represent the detected facial keypoints, including the corners of the eyes, the tip of the nose, and the corners of the mouth. These keypoints are used to identify and track specific facial features.</p> <p>Graph: Figure 4 The results of regression and heatmap under different weight ratios.</p> <p>Graph: Figure 5 Distillation and pruning loss weight ratio results.</p> <p>Graph: Figure 6 Experimental results under different pruning rates.</p> <p>Table 1 Normalized mean error comparison of the 300W dataset's common subset, challenging subset, and full set. The bold entries represent the model proposed in this paper.</p> <p> <ephtml> &lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th align="center" style="border-bottom:solid thin;border-top:solid thin"&gt;Method&lt;/th&gt;&lt;th align="center" style="border-bottom:solid thin;border-top:solid thin"&gt;Common&lt;/th&gt;&lt;th align="center" style="border-bottom:solid thin;border-top:solid thin"&gt;Challenging&lt;/th&gt;&lt;th align="center" style="border-bottom:solid thin;border-top:solid thin"&gt;Fullset&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td colspan="4" align="center" valign="middle" style="border-bottom:solid thin"&gt;Inter-pupil Normalization (IPN)&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="center" valign="middle"&gt;RCPR [&lt;xref ref-type="bibr" rid="bibr33"&gt;33&lt;/xref&gt;]&lt;/td&gt;&lt;td align="center" valign="middle"&gt;6.18&lt;/td&gt;&lt;td align="center" valign="middle"&gt;17.26&lt;/td&gt;&lt;td align="center" valign="middle"&gt;8.35&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="center" valign="middle"&gt;CFAN [&lt;xref ref-type="bibr" rid="bibr34"&gt;34&lt;/xref&gt;]&lt;/td&gt;&lt;td align="center" valign="middle"&gt;5.50&lt;/td&gt;&lt;td align="center" valign="middle"&gt;16.78&lt;/td&gt;&lt;td align="center" valign="middle"&gt;7.69&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="center" valign="middle"&gt;ESR [&lt;xref ref-type="bibr" rid="bibr35"&gt;35&lt;/xref&gt;]&lt;/td&gt;&lt;td align="center" valign="middle"&gt;5.28&lt;/td&gt;&lt;td align="center" valign="middle"&gt;17.00&lt;/td&gt;&lt;td align="center" valign="middle"&gt;7.58&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="center" valign="middle"&gt;SDM [&lt;xref ref-type="bibr" rid="bibr36"&gt;36&lt;/xref&gt;]&lt;/td&gt;&lt;td align="center" valign="middle"&gt;5.57&lt;/td&gt;&lt;td align="center" valign="middle"&gt;15.40&lt;/td&gt;&lt;td align="center" valign="middle"&gt;7.50&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="center" valign="middle"&gt;LBF [&lt;xref ref-type="bibr" rid="bibr37"&gt;37&lt;/xref&gt;]&lt;/td&gt;&lt;td align="center" valign="middle"&gt;4.95&lt;/td&gt;&lt;td align="center" valign="middle"&gt;11.98&lt;/td&gt;&lt;td align="center" valign="middle"&gt;6.32&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="center" valign="middle"&gt;CFSS [&lt;xref ref-type="bibr" rid="bibr38"&gt;38&lt;/xref&gt;]&lt;/td&gt;&lt;td align="center" valign="middle"&gt;4.73&lt;/td&gt;&lt;td align="center" valign="middle"&gt;9.98&lt;/td&gt;&lt;td align="center" valign="middle"&gt;5.76&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="center" valign="middle"&gt;3DDFA [&lt;xref ref-type="bibr" rid="bibr39"&gt;39&lt;/xref&gt;]&lt;/td&gt;&lt;td align="center" valign="middle"&gt;6.15&lt;/td&gt;&lt;td align="center" valign="middle"&gt;10.59&lt;/td&gt;&lt;td align="center" valign="middle"&gt;7.01&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="center" valign="middle"&gt;TCDCN [&lt;xref ref-type="bibr" rid="bibr40"&gt;40&lt;/xref&gt;]&lt;/td&gt;&lt;td align="center" valign="middle"&gt;4.80&lt;/td&gt;&lt;td align="center" valign="middle"&gt;8.60&lt;/td&gt;&lt;td align="center" valign="middle"&gt;5.54&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="center" valign="middle"&gt;MDM [&lt;xref ref-type="bibr" rid="bibr41"&gt;41&lt;/xref&gt;]&lt;/td&gt;&lt;td align="center" valign="middle"&gt;4.83&lt;/td&gt;&lt;td align="center" valign="middle"&gt;10.14&lt;/td&gt;&lt;td align="center" valign="middle"&gt;5.88&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="center" valign="middle"&gt;SeqMT [&lt;xref ref-type="bibr" rid="bibr42"&gt;42&lt;/xref&gt;]&lt;/td&gt;&lt;td align="center" valign="middle"&gt;4.84&lt;/td&gt;&lt;td align="center" valign="middle"&gt;9.93&lt;/td&gt;&lt;td align="center" valign="middle"&gt;5.74&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="center" valign="middle"&gt;RAR [&lt;xref ref-type="bibr" rid="bibr43"&gt;43&lt;/xref&gt;]&lt;/td&gt;&lt;td align="center" valign="middle"&gt;4.12&lt;/td&gt;&lt;td align="center" valign="middle"&gt;8.35&lt;/td&gt;&lt;td align="center" valign="middle"&gt;4.94&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="center" valign="middle"&gt;DVLN [&lt;xref ref-type="bibr" rid="bibr44"&gt;44&lt;/xref&gt;]&lt;/td&gt;&lt;td align="center" valign="middle"&gt;3.94&lt;/td&gt;&lt;td align="center" valign="middle"&gt;7.62&lt;/td&gt;&lt;td align="center" valign="middle"&gt;4.66&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="center" valign="middle"&gt;CPM [&lt;xref ref-type="bibr" rid="bibr45"&gt;45&lt;/xref&gt;]&lt;/td&gt;&lt;td align="center" valign="middle"&gt;3.39&lt;/td&gt;&lt;td align="center" valign="middle"&gt;8.14&lt;/td&gt;&lt;td align="center" valign="middle"&gt;4.36&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="center" valign="middle"&gt;DCFE [&lt;xref ref-type="bibr" rid="bibr46"&gt;46&lt;/xref&gt;]&lt;/td&gt;&lt;td align="center" valign="middle"&gt;3.83&lt;/td&gt;&lt;td align="center" valign="middle"&gt;7.54&lt;/td&gt;&lt;td align="center" valign="middle"&gt;4.55&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="center" valign="middle"&gt;TSR [&lt;xref ref-type="bibr" rid="bibr47"&gt;47&lt;/xref&gt;]&lt;/td&gt;&lt;td align="center" valign="middle"&gt;4.36&lt;/td&gt;&lt;td align="center" valign="middle"&gt;7.56&lt;/td&gt;&lt;td align="center" valign="middle"&gt;4.99&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="center" valign="middle"&gt;LAB [&lt;xref ref-type="bibr" rid="bibr31"&gt;31&lt;/xref&gt;]&lt;/td&gt;&lt;td align="center" valign="middle"&gt;3.42&lt;/td&gt;&lt;td align="center" valign="middle"&gt;6.98&lt;/td&gt;&lt;td align="center" valign="middle"&gt;4.12&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="center" valign="middle"&gt;PFLD 0.25X [&lt;xref ref-type="bibr" rid="bibr1"&gt;1&lt;/xref&gt;]&lt;/td&gt;&lt;td align="center" valign="middle"&gt;3.38&lt;/td&gt;&lt;td align="center" valign="middle"&gt;6.83&lt;/td&gt;&lt;td align="center" valign="middle"&gt;4.02&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="center" valign="middle" style="border-bottom:solid thin"&gt;PFLD 1X [&lt;xref ref-type="bibr" rid="bibr1"&gt;1&lt;/xref&gt;]&lt;/td&gt;&lt;td align="center" valign="middle" style="border-bottom:solid thin"&gt;3.32&lt;/td&gt;&lt;td align="center" valign="middle" style="border-bottom:solid thin"&gt;6.56&lt;/td&gt;&lt;td align="center" valign="middle" style="border-bottom:solid thin"&gt;3.95&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="center" valign="middle"&gt;&lt;bold&gt;Mobile&amp;#8211;distill&amp;#8211;prune-0.25X&lt;/bold&gt; (vs. PFLD 0.25X [&lt;xref ref-type="bibr" rid="bibr1"&gt;1&lt;/xref&gt;])&lt;/td&gt;&lt;td align="center" valign="middle"&gt;&lt;bold&gt;3.36&lt;/bold&gt; (+0.59%)&lt;/td&gt;&lt;td align="center" valign="middle"&gt;&lt;bold&gt;6.78&lt;/bold&gt; (+0.73%)&lt;/td&gt;&lt;td align="center" valign="middle"&gt;&lt;bold&gt;3.98&lt;/bold&gt; (+0.99%)&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="center" valign="middle"&gt;&lt;bold&gt;Mobile&amp;#8211;distill&amp;#8211;prune-1X&lt;/bold&gt; (vs. PFLD 1X [&lt;xref ref-type="bibr" rid="bibr1"&gt;1&lt;/xref&gt;])&lt;/td&gt;&lt;td align="center" valign="middle"&gt;&lt;bold&gt;3.24&lt;/bold&gt; (+2.4%)&lt;/td&gt;&lt;td align="center" valign="middle"&gt;&lt;bold&gt;6.54&lt;/bold&gt; (+0.3%)&lt;/td&gt;&lt;td align="center" valign="middle"&gt;&lt;bold&gt;3.89&lt;/bold&gt; (+1.5%)&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="center" valign="middle" style="border-bottom:solid thin"&gt;&lt;bold&gt;ResNet-50&lt;/bold&gt;&lt;/td&gt;&lt;td align="center" valign="middle" style="border-bottom:solid thin"&gt;&lt;bold&gt;3.20&lt;/bold&gt;&lt;/td&gt;&lt;td align="center" valign="middle" style="border-bottom:solid thin"&gt;&lt;bold&gt;6.48&lt;/bold&gt;&lt;/td&gt;&lt;td align="center" valign="middle" style="border-bottom:solid thin"&gt;&lt;bold&gt;3.87&lt;/bold&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td colspan="4" align="center" valign="middle" style="border-bottom:solid thin"&gt;Inter-ocular Normalization (ION)&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="center" valign="middle"&gt;PIFA-CNN [&lt;xref ref-type="bibr" rid="bibr48"&gt;48&lt;/xref&gt;]&lt;/td&gt;&lt;td align="center" valign="middle"&gt;5.43&lt;/td&gt;&lt;td align="center" valign="middle"&gt;9.88&lt;/td&gt;&lt;td align="center" valign="middle"&gt;6.30&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="center" valign="middle"&gt;RDR [&lt;xref ref-type="bibr" rid="bibr49"&gt;49&lt;/xref&gt;]&lt;/td&gt;&lt;td align="center" valign="middle"&gt;5.03&lt;/td&gt;&lt;td align="center" valign="middle"&gt;8.95&lt;/td&gt;&lt;td align="center" valign="middle"&gt;5.80&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="center" valign="middle"&gt;PCD-CNN [&lt;xref ref-type="bibr" rid="bibr32"&gt;32&lt;/xref&gt;]&lt;/td&gt;&lt;td align="center" valign="middle"&gt;3.67&lt;/td&gt;&lt;td align="center" valign="middle"&gt;7.62&lt;/td&gt;&lt;td align="center" valign="middle"&gt;4.44&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="center" valign="middle"&gt;SAN [&lt;xref ref-type="bibr" rid="bibr30"&gt;30&lt;/xref&gt;]&lt;/td&gt;&lt;td align="center" valign="middle"&gt;3.34&lt;/td&gt;&lt;td align="center" valign="middle"&gt;6.60&lt;/td&gt;&lt;td align="center" valign="middle"&gt;3.98&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="center" valign="middle"&gt;PFLD 0.25X [&lt;xref ref-type="bibr" rid="bibr1"&gt;1&lt;/xref&gt;]&lt;/td&gt;&lt;td align="center" valign="middle"&gt;3.03&lt;/td&gt;&lt;td align="center" valign="middle"&gt;5.15&lt;/td&gt;&lt;td align="center" valign="middle"&gt;3.45&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="center" valign="middle" style="border-bottom:solid thin"&gt;PFLD 1X [&lt;xref ref-type="bibr" rid="bibr1"&gt;1&lt;/xref&gt;]&lt;/td&gt;&lt;td align="center" valign="middle" style="border-bottom:solid thin"&gt;3.01&lt;/td&gt;&lt;td align="center" valign="middle" style="border-bottom:solid thin"&gt;5.08&lt;/td&gt;&lt;td align="center" valign="middle" style="border-bottom:solid thin"&gt;3.40&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="center" valign="middle"&gt;&lt;bold&gt;Mobile&amp;#8211;distill&amp;#8211;prune-0.25X&lt;/bold&gt; (vs. PFLD 0.25X [&lt;xref ref-type="bibr" rid="bibr1"&gt;1&lt;/xref&gt;])&lt;/td&gt;&lt;td align="center" valign="middle"&gt;&lt;bold&gt;3.02&lt;/bold&gt; (+0.3%)&lt;/td&gt;&lt;td align="center" valign="middle"&gt;&lt;bold&gt;5.14&lt;/bold&gt; (+0.19%)&lt;/td&gt;&lt;td align="center" valign="middle"&gt;&lt;bold&gt;3.44&lt;/bold&gt; (+0.28%)&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="center" valign="middle"&gt;&lt;bold&gt;Mobile&amp;#8211;distill&amp;#8211;prune-1X&lt;/bold&gt; (vs. PFLD 1X [&lt;xref ref-type="bibr" rid="bibr1"&gt;1&lt;/xref&gt;])&lt;/td&gt;&lt;td align="center" valign="middle"&gt;&lt;bold&gt;2.98&lt;/bold&gt; (+0.99%)&lt;/td&gt;&lt;td align="center" valign="middle"&gt;&lt;bold&gt;5.04&lt;/bold&gt; (+0.78%)&lt;/td&gt;&lt;td align="center" valign="middle"&gt;&lt;bold&gt;3.38&lt;/bold&gt; (+0.58%)&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="center" valign="middle" style="border-bottom:solid thin"&gt;&lt;bold&gt;ResNet-50&lt;/bold&gt;&lt;/td&gt;&lt;td align="center" valign="middle" style="border-bottom:solid thin"&gt;&lt;bold&gt;2.96&lt;/bold&gt;&lt;/td&gt;&lt;td align="center" valign="middle" style="border-bottom:solid thin"&gt;&lt;bold&gt;4.96&lt;/bold&gt;&lt;/td&gt;&lt;td align="center" valign="middle" style="border-bottom:solid thin"&gt;&lt;bold&gt;3.34&lt;/bold&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt; </ephtml> </p> <p>Table 2 Normalized mean error comparison on the AFLW–full dataset. The bold entries represent the model proposed in this paper.</p> <p> <ephtml> &lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th align="center" style="border-bottom:solid thin;border-top:solid thin"&gt;Method&lt;/th&gt;&lt;th align="center" style="border-bottom:solid thin;border-top:solid thin"&gt;RCPR [&lt;xref ref-type="bibr" rid="bibr33"&gt;33&lt;/xref&gt;]&lt;/th&gt;&lt;th align="center" style="border-bottom:solid thin;border-top:solid thin"&gt;CDM [&lt;xref ref-type="bibr" rid="bibr50"&gt;50&lt;/xref&gt;]&lt;/th&gt;&lt;th align="center" style="border-bottom:solid thin;border-top:solid thin"&gt;SDM [&lt;xref ref-type="bibr" rid="bibr36"&gt;36&lt;/xref&gt;]&lt;/th&gt;&lt;th align="center" style="border-bottom:solid thin;border-top:solid thin"&gt;ERT [&lt;xref ref-type="bibr" rid="bibr51"&gt;51&lt;/xref&gt;]&lt;/th&gt;&lt;th align="center" style="border-bottom:solid thin;border-top:solid thin"&gt;LBF [&lt;xref ref-type="bibr" rid="bibr37"&gt;37&lt;/xref&gt;]&lt;/th&gt;&lt;th align="center" style="border-bottom:solid thin;border-top:solid thin"&gt;CFSS [&lt;xref ref-type="bibr" rid="bibr38"&gt;38&lt;/xref&gt;]&lt;/th&gt;&lt;th align="center" style="border-bottom:solid thin;border-top:solid thin"&gt;CCL [&lt;xref ref-type="bibr" rid="bibr52"&gt;52&lt;/xref&gt;]&lt;/th&gt;&lt;th align="center" style="border-bottom:solid thin;border-top:solid thin"&gt;Binary-CNN [&lt;xref ref-type="bibr" rid="bibr53"&gt;53&lt;/xref&gt;]&lt;/th&gt;&lt;th align="center" style="border-bottom:solid thin;border-top:solid thin"&gt;PCD-CNN [&lt;xref ref-type="bibr" rid="bibr32"&gt;32&lt;/xref&gt;]&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td align="center" valign="middle" style="border-bottom:solid thin"&gt;&lt;bold&gt;AFLW&lt;/bold&gt;&lt;/td&gt;&lt;td align="center" valign="middle" style="border-bottom:solid thin"&gt;5.43&lt;/td&gt;&lt;td align="center" valign="middle" style="border-bottom:solid thin"&gt;3.73&lt;/td&gt;&lt;td align="center" valign="middle" style="border-bottom:solid thin"&gt;4.05&lt;/td&gt;&lt;td align="center" valign="middle" style="border-bottom:solid thin"&gt;4.35&lt;/td&gt;&lt;td align="center" valign="middle" style="border-bottom:solid thin"&gt;4.25&lt;/td&gt;&lt;td align="center" valign="middle" style="border-bottom:solid thin"&gt;3.92&lt;/td&gt;&lt;td align="center" valign="middle" style="border-bottom:solid thin"&gt;2.72&lt;/td&gt;&lt;td align="center" valign="middle" style="border-bottom:solid thin"&gt;2.85&lt;/td&gt;&lt;td align="center" valign="middle" style="border-bottom:solid thin"&gt;2.40&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="center" valign="middle" style="border-bottom:solid thin"&gt;&lt;bold&gt;Method&lt;/bold&gt;&lt;/td&gt;&lt;td align="center" valign="middle" style="border-bottom:solid thin"&gt;TSR [&lt;xref ref-type="bibr" rid="bibr47"&gt;47&lt;/xref&gt;]&lt;/td&gt;&lt;td align="center" valign="middle" style="border-bottom:solid thin"&gt;CPM [&lt;xref ref-type="bibr" rid="bibr45"&gt;45&lt;/xref&gt;]&lt;/td&gt;&lt;td align="center" valign="middle" style="border-bottom:solid thin"&gt;SAN [&lt;xref ref-type="bibr" rid="bibr30"&gt;30&lt;/xref&gt;]&lt;/td&gt;&lt;td align="center" valign="middle" style="border-bottom:solid thin"&gt;PFLD 0.25X [&lt;xref ref-type="bibr" rid="bibr1"&gt;1&lt;/xref&gt;]&lt;/td&gt;&lt;td align="center" valign="middle" style="border-bottom:solid thin"&gt;PFLD 1X [&lt;xref ref-type="bibr" rid="bibr1"&gt;1&lt;/xref&gt;]&lt;/td&gt;&lt;td align="center" valign="middle" style="border-bottom:solid thin"&gt;&lt;bold&gt;mobile&amp;#8211;distill&amp;#8211;prune-0.25X&lt;/bold&gt;&lt;/td&gt;&lt;td align="center" valign="middle" style="border-bottom:solid thin"&gt;&lt;bold&gt;mobile&amp;#8211;distill&amp;#8211;prune-1X&lt;/bold&gt;&lt;/td&gt;&lt;td align="center" valign="middle" style="border-bottom:solid thin"&gt;&lt;bold&gt;ResNet-50&lt;/bold&gt;&lt;/td&gt;&lt;td align="center" valign="middle" style="border-bottom:solid thin" /&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="center" valign="middle" style="border-bottom:solid thin"&gt;&lt;bold&gt;AFLW&lt;/bold&gt;&lt;/td&gt;&lt;td align="center" valign="middle" style="border-bottom:solid thin"&gt;2.17&lt;/td&gt;&lt;td align="center" valign="middle" style="border-bottom:solid thin"&gt;2.33&lt;/td&gt;&lt;td align="center" valign="middle" style="border-bottom:solid thin"&gt;1.91&lt;/td&gt;&lt;td align="center" valign="middle" style="border-bottom:solid thin"&gt;2.07&lt;/td&gt;&lt;td align="center" valign="middle" style="border-bottom:solid thin"&gt;1.88&lt;/td&gt;&lt;td align="center" valign="middle" style="border-bottom:solid thin"&gt;&lt;bold&gt;2.06&lt;/bold&gt;&lt;/td&gt;&lt;td align="center" valign="middle" style="border-bottom:solid thin"&gt;&lt;bold&gt;1.86&lt;/bold&gt;&lt;/td&gt;&lt;td align="center" valign="middle" style="border-bottom:solid thin"&gt;&lt;bold&gt;1.84&lt;/bold&gt;&lt;/td&gt;&lt;td align="center" valign="middle" style="border-bottom:solid thin" /&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt; </ephtml> </p> <p>Table 3 Comparison of the performance between regression alone and regression combined with heatmap.</p> <p> <ephtml> &lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th align="center" style="border-bottom:solid thin;border-top:solid thin"&gt;Backbone&lt;/th&gt;&lt;th align="center" style="border-bottom:solid thin;border-top:solid thin"&gt;Algorithm&lt;/th&gt;&lt;th align="center" style="border-bottom:solid thin;border-top:solid thin"&gt;MSE&lt;/th&gt;&lt;th align="center" style="border-bottom:solid thin;border-top:solid thin"&gt;&lt;italic&gt;t&lt;/italic&gt;-Test/&lt;italic&gt;p&lt;/italic&gt;-Value&lt;/th&gt;&lt;th align="center" style="border-bottom:solid thin;border-top:solid thin"&gt;Levene-Test/&lt;break /&gt;&lt;italic&gt;p&lt;/italic&gt;-Value&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td rowspan="2" align="center" valign="middle"&gt;MobileNet&lt;/td&gt;&lt;td rowspan="2" align="center" valign="middle"&gt;regression&lt;/td&gt;&lt;td rowspan="2" align="center" valign="middle"&gt;0.0556&lt;/td&gt;&lt;td align="center" valign="middle"&gt;T: 0.52, 0.97, 0.44&lt;/td&gt;&lt;td align="center" valign="middle"&gt;L: 0.73, 1.23, 0.06&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="center" valign="middle"&gt;P: 0.59, 0.32, 0.65&lt;/td&gt;&lt;td align="center" valign="middle"&gt;P: 0.39, 0.26, 0.80&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td rowspan="2" align="center" valign="middle" style="border-bottom:solid thin"&gt;MobileNet&lt;/td&gt;&lt;td rowspan="2" align="center" valign="middle" style="border-bottom:solid thin"&gt;regression + heatmap&lt;/td&gt;&lt;td rowspan="2" align="center" valign="middle" style="border-bottom:solid thin"&gt;0.0462&lt;/td&gt;&lt;td align="center" valign="middle"&gt;T: 0.64, 0.93, 0.29&lt;/td&gt;&lt;td align="center" valign="middle"&gt;L: 0.03, 1.87, 1.28&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="center" valign="middle" style="border-bottom:solid thin"&gt;P: 0.52, 0.34, 0.76&lt;/td&gt;&lt;td align="center" valign="middle" style="border-bottom:solid thin"&gt;P: 0.84, 0.34, 0.25&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt; </ephtml> </p> <p>Table 4 Performance comparison of distillation combined with pruning based on regression and heatmap.</p> <p> <ephtml> &lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th align="center" style="border-bottom:solid thin;border-top:solid thin"&gt;Backbone&lt;/th&gt;&lt;th align="center" style="border-bottom:solid thin;border-top:solid thin"&gt;Algorithm&lt;/th&gt;&lt;th align="center" style="border-bottom:solid thin;border-top:solid thin"&gt;MSE&lt;/th&gt;&lt;th align="center" style="border-bottom:solid thin;border-top:solid thin"&gt;&lt;italic&gt;t&lt;/italic&gt;-Test/&lt;italic&gt;p&lt;/italic&gt;-Value&lt;/th&gt;&lt;th align="center" style="border-bottom:solid thin;border-top:solid thin"&gt;Levene-Test/&lt;italic&gt;p&lt;/italic&gt;-Value&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td rowspan="2" align="center" valign="middle"&gt;ResNet-50&lt;/td&gt;&lt;td rowspan="2" align="center" valign="middle"&gt;regression + heatmap&lt;/td&gt;&lt;td rowspan="2" align="center" valign="middle"&gt;0.0457&lt;/td&gt;&lt;td align="center" valign="middle"&gt;T: 0.13, 0.48, 0.35&lt;/td&gt;&lt;td align="center" valign="middle"&gt;L: 1.21, 1.55, 0.02&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="center" valign="middle"&gt;P: 0.89, 0.62, 0.72&lt;/td&gt;&lt;td align="center" valign="middle"&gt;P: 0.26, 0.21, 0.88&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td rowspan="2" align="center" valign="middle"&gt;MobileNet&lt;/td&gt;&lt;td rowspan="2" align="center" valign="middle"&gt;regression + heatmap&lt;/td&gt;&lt;td rowspan="2" align="center" valign="middle"&gt;0.0462&lt;/td&gt;&lt;td align="center" valign="middle"&gt;T: 0.64, 0.93, 0.29&lt;/td&gt;&lt;td align="center" valign="middle"&gt;L: 0.03, 1.87, 1.28&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="center" valign="middle"&gt;P: 0.52, 0.34, 0.76&lt;/td&gt;&lt;td align="center" valign="middle"&gt;P: 0.84, 0.34, 0.25&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td rowspan="2" align="center" valign="middle"&gt;MobileNet&lt;/td&gt;&lt;td rowspan="2" align="center" valign="middle"&gt;regression + heatmap + &lt;break /&gt; distillation&lt;/td&gt;&lt;td rowspan="2" align="center" valign="middle"&gt;0.0434&lt;/td&gt;&lt;td align="center" valign="middle"&gt;T: 0.49, 0.54, 0.05&lt;/td&gt;&lt;td align="center" valign="middle"&gt;L: 0.15, 0.02, 0.29&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="center" valign="middle"&gt;P: 0.61, 0.58, 0.95&lt;/td&gt;&lt;td align="center" valign="middle"&gt;P: 0.69, 0.88, 0.58&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td rowspan="2" align="center" valign="middle" style="border-bottom:solid thin"&gt;MobileNet&lt;/td&gt;&lt;td rowspan="2" align="center" valign="middle" style="border-bottom:solid thin"&gt;regression + heatmap + &lt;break /&gt; distillation + pruning (50%)&lt;/td&gt;&lt;td rowspan="2" align="center" valign="middle" style="border-bottom:solid thin"&gt;0.0445&lt;/td&gt;&lt;td align="center" valign="middle"&gt;T: 0.18, 0.63, 0.45&lt;/td&gt;&lt;td align="center" valign="middle"&gt;L: 0.98, 0.42, 0.09&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="center" valign="middle" style="border-bottom:solid thin"&gt;P: 0.85, 0.52, 0.65&lt;/td&gt;&lt;td align="center" valign="middle" style="border-bottom:solid thin"&gt;P: 0.34, 0.51, 0.76&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt; </ephtml> </p> <p>Table 5 Comparison of model size and processing speed with existing methods. The bold entries represent the model proposed in this paper.</p> <p> <ephtml> &lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th align="center" style="border-bottom:solid thin;border-top:solid thin"&gt;Model&lt;/th&gt;&lt;th align="center" style="border-bottom:solid thin;border-top:solid thin"&gt;SDM [&lt;xref ref-type="bibr" rid="bibr36"&gt;36&lt;/xref&gt;]&lt;/th&gt;&lt;th align="center" style="border-bottom:solid thin;border-top:solid thin"&gt;SAN [&lt;xref ref-type="bibr" rid="bibr30"&gt;30&lt;/xref&gt;]&lt;/th&gt;&lt;th align="center" style="border-bottom:solid thin;border-top:solid thin"&gt;LAB [&lt;xref ref-type="bibr" rid="bibr31"&gt;31&lt;/xref&gt;]&lt;/th&gt;&lt;th align="center" style="border-bottom:solid thin;border-top:solid thin"&gt;PFLD 0.25X [&lt;xref ref-type="bibr" rid="bibr1"&gt;1&lt;/xref&gt;]&lt;/th&gt;&lt;th align="center" style="border-bottom:solid thin;border-top:solid thin"&gt;PFLD 1X [&lt;xref ref-type="bibr" rid="bibr1"&gt;1&lt;/xref&gt;]&lt;/th&gt;&lt;th align="center" style="border-bottom:solid thin;border-top:solid thin"&gt;mobile&amp;#8211;&lt;break /&gt;distill&amp;#8211;&lt;break /&gt;prune-0.25x&lt;/th&gt;&lt;th align="center" style="border-bottom:solid thin;border-top:solid thin"&gt;mobile&amp;#8211;&lt;break /&gt;distill&amp;#8211;&lt;break /&gt;prune-1x&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td align="center" valign="middle"&gt;&lt;bold&gt;Size (Mb)&lt;/bold&gt;&lt;/td&gt;&lt;td align="center" valign="middle"&gt;10.1&lt;/td&gt;&lt;td align="center" valign="middle"&gt;270.5 + 528&lt;/td&gt;&lt;td align="center" valign="middle"&gt;50.7&lt;/td&gt;&lt;td align="center" valign="middle"&gt;2.1&lt;/td&gt;&lt;td align="center" valign="middle"&gt;12.5&lt;/td&gt;&lt;td align="center" valign="middle"&gt;&lt;bold&gt;2.5&lt;/bold&gt;&lt;/td&gt;&lt;td align="center" valign="middle"&gt;&lt;bold&gt;7.8&lt;/bold&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="center" valign="middle" style="border-bottom:solid thin"&gt;&lt;bold&gt;Speed&lt;/bold&gt;&lt;/td&gt;&lt;td align="center" valign="middle" style="border-bottom:solid thin"&gt;16&amp;#160;ms (C)&lt;/td&gt;&lt;td align="center" valign="middle" style="border-bottom:solid thin"&gt;343&amp;#160;ms (G)&lt;/td&gt;&lt;td align="center" valign="middle" style="border-bottom:solid thin"&gt;2.6 s (C)&lt;break /&gt;60&amp;#160;ms (G)&lt;/td&gt;&lt;td align="center" valign="middle" style="border-bottom:solid thin"&gt;1.2&amp;#160;ms (C)&lt;break /&gt;1.2&amp;#160;ms (G)&lt;break /&gt;7&amp;#160;ms (A)&lt;/td&gt;&lt;td align="center" valign="middle" style="border-bottom:solid thin"&gt;6.1&amp;#160;ms (C)&lt;break /&gt;3.5&amp;#160;ms (G)&lt;break /&gt;26.4&amp;#160;ms (A)&lt;/td&gt;&lt;td align="center" valign="middle" style="border-bottom:solid thin"&gt;&lt;bold&gt;2&amp;#160;ms (C)&lt;/bold&gt;&lt;break /&gt;&lt;bold&gt;1.9&amp;#160;ms (G)&lt;/bold&gt;&lt;break /&gt;&lt;bold&gt;12&amp;#160;ms (A)&lt;/bold&gt;&lt;/td&gt;&lt;td align="center" valign="middle" style="border-bottom:solid thin"&gt;&lt;bold&gt;4.5&amp;#160;ms (C)&lt;/bold&gt;&lt;break /&gt;&lt;bold&gt;3.6&amp;#160;ms (G)&lt;/bold&gt;&lt;break /&gt;&lt;bold&gt;18.7&amp;#160;ms (A)&lt;/bold&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt; </ephtml> </p> <p>Table 6 Comparing model performance on three different mobile devices. The bold entries represent the model proposed in this paper.</p> <p> <ephtml> &lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th align="center" style="border-bottom:solid thin;border-top:solid thin"&gt;Mobile&lt;/th&gt;&lt;th align="center" style="border-bottom:solid thin;border-top:solid thin"&gt;HUAWEI Mate9 Pro&lt;/th&gt;&lt;th align="center" style="border-bottom:solid thin;border-top:solid thin"&gt;Xiaomi MI 9&lt;/th&gt;&lt;th align="center" style="border-bottom:solid thin;border-top:solid thin"&gt;HUAWEI P30 Pro&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td align="center" valign="middle"&gt;&lt;bold&gt;CPU&lt;/bold&gt;&lt;/td&gt;&lt;td align="center" valign="middle"&gt;Kirin960&lt;/td&gt;&lt;td align="center" valign="middle"&gt;Snapdragon855&lt;/td&gt;&lt;td align="center" valign="middle"&gt;Kirin980&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="center" valign="middle"&gt;&lt;bold&gt;Frequency&lt;/bold&gt;&lt;/td&gt;&lt;td align="center" valign="middle"&gt;&lt;p&gt;&lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics xmlns=""&gt;&lt;mrow&gt;&lt;mn&gt;4&lt;/mn&gt;&lt;mo&gt;&amp;#215;&lt;/mo&gt;&lt;mn&gt;2.4&lt;/mn&gt;&lt;/mrow&gt;&lt;/semantics&gt;&lt;/math&gt;&lt;/p&gt; GHz +&lt;break /&gt;&lt;p&gt;&lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics xmlns=""&gt;&lt;mrow&gt;&lt;mn&gt;4&lt;/mn&gt;&lt;mo&gt;&amp;#215;&lt;/mo&gt;&lt;mn&gt;1.8&lt;/mn&gt;&lt;/mrow&gt;&lt;/semantics&gt;&lt;/math&gt;&lt;/p&gt; GHz&lt;/td&gt;&lt;td align="center" valign="middle"&gt;&lt;p&gt;&lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics xmlns=""&gt;&lt;mrow&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;mo&gt;&amp;#215;&lt;/mo&gt;&lt;mn&gt;2.84&lt;/mn&gt;&lt;/mrow&gt;&lt;/semantics&gt;&lt;/math&gt;&lt;/p&gt; GHz +&lt;break /&gt;&lt;p&gt;&lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics xmlns=""&gt;&lt;mrow&gt;&lt;mn&gt;3&lt;/mn&gt;&lt;mo&gt;&amp;#215;&lt;/mo&gt;&lt;mn&gt;2.42&lt;/mn&gt;&lt;/mrow&gt;&lt;/semantics&gt;&lt;/math&gt;&lt;/p&gt; GHz + &lt;p&gt;&lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics xmlns=""&gt;&lt;mrow&gt;&lt;mn&gt;4&lt;/mn&gt;&lt;mo&gt;&amp;#215;&lt;/mo&gt;&lt;mn&gt;1.78&lt;/mn&gt;&lt;/mrow&gt;&lt;/semantics&gt;&lt;/math&gt;&lt;/p&gt; GHz&lt;/td&gt;&lt;td align="center" valign="middle"&gt;&lt;p&gt;&lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics xmlns=""&gt;&lt;mrow&gt;&lt;mn&gt;2&lt;/mn&gt;&lt;mo&gt;&amp;#215;&lt;/mo&gt;&lt;mn&gt;2.6&lt;/mn&gt;&lt;/mrow&gt;&lt;/semantics&gt;&lt;/math&gt;&lt;/p&gt; GHz +&lt;break /&gt;&lt;p&gt;&lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics xmlns=""&gt;&lt;mrow&gt;&lt;mn&gt;2&lt;/mn&gt;&lt;mo&gt;&amp;#215;&lt;/mo&gt;&lt;mn&gt;1.92&lt;/mn&gt;&lt;/mrow&gt;&lt;/semantics&gt;&lt;/math&gt;&lt;/p&gt; GHz + &lt;p&gt;&lt;math xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics xmlns=""&gt;&lt;mrow&gt;&lt;mn&gt;4&lt;/mn&gt;&lt;mo&gt;&amp;#215;&lt;/mo&gt;&lt;mn&gt;1.8&lt;/mn&gt;&lt;/mrow&gt;&lt;/semantics&gt;&lt;/math&gt;&lt;/p&gt; GHz&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="center" valign="middle"&gt;&lt;bold&gt;RAM&lt;/bold&gt;&lt;/td&gt;&lt;td align="center" valign="middle"&gt;4 GB&lt;/td&gt;&lt;td align="center" valign="middle"&gt;8 GB&lt;/td&gt;&lt;td align="center" valign="middle"&gt;8 GB&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="center" valign="middle"&gt;&lt;bold&gt;ResNet50&lt;/bold&gt;&lt;/td&gt;&lt;td align="center" valign="middle"&gt;87 ms&lt;/td&gt;&lt;td align="center" valign="middle"&gt;63 ms&lt;/td&gt;&lt;td align="center" valign="middle"&gt;49 ms&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="center" valign="middle"&gt;&lt;bold&gt;Mobile&amp;#8211;distill&amp;#8211;prune-1x&lt;/bold&gt;&lt;/td&gt;&lt;td align="center" valign="middle"&gt;31 ms&lt;/td&gt;&lt;td align="center" valign="middle"&gt;18 ms&lt;/td&gt;&lt;td align="center" valign="middle"&gt;14 ms&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="center" valign="middle" style="border-bottom:solid thin"&gt;&lt;bold&gt;Mobile&amp;#8211;distill&amp;#8211;prune-0.25x&lt;/bold&gt;&lt;/td&gt;&lt;td align="center" valign="middle" style="border-bottom:solid thin"&gt;19 ms&lt;/td&gt;&lt;td align="center" valign="middle" style="border-bottom:solid thin"&gt;13 ms&lt;/td&gt;&lt;td align="center" valign="middle" style="border-bottom:solid thin"&gt;11 ms&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt; </ephtml> </p> <hd id="AN0179351181-24">Author Contributions</hd> <p>Conceptualization, Y.H., Y.C., J.W. and Q.W.; Methodology, Y.H., P.Z. and J.L.; Software, Y.H., Y.C. and J.W.; Validation, Y.H. and P.Z.; Formal analysis, Y.C., J.W. and J.L.; Investigation, J.W. and Q.W.; Resources, P.Z.; Data curation, Y.C.; Writing—original draft, Y.H., Y.C., P.Z. and J.L.; Writing—review &amp; editing, Y.H., J.W. and Q.W. All authors have read and agreed to the published version of the manuscript.</p> <hd id="AN0179351181-25">Institutional Review Board Statement</hd> <p>Not applicable.</p> <hd id="AN0179351181-26">Informed Consent Statement</hd> <p>Not applicable.</p> <hd id="AN0179351181-27">Data Availability Statement</hd> <p>300W Dataset: the data presented in this study are openly available in FigShare at 10.6084/m9.figshare.2008501.v1, reference number 2008501. AFLW Dataset: the data presented in this study are openly available in FigShare at 10.6084/m9.figshare.1270890.v2, reference number 1270890. CelebA Dataset: the data presented in this study are openly available in FigShare at 10.6084/m9.figshare.2139357.v1, reference number 2139357.</p> <hd id="AN0179351181-28">Conflicts of Interest</hd> <p>The authors declare no conflicts of interest.</p> <ref id="AN0179351181-29"> <title> Footnotes </title> <blist> <bibl id="bib1" idref="ref1" type="bt">1</bibl> <bibtext> Disclaimer/Publisher's Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.</bibtext> </blist> </ref> <ref id="AN0179351181-30"> <title> References </title> <blist> <bibtext> Guo X., Li S., Yu J., Zhang J., Ma J., Ma L., Liu W., Ling H. PFLD: A practical facial landmark detector. arXiv. 2019. 1902.10859</bibtext> </blist> <blist> <bibl id="bib2" idref="ref2" type="bt">2</bibl> <bibtext> Wang X., Bo L., Fuxin L. Adaptive wing loss for robust face alignment via heatmap regression. Proceedings of the IEEE/CVF International Conference on Computer Vision. Seoul, Republic of Korea. 27 October–2 November 2019: 6971-6981</bibtext> </blist> <blist> <bibl id="bib3" idref="ref3" type="bt">3</bibl> <bibtext> Wang J., Sun K., Cheng T., Jiang B., Deng C., Zhao Y., Liu D., Mu Y., Tan M., Wang X. Deep high-resolution representation learning for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2020; 43: 3349-3364. 10.1109/TPAMI.2020.2983686. 32248092</bibtext> </blist> <blist> <bibl id="bib4" idref="ref4" type="bt">4</bibl> <bibtext> Xu Y., Yan W., Yang G., Luo J., Li T., He J. CenterFace: Joint face detection and alignment using face as point. Sci. Program. 2020; 2020: 7845384. 10.1155/2020/7845384</bibtext> </blist> <blist> <bibl id="bib5" idref="ref5" type="bt">5</bibl> <bibtext> Browatzki B., Wallraven C. 3FabRec: Fast few-shot face alignment by reconstruction. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, WA, USA. 13–19 June 2020: 6110-6120</bibtext> </blist> <blist> <bibl id="bib6" idref="ref6" type="bt">6</bibl> <bibtext> Liu Z., Lin W., Li X., Rao Q., Jiang T., Han M., Fan H., Sun J., Liu S. ADNet: Attention-guided deformable convolutional network for high dynamic range imaging. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville, TN, USA. 20–25 June 2021: 463-470</bibtext> </blist> <blist> <bibl id="bib7" idref="ref7" type="bt">7</bibl> <bibtext> Li Y., Yang S., Liu P., Zhang S., Wang Y., Wang Z., Yang W., Xia S.T. Simcc: A simple coordinate classification perspective for human pose estimation. Proceedings of the European Conference on Computer Vision. Tel Aviv, Israel. 23–27 October 2022: 89-106</bibtext> </blist> <blist> <bibl id="bib8" idref="ref8" type="bt">8</bibl> <bibtext> Bai Y., Wang A., Kortylewski A., Yuille A. Coke: Contrastive learning for robust keypoint detection. Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. Waikoloa, HI, USA. 2–7 January 2023: 65-74</bibtext> </blist> <blist> <bibl id="bib9" idref="ref9" type="bt">9</bibl> <bibtext> Wan J., Liu J., Zhou J., Lai Z., Shen L., Sun H., Xiong P., Min W. Precise facial landmark detection by reference heatmap transformer. IEEE Trans. Image Process. 2023; 32: 1966-1977. 10.1109/TIP.2023.3261749. 37030695</bibtext> </blist> <blist> <bibtext> Yu Z., Huang H., Chen W., Su Y., Liu Y., Wang X. Yolo-facev2: A scale and occlusion aware face detector. Pattern Recognit. 2024; 155: 110714. 10.1016/j.patcog.2024.110714</bibtext> </blist> <blist> <bibtext> Rangayya, Virupakshappa, Patil N. Improved face recognition method using SVM-MRF with KTBD based KCM segmentation approach. Int. J. Syst. Assur. Eng. Manag. 2024; 15: 1-12. 10.1007/s13198-021-01483-3</bibtext> </blist> <blist> <bibtext> Khan S.S., Sengupta D., Ghosh A., Chaudhuri A. MTCNN++: A CNN-based face detection algorithm inspired by MTCNN. Vis. Comput. 2024; 40: 899-917. 10.1007/s00371-023-02822-0</bibtext> </blist> <blist> <bibtext> Blalock D., Gonzalez Ortiz J.J., Frankle J., Guttag J. What is the state of neural network pruning?. Proc. Mach. Learn. Syst. 2020; 2: 129-146</bibtext> </blist> <blist> <bibtext> Vadera S., Ameen S. Methods for pruning deep neural networks. IEEE Access. 2022; 10: 63280-63300. 10.1109/ACCESS.2022.3182659</bibtext> </blist> <blist> <bibtext> Fang G., Ma X., Song M., Mi M.B., Wang X. Depgraph: Towards any structural pruning. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver, BC, Canada. 17–24 June 2023: 16091-16101</bibtext> </blist> <blist> <bibtext> Sun M., Liu Z., Bair A., Kolter J.Z. A simple and effective pruning approach for large language models. arXiv. 2023. 2306.11695</bibtext> </blist> <blist> <bibtext> Ji M., Heo B., Park S. Show, attend and distill: Knowledge distillation via attention-based feature matching. Proceedings of the AAAI Conference on Artificial Intelligence. Online. 2–9 February 2021; Volume 35: 7945-7952</bibtext> </blist> <blist> <bibtext> Yao Y., Huang S., Wang W., Dong L., Wei F. Adapt-and-distill: Developing small, fast and effective pretrained language models for domains. arXiv. 2021. 2106.13474</bibtext> </blist> <blist> <bibtext> Beyer L., Zhai X., Royer A., Markeeva L., Anil R., Kolesnikov A. Knowledge distillation: A good teacher is patient and consistent. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans, LA, USA. 18–24 June 2022: 10925-10934</bibtext> </blist> <blist> <bibtext> Park J., No A. Prune your model before distill it. Proceedings of the European Conference on Computer Vision. Glasgow, UK. 23–27 October 2022: 120-136</bibtext> </blist> <blist> <bibtext> Waheed A., Kadaoui K., Abdul-Mageed M. To Distill or Not to Distill? On the Robustness of Robust Knowledge Distillation. arXiv. 2024. 2406.04512</bibtext> </blist> <blist> <bibtext> He K., Zhang X., Ren S., Sun J. Deep residual learning for image recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA. 27–30 June 2016: 770-778</bibtext> </blist> <blist> <bibtext> Xie S., Girshick R., Dollár P., Tu Z., He K. Aggregated residual transformations for deep neural networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, HI, USA. 21–26 July 2017: 1492-1500</bibtext> </blist> <blist> <bibtext> Howard A.G., Zhu M., Chen B., Kalenichenko D., Wang W., Weyand T., Andreetto M., Adam H. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv. 2017. 1704.04861</bibtext> </blist> <blist> <bibtext> Sandler M., Howard A., Zhu M., Zhmoginov A., Chen L.C. Mobilenetv2: Inverted residuals and linear bottlenecks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Salt Lake City, UT, USA. 18–23 June 2018: 4510-4520</bibtext> </blist> <blist> <bibtext> Howard A., Sandler M., Chu G., Chen L.C., Chen B., Tan M., Wang W., Zhu Y., Pang R., Vasudevan V. Searching for mobilenetv3. Proceedings of the IEEE/CVF International Conference on Computer Vision. Seoul, Republic of Korea. 27 October–2 November 2019: 1314-1324</bibtext> </blist> <blist> <bibtext> Deng J., Dong W., Socher R., Li L.J., Li K., Fei-Fei L. Imagenet: A large-scale hierarchical image database. Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition. Miami, FL, USA. 20–25 June 2009: 248-255</bibtext> </blist> <blist> <bibtext> Sagonas C., Tzimiropoulos G., Zafeiriou S., Pantic M. 300 faces in-the-wild challenge: The first facial landmark localization challenge. Proceedings of the IEEE International Conference on Computer Vision Workshops. Sydney, NSW, Australia. 1–8 December 2013: 397-403</bibtext> </blist> <blist> <bibtext> Koestinger M., Wohlhart P., Roth P.M., Bischof H. Annotated facial landmarks in the wild: A large-scale, real-world database for facial landmark localization. Proceedings of the 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops). Barcelona, Spain. 6–13 November 2011: 2144-2151</bibtext> </blist> <blist> <bibtext> Dong X., Yan Y., Ouyang W., Yang Y. Style aggregated network for facial landmark detection. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Salt Lake City, UT, USA. 18–23 June 2018: 379-388</bibtext> </blist> <blist> <bibtext> Wu W., Qian C., Yang S., Wang Q., Cai Y., Zhou Q. Look at boundary: A boundary-aware face alignment algorithm. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Salt Lake City, UT, USA. 18–23 June 2018: 2129-2138</bibtext> </blist> <blist> <bibtext> Kumar A., Chellappa R. Disentangling 3D pose in a dendritic CNN for unconstrained 2D face alignment. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Salt Lake City, UT, USA. 18–23 June 2018: 430-439</bibtext> </blist> <blist> <bibtext> Burgos-Artizzu X.P., Perona P., Dollár P. Robust face landmark estimation under occlusion. Proceedings of the IEEE International Conference on Computer Vision. Sydney, NSW, Australia. 1–8 December 2013: 1513-1520</bibtext> </blist> <blist> <bibtext> Zhang J., Shan S., Kan M., Chen X. Coarse-to-fine auto-encoder networks (cfan) for real-time face alignment. Proceedings of the Computer Vision—ECCV 2014: 13th European Conference. Zurich, Switzerland. 6–12 September 2014: 1-16</bibtext> </blist> <blist> <bibtext> Cao X., Wei Y., Wen F., Sun J. Face alignment by explicit shape regression. Int. J. Comput. Vis. 2014; 107: 177-190. 10.1007/s11263-013-0667-3</bibtext> </blist> <blist> <bibtext> Xiong X., De la Torre F. Supervised descent method and its applications to face alignment. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Portland, OR, USA. 23–28 June 2013: 532-539</bibtext> </blist> <blist> <bibtext> Ren S., Cao X., Wei Y., Sun J. Face alignment at 3000 fps via regressing local binary features. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Columbus, Ohio, USA. 23–28 June 2014: 1685-1692</bibtext> </blist> <blist> <bibtext> Zhu S., Li C., Change Loy C., Tang X. Face alignment by coarse-to-fine shape searching. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Boston, MA, USA. 7–12 June 2015: 4998-5006</bibtext> </blist> <blist> <bibtext> Zhu X., Lei Z., Liu X., Shi H., Li S.Z. Face alignment across large poses: A 3D solution. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA. 26 June–1 July 2016: 146-155</bibtext> </blist> <blist> <bibtext> Zhang Z., Luo P., Loy C.C., Tang X. Facial landmark detection by deep multi-task learning. Proceedings of the Computer Vision—ECCV 2014: 13th European Conference. Zurich, Switzerland. 6–12 September 2014: 94-108</bibtext> </blist> <blist> <bibtext> Trigeorgis G., Snape P., Nicolaou M.A., Antonakos E., Zafeiriou S. Mnemonic descent method: A recurrent process applied for end-to-end face alignment. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA. 27–30 June 2016: 4177-4187</bibtext> </blist> <blist> <bibtext> Honari S., Molchanov P., Tyree S., Vincent P., Pal C., Kautz J. Improving landmark localization with semi-supervised learning. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Salt Lake City, UT, USA. 18–23 June 2018: 1546-1555</bibtext> </blist> <blist> <bibtext> Xiao S., Feng J., Xing J., Lai H., Yan S., Kassim A. Robust facial landmark detection via recurrent attentive-refinement networks. Proceedings of the Computer Vision–ECCV 2016: 14th European Conference. Amsterdam, The Netherlands. 11–14 October 2016: 57-72</bibtext> </blist> <blist> <bibtext> Wu W., Yang S. Leveraging intra and inter-dataset variations for robust face alignment. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. Honolulu, HI, USA. 21–26 July 2017: 150-159</bibtext> </blist> <blist> <bibtext> Wei S.E., Ramakrishna V., Kanade T., Sheikh Y. Convolutional pose machines. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA. 27–30 June 2016: 4724-4732</bibtext> </blist> <blist> <bibtext> Valle R., Buenaposada J.M., Valdes A., Baumela L. A deeply-initialized coarse-to-fine ensemble of regression trees for face alignment. Proceedings of the European Conference on Computer Vision (ECCV). Munich, Germany. 8–14 September 2018: 585-601</bibtext> </blist> <blist> <bibtext> Lv J., Shao X., Xing J., Cheng C., Zhou X. A deep regression architecture with two-stage re-initialization for high performance facial landmark detection. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, HI, USA. 21–26 July 2017: 3317-3326</bibtext> </blist> <blist> <bibtext> Jourabloo A., Ye M., Liu X., Ren L. Pose-invariant face alignment with a single CNN. Proceedings of the IEEE International Conference on Computer Vision. Venice, Italy. 22–29 October 2017: 3200-3209</bibtext> </blist> <blist> <bibtext> Xiao S., Feng J., Liu L., Nie X., Wang W., Yan S., Kassim A. Recurrent 3D-2D dual learning for large-pose facial landmark detection. Proceedings of the IEEE International Conference on Computer Vision. Venice, Italy. 22–29 October 2017: 1633-1642</bibtext> </blist> <blist> <bibtext> Yu X., Huang J., Zhang S., Yan W., Metaxas D.N. Pose-free facial landmark fitting via optimized part mixtures and cascaded deformable shape model. Proceedings of the IEEE International Conference on Computer Vision. Sydney, NSW, Australia. 1–8 December 2013: 1944-1951</bibtext> </blist> <blist> <bibtext> Kazemi V., Sullivan J. One millisecond face alignment with an ensemble of regression trees. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Columbus, OH, USA. 23–28 June 2014: 1867-1874</bibtext> </blist> <blist> <bibtext> Zhu S., Li C., Loy C.C., Tang X. Unconstrained face alignment via cascaded compositional learning. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA. 27–30 June 2016: 3409-3417</bibtext> </blist> <blist> <bibtext> Bulat A., Tzimiropoulos G. Binarized convolutional landmark localizers for human pose estimation and face alignment with limited resources. Proceedings of the IEEE International Conference on Computer Vision. Venice, Italy. 22–29 October 2017: 3706-3714</bibtext> </blist> </ref> <aug> <p>By Yonghui Huang; Yu Chen; Junhao Wang; Pengcheng Zhou; Jiaming Lai and Quanhai Wang</p> <p>Reported by Author; Author; Author; Author; Author; Author</p> </aug> <nolink nlid="nl1" bibid="bib10" firstref="ref10"></nolink> <nolink nlid="nl2" bibid="bib11" firstref="ref11"></nolink> <nolink nlid="nl3" bibid="bib12" firstref="ref12"></nolink> <nolink nlid="nl4" bibid="bib13" firstref="ref13"></nolink> <nolink nlid="nl5" bibid="bib14" firstref="ref14"></nolink> <nolink nlid="nl6" bibid="bib15" firstref="ref15"></nolink> <nolink nlid="nl7" bibid="bib16" firstref="ref16"></nolink> <nolink nlid="nl8" bibid="bib17" firstref="ref17"></nolink> <nolink nlid="nl9" bibid="bib18" firstref="ref18"></nolink> <nolink nlid="nl10" bibid="bib19" firstref="ref19"></nolink> <nolink nlid="nl11" bibid="bib20" firstref="ref20"></nolink> <nolink nlid="nl12" bibid="bib21" firstref="ref21"></nolink> <nolink nlid="nl13" bibid="bib22" firstref="ref22"></nolink> <nolink nlid="nl14" bibid="bib24" firstref="ref23"></nolink> <nolink nlid="nl15" bibid="bib26" firstref="ref24"></nolink> <nolink nlid="nl16" bibid="bib27" firstref="ref28"></nolink> <nolink nlid="nl17" bibid="bib28" firstref="ref35"></nolink> <nolink nlid="nl18" bibid="bib29" firstref="ref36"></nolink> <nolink nlid="nl19" bibid="bib30" firstref="ref37"></nolink> <nolink nlid="nl20" bibid="bib32" firstref="ref38"></nolink> <nolink nlid="nl21" bibid="bib36" firstref="ref42"></nolink> <nolink nlid="nl22" bibid="bib31" firstref="ref44"></nolink> CustomLinks: – Url: https://resolver.ebsco.com/c/xy5jbn/result?sid=EBSCO:edsdoj&genre=article&issn=20763417&ISBN=&volume=14&issue=16&date=20240801&spage=7153&pages=7153-7153&title=Applied Sciences&atitle=A%20Robust%20and%20Efficient%20Method%20for%20Effective%20Facial%20Keypoint%20Detection&aulast=Yonghui%20Huang&id=DOI:10.3390/app14167153 Name: Full Text Finder (for New FTF UI) (s8985755) Category: fullText Text: Find It @ SCU Libraries MouseOverText: Find It @ SCU Libraries – Url: https://doaj.org/article/dcd69ed481b74717810b1f91ca3086e7 Name: EDS - DOAJ (s8985755) Category: fullText Text: View record from DOAJ MouseOverText: View record from DOAJ
Header	DbId: edsdoj DbLabel: Directory of Open Access Journals An: edsdoj.69ed481b74717810b1f91ca3086e7 RelevancyScore: 997 AccessLevel: 3 PubType: Academic Journal PubTypeId: academicJournal PreciseRelevancyScore: 997.242065429688
IllustrationInfo
Items	– Name: Title Label: Title Group: Ti Data: A Robust and Efficient Method for Effective Facial Keypoint Detection – Name: Author Label: Authors Group: Au Data: <searchLink fieldCode="AR" term="%22Yonghui+Huang%22">Yonghui Huang</searchLink><br /><searchLink fieldCode="AR" term="%22Yu+Chen%22">Yu Chen</searchLink><br /><searchLink fieldCode="AR" term="%22Junhao+Wang%22">Junhao Wang</searchLink><br /><searchLink fieldCode="AR" term="%22Pengcheng+Zhou%22">Pengcheng Zhou</searchLink><br /><searchLink fieldCode="AR" term="%22Jiaming+Lai%22">Jiaming Lai</searchLink><br /><searchLink fieldCode="AR" term="%22Quanhai+Wang%22">Quanhai Wang</searchLink> – Name: TitleSource Label: Source Group: Src Data: Applied Sciences, Vol 14, Iss 16, p 7153 (2024) – Name: Publisher Label: Publisher Information Group: PubInfo Data: MDPI AG, 2024. – Name: DatePubCY Label: Publication Year Group: Date Data: 2024 – Name: Subset Label: Collection Group: HoldingsInfo Data: LCC:Technology<br />LCC:Engineering (General). Civil engineering (General)<br />LCC:Biology (General)<br />LCC:Physics<br />LCC:Chemistry – Name: Subject Label: Subject Terms Group: Su Data: <searchLink fieldCode="DE" term="%22facial+recognition%22">facial recognition</searchLink><br /><searchLink fieldCode="DE" term="%22landmark+detection%22">landmark detection</searchLink><br /><searchLink fieldCode="DE" term="%22model+optimization%22">model optimization</searchLink><br /><searchLink fieldCode="DE" term="%22Technology%22">Technology</searchLink><br /><searchLink fieldCode="DE" term="%22Engineering+%28General%29%2E+Civil+engineering+%28General%29%22">Engineering (General). Civil engineering (General)</searchLink><br /><searchLink fieldCode="DE" term="%22TA1-2040%22">TA1-2040</searchLink><br /><searchLink fieldCode="DE" term="%22Biology+%28General%29%22">Biology (General)</searchLink><br /><searchLink fieldCode="DE" term="%22QH301-705%2E5%22">QH301-705.5</searchLink><br /><searchLink fieldCode="DE" term="%22Physics%22">Physics</searchLink><br /><searchLink fieldCode="DE" term="%22QC1-999%22">QC1-999</searchLink><br /><searchLink fieldCode="DE" term="%22Chemistry%22">Chemistry</searchLink><br /><searchLink fieldCode="DE" term="%22QD1-999%22">QD1-999</searchLink> – Name: Abstract Label: Description Group: Ab Data: Facial keypoint detection technology faces significant challenges under conditions such as occlusion, extreme angles, and other demanding environments. Previous research has largely relied on deep learning regression methods using the face’s overall global template. However, these methods lack robustness in difficult conditions, leading to instability in detecting facial keypoints. To address this challenge, we propose a joint optimization approach that combines regression with heatmaps, emphasizing the importance of local apparent features. Furthermore, to mitigate the reduced learning capacity resulting from model pruning, we integrate external supervision signals through knowledge distillation into our method. This strategy fosters the development of efficient, effective, and lightweight facial keypoint detection technology. Experimental results on the CelebA, 300W, and AFLW datasets demonstrate that our proposed method significantly improves the robustness of facial keypoint detection. – Name: TypeDocument Label: Document Type Group: TypDoc Data: article – Name: Format Label: File Description Group: SrcInfo Data: electronic resource – Name: Language Label: Language Group: Lang Data: English – Name: ISSN Label: ISSN Group: ISSN Data: 2076-3417 – Name: NoteTitleSource Label: Relation Group: SrcInfo Data: https://www.mdpi.com/2076-3417/14/16/7153; https://doaj.org/toc/2076-3417 – Name: DOI Label: DOI Group: ID Data: 10.3390/app14167153 – Name: URL Label: Access URL Group: URL Data: <link linkTarget="URL" linkTerm="https://doaj.org/article/dcd69ed481b74717810b1f91ca3086e7" linkWindow="_blank">https://doaj.org/article/dcd69ed481b74717810b1f91ca3086e7</link> – Name: AN Label: Accession Number Group: ID Data: edsdoj.69ed481b74717810b1f91ca3086e7
PLink	https://login.libproxy.scu.edu/login?url=https://search.ebscohost.com/login.aspx?direct=true&site=eds-live&scope=site&db=edsdoj&AN=edsdoj.69ed481b74717810b1f91ca3086e7
RecordInfo	BibRecord: BibEntity: Identifiers: – Type: doi Value: 10.3390/app14167153 Languages: – Text: English PhysicalDescription: Pagination: PageCount: 1 StartPage: 7153 Subjects: – SubjectFull: facial recognition Type: general – SubjectFull: landmark detection Type: general – SubjectFull: model optimization Type: general – SubjectFull: Technology Type: general – SubjectFull: Engineering (General). Civil engineering (General) Type: general – SubjectFull: TA1-2040 Type: general – SubjectFull: Biology (General) Type: general – SubjectFull: QH301-705.5 Type: general – SubjectFull: Physics Type: general – SubjectFull: QC1-999 Type: general – SubjectFull: Chemistry Type: general – SubjectFull: QD1-999 Type: general Titles: – TitleFull: A Robust and Efficient Method for Effective Facial Keypoint Detection Type: main BibRelationships: HasContributorRelationships: – PersonEntity: Name: NameFull: Yonghui Huang – PersonEntity: Name: NameFull: Yu Chen – PersonEntity: Name: NameFull: Junhao Wang – PersonEntity: Name: NameFull: Pengcheng Zhou – PersonEntity: Name: NameFull: Jiaming Lai – PersonEntity: Name: NameFull: Quanhai Wang IsPartOfRelationships: – BibEntity: Dates: – D: 01 M: 08 Type: published Y: 2024 Identifiers: – Type: issn-print Value: 20763417 Numbering: – Type: volume Value: 14 – Type: issue Value: 16 Titles: – TitleFull: Applied Sciences Type: main
ResultId	1