Academic Journal
YOLOv3-Based Matching Approach for Roof Region Detection from Drone Images
Title: | YOLOv3-Based Matching Approach for Roof Region Detection from Drone Images |
---|---|
Authors: | Chia-Cheng Yeh, Yang-Lang Chang, Mohammad Alkhaleefah, Pai-Hui Hsu, Weiyong Eng, Voon-Chet Koo, Bormin Huang, Lena Chang |
Source: | Remote Sensing, Vol 13, Iss 1, p 127 (2021) |
Publisher Information: | MDPI AG, 2021. |
Publication Year: | 2021 |
Collection: | LCC:Science |
Subject Terms: | image matching, deep learning, YOLOv3, roof region detection, drone images, high-performance computing, Science |
More Details: | Due to the large data volume, the UAV image stitching and matching suffers from high computational cost. The traditional feature extraction algorithms—such as Scale-Invariant Feature Transform (SIFT), Speeded Up Robust Features (SURF), and Oriented FAST Rotated BRIEF (ORB)—require heavy computation to extract and describe features in high-resolution UAV images. To overcome this issue, You Only Look Once version 3 (YOLOv3) combined with the traditional feature point matching algorithms is utilized to extract descriptive features from the drone dataset of residential areas for roof detection. Unlike the traditional feature extraction algorithms, YOLOv3 performs the feature extraction solely on the proposed candidate regions instead of the entire image, thus the complexity of the image matching is reduced significantly. Then, all the extracted features are fed into Structural Similarity Index Measure (SSIM) to identify the corresponding roof region pair between consecutive image sequences. In addition, the candidate corresponding roof pair by our architecture serves as the coarse matching region pair and limits the search range of features matching to only the detected roof region. This further improves the feature matching consistency and reduces the chances of wrong feature matching. Analytical results show that the proposed method is 13× faster than the traditional image matching methods with comparable performance. |
Document Type: | article |
File Description: | electronic resource |
Language: | English |
ISSN: | 2072-4292 |
Relation: | https://www.mdpi.com/2072-4292/13/1/127; https://doaj.org/toc/2072-4292 |
DOI: | 10.3390/rs13010127 |
Access URL: | https://doaj.org/article/6477b4f6972b477fa8b3e05c4e7d67a2 |
Accession Number: | edsdoj.6477b4f6972b477fa8b3e05c4e7d67a2 |
Database: | Directory of Open Access Journals |
Full text is not displayed to guests. | Login for full access. |
FullText | Links: – Type: pdflink Url: https://content.ebscohost.com/cds/retrieve?content=AQICAHjPtM4BHU3ZchRwgzYmadcigk49r9CVlbU7V5F6lgH7WwGZzbU19kwBpUxyQVR0rfCRAAAA4jCB3wYJKoZIhvcNAQcGoIHRMIHOAgEAMIHIBgkqhkiG9w0BBwEwHgYJYIZIAWUDBAEuMBEEDKA0E1_ZJV2xmzV0gAIBEICBmhfqNY9c3UO76OBf4wpkQ0Q5vrcWqHflabfHeGyYnb-LOs1sC9j73e4FFYRLSNN3eJOaULka3FZBm05O0Sgc6YkEfFlxfhiOxCq-nQgJscegfPe0YgbKBFzlLAlV4x_UQQQ8xNzah5mAQV1KPfEQp9J9yIWCsBiU5J8S92vD29YDEiMy9MWPcooZ84B5ySmc01oux2MqhhFc9zE= Text: Availability: 1 Value: <anid>AN0147991657;[b03x]01jan.21;2021Jan11.04:23;v2.2.500</anid> <title id="AN0147991657-1">YOLOv3-Based Matching Approach for Roof Region Detection from Drone Images </title> <p>Due to the large data volume, the UAV image stitching and matching suffers from high computational cost. The traditional feature extraction algorithms—such as Scale-Invariant Feature Transform (SIFT), Speeded Up Robust Features (SURF), and Oriented FAST Rotated BRIEF (ORB)—require heavy computation to extract and describe features in high-resolution UAV images. To overcome this issue, You Only Look Once version 3 (YOLOv3) combined with the traditional feature point matching algorithms is utilized to extract descriptive features from the drone dataset of residential areas for roof detection. Unlike the traditional feature extraction algorithms, YOLOv3 performs the feature extraction solely on the proposed candidate regions instead of the entire image, thus the complexity of the image matching is reduced significantly. Then, all the extracted features are fed into Structural Similarity Index Measure (SSIM) to identify the corresponding roof region pair between consecutive image sequences. In addition, the candidate corresponding roof pair by our architecture serves as the coarse matching region pair and limits the search range of features matching to only the detected roof region. This further improves the feature matching consistency and reduces the chances of wrong feature matching. Analytical results show that the proposed method is 13× faster than the traditional image matching methods with comparable performance.</p> <p>Keywords: image matching; deep learning; YOLOv3; roof region detection; drone images; high-performance computing</p> <hd id="AN0147991657-2">1. Introduction</hd> <p>Image registration is a traditional computer vision problem for applications in various domains ranging from military, medical, surveillance, robotics, as well as remote sensing [[<reflink idref="bib1" id="ref1">1</reflink>]]. With advances in robotics, cameras can be effortlessly mounted on a UAV to capture the ground images from a top view. A UAV is often operated in a lawn-mower scanning pattern to capture a region of interests (ROI). These captured ROI images are then stitched together to provide an overview representation of the entire region. Drones are relatively low-cost and can be operated in remote areas.</p> <p>The process of image stitching is useful in a number of tasks, such as disaster prevention, environment change detection, road surveillance, land monitoring, and land measurement. The task of image matching can be divided into two sub-tasks: feature detection and feature description. Researchers have extensively used advanced handcraft feature descriptor algorithms, such as SIFT [[<reflink idref="bib2" id="ref2">2</reflink>]], SURF [[<reflink idref="bib4" id="ref3">4</reflink>]], and ORB [[<reflink idref="bib6" id="ref4">6</reflink>]]. In the task of feature detection, the distinctive and repetitive features are first detected and input into a non-ambiguous matching algorithm [[<reflink idref="bib7" id="ref5">7</reflink>]]. These features are further summarized by region descriptor algorithms such as SIFT, SURF, or ORB. These handcrafted descriptors work by summarizing the histogram of gradient in the region surrounding the feature. SIFT is the pioneer in the work of descriptor handcrafting that is robust to scale and orientation changes. SURF and ORB are approximate and fast versions of SIFT. Features are then matched based on several measures such as brute force matching and Flann-based matching, which is based on the nearest descriptor distance and the matches that satisfy a ratio test as suggested by Lowe et al. [[<reflink idref="bib2" id="ref6">2</reflink>]]. As the raw matches based on these measures often contain outliers, the Random Sample Consensus (RANSAC) [[<reflink idref="bib9" id="ref7">9</reflink>]] is often adopted to perform a match consistency check to filter the outliers. The drone image motion is generally caused by the movement of the camera. Hence, the camera motion can be modeled as a global motion in which every pixel in the image shares a single motion. The global motion is generally modeled as a transformation matrix, which can be estimated by as few as four matching pairs.</p> <p>Recent advances in deep learning and convolutional neural networks have been applied in various fields such as natural language processing and subsequently in computer vision, especially in the tasks of object detection and object classification [[<reflink idref="bib10" id="ref8">10</reflink>]]. The concept of the convolutional neural network was first introduced in LeNet [[<reflink idref="bib12" id="ref9">12</reflink>]]. AlexNet [[<reflink idref="bib13" id="ref10">13</reflink>]] made it well-known after winning the 2012 ImageNet Large-Scale Visual Recognition Challenge (ILSVRC) [[<reflink idref="bib14" id="ref11">14</reflink>]]. Various studies have shown that training a deep network on a large dataset can lead to a better testing accuracy [[<reflink idref="bib13" id="ref12">13</reflink>], [<reflink idref="bib15" id="ref13">15</reflink>]]. The advances in the hardware such as the graphics processing unit (GPU) made it possible to process larger data in a shorter time. Recent deep learning methods specifically YOLOv3 [[<reflink idref="bib17" id="ref14">17</reflink>]] have shown consistent good results for object detection and classification.</p> <p>The most straightforward idea for enhancing the computational time of the drone image registration is the use of high-performance computing (HPC) approach. This study introduces a novel method to integrate the GPU-based deep learning algorithm into traditional image matching methods. The use of a GPU is a significant recent advance for making the training stage of deep network methods more practical [[<reflink idref="bib18" id="ref15">18</reflink>], [<reflink idref="bib20" id="ref16">20</reflink>]]. The proposed method generates robust candidate regions by adopting YOLOv3 [[<reflink idref="bib17" id="ref17">17</reflink>]] and performs the traditional image matching only on the candidate regions. Similar to Fast R-CNN [[<reflink idref="bib22" id="ref18">22</reflink>]], the use of candidate regions are applied for the image matching tasks instead of image classification.</p> <p>Structural similarity (SSIM) is then adopted to determine the similarity of the candidates' regions. The mismatched regions are then filtered and the overlaps are matched to confirm the corresponding relationship of the overlapping regions on two adjacent images. The traditional feature extraction algorithm is then run to extract features from the matched regions and match the features. The search region is thus limited to very small area of the image, reducing the matching error. In the urban, the roof is an important information infrastructure [[<reflink idref="bib20" id="ref19">20</reflink>]]. Therefore, it led to a significant reduction in the computational requirements as the image matching is only performed on the candidate roof regions which is well suited for real-time image registration applications. In this paper, it is shown that our proposed method has achieved 13× faster than the traditional methods of SIFT, SURF, and ORB.</p> <hd id="AN0147991657-3">2. Traditional Image Stitching Methods and Deep Learning</hd> <p>Image stitching has been long studied in the fields of computer vision and remote sensing. Traditional image matching methods involve handcrafting descriptors that are robust to photometric and geometric variations at some distinctive repetitive feature locations. The computational cost of the image stitching process rises linearly with the image size as more features are detected and matched for the image stitching. Recent advances in convolutional neural networks and deep learning have shown remarkable results in the field of language processing and image processing. Deep learning has revolutionized high-level computer vision tasks such as object detection and classification. However, further research is needed on adapting deep learning methods in low-level computer vision tasks such as image matching.</p> <hd id="AN0147991657-4">2.1. Traditional Image Matching</hd> <p>Traditional image matching methods can be classified as feature-based or pixel-based matching. For drone image registration, the motion is only caused by the movement of the drone. This motion can be approximated by only a single global motion, shared by all the pixels in the image. Hence, feature-based matching is popular in drone image registration. Moreover, feature-based matching is robust to photometric and geometric variations. Only a few distinctive repetitive feature points are detected, and their descriptors are matched. Well-known feature detection methods include the Harris corner detector [[<reflink idref="bib7" id="ref20">7</reflink>]], Hessian affine region detector [[<reflink idref="bib24" id="ref21">24</reflink>]], and Shi Tomasi feature detector [[<reflink idref="bib8" id="ref22">8</reflink>]]. Feature descriptors are handcrafted, such as SIFT [[<reflink idref="bib2" id="ref23">2</reflink>]], SURF [[<reflink idref="bib4" id="ref24">4</reflink>]], and ORB are based on the histogram of gradient (HOG) for a local region surrounding a keypoint location and also the pixel gradient. SIFT [[<reflink idref="bib2" id="ref25">2</reflink>]] is a pioneering feature descriptor, and is the basis for the faster approximate variants SURF [[<reflink idref="bib4" id="ref26">4</reflink>]] and ORB [[<reflink idref="bib6" id="ref27">6</reflink>]].</p> <hd id="AN0147991657-5">2.1.1. Scale-Invariant Feature Transform (SIFT)</hd> <p>David Lowe presented the Scale-Invariant Feature Transform (SIFT) algorithm in 1999 [[<reflink idref="bib2" id="ref28">2</reflink>]]. SIFT is perhaps one of the earliest works on providing a comprehensive keypoint detection and feature descriptor extraction technique. The SIFT algorithm has four basic steps.</p> <p>First, building a multi-resolution pyramid over the input image, and applies difference of Gaussians (DoG), as shown in Equation (<reflink idref="bib1" id="ref29">1</reflink>)</p> <p>(<reflink idref="bib1" id="ref30">1</reflink>) <ephtml> &lt;math display="block" xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;mrow&gt;&lt;mi&gt;D&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;x&lt;/mi&gt;&lt;mo&gt;,&lt;/mo&gt;&lt;mi&gt;y&lt;/mi&gt;&lt;mo&gt;,&lt;/mo&gt;&lt;mi&gt;&amp;#963;&lt;/mi&gt;&lt;/mrow&gt;&lt;mo&gt;=&lt;/mo&gt;&lt;mo&gt;(&lt;/mo&gt;&lt;mrow&gt;&lt;mi&gt;G&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;x&lt;/mi&gt;&lt;mo&gt;,&lt;/mo&gt;&lt;mi&gt;y&lt;/mi&gt;&lt;mo&gt;,&lt;/mo&gt;&lt;mi&gt;k&lt;/mi&gt;&lt;mi&gt;&amp;#963;&lt;/mi&gt;&lt;/mrow&gt;&lt;mo&gt;&amp;#8722;&lt;/mo&gt;&lt;mi&gt;G&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;x&lt;/mi&gt;&lt;mo&gt;,&lt;/mo&gt;&lt;mi&gt;y&lt;/mi&gt;&lt;mo&gt;,&lt;/mo&gt;&lt;mi&gt;&amp;#963;&lt;/mi&gt;&lt;/mrow&gt;&lt;/mrow&gt;&lt;mo&gt;)&lt;/mo&gt;&lt;mo&gt;&amp;#8727;&lt;/mo&gt;&lt;mi&gt;I&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;x&lt;/mi&gt;&lt;mo&gt;,&lt;/mo&gt;&lt;mi&gt;y&lt;/mi&gt;&lt;/mrow&gt;&lt;/mrow&gt;&lt;mspace linebreak="newline" /&gt;&lt;mrow&gt;&lt;mo&gt;=&lt;/mo&gt;&lt;mi&gt;L&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;x&lt;/mi&gt;&lt;mo&gt;,&lt;/mo&gt;&lt;mi&gt;y&lt;/mi&gt;&lt;mo&gt;,&lt;/mo&gt;&lt;mi&gt;k&lt;/mi&gt;&lt;mi&gt;&amp;#963;&lt;/mi&gt;&lt;/mrow&gt;&lt;mo&gt;&amp;#8722;&lt;/mo&gt;&lt;mi&gt;L&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;x&lt;/mi&gt;&lt;mo&gt;,&lt;/mo&gt;&lt;mi&gt;y&lt;/mi&gt;&lt;mo&gt;,&lt;/mo&gt;&lt;mi&gt;&amp;#963;&lt;/mi&gt;&lt;/mrow&gt;&lt;/mrow&gt;&lt;/mrow&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml></p> <p>In Equation (<reflink idref="bib1" id="ref31">1</reflink>), <ephtml> &lt;math display="block" xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mi&gt;k&lt;/mi&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> denotes a constant multiplicative factor, <ephtml> &lt;math display="block" xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;mi&gt;k&lt;/mi&gt;&lt;mo&gt;=&lt;/mo&gt;&lt;msqrt&gt;&lt;mn&gt;2&lt;/mn&gt;&lt;/msqrt&gt;&lt;/mrow&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> .</p> <p>Secondly, a keypoint localization where the keypoint candidates are localized and refined by eliminating the low contrast points.</p> <p>Thirdly, to characterize the image at each keypoint, the Gaussian smoothed image L at each level of the pyramid is processed with the closest scale, hence all the computations are performed in a scale-invariant manner. At each pixel <emph>L</emph>(<emph>x</emph>,<emph>y</emph>), the gradient magnitude <ephtml> &lt;math display="block" xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;mi&gt;m&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;x&lt;/mi&gt;&lt;mo&gt;,&lt;/mo&gt;&lt;mi&gt;y&lt;/mi&gt;&lt;/mrow&gt;&lt;/mrow&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> and orientation <ephtml> &lt;math display="block" xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;mi&gt;&amp;#952;&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;x&lt;/mi&gt;&lt;mo&gt;,&lt;/mo&gt;&lt;mi&gt;y&lt;/mi&gt;&lt;/mrow&gt;&lt;/mrow&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> of the feature points in the image can be calculated as shown in Equation (<reflink idref="bib2" id="ref32">2</reflink>) and Equation (<reflink idref="bib3" id="ref33">3</reflink>).</p> <p>(<reflink idref="bib2" id="ref34">2</reflink>) <ephtml> &lt;math display="block" xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;mi&gt;m&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;x&lt;/mi&gt;&lt;mo&gt;,&lt;/mo&gt;&lt;mi&gt;y&lt;/mi&gt;&lt;/mrow&gt;&lt;mo&gt;=&lt;/mo&gt;&lt;msqrt&gt;&lt;mrow&gt;&lt;msup&gt;&lt;mrow&gt;&lt;mrow&gt;&lt;mfrac&gt;&lt;mrow&gt;&lt;mo&gt;&amp;#8706;&lt;/mo&gt;&lt;mi&gt;L&lt;/mi&gt;&lt;/mrow&gt;&lt;mrow&gt;&lt;mo&gt;&amp;#8706;&lt;/mo&gt;&lt;mi&gt;x&lt;/mi&gt;&lt;/mrow&gt;&lt;/mfrac&gt;&lt;/mrow&gt;&lt;/mrow&gt;&lt;mn&gt;2&lt;/mn&gt;&lt;/msup&gt;&lt;mo&gt;+&lt;/mo&gt;&lt;msup&gt;&lt;mrow&gt;&lt;mrow&gt;&lt;mfrac&gt;&lt;mrow&gt;&lt;mo&gt;&amp;#8706;&lt;/mo&gt;&lt;mi&gt;L&lt;/mi&gt;&lt;/mrow&gt;&lt;mrow&gt;&lt;mo&gt;&amp;#8706;&lt;/mo&gt;&lt;mi&gt;y&lt;/mi&gt;&lt;/mrow&gt;&lt;/mfrac&gt;&lt;/mrow&gt;&lt;/mrow&gt;&lt;mn&gt;2&lt;/mn&gt;&lt;/msup&gt;&lt;/mrow&gt;&lt;/msqrt&gt;&lt;/mrow&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml></p> <p>(<reflink idref="bib3" id="ref35">3</reflink>) <ephtml> &lt;math display="block" xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;mi&gt;&amp;#952;&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;x&lt;/mi&gt;&lt;mo&gt;,&lt;/mo&gt;&lt;mi&gt;y&lt;/mi&gt;&lt;/mrow&gt;&lt;mo&gt;=&lt;/mo&gt;&lt;msup&gt;&lt;mrow&gt;&lt;mi&gt;tan&lt;/mi&gt;&lt;/mrow&gt;&lt;mrow&gt;&lt;mo&gt;&amp;#8722;&lt;/mo&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;/mrow&gt;&lt;/msup&gt;&lt;mrow&gt;&lt;mfrac bevelled="true"&gt;&lt;mrow&gt;&lt;mfrac&gt;&lt;mrow&gt;&lt;mo&gt;&amp;#8706;&lt;/mo&gt;&lt;mi&gt;L&lt;/mi&gt;&lt;/mrow&gt;&lt;mrow&gt;&lt;mo&gt;&amp;#8706;&lt;/mo&gt;&lt;mi&gt;y&lt;/mi&gt;&lt;/mrow&gt;&lt;/mfrac&gt;&lt;/mrow&gt;&lt;mrow&gt;&lt;mfrac&gt;&lt;mrow&gt;&lt;mo&gt;&amp;#8706;&lt;/mo&gt;&lt;mi&gt;L&lt;/mi&gt;&lt;/mrow&gt;&lt;mrow&gt;&lt;mo&gt;&amp;#8706;&lt;/mo&gt;&lt;mi&gt;x&lt;/mi&gt;&lt;/mrow&gt;&lt;/mfrac&gt;&lt;/mrow&gt;&lt;/mfrac&gt;&lt;/mrow&gt;&lt;/mrow&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml></p> <p>The final step of the SIFT algorithm is the local image descriptors where location, scale, and orientation are determined for each keypoint.</p> <hd id="AN0147991657-6">2.1.2. Speeded Up Robust Features (SURF)</hd> <p>Herbert Bay et al. presented a novel image feature detection and extraction algorithm called Speeded Up Robust Features (SURF) [[<reflink idref="bib4" id="ref36">4</reflink>]]. SURF is based on the Hessian matrix which can find feature points [[<reflink idref="bib4" id="ref37">4</reflink>]]. Hessian matrix is a square matrix of second-order partial derivatives of a scalar-valued function. It describes the local curvature of a function of many variables. The Hessian matrix measures the local change around each point. It chooses the points at the maximum determinant. Given a point <ephtml> &lt;math display="block" xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;mo /&gt;&lt;mi&gt;X&lt;/mi&gt;&lt;mo&gt;=&lt;/mo&gt;&lt;mrow&gt;&lt;mi&gt;x&lt;/mi&gt;&lt;mo&gt;,&lt;/mo&gt;&lt;mi&gt;y&lt;/mi&gt;&lt;/mrow&gt;&lt;/mrow&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> in image <emph>I</emph>, the Hessian matrix <ephtml> &lt;math display="block" xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;mi&gt;H&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;X&lt;/mi&gt;&lt;mo&gt;,&lt;/mo&gt;&lt;mi&gt;&amp;#963;&lt;/mi&gt;&lt;/mrow&gt;&lt;/mrow&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> at point <ephtml> &lt;math display="block" xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mi&gt;X&lt;/mi&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> and scale <ephtml> &lt;math display="block" xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mi&gt;&amp;#963;&lt;/mi&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> is defined as</p> <p>(<reflink idref="bib4" id="ref38">4</reflink>) <ephtml> &lt;math display="block" xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;mi&gt;H&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;X&lt;/mi&gt;&lt;mo&gt;,&lt;/mo&gt;&lt;mi&gt;&amp;#963;&lt;/mi&gt;&lt;/mrow&gt;&lt;mo&gt;=&lt;/mo&gt;&lt;mfenced close="]" open="["&gt;&lt;mrow&gt;&lt;mtable equalrows="true" equalcolumns="true"&gt;&lt;mtr&gt;&lt;mtd&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mi&gt;L&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;x&lt;/mi&gt;&lt;mi&gt;x&lt;/mi&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;mrow&gt;&lt;mi&gt;X&lt;/mi&gt;&lt;mo&gt;,&lt;/mo&gt;&lt;mi&gt;&amp;#963;&lt;/mi&gt;&lt;/mrow&gt;&lt;/mrow&gt;&lt;/mtd&gt;&lt;mtd&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mi&gt;L&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;x&lt;/mi&gt;&lt;mi&gt;y&lt;/mi&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;mrow&gt;&lt;mi&gt;X&lt;/mi&gt;&lt;mo&gt;,&lt;/mo&gt;&lt;mi&gt;&amp;#963;&lt;/mi&gt;&lt;/mrow&gt;&lt;/mrow&gt;&lt;/mtd&gt;&lt;/mtr&gt;&lt;mtr&gt;&lt;mtd&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mi&gt;L&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;x&lt;/mi&gt;&lt;mi&gt;y&lt;/mi&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;mrow&gt;&lt;mi&gt;X&lt;/mi&gt;&lt;mo&gt;,&lt;/mo&gt;&lt;mi&gt;&amp;#963;&lt;/mi&gt;&lt;/mrow&gt;&lt;/mrow&gt;&lt;/mtd&gt;&lt;mtd&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mi&gt;L&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;y&lt;/mi&gt;&lt;mi&gt;y&lt;/mi&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;mrow&gt;&lt;mi&gt;X&lt;/mi&gt;&lt;mo&gt;,&lt;/mo&gt;&lt;mi&gt;&amp;#963;&lt;/mi&gt;&lt;/mrow&gt;&lt;/mrow&gt;&lt;/mtd&gt;&lt;/mtr&gt;&lt;/mtable&gt;&lt;/mrow&gt;&lt;/mfenced&gt;&lt;/mrow&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml></p> <p>where <ephtml> &lt;math display="block" xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mi&gt;L&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;x&lt;/mi&gt;&lt;mi&gt;x&lt;/mi&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;mrow&gt;&lt;mi&gt;X&lt;/mi&gt;&lt;mo&gt;,&lt;/mo&gt;&lt;mi&gt;&amp;#963;&lt;/mi&gt;&lt;/mrow&gt;&lt;/mrow&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> denotes the convolution of the Gaussian second-order derivative <ephtml> &lt;math display="block" xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;mfrac&gt;&lt;mrow&gt;&lt;msup&gt;&lt;mo&gt;&amp;#8706;&lt;/mo&gt;&lt;mn&gt;2&lt;/mn&gt;&lt;/msup&gt;&lt;/mrow&gt;&lt;mrow&gt;&lt;mo&gt;&amp;#8706;&lt;/mo&gt;&lt;msup&gt;&lt;mi&gt;x&lt;/mi&gt;&lt;mn&gt;2&lt;/mn&gt;&lt;/msup&gt;&lt;/mrow&gt;&lt;/mfrac&gt;&lt;mi&gt;g&lt;/mi&gt;&lt;mi&gt;x&lt;/mi&gt;&lt;/mrow&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> with image <emph>I</emph> at point <ephtml> &lt;math display="block" xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mi&gt;X&lt;/mi&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> , <ephtml> &lt;math display="block" xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mi&gt;L&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;x&lt;/mi&gt;&lt;mi&gt;y&lt;/mi&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;mrow&gt;&lt;mi&gt;X&lt;/mi&gt;&lt;mo&gt;,&lt;/mo&gt;&lt;mi&gt;&amp;#963;&lt;/mi&gt;&lt;/mrow&gt;&lt;/mrow&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> , and <ephtml> &lt;math display="block" xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mi&gt;L&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;y&lt;/mi&gt;&lt;mi&gt;y&lt;/mi&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;mrow&gt;&lt;mi&gt;X&lt;/mi&gt;&lt;mo&gt;,&lt;/mo&gt;&lt;mi&gt;&amp;#963;&lt;/mi&gt;&lt;/mrow&gt;&lt;/mrow&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> are defined similarly. For orientation assignment, it uses wavelet responses in both horizontal and vertical directions by applying adequate Gaussian weights. For feature description also SURF uses the wavelet responses. A neighborhood around the keypoint is selected and divided into subregions and then for each subregion the wavelet responses are taken and represented to get SURF feature descriptor. The sign of Laplacian which is already computed in the detection is used for underlying interest points. The sign of the Laplacian distinguishes bright blobs on dark backgrounds from the reverse case. In matching cases, the features are compared only if they have same type of contrast (based on sign) which allows faster matching [[<reflink idref="bib5" id="ref39">5</reflink>]].</p> <hd id="AN0147991657-7">2.1.3. Oriented FAST and Rotated BRIEF(ORB)</hd> <p>Oriented FAST and Rotated BRIEF (ORB) is a computed feature extractor and descriptor algorithm presented by Ethan Rublee [[<reflink idref="bib6" id="ref40">6</reflink>]].</p> <p>ORB is a fusion of the FAST keypoint detector and BRIEF descriptor with some modifications. Initially to determine the keypoints, it uses FAST, as shown in Equation (<reflink idref="bib5" id="ref41">5</reflink>).</p> <p>(<reflink idref="bib5" id="ref42">5</reflink>) <ephtml> &lt;math display="block" xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;mi&gt;C&lt;/mi&gt;&lt;mi&gt;R&lt;/mi&gt;&lt;mi&gt;F&lt;/mi&gt;&lt;mo&gt;=&lt;/mo&gt;&lt;mrow&gt;&lt;mtable equalrows="true" equalcolumns="true"&gt;&lt;mtr&gt;&lt;mtd&gt;&lt;mrow&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;mo&gt;,&lt;/mo&gt;&lt;mo /&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;mi&gt;f&lt;/mi&gt;&lt;mo /&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mi&gt;I&lt;/mi&gt;&lt;mi&gt;P&lt;/mi&gt;&lt;/msub&gt;&lt;mo&gt;&amp;#8722;&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;I&lt;/mi&gt;&lt;mi&gt;k&lt;/mi&gt;&lt;/msub&gt;&lt;/mrow&gt;&lt;mo&gt;&amp;#62;&lt;/mo&gt;&lt;mi&gt;t&lt;/mi&gt;&lt;mo /&gt;&lt;mo /&gt;&lt;mo /&gt;&lt;mo /&gt;&lt;mo /&gt;&lt;mo /&gt;&lt;mo /&gt;&lt;mo /&gt;&lt;mo /&gt;&lt;mo /&gt;&lt;mo /&gt;&lt;mo /&gt;&lt;/mrow&gt;&lt;/mtd&gt;&lt;/mtr&gt;&lt;mtr&gt;&lt;mtd&gt;&lt;mrow&gt;&lt;mo /&gt;&lt;mn&gt;0&lt;/mn&gt;&lt;mo&gt;,&lt;/mo&gt;&lt;mo /&gt;&lt;mo /&gt;&lt;mo /&gt;&lt;mo /&gt;&lt;mo /&gt;&lt;mo /&gt;&lt;mo /&gt;&lt;mo /&gt;&lt;mo /&gt;&lt;mo /&gt;&lt;mi&gt;e&lt;/mi&gt;&lt;mi&gt;l&lt;/mi&gt;&lt;mi&gt;s&lt;/mi&gt;&lt;mi&gt;e&lt;/mi&gt;&lt;mo /&gt;&lt;mo /&gt;&lt;mo /&gt;&lt;mo /&gt;&lt;mo /&gt;&lt;mo /&gt;&lt;mo /&gt;&lt;mo /&gt;&lt;mo /&gt;&lt;mo /&gt;&lt;mo /&gt;&lt;mo /&gt;&lt;mo /&gt;&lt;/mrow&gt;&lt;/mtd&gt;&lt;/mtr&gt;&lt;/mtable&gt;&lt;/mrow&gt;&lt;/mrow&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml></p> <p>FAST corner detector uses a circle of 16 pixels to classify whether a candidate point p is actually a corner or not. Each pixel in the circle is labeled from integer number 1 to 16 clockwise. <ephtml> &lt;math display="block" xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mi&gt;I&lt;/mi&gt;&lt;mi&gt;P&lt;/mi&gt;&lt;/msub&gt;&lt;/mrow&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> is the intensity of candidate pixel <emph>p</emph>. <ephtml> &lt;math display="block" xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mi&gt;I&lt;/mi&gt;&lt;mi&gt;k&lt;/mi&gt;&lt;/msub&gt;&lt;/mrow&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> is the intensity of number 1 to 16. Corner Response Function ( <ephtml> &lt;math display="block" xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;mi&gt;C&lt;/mi&gt;&lt;mi&gt;R&lt;/mi&gt;&lt;mi&gt;F&lt;/mi&gt;&lt;/mrow&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> ) gives a numerical value for the corner strength at a pixel location based on the image intensity in the local neighborhoods. The <ephtml> &lt;math display="block" xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mi&gt;t&lt;/mi&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> is a threshold intensity value. Then a Harris corner measure is applied to find top N points among them. FAST does not compute the orientation and is rotation variant. It computes the intensity weighted centroid of the patch with located corner at center. The direction of the vector from this corner point to centroid gives the orientation. Moments are computed to improve the rotation invariance. The descriptor BRIEF poorly performs if there is an in-plane rotation. In ORB, a rotation matrix is computed using the orientation of patch and then the BRIEF descriptors are steered according to the orientation [[<reflink idref="bib6" id="ref43">6</reflink>]].</p> <hd id="AN0147991657-8">2.1.4. RANdom SAmple Consensus (RANSAC)</hd> <p>These descriptors between image pairs are then matched against each other to identify the best match with the minimum distance by brute force method. As the matches often contain outliers, a consistency check such as RANSAC [[<reflink idref="bib9" id="ref44">9</reflink>]] is often used to remove inconsistent matches. Figure 1 shows the match points of an input image pair after adopting the RANSAC algorithm [[<reflink idref="bib9" id="ref45">9</reflink>]]. The consistent matches are then used to model a transformation matrix for estimating a global motion for every pixel.</p> <hd id="AN0147991657-9">2.2. Deep Learning Algorithms</hd> <p>The development of neural networks-based systems have drastically increased and demonstrated extraordinary performance [[<reflink idref="bib22" id="ref46">22</reflink>]]. The neural networks-based methods have recently emerged as potential alternatives to the traditional methods [[<reflink idref="bib24" id="ref47">24</reflink>]]. The recent success of deep learning in computer vision has led to the adoption of the convolutional neural network (CNN) in low-level computer vision tasks such as image matching. Hardware advances such as GPU enable training of a very deep CNN that incorporates hundreds of layers [[<reflink idref="bib11" id="ref48">11</reflink>]].</p> <hd id="AN0147991657-10">Object Detection Network</hd> <p>Most current object detection frameworks are either one-stage or two-stage. Regions with convolutional neural network (R-CNN) [[<reflink idref="bib26" id="ref49">26</reflink>]], fast R-CNN [[<reflink idref="bib22" id="ref50">22</reflink>]], and faster R-CNN [[<reflink idref="bib27" id="ref51">27</reflink>]] are two-stage object detection frameworks. Two-stage object detectors often achieve high object detection accuracy at a high computational cost. One-stage object detectors, including single shot multibox detector (SSD) [[<reflink idref="bib29" id="ref52">29</reflink>]] and YOLOv3 [[<reflink idref="bib17" id="ref53">17</reflink>]], they formulate the object detection of an input image as a regression problem that outputs class probabilities as well as bounding box coordinates. One-stage object detectors have gained popularity recently, as they achieve comparable object detection accuracy and better speed than two-stage object detectors. Specifically, YOLOv3 [[<reflink idref="bib17" id="ref54">17</reflink>]] has reported achieving consistent high accuracy in object detection. On a Pascal Titan X, YOLOv3 [[<reflink idref="bib17" id="ref55">17</reflink>]] runs in real time at 30 FPS, and has a mAP-50 of 57.9% on COCO test-dev.</p> <p>In this paper, we construct a YOLOv3-based [[<reflink idref="bib17" id="ref56">17</reflink>]] end-to-end training convolutional neural network to detect "roof". YOLOv3 [[<reflink idref="bib17" id="ref57">17</reflink>]] used a single neural network to directly predict the bounding box and class probability. The detailed information about "YOLOv3 object detection" in next section.</p> <hd id="AN0147991657-11">3. Proposed Method</hd> <p>This study presents a novel method to generate a few plausible candidate regions using YOLOv3 [[<reflink idref="bib17" id="ref58">17</reflink>]] object detection for two subsequent drone images on NVIDIA TITAN Xp. The proposed method performs traditional image matching procedures, such as feature extraction and description methods, only in the candidate roof region, thus significantly reducing complexity compared to conventional methods (such as SIFT [[<reflink idref="bib2" id="ref59">2</reflink>]], SURF [[<reflink idref="bib4" id="ref60">4</reflink>]], and ORB [[<reflink idref="bib6" id="ref61">6</reflink>]]). Figure 2 shows the complete flow chart of the algorithm.</p> <p>All the default YOLOv3 [[<reflink idref="bib17" id="ref62">17</reflink>]] parameter settings were applied, except that the network was only trained for a single class "roof". The image was divided into S × S grid cells of 13 × 13, 26 × 26 and 52 × 52 for detection on the corresponding scales. Each grid cell is responsible for outputting three bounding boxes, B = 3. Each bounding box outputs five parameters x, y, w, h, and confidence (refers Equation (<reflink idref="bib6" id="ref63">6</reflink>)) which define the bounding box location as well as a confidence score indicating the likelihood that the bounding box contains an object.</p> <p>(<reflink idref="bib6" id="ref64">6</reflink>) <ephtml> &lt;math display="block" xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;mi&gt;B&lt;/mi&gt;&lt;mi&gt;o&lt;/mi&gt;&lt;mi&gt;x&lt;/mi&gt;&lt;mo /&gt;&lt;mi&gt;C&lt;/mi&gt;&lt;mi&gt;o&lt;/mi&gt;&lt;mi&gt;n&lt;/mi&gt;&lt;mi&gt;f&lt;/mi&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;mi&gt;d&lt;/mi&gt;&lt;mi&gt;e&lt;/mi&gt;&lt;mi&gt;n&lt;/mi&gt;&lt;mi&gt;c&lt;/mi&gt;&lt;mi&gt;e&lt;/mi&gt;&lt;mo&gt;=&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;P&lt;/mi&gt;&lt;mi&gt;r&lt;/mi&gt;&lt;/msub&gt;&lt;mrow&gt;&lt;mi&gt;O&lt;/mi&gt;&lt;mi&gt;b&lt;/mi&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;mi&gt;e&lt;/mi&gt;&lt;mi&gt;c&lt;/mi&gt;&lt;mi&gt;t&lt;/mi&gt;&lt;/mrow&gt;&lt;mo&gt;&amp;#8727;&lt;/mo&gt;&lt;mi&gt;I&lt;/mi&gt;&lt;mi&gt;O&lt;/mi&gt;&lt;mi&gt;U&lt;/mi&gt;&lt;mrow&gt;&lt;mtable&gt;&lt;mtr&gt;&lt;mtd&gt;&lt;mrow&gt;&lt;mi&gt;t&lt;/mi&gt;&lt;mi&gt;r&lt;/mi&gt;&lt;mi&gt;u&lt;/mi&gt;&lt;mi&gt;t&lt;/mi&gt;&lt;mi&gt;h&lt;/mi&gt;&lt;/mrow&gt;&lt;/mtd&gt;&lt;/mtr&gt;&lt;mtr&gt;&lt;mtd&gt;&lt;mrow&gt;&lt;mi&gt;p&lt;/mi&gt;&lt;mi&gt;r&lt;/mi&gt;&lt;mi&gt;e&lt;/mi&gt;&lt;mi&gt;d&lt;/mi&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;mi&gt;c&lt;/mi&gt;&lt;mi&gt;t&lt;/mi&gt;&lt;/mrow&gt;&lt;/mtd&gt;&lt;/mtr&gt;&lt;/mtable&gt;&lt;/mrow&gt;&lt;/mrow&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml></p> <p> <ephtml> &lt;math display="block" xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mi&gt;P&lt;/mi&gt;&lt;mi&gt;r&lt;/mi&gt;&lt;/msub&gt;&lt;mrow&gt;&lt;mi&gt;O&lt;/mi&gt;&lt;mi&gt;b&lt;/mi&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;mi&gt;e&lt;/mi&gt;&lt;mi&gt;c&lt;/mi&gt;&lt;mi&gt;t&lt;/mi&gt;&lt;/mrow&gt;&lt;/mrow&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> denotes the probability that the box contains an object. If a cell has no object, then the confidence scores should be 0, otherwise the confidence score should equal the intersection over union (<emph>IOU</emph>) between the predicted box and ground truth. <emph>IOU</emph> is a ratio between the intersection and the union of the predicted boxes and the ground truth boxes, when <emph>IOU</emph> exceeds the threshold, the bounding box is correct, as shown in Equation (<reflink idref="bib7" id="ref65">7</reflink>). This standard is used to measure the correlation between ground truth, <ephtml> &lt;math display="block" xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;mi&gt;b&lt;/mi&gt;&lt;mi&gt;o&lt;/mi&gt;&lt;msub&gt;&lt;mi&gt;x&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;t&lt;/mi&gt;&lt;mi&gt;r&lt;/mi&gt;&lt;mi&gt;u&lt;/mi&gt;&lt;mi&gt;t&lt;/mi&gt;&lt;mi&gt;h&lt;/mi&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;/mrow&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> and prediction, <ephtml> &lt;math display="block" xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;mo /&gt;&lt;mi&gt;b&lt;/mi&gt;&lt;mi&gt;o&lt;/mi&gt;&lt;msub&gt;&lt;mi&gt;x&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;p&lt;/mi&gt;&lt;mi&gt;r&lt;/mi&gt;&lt;mi&gt;e&lt;/mi&gt;&lt;mi&gt;d&lt;/mi&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;mi&gt;c&lt;/mi&gt;&lt;mi&gt;t&lt;/mi&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;/mrow&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> ; a higher value represents a higher correlation.</p> <p>(<reflink idref="bib7" id="ref66">7</reflink>) <ephtml> &lt;math display="block" xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;mi&gt;I&lt;/mi&gt;&lt;mi&gt;O&lt;/mi&gt;&lt;mi&gt;U&lt;/mi&gt;&lt;mrow&gt;&lt;mtable equalrows="true" equalcolumns="true"&gt;&lt;mtr&gt;&lt;mtd&gt;&lt;mrow&gt;&lt;mi&gt;t&lt;/mi&gt;&lt;mi&gt;r&lt;/mi&gt;&lt;mi&gt;u&lt;/mi&gt;&lt;mi&gt;t&lt;/mi&gt;&lt;mi&gt;h&lt;/mi&gt;&lt;/mrow&gt;&lt;/mtd&gt;&lt;/mtr&gt;&lt;mtr&gt;&lt;mtd&gt;&lt;mrow&gt;&lt;mi&gt;p&lt;/mi&gt;&lt;mi&gt;r&lt;/mi&gt;&lt;mi&gt;e&lt;/mi&gt;&lt;mi&gt;d&lt;/mi&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;mi&gt;c&lt;/mi&gt;&lt;mi&gt;t&lt;/mi&gt;&lt;/mrow&gt;&lt;/mtd&gt;&lt;/mtr&gt;&lt;/mtable&gt;&lt;/mrow&gt;&lt;mo&gt;=&lt;/mo&gt;&lt;mfrac&gt;&lt;mrow&gt;&lt;mrow&gt;&lt;mi&gt;b&lt;/mi&gt;&lt;mi&gt;o&lt;/mi&gt;&lt;msub&gt;&lt;mi&gt;x&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;p&lt;/mi&gt;&lt;mi&gt;r&lt;/mi&gt;&lt;mi&gt;e&lt;/mi&gt;&lt;mi&gt;d&lt;/mi&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;mi&gt;c&lt;/mi&gt;&lt;mi&gt;t&lt;/mi&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;mo&gt;&amp;#8745;&lt;/mo&gt;&lt;mi&gt;b&lt;/mi&gt;&lt;mi&gt;o&lt;/mi&gt;&lt;msub&gt;&lt;mi&gt;x&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;t&lt;/mi&gt;&lt;mi&gt;r&lt;/mi&gt;&lt;mi&gt;u&lt;/mi&gt;&lt;mi&gt;t&lt;/mi&gt;&lt;mi&gt;h&lt;/mi&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;/mrow&gt;&lt;/mrow&gt;&lt;mrow&gt;&lt;mrow&gt;&lt;mi&gt;b&lt;/mi&gt;&lt;mi&gt;o&lt;/mi&gt;&lt;msub&gt;&lt;mi&gt;x&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;p&lt;/mi&gt;&lt;mi&gt;r&lt;/mi&gt;&lt;mi&gt;e&lt;/mi&gt;&lt;mi&gt;d&lt;/mi&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;mi&gt;c&lt;/mi&gt;&lt;mi&gt;t&lt;/mi&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;mo&gt;&amp;#8746;&lt;/mo&gt;&lt;mi&gt;b&lt;/mi&gt;&lt;mi&gt;o&lt;/mi&gt;&lt;msub&gt;&lt;mi&gt;x&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;t&lt;/mi&gt;&lt;mi&gt;r&lt;/mi&gt;&lt;mi&gt;u&lt;/mi&gt;&lt;mi&gt;t&lt;/mi&gt;&lt;mi&gt;h&lt;/mi&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;/mrow&gt;&lt;/mrow&gt;&lt;/mfrac&gt;&lt;/mrow&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml></p> <p> <emph>IOU</emph> is frequently adopted as an evaluation metric to measure the accuracy of an object detector. The importance of <emph>IOU</emph> is not limited to assigning anchor boxes during preparation of the training dataset but is also very useful when adopting the non-max suppression algorithm for cleaning up whenever multiple boxes are predicted for the same object. The <emph>IOU</emph> is assigned to 0.5 (the default threshold is usually 0.5), which means that at least half of the ground truth and the predicted box cover the same region. When <emph>IOU</emph> is greater than 50% threshold, the test case is predicted as containing an object.</p> <p>Each grid cell is assigned 1 conditional class probability, <ephtml> &lt;math display="block" xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mi&gt;P&lt;/mi&gt;&lt;mi&gt;r&lt;/mi&gt;&lt;/msub&gt;&lt;mrow&gt;&lt;mi&gt;c&lt;/mi&gt;&lt;mi&gt;l&lt;/mi&gt;&lt;mi&gt;a&lt;/mi&gt;&lt;mi&gt;s&lt;/mi&gt;&lt;mi&gt;s&lt;/mi&gt;&lt;mo&gt;|&lt;/mo&gt;&lt;mi&gt;o&lt;/mi&gt;&lt;mi&gt;b&lt;/mi&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;mi&gt;e&lt;/mi&gt;&lt;mi&gt;c&lt;/mi&gt;&lt;mi&gt;t&lt;/mi&gt;&lt;/mrow&gt;&lt;/mrow&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> , which is the probability that the object belongs to the class "roof" given an object is presence. The class confidence score for each prediction box is then calculated as Equation (<reflink idref="bib8" id="ref67">8</reflink>), which gives the classification confidence as well as the localization confidence.</p> <p>(<reflink idref="bib8" id="ref68">8</reflink>) <ephtml> &lt;math display="block" xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;mi&gt;C&lt;/mi&gt;&lt;mi&gt;l&lt;/mi&gt;&lt;mi&gt;a&lt;/mi&gt;&lt;mi&gt;s&lt;/mi&gt;&lt;mi&gt;s&lt;/mi&gt;&lt;mo /&gt;&lt;mi&gt;C&lt;/mi&gt;&lt;mi&gt;o&lt;/mi&gt;&lt;mi&gt;n&lt;/mi&gt;&lt;mi&gt;f&lt;/mi&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;mi&gt;d&lt;/mi&gt;&lt;mi&gt;e&lt;/mi&gt;&lt;mi&gt;n&lt;/mi&gt;&lt;mi&gt;c&lt;/mi&gt;&lt;mi&gt;e&lt;/mi&gt;&lt;mo&gt;=&lt;/mo&gt;&lt;mi&gt;B&lt;/mi&gt;&lt;mi&gt;o&lt;/mi&gt;&lt;mi&gt;x&lt;/mi&gt;&lt;mo /&gt;&lt;mi&gt;C&lt;/mi&gt;&lt;mi&gt;o&lt;/mi&gt;&lt;mi&gt;n&lt;/mi&gt;&lt;mi&gt;f&lt;/mi&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;mi&gt;d&lt;/mi&gt;&lt;mi&gt;e&lt;/mi&gt;&lt;mi&gt;n&lt;/mi&gt;&lt;mi&gt;c&lt;/mi&gt;&lt;mi&gt;e&lt;/mi&gt;&lt;mo /&gt;&lt;mo&gt;&amp;#215;&lt;/mo&gt;&lt;mo /&gt;&lt;msub&gt;&lt;mi&gt;P&lt;/mi&gt;&lt;mi&gt;r&lt;/mi&gt;&lt;/msub&gt;&lt;mrow&gt;&lt;mi&gt;c&lt;/mi&gt;&lt;mi&gt;l&lt;/mi&gt;&lt;mi&gt;a&lt;/mi&gt;&lt;mi&gt;s&lt;/mi&gt;&lt;mi&gt;s&lt;/mi&gt;&lt;mo&gt;|&lt;/mo&gt;&lt;mi&gt;o&lt;/mi&gt;&lt;mi&gt;b&lt;/mi&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;mi&gt;e&lt;/mi&gt;&lt;mi&gt;c&lt;/mi&gt;&lt;mi&gt;t&lt;/mi&gt;&lt;/mrow&gt;&lt;/mrow&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml></p> <p>The detection output tensor is of size <emph>S × S × B × (5 + C)</emph>. The value 5 is for the four bounding attributes and one confidence score. Figure 3 shows the detection process using YOLOv3 [[<reflink idref="bib17" id="ref69">17</reflink>]]. Figure 4 shows a backbone network adopted in YOLOv3 [[<reflink idref="bib17" id="ref70">17</reflink>]] for a multiscale object detection. This study adopted the network model for a single class object "roof". The object "roof" became our candidate regions.</p> <hd id="AN0147991657-12">3.1. Dataset and Training Process</hd> <p></p> <hd id="AN0147991657-13">3.1.1. Experiment Environment</hd> <p>The experiment environment includes Intel(R) core (TM) i7-8770 @3.2GHz (CPU) and 24 GB of memory, NVIDIA GeForce TITAN Xp GPU with 24 GB memory and using CUDA 9.0. Table 1 shows the hardware and software configurations for the training process.</p> <hd id="AN0147991657-14">3.1.2. The Datasets</hd> <p>To evaluate the effectiveness of this research method, we used a set of real images acquired by a UAV equipped with imaging sensors spanning the visible range. The camera is SONY a7R, characterized by a Exmor R full frame CMOS sensor with 36.4 megapixels. All images have been acquired from the National Science and Technology Center for Disaster Reduction, New Taipei, on 13 October 2016, at 10:00 a.m. The images are characterized by three channels (RGB) with 8 bits of radiometric resolution and a spatial resolution of 25 cm ground sample distance (GSD). Table 2 shows the UAV platform and sensor characteristics.</p> <p>In this study, the dataset comprises 99 drone images with 6000 × 4000 pixel size, captured in the Xizhi District, New Taipei City, Taiwan. As YOLOv3 [[<reflink idref="bib17" id="ref71">17</reflink>]] is designed to train and test the images of 416 × 416 pixel size, the original images were cropped into 1000 × 1000 pixel size with overlapping areas of 70% between the subsequent images. The cropped images were then randomly split into training and testing data at ratio 9:1.</p> <p>In order to train the network to output the location of the object, all the ground truth objects in the images need to be labeled first. We used the LabelImg open source project on GitHub (tzutalin.github) [[<reflink idref="bib29" id="ref72">29</reflink>]], which is currently the most widely used annotation tool. An open-source software application "LabelImg" [[<reflink idref="bib29" id="ref73">29</reflink>]] was adopted to create the ground truth bounding boxes for the object detection task. Figure 5 shows a screenshot of the process of creating the ground truth bounding boxes using the labelImg software. As the drone images mostly covered the residential areas, only a single class of object "roof" was labeled. The annotations of training images in the XML format were used directly in the YOLOv3 end-to-end training network.</p> <hd id="AN0147991657-15">3.2. Evaluation Methods</hd> <p>The precision is the ratio of true positives (true predictions) to the total number of predicted positives</p> <p>(<reflink idref="bib9" id="ref74">9</reflink>) <ephtml> &lt;math display="block" xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;mi&gt;P&lt;/mi&gt;&lt;mi&gt;r&lt;/mi&gt;&lt;mi&gt;e&lt;/mi&gt;&lt;mi&gt;c&lt;/mi&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;mi&gt;s&lt;/mi&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;mi&gt;o&lt;/mi&gt;&lt;mi&gt;n&lt;/mi&gt;&lt;mo&gt;=&lt;/mo&gt;&lt;mfrac&gt;&lt;mrow&gt;&lt;mi&gt;T&lt;/mi&gt;&lt;mi&gt;P&lt;/mi&gt;&lt;/mrow&gt;&lt;mrow&gt;&lt;mi&gt;T&lt;/mi&gt;&lt;mi&gt;P&lt;/mi&gt;&lt;mo&gt;+&lt;/mo&gt;&lt;mi&gt;F&lt;/mi&gt;&lt;mi&gt;P&lt;/mi&gt;&lt;/mrow&gt;&lt;/mfrac&gt;&lt;/mrow&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml></p> <p> <emph>TP</emph> denotes the number of true positives. <emph>FP</emph> denotes the number of false positives and <emph>FN</emph> is the number of false negatives. The recall is the ratio of true positives to the total of ground truth positives.</p> <p>(<reflink idref="bib10" id="ref75">10</reflink>) <ephtml> &lt;math display="block" xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;mi&gt;R&lt;/mi&gt;&lt;mi&gt;e&lt;/mi&gt;&lt;mi&gt;c&lt;/mi&gt;&lt;mi&gt;a&lt;/mi&gt;&lt;mi&gt;l&lt;/mi&gt;&lt;mi&gt;l&lt;/mi&gt;&lt;mo&gt;=&lt;/mo&gt;&lt;mfrac&gt;&lt;mrow&gt;&lt;mi&gt;T&lt;/mi&gt;&lt;mi&gt;P&lt;/mi&gt;&lt;/mrow&gt;&lt;mrow&gt;&lt;mi&gt;T&lt;/mi&gt;&lt;mi&gt;P&lt;/mi&gt;&lt;mo&gt;+&lt;/mo&gt;&lt;mi&gt;F&lt;/mi&gt;&lt;mi&gt;N&lt;/mi&gt;&lt;/mrow&gt;&lt;/mfrac&gt;&lt;/mrow&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml></p> <p>The average precision (AP) is the area under the precision–recall curve, and <ephtml> &lt;math display="block" xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;mi&gt;p&lt;/mi&gt;&lt;mi&gt;k&lt;/mi&gt;&lt;/mrow&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> denotes the precision value at <ephtml> &lt;math display="block" xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;mi&gt;r&lt;/mi&gt;&lt;mi&gt;e&lt;/mi&gt;&lt;mi&gt;c&lt;/mi&gt;&lt;mi&gt;a&lt;/mi&gt;&lt;mi&gt;l&lt;/mi&gt;&lt;mi&gt;l&lt;/mi&gt;&lt;mo&gt;=&lt;/mo&gt;&lt;mi&gt;k&lt;/mi&gt;&lt;/mrow&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> .</p> <p>(<reflink idref="bib11" id="ref76">11</reflink>) <ephtml> &lt;math display="block" xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;mi&gt;A&lt;/mi&gt;&lt;mi&gt;v&lt;/mi&gt;&lt;mi&gt;g&lt;/mi&gt;&lt;mi&gt;P&lt;/mi&gt;&lt;mi&gt;r&lt;/mi&gt;&lt;mi&gt;e&lt;/mi&gt;&lt;mi&gt;c&lt;/mi&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;mi&gt;s&lt;/mi&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;mi&gt;o&lt;/mi&gt;&lt;mi&gt;n&lt;/mi&gt;&lt;mo&gt;=&lt;/mo&gt;&lt;munderover&gt;&lt;mstyle mathsize="140%" displaystyle="true"&gt;&lt;mo&gt;&amp;#8721;&lt;/mo&gt;&lt;/mstyle&gt;&lt;mrow&gt;&lt;mi&gt;k&lt;/mi&gt;&lt;mo&gt;=&lt;/mo&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;/mrow&gt;&lt;mi&gt;N&lt;/mi&gt;&lt;/munderover&gt;&lt;mi&gt;p&lt;/mi&gt;&lt;mi&gt;k&lt;/mi&gt;&lt;mo&gt;&amp;#916;&lt;/mo&gt;&lt;mi&gt;k&lt;/mi&gt;&lt;/mrow&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml></p> <p>The loss function is a function that maps an event or value of one or more variables onto a real number intuitively representing some 'cost' associated with the event. Therefore, the performance of the training model can be measured by calculating the loss function.</p> <p>YOLOv3 uses multiple logistic classifiers instead of Softmax to classify each box, since Softmax is not suitable for multi-label classification, and increasing the number of independent multiple logistic classifiers does not decrease the classification accuracy. Therefore, the optimization loss function can be expressed as shown in Equation (<reflink idref="bib12" id="ref77">12</reflink>).</p> <p>(<reflink idref="bib12" id="ref78">12</reflink>) <ephtml> &lt;math display="block" xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;mrow&gt;&lt;mi&gt;l&lt;/mi&gt;&lt;mi&gt;o&lt;/mi&gt;&lt;mi&gt;s&lt;/mi&gt;&lt;mi&gt;s&lt;/mi&gt;&lt;mo /&gt;&lt;mrow&gt;&lt;mi&gt;o&lt;/mi&gt;&lt;mi&gt;b&lt;/mi&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;mi&gt;e&lt;/mi&gt;&lt;mi&gt;c&lt;/mi&gt;&lt;mi&gt;t&lt;/mi&gt;&lt;/mrow&gt;&lt;mo&gt;=&lt;/mo&gt;&lt;msub&gt;&lt;mi mathvariant="sans-serif"&gt;&amp;#955;&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;c&lt;/mi&gt;&lt;mi&gt;o&lt;/mi&gt;&lt;mi&gt;o&lt;/mi&gt;&lt;mi&gt;r&lt;/mi&gt;&lt;mi&gt;d&lt;/mi&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;munderover&gt;&lt;mstyle mathsize="140%" displaystyle="true"&gt;&lt;mo&gt;&amp;#8721;&lt;/mo&gt;&lt;/mstyle&gt;&lt;mrow&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;mo&gt;=&lt;/mo&gt;&lt;mn&gt;0&lt;/mn&gt;&lt;/mrow&gt;&lt;mrow&gt;&lt;mi&gt;k&lt;/mi&gt;&lt;mo&gt;&amp;#215;&lt;/mo&gt;&lt;mi&gt;k&lt;/mi&gt;&lt;/mrow&gt;&lt;/munderover&gt;&lt;munderover&gt;&lt;mstyle mathsize="140%" displaystyle="true"&gt;&lt;mo&gt;&amp;#8721;&lt;/mo&gt;&lt;/mstyle&gt;&lt;mrow&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;mo&gt;=&lt;/mo&gt;&lt;mn&gt;0&lt;/mn&gt;&lt;/mrow&gt;&lt;mi&gt;M&lt;/mi&gt;&lt;/munderover&gt;&lt;msubsup&gt;&lt;mi&gt;I&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;/mrow&gt;&lt;mrow&gt;&lt;mi&gt;o&lt;/mi&gt;&lt;mi&gt;b&lt;/mi&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;/mrow&gt;&lt;/msubsup&gt;&lt;mrow&gt;&lt;msup&gt;&lt;mrow&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mi&gt;x&lt;/mi&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;/msub&gt;&lt;mo&gt;&amp;#8722;&lt;/mo&gt;&lt;mover accent="true"&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mi&gt;x&lt;/mi&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;/msub&gt;&lt;/mrow&gt;&lt;mo stretchy="true"&gt;^&lt;/mo&gt;&lt;/mover&gt;&lt;/mrow&gt;&lt;/mrow&gt;&lt;mn&gt;2&lt;/mn&gt;&lt;/msup&gt;&lt;mo&gt;+&lt;/mo&gt;&lt;msup&gt;&lt;mrow&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mi&gt;y&lt;/mi&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;/msub&gt;&lt;mo&gt;&amp;#8722;&lt;/mo&gt;&lt;mover accent="true"&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mi&gt;y&lt;/mi&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;/msub&gt;&lt;/mrow&gt;&lt;mo stretchy="true"&gt;^&lt;/mo&gt;&lt;/mover&gt;&lt;/mrow&gt;&lt;/mrow&gt;&lt;mn&gt;2&lt;/mn&gt;&lt;/msup&gt;&lt;/mrow&gt;&lt;mo&gt;+&lt;/mo&gt;&lt;/mrow&gt;&lt;mspace linebreak="newline" /&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mi mathvariant="sans-serif"&gt;&amp;#955;&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;c&lt;/mi&gt;&lt;mi&gt;o&lt;/mi&gt;&lt;mi&gt;o&lt;/mi&gt;&lt;mi&gt;r&lt;/mi&gt;&lt;mi&gt;d&lt;/mi&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;munderover&gt;&lt;mstyle mathsize="140%" displaystyle="true"&gt;&lt;mo&gt;&amp;#8721;&lt;/mo&gt;&lt;/mstyle&gt;&lt;mrow&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;mo&gt;=&lt;/mo&gt;&lt;mn&gt;0&lt;/mn&gt;&lt;/mrow&gt;&lt;mrow&gt;&lt;mi&gt;k&lt;/mi&gt;&lt;mo&gt;&amp;#215;&lt;/mo&gt;&lt;mi&gt;k&lt;/mi&gt;&lt;/mrow&gt;&lt;/munderover&gt;&lt;munderover&gt;&lt;mstyle mathsize="140%" displaystyle="true"&gt;&lt;mo&gt;&amp;#8721;&lt;/mo&gt;&lt;/mstyle&gt;&lt;mrow&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;mo&gt;=&lt;/mo&gt;&lt;mn&gt;0&lt;/mn&gt;&lt;/mrow&gt;&lt;mi&gt;M&lt;/mi&gt;&lt;/munderover&gt;&lt;msubsup&gt;&lt;mi&gt;I&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;/mrow&gt;&lt;mrow&gt;&lt;mi&gt;o&lt;/mi&gt;&lt;mi&gt;b&lt;/mi&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;/mrow&gt;&lt;/msubsup&gt;&lt;mrow&gt;&lt;mn&gt;2&lt;/mn&gt;&lt;mo&gt;&amp;#8722;&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;w&lt;/mi&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;/msub&gt;&lt;mo&gt;&amp;#215;&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;h&lt;/mi&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;/msub&gt;&lt;/mrow&gt;&lt;mrow&gt;&lt;msup&gt;&lt;mrow&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mi&gt;w&lt;/mi&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;/msub&gt;&lt;mo&gt;&amp;#8722;&lt;/mo&gt;&lt;mover accent="true"&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mi&gt;w&lt;/mi&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;/msub&gt;&lt;/mrow&gt;&lt;mo stretchy="true"&gt;^&lt;/mo&gt;&lt;/mover&gt;&lt;/mrow&gt;&lt;/mrow&gt;&lt;mn&gt;2&lt;/mn&gt;&lt;/msup&gt;&lt;mo&gt;+&lt;/mo&gt;&lt;msup&gt;&lt;mrow&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mi&gt;h&lt;/mi&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;/msub&gt;&lt;mo&gt;&amp;#8722;&lt;/mo&gt;&lt;mover accent="true"&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mi&gt;h&lt;/mi&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;/msub&gt;&lt;/mrow&gt;&lt;mo stretchy="true"&gt;^&lt;/mo&gt;&lt;/mover&gt;&lt;/mrow&gt;&lt;/mrow&gt;&lt;mn&gt;2&lt;/mn&gt;&lt;/msup&gt;&lt;/mrow&gt;&lt;mo&gt;&amp;#8722;&lt;/mo&gt;&lt;/mrow&gt;&lt;mspace linebreak="newline" /&gt;&lt;mrow&gt;&lt;munderover&gt;&lt;mstyle mathsize="140%" displaystyle="true"&gt;&lt;mo&gt;&amp;#8721;&lt;/mo&gt;&lt;/mstyle&gt;&lt;mrow&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;mo&gt;=&lt;/mo&gt;&lt;mn&gt;0&lt;/mn&gt;&lt;/mrow&gt;&lt;mrow&gt;&lt;mi&gt;k&lt;/mi&gt;&lt;mo&gt;&amp;#215;&lt;/mo&gt;&lt;mi&gt;k&lt;/mi&gt;&lt;/mrow&gt;&lt;/munderover&gt;&lt;munderover&gt;&lt;mstyle mathsize="140%" displaystyle="true"&gt;&lt;mo&gt;&amp;#8721;&lt;/mo&gt;&lt;/mstyle&gt;&lt;mrow&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;mo&gt;=&lt;/mo&gt;&lt;mn&gt;0&lt;/mn&gt;&lt;/mrow&gt;&lt;mi&gt;M&lt;/mi&gt;&lt;/munderover&gt;&lt;msubsup&gt;&lt;mi&gt;I&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;/mrow&gt;&lt;mrow&gt;&lt;mi&gt;o&lt;/mi&gt;&lt;mi&gt;b&lt;/mi&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;/mrow&gt;&lt;/msubsup&gt;&lt;mrow&gt;&lt;mrow&gt;&lt;mover accent="true"&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mi&gt;C&lt;/mi&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;/msub&gt;&lt;/mrow&gt;&lt;mo stretchy="true"&gt;^&lt;/mo&gt;&lt;/mover&gt;&lt;mi&gt;l&lt;/mi&gt;&lt;mi&gt;o&lt;/mi&gt;&lt;mi&gt;g&lt;/mi&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mi&gt;C&lt;/mi&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;/msub&gt;&lt;/mrow&gt;&lt;mo&gt;+&lt;/mo&gt;&lt;mrow&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;mo&gt;&amp;#8722;&lt;/mo&gt;&lt;mover accent="true"&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mi&gt;C&lt;/mi&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;/msub&gt;&lt;/mrow&gt;&lt;mo stretchy="true"&gt;^&lt;/mo&gt;&lt;/mover&gt;&lt;/mrow&gt;&lt;mi&gt;l&lt;/mi&gt;&lt;mi&gt;o&lt;/mi&gt;&lt;mi&gt;g&lt;/mi&gt;&lt;mrow&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;mo&gt;&amp;#8722;&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;C&lt;/mi&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;/msub&gt;&lt;/mrow&gt;&lt;/mrow&gt;&lt;/mrow&gt;&lt;mo&gt;&amp;#8722;&lt;/mo&gt;&lt;/mrow&gt;&lt;mspace linebreak="newline" /&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mi mathvariant="sans-serif"&gt;&amp;#955;&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;n&lt;/mi&gt;&lt;mi&gt;o&lt;/mi&gt;&lt;mi&gt;o&lt;/mi&gt;&lt;mi&gt;b&lt;/mi&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;munderover&gt;&lt;mstyle mathsize="140%" displaystyle="true"&gt;&lt;mo&gt;&amp;#8721;&lt;/mo&gt;&lt;/mstyle&gt;&lt;mrow&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;mo&gt;=&lt;/mo&gt;&lt;mn&gt;0&lt;/mn&gt;&lt;/mrow&gt;&lt;mrow&gt;&lt;mi&gt;k&lt;/mi&gt;&lt;mo&gt;&amp;#215;&lt;/mo&gt;&lt;mi&gt;k&lt;/mi&gt;&lt;/mrow&gt;&lt;/munderover&gt;&lt;munderover&gt;&lt;mstyle mathsize="140%" displaystyle="true"&gt;&lt;mo&gt;&amp;#8721;&lt;/mo&gt;&lt;/mstyle&gt;&lt;mrow&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;mo&gt;=&lt;/mo&gt;&lt;mn&gt;0&lt;/mn&gt;&lt;/mrow&gt;&lt;mi&gt;M&lt;/mi&gt;&lt;/munderover&gt;&lt;msubsup&gt;&lt;mi&gt;I&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;/mrow&gt;&lt;mrow&gt;&lt;mi&gt;n&lt;/mi&gt;&lt;mi&gt;o&lt;/mi&gt;&lt;mi&gt;o&lt;/mi&gt;&lt;mi&gt;b&lt;/mi&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;/mrow&gt;&lt;/msubsup&gt;&lt;mrow&gt;&lt;mrow&gt;&lt;mover accent="true"&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mi&gt;C&lt;/mi&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;/msub&gt;&lt;/mrow&gt;&lt;mo stretchy="true"&gt;^&lt;/mo&gt;&lt;/mover&gt;&lt;mi&gt;l&lt;/mi&gt;&lt;mi&gt;o&lt;/mi&gt;&lt;mi&gt;g&lt;/mi&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mi&gt;C&lt;/mi&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;/msub&gt;&lt;/mrow&gt;&lt;mo&gt;+&lt;/mo&gt;&lt;mrow&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;mo&gt;&amp;#8722;&lt;/mo&gt;&lt;mover accent="true"&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mi&gt;C&lt;/mi&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;/msub&gt;&lt;/mrow&gt;&lt;mo stretchy="true"&gt;^&lt;/mo&gt;&lt;/mover&gt;&lt;/mrow&gt;&lt;mi&gt;l&lt;/mi&gt;&lt;mi&gt;o&lt;/mi&gt;&lt;mi&gt;g&lt;/mi&gt;&lt;mrow&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;mo&gt;&amp;#8722;&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;C&lt;/mi&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;/msub&gt;&lt;/mrow&gt;&lt;/mrow&gt;&lt;/mrow&gt;&lt;mo&gt;&amp;#8722;&lt;/mo&gt;&lt;/mrow&gt;&lt;mspace linebreak="newline" /&gt;&lt;mrow&gt;&lt;munderover&gt;&lt;mstyle mathsize="140%" displaystyle="true"&gt;&lt;mo&gt;&amp;#8721;&lt;/mo&gt;&lt;/mstyle&gt;&lt;mrow&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;mo&gt;=&lt;/mo&gt;&lt;mn&gt;0&lt;/mn&gt;&lt;/mrow&gt;&lt;mrow&gt;&lt;mi&gt;k&lt;/mi&gt;&lt;mo&gt;&amp;#215;&lt;/mo&gt;&lt;mi&gt;k&lt;/mi&gt;&lt;/mrow&gt;&lt;/munderover&gt;&lt;msubsup&gt;&lt;mi&gt;I&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;/mrow&gt;&lt;mrow&gt;&lt;mi&gt;o&lt;/mi&gt;&lt;mi&gt;b&lt;/mi&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;/mrow&gt;&lt;/msubsup&gt;&lt;mstyle mathsize="140%" displaystyle="true"&gt;&lt;mo&gt;&amp;#8721;&lt;/mo&gt;&lt;/mstyle&gt;&lt;mrow&gt;&lt;mi&gt;c&lt;/mi&gt;&lt;mo&gt;&amp;#8712;&lt;/mo&gt;&lt;mi&gt;c&lt;/mi&gt;&lt;mi&gt;l&lt;/mi&gt;&lt;mi&gt;a&lt;/mi&gt;&lt;mi&gt;s&lt;/mi&gt;&lt;mi&gt;s&lt;/mi&gt;&lt;mi&gt;e&lt;/mi&gt;&lt;mi&gt;s&lt;/mi&gt;&lt;/mrow&gt;&lt;mrow&gt;&lt;mover accent="true"&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mi&gt;p&lt;/mi&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;/msub&gt;&lt;/mrow&gt;&lt;mo stretchy="true"&gt;^&lt;/mo&gt;&lt;/mover&gt;&lt;mi&gt;c&lt;/mi&gt;&lt;mi&gt;l&lt;/mi&gt;&lt;mi&gt;o&lt;/mi&gt;&lt;mi&gt;g&lt;/mi&gt;&lt;mfenced&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mi&gt;p&lt;/mi&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;/msub&gt;&lt;mi&gt;c&lt;/mi&gt;&lt;/mrow&gt;&lt;/mfenced&gt;&lt;mo&gt;+&lt;/mo&gt;&lt;mfenced&gt;&lt;mrow&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;mo&gt;&amp;#8722;&lt;/mo&gt;&lt;mover accent="true"&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mi&gt;p&lt;/mi&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;/msub&gt;&lt;/mrow&gt;&lt;mo stretchy="true"&gt;^&lt;/mo&gt;&lt;/mover&gt;&lt;mi&gt;c&lt;/mi&gt;&lt;/mrow&gt;&lt;/mfenced&gt;&lt;mi&gt;l&lt;/mi&gt;&lt;mi&gt;o&lt;/mi&gt;&lt;mi&gt;g&lt;/mi&gt;&lt;mfenced&gt;&lt;mrow&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;mo&gt;&amp;#8722;&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;p&lt;/mi&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;/msub&gt;&lt;mi&gt;c&lt;/mi&gt;&lt;/mrow&gt;&lt;/mfenced&gt;&lt;/mrow&gt;&lt;/mrow&gt;&lt;/mrow&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml></p> <p>In Equation (<reflink idref="bib12" id="ref79">12</reflink>), the loss function <emph>term1</emph> ( <ephtml> &lt;math display="block" xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mi mathvariant="sans-serif"&gt;&amp;#955;&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;c&lt;/mi&gt;&lt;mi&gt;o&lt;/mi&gt;&lt;mi&gt;o&lt;/mi&gt;&lt;mi&gt;r&lt;/mi&gt;&lt;mi&gt;d&lt;/mi&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;munderover&gt;&lt;mstyle mathsize="80%" displaystyle="true"&gt;&lt;mo&gt;&amp;#8721;&lt;/mo&gt;&lt;/mstyle&gt;&lt;mrow&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;mo&gt;=&lt;/mo&gt;&lt;mn&gt;0&lt;/mn&gt;&lt;/mrow&gt;&lt;mrow&gt;&lt;mi&gt;k&lt;/mi&gt;&lt;mo&gt;&amp;#215;&lt;/mo&gt;&lt;mi&gt;k&lt;/mi&gt;&lt;/mrow&gt;&lt;/munderover&gt;&lt;munderover&gt;&lt;mstyle mathsize="80%" displaystyle="true"&gt;&lt;mo&gt;&amp;#8721;&lt;/mo&gt;&lt;/mstyle&gt;&lt;mrow&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;mo&gt;=&lt;/mo&gt;&lt;mn&gt;0&lt;/mn&gt;&lt;/mrow&gt;&lt;mi&gt;M&lt;/mi&gt;&lt;/munderover&gt;&lt;msubsup&gt;&lt;mi&gt;I&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;/mrow&gt;&lt;mrow&gt;&lt;mi&gt;o&lt;/mi&gt;&lt;mi&gt;b&lt;/mi&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;/mrow&gt;&lt;/msubsup&gt;&lt;mrow&gt;&lt;msup&gt;&lt;mrow&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mi&gt;x&lt;/mi&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;/msub&gt;&lt;mo&gt;&amp;#8722;&lt;/mo&gt;&lt;mover accent="true"&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mi&gt;x&lt;/mi&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;/msub&gt;&lt;/mrow&gt;&lt;mo stretchy="true"&gt;^&lt;/mo&gt;&lt;/mover&gt;&lt;/mrow&gt;&lt;/mrow&gt;&lt;mn&gt;2&lt;/mn&gt;&lt;/msup&gt;&lt;mo&gt;+&lt;/mo&gt;&lt;msup&gt;&lt;mrow&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mi&gt;y&lt;/mi&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;/msub&gt;&lt;mo&gt;&amp;#8722;&lt;/mo&gt;&lt;mover accent="true"&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mi&gt;y&lt;/mi&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;/msub&gt;&lt;/mrow&gt;&lt;mo stretchy="true"&gt;^&lt;/mo&gt;&lt;/mover&gt;&lt;/mrow&gt;&lt;/mrow&gt;&lt;mn&gt;2&lt;/mn&gt;&lt;/msup&gt;&lt;/mrow&gt;&lt;/mrow&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> ) calculates the loss related to the predicted bounding box position <emph>(x, y).</emph><emph>term2</emph> ( <ephtml> &lt;math display="block" xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mi mathvariant="sans-serif"&gt;&amp;#955;&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;c&lt;/mi&gt;&lt;mi&gt;o&lt;/mi&gt;&lt;mi&gt;o&lt;/mi&gt;&lt;mi&gt;r&lt;/mi&gt;&lt;mi&gt;d&lt;/mi&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;munderover&gt;&lt;mstyle mathsize="80%" displaystyle="true"&gt;&lt;mo&gt;&amp;#8721;&lt;/mo&gt;&lt;/mstyle&gt;&lt;mrow&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;mo&gt;=&lt;/mo&gt;&lt;mn&gt;0&lt;/mn&gt;&lt;/mrow&gt;&lt;mrow&gt;&lt;mi&gt;k&lt;/mi&gt;&lt;mo&gt;&amp;#215;&lt;/mo&gt;&lt;mi&gt;k&lt;/mi&gt;&lt;/mrow&gt;&lt;/munderover&gt;&lt;munderover&gt;&lt;mstyle mathsize="80%" displaystyle="true"&gt;&lt;mo&gt;&amp;#8721;&lt;/mo&gt;&lt;/mstyle&gt;&lt;mrow&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;mo&gt;=&lt;/mo&gt;&lt;mn&gt;0&lt;/mn&gt;&lt;/mrow&gt;&lt;mi&gt;M&lt;/mi&gt;&lt;/munderover&gt;&lt;msubsup&gt;&lt;mi&gt;I&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;/mrow&gt;&lt;mrow&gt;&lt;mi&gt;o&lt;/mi&gt;&lt;mi&gt;b&lt;/mi&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;/mrow&gt;&lt;/msubsup&gt;&lt;mrow&gt;&lt;mn&gt;2&lt;/mn&gt;&lt;mo&gt;&amp;#8722;&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;w&lt;/mi&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;/msub&gt;&lt;mo&gt;&amp;#215;&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;h&lt;/mi&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;/msub&gt;&lt;/mrow&gt;&lt;mrow&gt;&lt;msup&gt;&lt;mrow&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mi&gt;w&lt;/mi&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;/msub&gt;&lt;mo&gt;&amp;#8722;&lt;/mo&gt;&lt;mover accent="true"&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mi&gt;w&lt;/mi&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;/msub&gt;&lt;/mrow&gt;&lt;mo stretchy="true"&gt;^&lt;/mo&gt;&lt;/mover&gt;&lt;/mrow&gt;&lt;/mrow&gt;&lt;mn&gt;2&lt;/mn&gt;&lt;/msup&gt;&lt;mo&gt;+&lt;/mo&gt;&lt;msup&gt;&lt;mrow&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mi&gt;h&lt;/mi&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;/msub&gt;&lt;mo&gt;&amp;#8722;&lt;/mo&gt;&lt;mover accent="true"&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mi&gt;h&lt;/mi&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;/msub&gt;&lt;/mrow&gt;&lt;mo stretchy="true"&gt;^&lt;/mo&gt;&lt;/mover&gt;&lt;/mrow&gt;&lt;/mrow&gt;&lt;mn&gt;2&lt;/mn&gt;&lt;/msup&gt;&lt;/mrow&gt;&lt;/mrow&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> ) calculates the loss related to the predicted box width and height <emph>(w, h)</emph>. Terms <emph>term3</emph> ( <ephtml> &lt;math display="block" xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;munderover&gt;&lt;mstyle mathsize="80%" displaystyle="true"&gt;&lt;mo&gt;&amp;#8721;&lt;/mo&gt;&lt;/mstyle&gt;&lt;mrow&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;mo&gt;=&lt;/mo&gt;&lt;mn&gt;0&lt;/mn&gt;&lt;/mrow&gt;&lt;mrow&gt;&lt;mi&gt;k&lt;/mi&gt;&lt;mo&gt;&amp;#215;&lt;/mo&gt;&lt;mi&gt;k&lt;/mi&gt;&lt;/mrow&gt;&lt;/munderover&gt;&lt;munderover&gt;&lt;mstyle mathsize="80%" displaystyle="true"&gt;&lt;mo&gt;&amp;#8721;&lt;/mo&gt;&lt;/mstyle&gt;&lt;mrow&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;mo&gt;=&lt;/mo&gt;&lt;mn&gt;0&lt;/mn&gt;&lt;/mrow&gt;&lt;mi&gt;M&lt;/mi&gt;&lt;/munderover&gt;&lt;msubsup&gt;&lt;mi&gt;I&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;/mrow&gt;&lt;mrow&gt;&lt;mi&gt;o&lt;/mi&gt;&lt;mi&gt;b&lt;/mi&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;/mrow&gt;&lt;/msubsup&gt;&lt;mrow&gt;&lt;mrow&gt;&lt;mover accent="true"&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mi&gt;C&lt;/mi&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;/msub&gt;&lt;/mrow&gt;&lt;mo stretchy="true"&gt;^&lt;/mo&gt;&lt;/mover&gt;&lt;mi&gt;l&lt;/mi&gt;&lt;mi&gt;o&lt;/mi&gt;&lt;mi&gt;g&lt;/mi&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mi&gt;C&lt;/mi&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;/msub&gt;&lt;/mrow&gt;&lt;mo&gt;+&lt;/mo&gt;&lt;mrow&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;mo&gt;&amp;#8722;&lt;/mo&gt;&lt;mover accent="true"&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mi&gt;C&lt;/mi&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;/msub&gt;&lt;/mrow&gt;&lt;mo stretchy="true"&gt;^&lt;/mo&gt;&lt;/mover&gt;&lt;/mrow&gt;&lt;mi&gt;l&lt;/mi&gt;&lt;mi&gt;o&lt;/mi&gt;&lt;mi&gt;g&lt;/mi&gt;&lt;mrow&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;mo&gt;&amp;#8722;&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;C&lt;/mi&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;/msub&gt;&lt;/mrow&gt;&lt;/mrow&gt;&lt;/mrow&gt;&lt;/mrow&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> ) and <emph>term4</emph> ( <ephtml> &lt;math display="block" xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mi mathvariant="sans-serif"&gt;&amp;#955;&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;n&lt;/mi&gt;&lt;mi&gt;o&lt;/mi&gt;&lt;mi&gt;o&lt;/mi&gt;&lt;mi&gt;b&lt;/mi&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;munderover&gt;&lt;mstyle mathsize="80%" displaystyle="true"&gt;&lt;mo&gt;&amp;#8721;&lt;/mo&gt;&lt;/mstyle&gt;&lt;mrow&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;mo&gt;=&lt;/mo&gt;&lt;mn&gt;0&lt;/mn&gt;&lt;/mrow&gt;&lt;mrow&gt;&lt;mi&gt;k&lt;/mi&gt;&lt;mo&gt;&amp;#215;&lt;/mo&gt;&lt;mi&gt;k&lt;/mi&gt;&lt;/mrow&gt;&lt;/munderover&gt;&lt;munderover&gt;&lt;mstyle mathsize="80%" displaystyle="true"&gt;&lt;mo&gt;&amp;#8721;&lt;/mo&gt;&lt;/mstyle&gt;&lt;mrow&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;mo&gt;=&lt;/mo&gt;&lt;mn&gt;0&lt;/mn&gt;&lt;/mrow&gt;&lt;mi&gt;M&lt;/mi&gt;&lt;/munderover&gt;&lt;msubsup&gt;&lt;mi&gt;I&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;/mrow&gt;&lt;mrow&gt;&lt;mi&gt;n&lt;/mi&gt;&lt;mi&gt;o&lt;/mi&gt;&lt;mi&gt;o&lt;/mi&gt;&lt;mi&gt;b&lt;/mi&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;/mrow&gt;&lt;/msubsup&gt;&lt;/mrow&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml><ephtml> &lt;math display="block" xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;mrow&gt;&lt;mrow&gt;&lt;mover accent="true"&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mi&gt;C&lt;/mi&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;/msub&gt;&lt;/mrow&gt;&lt;mo stretchy="true"&gt;^&lt;/mo&gt;&lt;/mover&gt;&lt;mi&gt;l&lt;/mi&gt;&lt;mi&gt;o&lt;/mi&gt;&lt;mi&gt;g&lt;/mi&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mi&gt;C&lt;/mi&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;/msub&gt;&lt;/mrow&gt;&lt;mo&gt;+&lt;/mo&gt;&lt;mrow&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;mo&gt;&amp;#8722;&lt;/mo&gt;&lt;mover accent="true"&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mi&gt;C&lt;/mi&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;/msub&gt;&lt;/mrow&gt;&lt;mo stretchy="true"&gt;^&lt;/mo&gt;&lt;/mover&gt;&lt;/mrow&gt;&lt;mi&gt;l&lt;/mi&gt;&lt;mi&gt;o&lt;/mi&gt;&lt;mi&gt;g&lt;/mi&gt;&lt;mrow&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;mo&gt;&amp;#8722;&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;C&lt;/mi&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;/msub&gt;&lt;/mrow&gt;&lt;/mrow&gt;&lt;/mrow&gt;&lt;/mrow&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> ) compute each bounding box predictor and the loss associated with the confidence score. <ephtml> &lt;math display="block" xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mi&gt;C&lt;/mi&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;/msub&gt;&lt;/mrow&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> is the confidence score, and <ephtml> &lt;math display="block" xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mover accent="true"&gt;&lt;mi&gt;C&lt;/mi&gt;&lt;mo&gt;^&lt;/mo&gt;&lt;/mover&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;/msub&gt;&lt;/mrow&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> is the intersection over union of the predicted box with the ground true. <ephtml> &lt;math display="block" xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mover accent="true"&gt;&lt;mi&gt;C&lt;/mi&gt;&lt;mo&gt;^&lt;/mo&gt;&lt;/mover&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;/msub&gt;&lt;/mrow&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> is expressed in Equation (<reflink idref="bib13" id="ref80">13</reflink>). The final term ( <ephtml> &lt;math display="block" xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;munderover&gt;&lt;mstyle mathsize="80%" displaystyle="true"&gt;&lt;mo&gt;&amp;#8721;&lt;/mo&gt;&lt;/mstyle&gt;&lt;mrow&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;mo&gt;=&lt;/mo&gt;&lt;mn&gt;0&lt;/mn&gt;&lt;/mrow&gt;&lt;mrow&gt;&lt;mi&gt;k&lt;/mi&gt;&lt;mo&gt;&amp;#215;&lt;/mo&gt;&lt;mi&gt;k&lt;/mi&gt;&lt;/mrow&gt;&lt;/munderover&gt;&lt;msubsup&gt;&lt;mi&gt;I&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;/mrow&gt;&lt;mrow&gt;&lt;mi&gt;o&lt;/mi&gt;&lt;mi&gt;b&lt;/mi&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;/mrow&gt;&lt;/msubsup&gt;&lt;mstyle mathsize="80%" displaystyle="true"&gt;&lt;mo&gt;&amp;#8721;&lt;/mo&gt;&lt;/mstyle&gt;&lt;mrow&gt;&lt;mi&gt;c&lt;/mi&gt;&lt;mo&gt;&amp;#8712;&lt;/mo&gt;&lt;mi&gt;c&lt;/mi&gt;&lt;mi&gt;l&lt;/mi&gt;&lt;mi&gt;a&lt;/mi&gt;&lt;mi&gt;s&lt;/mi&gt;&lt;mi&gt;s&lt;/mi&gt;&lt;mi&gt;e&lt;/mi&gt;&lt;mi&gt;s&lt;/mi&gt;&lt;/mrow&gt;&lt;mrow&gt;&lt;mover accent="true"&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mi&gt;p&lt;/mi&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;/msub&gt;&lt;/mrow&gt;&lt;mo stretchy="true"&gt;^&lt;/mo&gt;&lt;/mover&gt;&lt;mi&gt;c&lt;/mi&gt;&lt;mi&gt;l&lt;/mi&gt;&lt;mi&gt;o&lt;/mi&gt;&lt;mi&gt;g&lt;/mi&gt;&lt;mfenced&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mi&gt;p&lt;/mi&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;/msub&gt;&lt;mi&gt;c&lt;/mi&gt;&lt;/mrow&gt;&lt;/mfenced&gt;&lt;mo&gt;+&lt;/mo&gt;&lt;mfenced&gt;&lt;mrow&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;mo&gt;&amp;#8722;&lt;/mo&gt;&lt;mover accent="true"&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mi&gt;p&lt;/mi&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;/msub&gt;&lt;/mrow&gt;&lt;mo stretchy="true"&gt;^&lt;/mo&gt;&lt;/mover&gt;&lt;mi&gt;c&lt;/mi&gt;&lt;/mrow&gt;&lt;/mfenced&gt;&lt;mi&gt;l&lt;/mi&gt;&lt;mi&gt;o&lt;/mi&gt;&lt;mi&gt;g&lt;/mi&gt;&lt;mfenced&gt;&lt;mrow&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;mo&gt;&amp;#8722;&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;p&lt;/mi&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;/msub&gt;&lt;mi&gt;c&lt;/mi&gt;&lt;/mrow&gt;&lt;/mfenced&gt;&lt;/mrow&gt;&lt;/mrow&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> ) is the classification loss.</p> <p>(<reflink idref="bib13" id="ref81">13</reflink>) <ephtml> &lt;math display="block" xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mover accent="true"&gt;&lt;mi&gt;C&lt;/mi&gt;&lt;mo stretchy="false"&gt;^&lt;/mo&gt;&lt;/mover&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;/msub&gt;&lt;mo&gt;=&lt;/mo&gt;&lt;mi&gt;Pr&lt;/mi&gt;&lt;mrow&gt;&lt;mi&gt;O&lt;/mi&gt;&lt;mi&gt;b&lt;/mi&gt;&lt;mi&gt;j&lt;/mi&gt;&lt;mi&gt;e&lt;/mi&gt;&lt;mi&gt;c&lt;/mi&gt;&lt;mi&gt;t&lt;/mi&gt;&lt;/mrow&gt;&lt;mo&gt;&amp;#215;&lt;/mo&gt;&lt;mi&gt;I&lt;/mi&gt;&lt;mi&gt;O&lt;/mi&gt;&lt;mi&gt;U&lt;/mi&gt;&lt;mrow&gt;&lt;mtable equalrows="true" equalcolumns="true"&gt;&lt;mtr&gt;&lt;mtd&gt;&lt;mrow&gt;&lt;mi&gt;t&lt;/mi&gt;&lt;mi&gt;r&lt;/mi&gt;&lt;mi&gt;u&lt;/mi&gt;&lt;mi&gt;t&lt;/mi&gt;&lt;mi&gt;h&lt;/mi&gt;&lt;/mrow&gt;&lt;/mtd&gt;&lt;/mtr&gt;&lt;mtr&gt;&lt;mtd&gt;&lt;mrow&gt;&lt;mi&gt;p&lt;/mi&gt;&lt;mi&gt;r&lt;/mi&gt;&lt;mi&gt;e&lt;/mi&gt;&lt;mi&gt;d&lt;/mi&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;mi&gt;c&lt;/mi&gt;&lt;mi&gt;t&lt;/mi&gt;&lt;/mrow&gt;&lt;/mtd&gt;&lt;/mtr&gt;&lt;/mtable&gt;&lt;/mrow&gt;&lt;/mrow&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml></p> <p>In this study, the dataset comprises 99 drone images. We have used 50 drone images which totally consist of 2200 house roofs divided into 2000 training samples and 200 testing samples. The length of the training time is 4 h. The training time of the deep learning algorithm is excluded from the T computation. Figure 6 shows the precision-recall curve generated by the model trained with our dataset (training sample = 2000, testing sample = 200). The average precision obtained is AP = 80.91%.</p> <p>We trained the YOLOv3 [[<reflink idref="bib17" id="ref82">17</reflink>]] roof detection model on the datasets. Figure 7 depicts the roof detection results on the dataset.</p> <hd id="AN0147991657-16">3.3. Evaluation and Testing Process</hd> <p>Structural similarity (SSIM) [[<reflink idref="bib30" id="ref83">30</reflink>]] has been used to find the corresponding candidate regions between images in each image pair. Three traditional feature extraction and matching algorithms, SIFT [[<reflink idref="bib2" id="ref84">2</reflink>]], SURF [[<reflink idref="bib4" id="ref85">4</reflink>]], and ORB [[<reflink idref="bib6" id="ref86">6</reflink>]] were then run for image matching within the corresponding candidate regions. The quality of the candidate region pair was evaluated by four evaluation methods, namely, execution time (T) [[<reflink idref="bib31" id="ref87">31</reflink>]], match rate (MR) [[<reflink idref="bib31" id="ref88">31</reflink>], [<reflink idref="bib33" id="ref89">33</reflink>]], match performance (MP) [[<reflink idref="bib34" id="ref90">34</reflink>]], and root mean squared error (RMSE) [[<reflink idref="bib35" id="ref91">35</reflink>], [<reflink idref="bib37" id="ref92">37</reflink>]]. The execution time (T) measures the algorithms efficiency.</p> <p>(<reflink idref="bib14" id="ref93">14</reflink>) <ephtml> &lt;math display="block" xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;mi&gt;M&lt;/mi&gt;&lt;mi&gt;R&lt;/mi&gt;&lt;mo&gt;=&lt;/mo&gt;&lt;mn&gt;2&lt;/mn&gt;&lt;mo&gt;&amp;#215;&lt;/mo&gt;&lt;mi&gt;M&lt;/mi&gt;&lt;mi&gt;a&lt;/mi&gt;&lt;mi&gt;t&lt;/mi&gt;&lt;mi&gt;c&lt;/mi&gt;&lt;mi&gt;h&lt;/mi&gt;&lt;mi&gt;e&lt;/mi&gt;&lt;mi&gt;s&lt;/mi&gt;&lt;mo&gt;/&lt;/mo&gt;&lt;mrow&gt;&lt;mi&gt;K&lt;/mi&gt;&lt;mi&gt;e&lt;/mi&gt;&lt;mi&gt;y&lt;/mi&gt;&lt;mi&gt;p&lt;/mi&gt;&lt;mi&gt;o&lt;/mi&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;mi&gt;n&lt;/mi&gt;&lt;msub&gt;&lt;mi&gt;t&lt;/mi&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;/msub&gt;&lt;mo&gt;+&lt;/mo&gt;&lt;mi&gt;K&lt;/mi&gt;&lt;mi&gt;e&lt;/mi&gt;&lt;mi&gt;y&lt;/mi&gt;&lt;mi&gt;p&lt;/mi&gt;&lt;mi&gt;o&lt;/mi&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;mi&gt;n&lt;/mi&gt;&lt;msub&gt;&lt;mi&gt;t&lt;/mi&gt;&lt;mn&gt;2&lt;/mn&gt;&lt;/msub&gt;&lt;/mrow&gt;&lt;/mrow&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml></p> <p>(<reflink idref="bib15" id="ref94">15</reflink>) <ephtml> &lt;math display="block" xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;mi&gt;M&lt;/mi&gt;&lt;mi&gt;P&lt;/mi&gt;&lt;mo&gt;=&lt;/mo&gt;&lt;mi&gt;M&lt;/mi&gt;&lt;mi&gt;R&lt;/mi&gt;&lt;mo&gt;/&lt;/mo&gt;&lt;mi&gt;T&lt;/mi&gt;&lt;/mrow&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml></p> <p>(<reflink idref="bib16" id="ref95">16</reflink>) <ephtml> &lt;math display="block" xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;mi&gt;R&lt;/mi&gt;&lt;mi&gt;M&lt;/mi&gt;&lt;mi&gt;S&lt;/mi&gt;&lt;mi&gt;E&lt;/mi&gt;&lt;mo&gt;=&lt;/mo&gt;&lt;msqrt&gt;&lt;mrow&gt;&lt;mfrac&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;mi&gt;k&lt;/mi&gt;&lt;/mfrac&gt;&lt;munderover&gt;&lt;mstyle mathsize="140%" displaystyle="true"&gt;&lt;mo&gt;&amp;#8721;&lt;/mo&gt;&lt;/mstyle&gt;&lt;mrow&gt;&lt;mi&gt;p&lt;/mi&gt;&lt;mo&gt;,&lt;/mo&gt;&lt;mi&gt;q&lt;/mi&gt;&lt;mo&gt;=&lt;/mo&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;/mrow&gt;&lt;mrow&gt;&lt;mi&gt;p&lt;/mi&gt;&lt;mo&gt;,&lt;/mo&gt;&lt;mi&gt;q&lt;/mi&gt;&lt;mo&gt;=&lt;/mo&gt;&lt;mi&gt;k&lt;/mi&gt;&lt;/mrow&gt;&lt;/munderover&gt;&lt;msup&gt;&lt;mrow&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mi&gt;x&lt;/mi&gt;&lt;mi&gt;p&lt;/mi&gt;&lt;/msub&gt;&lt;mo&gt;&amp;#8722;&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;x&lt;/mi&gt;&lt;mi&gt;q&lt;/mi&gt;&lt;/msub&gt;&lt;/mrow&gt;&lt;/mrow&gt;&lt;mn&gt;2&lt;/mn&gt;&lt;/msup&gt;&lt;mo&gt;+&lt;/mo&gt;&lt;msup&gt;&lt;mrow&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mi&gt;y&lt;/mi&gt;&lt;mi&gt;p&lt;/mi&gt;&lt;/msub&gt;&lt;mo&gt;&amp;#8722;&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;y&lt;/mi&gt;&lt;mi&gt;q&lt;/mi&gt;&lt;/msub&gt;&lt;/mrow&gt;&lt;/mrow&gt;&lt;mn&gt;2&lt;/mn&gt;&lt;/msup&gt;&lt;/mrow&gt;&lt;/msqrt&gt;&lt;/mrow&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml></p> <p>Matching rate (MR) is the ratio between the number of correct matching feature points and the total number of matching feature points detected by the algorithm. In Equation (<reflink idref="bib14" id="ref96">14</reflink>), <ephtml> &lt;math display="block" xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;mrow&gt;&lt;mi&gt;K&lt;/mi&gt;&lt;mi&gt;e&lt;/mi&gt;&lt;mi&gt;y&lt;/mi&gt;&lt;mi&gt;p&lt;/mi&gt;&lt;mi&gt;o&lt;/mi&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;mi&gt;n&lt;/mi&gt;&lt;msub&gt;&lt;mi&gt;t&lt;/mi&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;/msub&gt;&lt;mo&gt;,&lt;/mo&gt;&lt;mi&gt;K&lt;/mi&gt;&lt;mi&gt;e&lt;/mi&gt;&lt;mi&gt;y&lt;/mi&gt;&lt;mi&gt;p&lt;/mi&gt;&lt;mi&gt;o&lt;/mi&gt;&lt;mi&gt;i&lt;/mi&gt;&lt;mi&gt;n&lt;/mi&gt;&lt;msub&gt;&lt;mi&gt;t&lt;/mi&gt;&lt;mn&gt;2&lt;/mn&gt;&lt;/msub&gt;&lt;/mrow&gt;&lt;/mrow&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> refers to the numbers of keypoints detected in the first and second images respectively, and <ephtml> &lt;math display="block" xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;mi&gt;M&lt;/mi&gt;&lt;mi&gt;a&lt;/mi&gt;&lt;mi&gt;t&lt;/mi&gt;&lt;mi&gt;c&lt;/mi&gt;&lt;mi&gt;h&lt;/mi&gt;&lt;mi&gt;e&lt;/mi&gt;&lt;mi&gt;s&lt;/mi&gt;&lt;/mrow&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> is the number of matches between these two series of interest points. In Equation (<reflink idref="bib15" id="ref97">15</reflink>), we using match performance (MP) to understand the matching status per unit time. In Equation (<reflink idref="bib16" id="ref98">16</reflink>), <emph>k</emph> is the filtered match pair number, where <ephtml> &lt;math display="block" xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;mi&gt;p&lt;/mi&gt;&lt;mo&gt;&amp;#8712;&lt;/mo&gt;&lt;mrow&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;mo&gt;,&lt;/mo&gt;&lt;mi&gt;k&lt;/mi&gt;&lt;/mrow&gt;&lt;/mrow&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> , <ephtml> &lt;math display="block" xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;mi&gt;q&lt;/mi&gt;&lt;mo&gt;&amp;#8712;&lt;/mo&gt;&lt;mrow&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;mo&gt;,&lt;/mo&gt;&lt;mi&gt;k&lt;/mi&gt;&lt;/mrow&gt;&lt;/mrow&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> , and <ephtml> &lt;math display="block" xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;mo stretchy="false"&gt;(&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;x&lt;/mi&gt;&lt;mi&gt;p&lt;/mi&gt;&lt;/msub&gt;&lt;mo&gt;,&lt;/mo&gt;&lt;mo /&gt;&lt;msub&gt;&lt;mi&gt;y&lt;/mi&gt;&lt;mi&gt;p&lt;/mi&gt;&lt;/msub&gt;&lt;mo stretchy="false"&gt;)&lt;/mo&gt;&lt;mo /&gt;&lt;/mrow&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> and <ephtml> &lt;math display="block" xmlns="http://www.w3.org/1998/Math/MathML"&gt;&lt;semantics&gt;&lt;mrow&gt;&lt;mo stretchy="false"&gt;(&lt;/mo&gt;&lt;msub&gt;&lt;mi&gt;x&lt;/mi&gt;&lt;mi&gt;q&lt;/mi&gt;&lt;/msub&gt;&lt;mo&gt;,&lt;/mo&gt;&lt;mo /&gt;&lt;msub&gt;&lt;mi&gt;y&lt;/mi&gt;&lt;mi&gt;q&lt;/mi&gt;&lt;/msub&gt;&lt;mo stretchy="false"&gt;)&lt;/mo&gt;&lt;/mrow&gt;&lt;/semantics&gt;&lt;/math&gt; </ephtml> are the spatial coordinates of the corresponding matching points on the registration image and the reference image, respectively. A smaller RMSE means a higher registration accuracy, and RMSE &lt; 1 means that the registration accuracy the sub-pixel level.</p> <hd id="AN0147991657-17">4. Experimental Results</hd> <p></p> <hd id="AN0147991657-18">4.1. Xizhi District, New Taipei City CASE 1</hd> <p>After the training of YOLOv3 is completed, the weights generated after the training can be used to detect the candidate overlapping areas of other UAV images. Figure 8a shows the first image (taken image time = t) that is reference image for the proposed YOLOv3-based roof region detection. Figure 8b, there are three roof regions were detected and highlighted by YOLOv3-based roof region detection. Figure 8c–e shows the candidate regions in the reference image detected by YOLOv3 object detector.</p> <p>Figure 9a shows the second image (taken image time = t + interval shooting time) that is to be registered for the proposed YOLOv3-based roof region detection. Figure 9b there are three roof regions were detected and highlighted by YOLOv3-based roof region detection. Figure 9c–e shows the candidate regions in the registered image detected by YOLOv3 object detector.</p> <p>The candidate roof regions were matched to find the corresponding region pair using SSIM [[<reflink idref="bib30" id="ref99">30</reflink>]]. Table 3 shows the SSIM measure between candidate regions and their execution times respectively. After obtaining the corresponding region pairs, traditional feature matching algorithms, SIFT [[<reflink idref="bib2" id="ref100">2</reflink>]], SURF [[<reflink idref="bib4" id="ref101">4</reflink>]], and ORB [[<reflink idref="bib6" id="ref102">6</reflink>]], were performed as shown in Figure 10, Figure 11 and Figure 12 (the right image is registered image, the left image is reference image).</p> <p>To compare the proposed method with the traditional image matching algorithms, SIFT [[<reflink idref="bib2" id="ref103">2</reflink>]], SURF [[<reflink idref="bib4" id="ref104">4</reflink>]], and ORB [[<reflink idref="bib6" id="ref105">6</reflink>]], feature extraction and matching were performed using these algorithms on the original image pairs, as shown in Figure 13, Figure 14 and Figure 15, respectively. We recorded the number of keypoint, time, and match point coordinates to computed match rate (MR), match performance (MP), and root mean squared error (RMSE).</p> <p>In this paper, we used the ENVI (Environment for Visualizing Images) software to computed the root-mean-squared error (RMSE). The manually selected GCPs (ground control points) combined with match point coordinates used in the root mean squared error (RMSE) calculation. As shown in Figure 16, the 20 pairs of red markers denote the manual selected of GCPs.</p> <p>Table 4 and Figure 17 summarize the comparison of traditional image matching algorithms, SIFT [[<reflink idref="bib2" id="ref106">2</reflink>]], SURF [[<reflink idref="bib4" id="ref107">4</reflink>]] and ORB [[<reflink idref="bib6" id="ref108">6</reflink>]], with the YOLOv3-based candidate region matching algorithms YOLOv3+SIFT, YOLOv3+SURF, and YOLOv3+ORB. As shown in Table 4, the proposed method was more than 13× faster compared to the traditional image matching algorithm.</p> <p>Figure 18 shows the registration result by using the proposed YOLOv3-based matching method.</p> <p>Our proposed method is compared with the traditional image matching algorithms such as SIFT [[<reflink idref="bib2" id="ref109">2</reflink>]], SURF [[<reflink idref="bib4" id="ref110">4</reflink>]], and ORB [[<reflink idref="bib6" id="ref111">6</reflink>]]. For our quantitative evaluation indexes execution time (T), match rate (MR), match performance (MP), and root mean squared error (RMSE). For the traditional image matching algorithm, the SIFT [[<reflink idref="bib2" id="ref112">2</reflink>]] algorithm had largest number of matching number, it had the longest execution time and lowest match rate. The SURF [[<reflink idref="bib4" id="ref113">4</reflink>]] algorithm's match rate (MR) and root mean squared error (RMSE) have the best performance among the traditional image matching algorithms. The ORB [[<reflink idref="bib6" id="ref114">6</reflink>]] has the best execution time (T) among the traditional image matching algorithm. As shown in Table 4, experimental results show that the proposed method performance was better than the traditional image matching algorithm. The proposed method can be rapidly implemented and has high accuracy and strong robustness.</p> <hd id="AN0147991657-19">4.2. Xizhi District, New Taipei City CASE 2</hd> <p>In this paper, we have evaluated the performance of the YOLOv3-based roof region detection with other cases. Figure 19 shows the reference image and candidate regions in the reference image detected by YOLOv3 object detector.</p> <p>Figure 20 shows the registered image and candidate regions in the registered image detected by YOLOv3 object detector.</p> <p>Table 5 shows the SSIM measures between candidate regions and their execution times.</p> <p>Traditional feature matching algorithms, SIFT [[<reflink idref="bib2" id="ref115">2</reflink>]], SURF [[<reflink idref="bib4" id="ref116">4</reflink>]], and ORB [[<reflink idref="bib6" id="ref117">6</reflink>]], were run on the corresponding region pairs as shown in Figure 21 and Figure 22. The right image is registered image, the left image is reference image.</p> <p>To compare the proposed method with the traditional image matching algorithms, SIFT [[<reflink idref="bib2" id="ref118">2</reflink>]], SURF [[<reflink idref="bib4" id="ref119">4</reflink>]], and ORB [[<reflink idref="bib6" id="ref120">6</reflink>]] feature extraction and matching were performed on the original image pairs as shown in Figure 23, Figure 24 and Figure 25.</p> <p>The manually selected GCPs (ground control points) combined with match point coordinates have been collected with the root mean square error (RMSE). As shown in Figure 26.</p> <p>Table 6 and Figure 27 summarize the comparison between traditional image matching algorithms, SIFT [[<reflink idref="bib2" id="ref121">2</reflink>]], SURF [[<reflink idref="bib4" id="ref122">4</reflink>]], and ORB [[<reflink idref="bib6" id="ref123">6</reflink>]] with the YOLOv3-based candidate region matching algorithms YOLOv3+SIFT, YOLOv3+SURF, and YOLOv3+ORB. As shown in Table 6, the proposed method was more than 15× faster compared to the traditional image matching algorithm.</p> <p>Figure 28 shows the registration result from the proposed YOLOv3-based matching method. As shown in Table 6, the results show that the proposed method performed was better than the traditional image matching algorithm especially in the execution time (T), where it performs 15× faster than the traditional methods. The proposed method can be rapidly implemented and has high accuracy and strong robustness.</p> <hd id="AN0147991657-20">5. Conclusions</hd> <p>Traditional feature-based image matching algorithms dominated the image matching for decades. A fast image matching algorithm is desired as image resolution and size are growing significantly. With the advances of GPU, deep learning algorithms are adopted in various computer vision and language processing fields. In this paper, we proposed a YOLOv3-based image matching approach for fast roof region detection from drone images. As the feature-based matching is performed only on the corresponding region pair instead of the original image pair, the computation complexity is reduced significantly. The proposed approach showed comparable results and performed 13× faster than the traditional methods. In the future work, our model will be trained using overlapping regions with different object conditions. The proposed approach to other UAV images.</p> <hd id="AN0147991657-21">Figures and Tables</hd> <p>Graph: Figure 1 The match points of an input image pair before and after adopting RANSAC algorithm [[<reflink idref="bib9" id="ref124">9</reflink>]]. (a) initial match, (b) filtered match points.</p> <p>Graph: Figure 2 Overall flow chart of the proposed algorithm.</p> <p>Graph: Figure 3 YOLOv3 [[<reflink idref="bib17" id="ref125">17</reflink>]] based model for candidate region. It formulates the candidate region or roof detection as a regression problem. For the illustration purpose, this example has a grid cell size 7 × 7 here. During detection process, the image is first split into S × S size grid cells, and three bounding boxes are estimated for each grid cell. Each bounding box outputs four box attributes indicating its size and location. The final detection is based on the box confidence and class probability.</p> <p>Graph: Figure 4 Backbone network used by YOLOv3 [[<reflink idref="bib17" id="ref126">17</reflink>]] for a three-scale object detection.</p> <p>Graph: Figure 5 LabelImg software interface used to generate the ground truth object labels.</p> <p>Graph: Figure 6 A precision-recall curve. The average precision obtained is AP = 80.91%.</p> <p>Graph: Figure 7 Use of two subsequent drone images to perform image matching task by YOLOv3 [[<reflink idref="bib17" id="ref127">17</reflink>]].</p> <p>Graph: Figure 8 Drone images were used to perform the image matching task. (a) the reference image, (b) three roof regions were detected and highlighted by YOLOv3-based roof region detection. (c–e) are the candidate regions in the reference image detected by YOLOv3 object detector.</p> <p>Graph: Figure 9 Drone images were used to perform the image matching task. (a) the image to be registered, (b) three roof regions were detected and highlighted by YOLOv3-based roof region detection. (c–e) are the candidate regions in the registered image detected by YOLOv3 object detector.</p> <p>Graph: Figure 10 Traditional image matching is performed on the candidate roof region pair. The figure shows from left to right: (a) SIFT [[<reflink idref="bib2" id="ref128">2</reflink>]], (b) SURF [[<reflink idref="bib4" id="ref129">4</reflink>]], and (c) ORB [[<reflink idref="bib6" id="ref130">6</reflink>]] feature extraction and matching.</p> <p>Graph: Figure 11 From left to right: (a) SIFT [[<reflink idref="bib2" id="ref131">2</reflink>]], (b) SURF [[<reflink idref="bib4" id="ref132">4</reflink>]], and (c) ORB [[<reflink idref="bib6" id="ref133">6</reflink>]] feature extraction and matching.</p> <p>Graph: Figure 12 From left to right: (a) SIFT [[<reflink idref="bib2" id="ref134">2</reflink>]], (b) SURF [[<reflink idref="bib4" id="ref135">4</reflink>]], and (c) ORB [[<reflink idref="bib6" id="ref136">6</reflink>]] feature extraction and matching.</p> <p>Graph: Figure 13 Traditional image matching by SIFT [[<reflink idref="bib2" id="ref137">2</reflink>]]. The corresponding matched key points are linked by color lines.</p> <p>Graph: Figure 14 Traditional image matching by SURF [[<reflink idref="bib4" id="ref138">4</reflink>]]. The corresponding matched key points are linked by color lines.</p> <p>Graph: Figure 15 Traditional image matching by ORB [[<reflink idref="bib6" id="ref139">6</reflink>]]. The corresponding matched key points are linked by color lines.</p> <p>Graph: Figure 16 20 pairs of red markers denote the manual selection of ground control points used in the RMSE calculation.</p> <p>Graph: Figure 17 Results of four evaluation methods comparing the traditional image matching algorithms with the YOLOv3-based image matching algorithm. (a) Match rate (MR), (b) time (ms), (c) match performance (MP), and (d) root-mean-squared error (RMSE).</p> <p>Graph: Figure 18 Registration result of Figure 8a and Figure 9a using: (a) YOLOv3+SIFT, (b) YOLOv3+SURF, (c) YOLOv3+ORB.</p> <p>Graph: Figure 19 Drone images were used to perform the image matching task. (a) Reference image, (b) three roof regions were detected and highlighted by YOLOv3-based roof region detection. (c,d) are the candidate regions in the reference image detected by YOLOv3 object detector.</p> <p>Graph: Figure 20 Drone images were used to perform the image matching task. (a) Image to be registered, (b) three roof regions were detected and highlighted by YOLOv3-based roof region detection. (c,d) are the candidate regions in the registered image detected by YOLOv3 object detector.</p> <p>Graph: Figure 21 After the corresponding candidate region was identified between the image pair, traditional image matching was performed on the candidate roof region pair. Figure shows from left to right: (a) SIFT [[<reflink idref="bib2" id="ref140">2</reflink>]], (b) SURF [[<reflink idref="bib4" id="ref141">4</reflink>]], and (c) ORB [[<reflink idref="bib6" id="ref142">6</reflink>]] feature extraction and matching.</p> <p>Graph: Figure 22 From left to right: (a) SIFT [[<reflink idref="bib2" id="ref143">2</reflink>]], (b) SURF [[<reflink idref="bib4" id="ref144">4</reflink>]], and (c) ORB [[<reflink idref="bib6" id="ref145">6</reflink>]] feature extraction and matching algorithms.</p> <p>Graph: Figure 23 Traditional image matching by SIFT [[<reflink idref="bib2" id="ref146">2</reflink>]]. The corresponding matched key points are linked by color lines.</p> <p>Graph: Figure 24 Traditional image matching by SURF [[<reflink idref="bib4" id="ref147">4</reflink>]]. The corresponding matched key points are linked by color lines.</p> <p>Graph: Figure 25 Traditional image matching by ORB [[<reflink idref="bib6" id="ref148">6</reflink>]]. The corresponding matched key points are linked by color lines.</p> <p>Graph: Figure 26 20 pairs of red markers denote the manual selection of ground control points used in the RMSE calculation.</p> <p>Graph: Figure 27 Results of four evaluation methods on the traditional image matching algorithms and YOLOv3-based image matching algorithm. (a) Match rate (MR), (b) time (ms), (c) match performance (MP), and (d) root-mean-squared error (RMSE).</p> <p>Graph: Figure 28 Registration result of Figure 19a and Figure 20a using: (a) YOLOv3+SIFT, (b) YOLOv3+SURF, (c) YOLOv3+ORB.</p> <p>Table 1 Computing hardware and training environment for YOLOv3-based candidate regions.</p> <p> <ephtml> &lt;table&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td align="left" valign="middle" style="border-top:solid thin"&gt;Operating System&lt;/td&gt;&lt;td align="left" valign="middle" style="border-top:solid thin"&gt;Ubuntu 16.04 LTS&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="left" valign="middle"&gt;Central processing &lt;/td&gt;&lt;td align="left" valign="middle"&gt;Intel i7-8700 3.2GHz&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="left" valign="middle"&gt;Random-access memory (RAM)&lt;/td&gt;&lt;td align="left" valign="middle"&gt;DDR4 2400 24GB&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="left" valign="middle"&gt;Graphics card&lt;/td&gt;&lt;td align="left" valign="middle"&gt;TITAN Xp (Pascal)&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="left" valign="middle" style="border-bottom:solid thin"&gt;Software&lt;/td&gt;&lt;td align="left" valign="middle" style="border-bottom:solid thin"&gt;Darknet, CUDA9.0&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt; </ephtml> </p> <p>Table 2 UAV platform and sensor characteristics.</p> <p> <ephtml> &lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th align="center" style="border-top:solid thin;border-bottom:solid thin"&gt;Characteristic Name&lt;/th&gt;&lt;th align="center" style="border-top:solid thin;border-bottom:solid thin"&gt;Description&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td align="center" valign="middle"&gt;Platform &lt;/td&gt;&lt;td align="center" valign="middle"&gt;ALIAS&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="center" valign="middle"&gt;Flight altitude Above Ground Level (AGL)&lt;/td&gt;&lt;td align="center" valign="middle"&gt;200 m&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="center" valign="middle"&gt;Sensor&lt;/td&gt;&lt;td align="center" valign="middle"&gt;SONY a7R&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="center" valign="middle"&gt;Resolution&lt;/td&gt;&lt;td align="center" valign="middle"&gt;7360 &amp;#215; 4912&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="center" valign="middle"&gt;Output data format&lt;/td&gt;&lt;td align="center" valign="middle"&gt;JPEG (Exif 2.3)/ RAW (Sony ARW 2.3)&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="center" valign="middle"&gt;Spatial resolution&lt;/td&gt;&lt;td align="center" valign="middle"&gt;25 cm (GSD)&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="center" valign="middle" style="border-bottom:solid thin"&gt;Weather&lt;/td&gt;&lt;td align="center" valign="middle" style="border-bottom:solid thin"&gt;Overcast&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt; </ephtml> </p> <p>Table 3 SSIM between candidate regions to find a matched roof region.</p> <p> <ephtml> &lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th align="center" style="border-top:solid thin;border-bottom:solid thin"&gt;Figure&lt;/th&gt;&lt;th align="center" style="border-top:solid thin;border-bottom:solid thin"&gt;SSIM&lt;/th&gt;&lt;th align="center" style="border-top:solid thin;border-bottom:solid thin"&gt;SSIM Execution Time (ms)&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td align="center" valign="middle"&gt;Figure 8c and Figure 9c&lt;/td&gt;&lt;td align="center" valign="middle"&gt;0.7743&lt;/td&gt;&lt;td align="center" valign="middle"&gt;3.27&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="center" valign="middle"&gt;Figure 8c and Figure 9d&lt;/td&gt;&lt;td align="center" valign="middle"&gt;0.6528&lt;/td&gt;&lt;td align="center" valign="middle"&gt;3.12&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="center" valign="middle" style="border-bottom:solid thin"&gt;Figure 8c and Figure 9e&lt;/td&gt;&lt;td align="center" valign="middle" style="border-bottom:solid thin"&gt;0.2644&lt;/td&gt;&lt;td align="center" valign="middle" style="border-bottom:solid thin"&gt;2.83&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="center" valign="middle"&gt;Figure 8d and Figure 9c&lt;/td&gt;&lt;td align="center" valign="middle"&gt;0.6236&lt;/td&gt;&lt;td align="center" valign="middle"&gt;3.03&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="center" valign="middle"&gt;Figure 8d and Figure 9d&lt;/td&gt;&lt;td align="center" valign="middle"&gt;0.8026&lt;/td&gt;&lt;td align="center" valign="middle"&gt;3.18&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="center" valign="middle" style="border-bottom:solid thin"&gt;Figure 8d and Figure 9e&lt;/td&gt;&lt;td align="center" valign="middle" style="border-bottom:solid thin"&gt;0.2836&lt;/td&gt;&lt;td align="center" valign="middle" style="border-bottom:solid thin"&gt;2.91&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="center" valign="middle"&gt;Figure 8e and Figure 9c&lt;/td&gt;&lt;td align="center" valign="middle"&gt;0.2731&lt;/td&gt;&lt;td align="center" valign="middle"&gt;3.08&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="center" valign="middle"&gt;Figure 8e and Figure 9d&lt;/td&gt;&lt;td align="center" valign="middle"&gt;0.2836&lt;/td&gt;&lt;td align="center" valign="middle"&gt;3.15&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="center" valign="middle" style="border-bottom:solid thin"&gt;Figure 8e and Figure 9e&lt;/td&gt;&lt;td align="center" valign="middle" style="border-bottom:solid thin"&gt;0.7263&lt;/td&gt;&lt;td align="center" valign="middle" style="border-bottom:solid thin"&gt;3.22&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt; </ephtml> </p> <p>Table 4 Comparison between the traditional image matching methods and the YOLOv3-based candidate region image matching method for image pair in Figure 8.</p> <p> <ephtml> &lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th rowspan="2" align="center" style="border-top:solid thin;border-bottom:solid thin"&gt;Method&lt;/th&gt;&lt;th rowspan="2" align="center" style="border-top:solid thin;border-bottom:solid thin"&gt;&lt;italic&gt;Keypoint&lt;sub&gt;1&lt;/sub&gt;&lt;/italic&gt;&lt;/th&gt;&lt;th rowspan="2" align="center" style="border-top:solid thin;border-bottom:solid thin"&gt;&lt;italic&gt;Keypoint&lt;sub&gt;2&lt;/sub&gt;&lt;/italic&gt;&lt;/th&gt;&lt;th rowspan="2" align="center" style="border-top:solid thin;border-bottom:solid thin"&gt;Matches&lt;/th&gt;&lt;th rowspan="2" align="center" style="border-top:solid thin;border-bottom:solid thin"&gt;Match Rate (%)&lt;/th&gt;&lt;th colspan="2" align="center" style="border-top:solid thin;border-bottom:solid thin"&gt;Execution Time (ms)&lt;/th&gt;&lt;th rowspan="2" align="center" style="border-top:solid thin;border-bottom:solid thin"&gt;Match Performance (%)&lt;/th&gt;&lt;th rowspan="2" align="center" style="border-top:solid thin;border-bottom:solid thin"&gt;RMSE&lt;/th&gt;&lt;/tr&gt;&lt;tr&gt;&lt;th align="center" style="border-bottom:solid thin"&gt;YOLOv3 Time&lt;/th&gt;&lt;th align="center" style="border-bottom:solid thin"&gt;Matching Time&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td align="center" valign="middle"&gt;SIFT&lt;/td&gt;&lt;td align="center" valign="middle"&gt;2000&lt;/td&gt;&lt;td align="center" valign="middle"&gt;2001&lt;/td&gt;&lt;td align="center" valign="middle"&gt;1126&lt;/td&gt;&lt;td align="center" valign="middle"&gt;56.29&lt;/td&gt;&lt;td align="center" valign="middle"&gt;0&lt;/td&gt;&lt;td align="center" valign="middle"&gt;1183.68&lt;/td&gt;&lt;td align="center" valign="middle"&gt;0.05&lt;/td&gt;&lt;td align="center" valign="middle"&gt;0.9647&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="center" valign="middle" style="border-bottom:solid thin"&gt;YOLOv3+SIFT&lt;/td&gt;&lt;td align="center" valign="middle" style="border-bottom:solid thin"&gt;490&lt;/td&gt;&lt;td align="center" valign="middle" style="border-bottom:solid thin"&gt;490&lt;/td&gt;&lt;td align="center" valign="middle" style="border-bottom:solid thin"&gt;477&lt;/td&gt;&lt;td align="center" valign="middle" style="border-bottom:solid thin"&gt;97.35&lt;/td&gt;&lt;td align="center" valign="middle" style="border-bottom:solid thin"&gt;28.98&lt;/td&gt;&lt;td align="center" valign="middle" style="border-bottom:solid thin"&gt;29.79&lt;/td&gt;&lt;td align="center" valign="middle" style="border-bottom:solid thin"&gt;1.66&lt;/td&gt;&lt;td align="center" valign="middle" style="border-bottom:solid thin"&gt;0.8578&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="center" valign="middle"&gt;SURF&lt;/td&gt;&lt;td align="center" valign="middle"&gt;1582&lt;/td&gt;&lt;td align="center" valign="middle"&gt;1507&lt;/td&gt;&lt;td align="center" valign="middle"&gt;1024&lt;/td&gt;&lt;td align="center" valign="middle"&gt;66.30&lt;/td&gt;&lt;td align="center" valign="middle"&gt;0&lt;/td&gt;&lt;td align="center" valign="middle"&gt;1064.84&lt;/td&gt;&lt;td align="center" valign="middle"&gt;0.06&lt;/td&gt;&lt;td align="center" valign="middle"&gt;0.9285&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="center" valign="middle" style="border-bottom:solid thin"&gt;YOLOv3+SURF&lt;/td&gt;&lt;td align="center" valign="middle" style="border-bottom:solid thin"&gt;80&lt;/td&gt;&lt;td align="center" valign="middle" style="border-bottom:solid thin"&gt;80&lt;/td&gt;&lt;td align="center" valign="middle" style="border-bottom:solid thin"&gt;78&lt;/td&gt;&lt;td align="center" valign="middle" style="border-bottom:solid thin"&gt;97.50&lt;/td&gt;&lt;td align="center" valign="middle" style="border-bottom:solid thin"&gt;28.98&lt;/td&gt;&lt;td align="center" valign="middle" style="border-bottom:solid thin"&gt;22.94&lt;/td&gt;&lt;td align="center" valign="middle" style="border-bottom:solid thin"&gt;1.88&lt;/td&gt;&lt;td align="center" valign="middle" style="border-bottom:solid thin"&gt;0.8864&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="center" valign="middle"&gt;ORB&lt;/td&gt;&lt;td align="center" valign="middle"&gt;1500&lt;/td&gt;&lt;td align="center" valign="middle"&gt;1486&lt;/td&gt;&lt;td align="center" valign="middle"&gt;586&lt;/td&gt;&lt;td align="center" valign="middle"&gt;39.25&lt;/td&gt;&lt;td align="center" valign="middle"&gt;0&lt;/td&gt;&lt;td align="center" valign="middle"&gt;506.63&lt;/td&gt;&lt;td align="center" valign="middle"&gt;0.08&lt;/td&gt;&lt;td align="center" valign="middle"&gt;0.9751&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="center" valign="middle" style="border-bottom:solid thin"&gt;YOLOv3+ORB&lt;/td&gt;&lt;td align="center" valign="middle" style="border-bottom:solid thin"&gt;619&lt;/td&gt;&lt;td align="center" valign="middle" style="border-bottom:solid thin"&gt;619&lt;/td&gt;&lt;td align="center" valign="middle" style="border-bottom:solid thin"&gt;603&lt;/td&gt;&lt;td align="center" valign="middle" style="border-bottom:solid thin"&gt;97.42&lt;/td&gt;&lt;td align="center" valign="middle" style="border-bottom:solid thin"&gt;28.98&lt;/td&gt;&lt;td align="center" valign="middle" style="border-bottom:solid thin"&gt;10.99&lt;/td&gt;&lt;td align="center" valign="middle" style="border-bottom:solid thin"&gt;2.43&lt;/td&gt;&lt;td align="center" valign="middle" style="border-bottom:solid thin"&gt;0.8962&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt; </ephtml> </p> <p>Table 5 SSIM between candidate region to find a matched roof region.</p> <p> <ephtml> &lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th align="center" style="border-top:solid thin;border-bottom:solid thin"&gt;Figure&lt;/th&gt;&lt;th align="center" style="border-top:solid thin;border-bottom:solid thin"&gt;SSIM&lt;/th&gt;&lt;th align="center" style="border-top:solid thin;border-bottom:solid thin"&gt;SSIM Execution Time (ms)&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td align="center" valign="middle"&gt;Figure 19c and Figure 20c&lt;/td&gt;&lt;td align="center" valign="middle"&gt;0.7143&lt;/td&gt;&lt;td align="center" valign="middle"&gt;3.53&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="center" valign="middle"&gt;Figure 19c and Figure 20d&lt;/td&gt;&lt;td align="center" valign="middle" style="border-bottom:solid thin"&gt;0.3597&lt;/td&gt;&lt;td align="center" valign="middle" style="border-bottom:solid thin"&gt;3.48&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="center" valign="middle" style="border-top:solid thin"&gt;Figure 19d and Figure 20c&lt;/td&gt;&lt;td align="center" valign="middle"&gt;0.4001&lt;/td&gt;&lt;td align="center" valign="middle"&gt;3.46&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="center" valign="middle" style="border-bottom:solid thin"&gt;Figure 19d and Figure 20d&lt;/td&gt;&lt;td align="center" valign="middle" style="border-bottom:solid thin"&gt;0.7688&lt;/td&gt;&lt;td align="center" valign="middle" style="border-bottom:solid thin"&gt;3.58&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt; </ephtml> </p> <p>Table 6 Comparison between traditional image matching methods and the YOLOv3-based candidate region image matching method for image pair in Figure 19.</p> <p> <ephtml> &lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th rowspan="2" align="center" style="border-top:solid thin;border-bottom:solid thin"&gt;Method&lt;/th&gt;&lt;th rowspan="2" align="center" style="border-top:solid thin;border-bottom:solid thin"&gt;&lt;italic&gt;Keypoint&lt;sub&gt;1&lt;/sub&gt;&lt;/italic&gt;&lt;/th&gt;&lt;th rowspan="2" align="center" style="border-top:solid thin;border-bottom:solid thin"&gt;&lt;italic&gt;Keypoint&lt;sub&gt;2&lt;/sub&gt;&lt;/italic&gt;&lt;/th&gt;&lt;th rowspan="2" align="center" style="border-top:solid thin;border-bottom:solid thin"&gt;Matches&lt;/th&gt;&lt;th rowspan="2" align="center" style="border-top:solid thin;border-bottom:solid thin"&gt;Match Rate (%)&lt;/th&gt;&lt;th colspan="2" align="center" style="border-top:solid thin;border-bottom:solid thin"&gt;Execution Time (ms)&lt;/th&gt;&lt;th rowspan="2" align="center" style="border-top:solid thin;border-bottom:solid thin"&gt;Match Performance (%)&lt;/th&gt;&lt;th rowspan="2" align="center" style="border-top:solid thin;border-bottom:solid thin"&gt;RMSE&lt;/th&gt;&lt;/tr&gt;&lt;tr&gt;&lt;th align="center" style="border-bottom:solid thin"&gt;YOLOv3 Time&lt;/th&gt;&lt;th align="center" style="border-bottom:solid thin"&gt;Matching Time &lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td align="center" valign="middle"&gt;SIFT&lt;/td&gt;&lt;td align="center" valign="middle"&gt;2000&lt;/td&gt;&lt;td align="center" valign="middle"&gt;2000&lt;/td&gt;&lt;td align="center" valign="middle"&gt;614&lt;/td&gt;&lt;td align="center" valign="middle"&gt;30.70&lt;/td&gt;&lt;td align="center" valign="middle"&gt;0&lt;/td&gt;&lt;td align="center" valign="middle"&gt;1094.49&lt;/td&gt;&lt;td align="center" valign="middle"&gt;0.03&lt;/td&gt;&lt;td align="center" valign="middle"&gt;0.9184&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="center" valign="middle" style="border-bottom:solid thin"&gt;YOLOv3+SIFT&lt;/td&gt;&lt;td align="center" valign="middle" style="border-bottom:solid thin"&gt;130&lt;/td&gt;&lt;td align="center" valign="middle" style="border-bottom:solid thin"&gt;186&lt;/td&gt;&lt;td align="center" valign="middle" style="border-bottom:solid thin"&gt;124&lt;/td&gt;&lt;td align="center" valign="middle" style="border-bottom:solid thin"&gt;78.48&lt;/td&gt;&lt;td align="center" valign="middle" style="border-bottom:solid thin"&gt;16.72&lt;/td&gt;&lt;td align="center" valign="middle" style="border-bottom:solid thin"&gt;25.91&lt;/td&gt;&lt;td align="center" valign="middle" style="border-bottom:solid thin"&gt;1.84&lt;/td&gt;&lt;td align="center" valign="middle" style="border-bottom:solid thin"&gt;0.9069&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="center" valign="middle"&gt;SURF&lt;/td&gt;&lt;td align="center" valign="middle"&gt;309&lt;/td&gt;&lt;td align="center" valign="middle"&gt;520&lt;/td&gt;&lt;td align="center" valign="middle"&gt;239&lt;/td&gt;&lt;td align="center" valign="middle"&gt;57.66&lt;/td&gt;&lt;td align="center" valign="middle"&gt;0&lt;/td&gt;&lt;td align="center" valign="middle"&gt;936.49&lt;/td&gt;&lt;td align="center" valign="middle"&gt;0.06&lt;/td&gt;&lt;td align="center" valign="middle"&gt;0.9713&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="center" valign="middle" style="border-bottom:solid thin"&gt;YOLOv3+SURF&lt;/td&gt;&lt;td align="center" valign="middle" style="border-bottom:solid thin"&gt;37&lt;/td&gt;&lt;td align="center" valign="middle" style="border-bottom:solid thin"&gt;32&lt;/td&gt;&lt;td align="center" valign="middle" style="border-bottom:solid thin"&gt;27&lt;/td&gt;&lt;td align="center" valign="middle" style="border-bottom:solid thin"&gt;78.26&lt;/td&gt;&lt;td align="center" valign="middle" style="border-bottom:solid thin"&gt;16.72&lt;/td&gt;&lt;td align="center" valign="middle" style="border-bottom:solid thin"&gt;21.40&lt;/td&gt;&lt;td align="center" valign="middle" style="border-bottom:solid thin"&gt;2.05&lt;/td&gt;&lt;td align="center" valign="middle" style="border-bottom:solid thin"&gt;0.8715&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="center" valign="middle"&gt;ORB&lt;/td&gt;&lt;td align="center" valign="middle"&gt;1500&lt;/td&gt;&lt;td align="center" valign="middle"&gt;1496&lt;/td&gt;&lt;td align="center" valign="middle"&gt;381&lt;/td&gt;&lt;td align="center" valign="middle"&gt;25.43&lt;/td&gt;&lt;td align="center" valign="middle"&gt;0&lt;/td&gt;&lt;td align="center" valign="middle"&gt;295.57&lt;/td&gt;&lt;td align="center" valign="middle"&gt;0.06&lt;/td&gt;&lt;td align="center" valign="middle"&gt;0.9742&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="center" valign="middle" style="border-bottom:solid thin"&gt;YOLOv3+ORB&lt;/td&gt;&lt;td align="center" valign="middle" style="border-bottom:solid thin"&gt;129&lt;/td&gt;&lt;td align="center" valign="middle" style="border-bottom:solid thin"&gt;105&lt;/td&gt;&lt;td align="center" valign="middle" style="border-bottom:solid thin"&gt;95&lt;/td&gt;&lt;td align="center" valign="middle" style="border-bottom:solid thin"&gt;81.20&lt;/td&gt;&lt;td align="center" valign="middle" style="border-bottom:solid thin"&gt;16.72&lt;/td&gt;&lt;td align="center" valign="middle" style="border-bottom:solid thin"&gt;9.94&lt;/td&gt;&lt;td align="center" valign="middle" style="border-bottom:solid thin"&gt;3.05&lt;/td&gt;&lt;td align="center" valign="middle" style="border-bottom:solid thin"&gt;0.9242&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt; </ephtml> </p> <hd id="AN0147991657-22">Author Contributions</hd> <p>Conceptualization, C.-C.Y. and P.-H.H.; Data curation, C.-C.Y.; Methodology, C.-C.Y., Y.-L.C., P.-H.H., V.-C.K., B.H., and W.E.; Supervision, Y.-L.C., M.A., P.-H.H., B.H., L.C., and V.-C.K.; Validation, C.-C.Y., V.-C.K., and W.E.; Writing—original draft, W.E., M.A., and C.-C.Y.; Writing—review and editing, W.E., C.-C.Y., M.A., V.-C.K., B.H., and Y.-L.C.; C.-C.Y. and Y.-L.C. have the same contributions. All authors have read and agreed to the published version of the manuscript.</p> <hd id="AN0147991657-23">Funding</hd> <p>This research received no external funding.</p> <hd id="AN0147991657-24">Institutional Review Board Statement</hd> <p>Not applicable.</p> <hd id="AN0147991657-25">Informed Consent Statement</hd> <p>Not applicable.</p> <hd id="AN0147991657-26">Data Availability Statement</hd> <p>The data presented in this study are available on request from the corresponding author.</p> <hd id="AN0147991657-27">Conflicts of Interest</hd> <p>The authors declare no conflict of interest.</p> <hd id="AN0147991657-28">Acknowledgments</hd> <p>This work was sponsored by the Ministry of Science and Technology, Taiwan, (grant nos. MOST 108A27A, 108-2116-M-027-003, and 107-2116-M-027-003); National Space Organization, Taiwan, (grant no. NSPO-S-108216); Sinotech Engineering Consultants Inc., (grant no. A-RD-I7001-002), and National Taipei University of Technology, (grant nos. USTP-NTUT-NTOU-107-02, and NTUT-USTB-108-02).</p> <ref id="AN0147991657-29"> <title> Footnotes </title> <blist> <bibl id="bib1" idref="ref1" type="bt">1</bibl> <bibtext> Publisher's Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.</bibtext> </blist> </ref> <ref id="AN0147991657-30"> <title> References </title> <blist> <bibtext> Brown L.G. A Survey of Image Registration Techniques. ACM. 1992; 24: 326-376. 10.1145/146370.146374</bibtext> </blist> <blist> <bibl id="bib2" idref="ref2" type="bt">2</bibl> <bibtext> Lowe D.G. Object Recognition from Local Scale-Invariant Features. ICCV. 1999; 99: 1150-1157</bibtext> </blist> <blist> <bibl id="bib3" idref="ref33" type="bt">3</bibl> <bibtext> Lowe D.G. Distinctive Image Features from Scale-Invariant Keypoints. Int. J. Comput. Vis. 2004; 60: 91-110. 10.1023/B:VISI.0000029664.99615.94</bibtext> </blist> <blist> <bibl id="bib4" idref="ref3" type="bt">4</bibl> <bibtext> Bay H., Tuytelaars T., Van Gool L. Surf: Speeded up robust features. European Conference on Computer Vision. Graz, Austria. 7–13 May 2006; Springer: New York, NY, USA. 2006: 404-417. 10.1016/j.cviu.2007.09.014</bibtext> </blist> <blist> <bibl id="bib5" idref="ref39" type="bt">5</bibl> <bibtext> Bay H., Ess A., Tuytelaars T., Van Gool L. Speeded-Up Robust Features (SURF). Comput. Vis. Image Underst. 2008; 110: 346-359. 10.1016/j.cviu.2007.09.014</bibtext> </blist> <blist> <bibl id="bib6" idref="ref4" type="bt">6</bibl> <bibtext> Rublee E., Rabaud V., Konolige K., Bradski G.R. ORB: An efficient alternative to SIFT or SURF. Proceedings of the IEEE International Conference on Computer Vision (ICCV). Barcelona, Spain. 6–13 November 2011: 2564-2571</bibtext> </blist> <blist> <bibl id="bib7" idref="ref5" type="bt">7</bibl> <bibtext> Harris C.G., Stephens M.J. A combined corner and edge detector. Proceedings of the Fourth Alvey Vision Conference. Manchester, UK. 31 August–2 September 1988: 147-152</bibtext> </blist> <blist> <bibl id="bib8" idref="ref22" type="bt">8</bibl> <bibtext> Shi J., Tomasi C. Good features to track. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Seattle, DC, USA. 21–23 June 1994: 593-600</bibtext> </blist> <blist> <bibl id="bib9" idref="ref7" type="bt">9</bibl> <bibtext> Fischler M.A., Bolles R.C. Random Sample Consensus: A Paradigm for Model Fitting with Applications to Image Analysis and Automated Cartography. Commun. ACM. 1981; 24: 381-395. 10.1145/358669.358692</bibtext> </blist> <blist> <bibtext> Kamilaris A., Prenafeta-Boldú F.X. Deep Learning in Agriculture: A Survey. Comput. Electron. Agric. 2018; 147: 70-90. 10.1016/j.compag.2018.02.016</bibtext> </blist> <blist> <bibtext> Liu L., Ouyang W., Wang X., Fieguth P., Chen J., Liu X., Pietikäinen M. Deep Learning for Generic Object Detection: A Survey. Int. J. Comput. Vis. 2020; 128: 261-318. 10.1007/s11263-019-01247-4</bibtext> </blist> <blist> <bibtext> LeCun Y., Bottou L., Bengio Y., Haffner P. Gradient-Based Learning Applied to Document Recognition. Proc. IEEE. 1998; 86: 2278-2324. 10.1109/5.726791</bibtext> </blist> <blist> <bibtext> Krizhevsky A., Sutskever I., Hinton G.E. Imagenet classification with deep convolutional neural networks. Proceedings of the 25th Conference on Advances in Neural Information Processing Systems (NIPS). Lake Tahoe, NV, USA. 3–6 December 2012: 1097-1105</bibtext> </blist> <blist> <bibtext> Serre T. Deep Learning: The Good, the bad, and the Ugly. Annu. Rev. Vis. Sci. 2019; 5: 399-426. 10.1146/annurev-vision-091718-014951</bibtext> </blist> <blist> <bibtext> Simonyan K., Zisserman A. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv. 2014. 1409.1556</bibtext> </blist> <blist> <bibtext> He K., Zhang X., Ren S., Sun J. Deep residual learning for image recognition. Proceedings of the 2016 IEEE Conference on Computer, Vision, Pattern Recognition. Las Vegas, NV, USA. 27–30 June 2016: 770-778</bibtext> </blist> <blist> <bibtext> Redmon J., Farhadi A. Yolov3: An Incremental Improvement. arXiv. 2018. 1804.02767</bibtext> </blist> <blist> <bibtext> Raina R., Madhavan A., Ng A.Y. Large-scale deep unsupervised learning using graphics processors. Proceedings of the 26th Annual International Conference on Machine Learning. Montreal, QC, Canada. 14–18 June 2009; ACM: Montreal, QC, Canada. 2009: 873-880</bibtext> </blist> <blist> <bibtext> Cire¸san D.C., Meier U., Gambardella L.M., Schmidhuber J. Deep, Big, Simple Neural Nets for Handwritten Digit Recognition. Neural Comput. 2010; 22: 3207-3220. 10.1162/NECO_a_00052. 20858131</bibtext> </blist> <blist> <bibtext> Sugihara K., Hayashi Y. Automatic Generation of 3D Building Models with Multiple Roofs. Tsinghua Sci. Technol. 2008; 13: 368-374. 10.1016/S1007-0214(08)70176-7</bibtext> </blist> <blist> <bibtext> Dahl G.E., Yu D., Deng L., Acero A. Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition. IEEE Trans. Audio Speech Lang. Process. 2011; 20: 30-42. 10.1109/TASL.2011.2134090</bibtext> </blist> <blist> <bibtext> Lee D., Lee S.-J., Seo Y.-J. Application of Recent Developments in Deep Learning to ANN-Based Automatic Berthing Systems. Int. J. Eng. Technol. Innov. 2020; 10: 75-90. 10.46604/ijeti.2020.4354</bibtext> </blist> <blist> <bibtext> Girshick R. Fast R-CNN. Proceedings of the IEEE International Conference on Computer Vision (ICCV). Santiago, Chile. 13–16 December 2015: 1440-1448</bibtext> </blist> <blist> <bibtext> Mikolajczyk K., Schmid C. An affine invariant interest point detector. Proceedings of the European Conference on Computer Vision. Copenhagen, Denmark. 28–31 May 2002: 128-142</bibtext> </blist> <blist> <bibtext> Zhang Z., Geiger J., Pohjalainen J., Mousa A.E., Jin W., Schuller B. Deep Learning for Environmentally Robust Speech Recognition: An Overview of Recent Developments. ACM Trans. Intell. Syst. Technol. 2018; 9: 49:1-49:28. 10.1145/3178115</bibtext> </blist> <blist> <bibtext> Girshick R., Donahue J., Darrell T., Malik J. Rich feature hierarchies for accurate object detection and semantic segmentation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Columbus, OH, USA. 23–28 June 2014: 580-587</bibtext> </blist> <blist> <bibtext> Ren S., He K., Girshick R., Sun J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016; 39: 1137-1149. 10.1109/TPAMI.2016.2577031</bibtext> </blist> <blist> <bibtext> Liu W., Anguelov D., Erhan D., Szegedy C., Reed S., Fu C.Y., Berg A.C. Ssd: Single shot multibox detector. Proceedings of the European Conference on Computer Vision—ECCV. Amsterdam, The Netherlands. 8–16 October 2016: 21-37</bibtext> </blist> <blist> <bibtext> TzutalinAvailable online: https://github.com/tzutalin/labelImg(accessed on 30 May 2019)</bibtext> </blist> <blist> <bibtext> Wang Z., Bovik A.C., Sheikh H.R., Simoncelli E.P. Image Quality Assessment: From Error Visibility to Structural Similarity. IEEE Trans. Image Process. 2004; 13: 600-612. 10.1109/TIP.2003.819861</bibtext> </blist> <blist> <bibtext> Alhwarin F.. Fast and Robust Image Feature Matching Methods for Computer Vision Applications; Shaker Verlag: Aachen, Germany. 2011</bibtext> </blist> <blist> <bibtext> Karami E., Prasad S., Shehata M. Image Matching Using SIFT, SURF, BRIEF and ORB: Performance Comparison for Distorted Images. arXiv. 2017. 1710.02726</bibtext> </blist> <blist> <bibtext> Preeti M., Bharat P. An Advanced Technique of Image Matching Using SIFT and SURF. Int. J. Adv. Res. Comput. Commun. Eng. 2016; 5: 462-466</bibtext> </blist> <blist> <bibtext> He M.M., Guo Q., Li A., Chen J., Chen B., Feng X.X. Automatic Fast Feature-Level Image Registration for High-Resolution Remote Sensing Images. J. Remote Sens. 2018; 2: 277-292</bibtext> </blist> <blist> <bibtext> Agüera-Vega F., Carvajal-Ramírez F., Martínez-Carricondo P. Accuracy of Digital Surface Models and Orthophotos Derived from Unmanned Aerial Vehicle Photogrammetry. J. Surv. Eng. 2016; 143: 4016025. 10.1061/(ASCE)SU.1943-5428.0000206</bibtext> </blist> <blist> <bibtext> Manfreda S., Dvorak P., Mullerova J., Herban S., Vuono P., Arranz Justel J., Perks M. Assessing the Accuracy of Digital Surface Models Derived from Optical Imagery Acquired with Unmanned Aerial Systems. Drones. 2019; 315. 10.3390/drones3010015</bibtext> </blist> <blist> <bibtext> Gross J.W., Heumann B.W. A Statistical Examination of Image Stitching Software Packages or Use with Unmanned Aerial Systems. Photogramm. Eng. Remote Sens. 2016; 82: 419-425. 10.14358/PERS.82.6.419</bibtext> </blist> <blist> <bibtext> Oniga V.-E., Breaban A.-I., Statescu F. Determining the Optimum Number of Ground Control Points for Obtaining High Precision Results Based on UAS Images. Proceedings. 2018; 2352. 10.3390/ecrs-2-05165</bibtext> </blist> </ref> <aug> <p>By Chia-Cheng Yeh; Yang-Lang Chang; Mohammad Alkhaleefah; Pai-Hui Hsu; Weiyong Eng; Voon-Chet Koo; Bormin Huang and Lena Chang</p> <p>Reported by Author; Author; Author; Author; Author; Author; Author; Author</p> </aug> <nolink nlid="nl1" bibid="bib10" firstref="ref8"></nolink> <nolink nlid="nl2" bibid="bib12" firstref="ref9"></nolink> <nolink nlid="nl3" bibid="bib13" firstref="ref10"></nolink> <nolink nlid="nl4" bibid="bib14" firstref="ref11"></nolink> <nolink nlid="nl5" bibid="bib15" firstref="ref13"></nolink> <nolink nlid="nl6" bibid="bib17" firstref="ref14"></nolink> <nolink nlid="nl7" bibid="bib18" firstref="ref15"></nolink> <nolink nlid="nl8" bibid="bib20" firstref="ref16"></nolink> <nolink nlid="nl9" bibid="bib22" firstref="ref18"></nolink> <nolink nlid="nl10" bibid="bib24" firstref="ref21"></nolink> <nolink nlid="nl11" bibid="bib11" firstref="ref48"></nolink> <nolink nlid="nl12" bibid="bib26" firstref="ref49"></nolink> <nolink nlid="nl13" bibid="bib27" firstref="ref51"></nolink> <nolink nlid="nl14" bibid="bib29" firstref="ref52"></nolink> <nolink nlid="nl15" bibid="bib30" firstref="ref83"></nolink> <nolink nlid="nl16" bibid="bib31" firstref="ref87"></nolink> <nolink nlid="nl17" bibid="bib33" firstref="ref89"></nolink> <nolink nlid="nl18" bibid="bib34" firstref="ref90"></nolink> <nolink nlid="nl19" bibid="bib35" firstref="ref91"></nolink> <nolink nlid="nl20" bibid="bib37" firstref="ref92"></nolink> <nolink nlid="nl21" bibid="bib16" firstref="ref95"></nolink> CustomLinks: – Url: https://resolver.ebsco.com/c/xy5jbn/result?sid=EBSCO:edsdoj&genre=article&issn=20724292&ISBN=&volume=13&issue=1&date=20210101&spage=127&pages=127-127&title=Remote Sensing&atitle=YOLOv3-Based%20Matching%20Approach%20for%20Roof%20Region%20Detection%20from%20Drone%20Images&aulast=Chia-Cheng%20Yeh&id=DOI:10.3390/rs13010127 Name: Full Text Finder (for New FTF UI) (s8985755) Category: fullText Text: Find It @ SCU Libraries MouseOverText: Find It @ SCU Libraries – Url: https://doaj.org/article/6477b4f6972b477fa8b3e05c4e7d67a2 Name: EDS - DOAJ (s8985755) Category: fullText Text: View record from DOAJ MouseOverText: View record from DOAJ |
---|---|
Header | DbId: edsdoj DbLabel: Directory of Open Access Journals An: edsdoj.6477b4f6972b477fa8b3e05c4e7d67a2 RelevancyScore: 894 AccessLevel: 3 PubType: Academic Journal PubTypeId: academicJournal PreciseRelevancyScore: 893.9462890625 |
IllustrationInfo | |
Items | – Name: Title Label: Title Group: Ti Data: YOLOv3-Based Matching Approach for Roof Region Detection from Drone Images – Name: Author Label: Authors Group: Au Data: <searchLink fieldCode="AR" term="%22Chia-Cheng+Yeh%22">Chia-Cheng Yeh</searchLink><br /><searchLink fieldCode="AR" term="%22Yang-Lang+Chang%22">Yang-Lang Chang</searchLink><br /><searchLink fieldCode="AR" term="%22Mohammad+Alkhaleefah%22">Mohammad Alkhaleefah</searchLink><br /><searchLink fieldCode="AR" term="%22Pai-Hui+Hsu%22">Pai-Hui Hsu</searchLink><br /><searchLink fieldCode="AR" term="%22Weiyong+Eng%22">Weiyong Eng</searchLink><br /><searchLink fieldCode="AR" term="%22Voon-Chet+Koo%22">Voon-Chet Koo</searchLink><br /><searchLink fieldCode="AR" term="%22Bormin+Huang%22">Bormin Huang</searchLink><br /><searchLink fieldCode="AR" term="%22Lena+Chang%22">Lena Chang</searchLink> – Name: TitleSource Label: Source Group: Src Data: Remote Sensing, Vol 13, Iss 1, p 127 (2021) – Name: Publisher Label: Publisher Information Group: PubInfo Data: MDPI AG, 2021. – Name: DatePubCY Label: Publication Year Group: Date Data: 2021 – Name: Subset Label: Collection Group: HoldingsInfo Data: LCC:Science – Name: Subject Label: Subject Terms Group: Su Data: <searchLink fieldCode="DE" term="%22image+matching%22">image matching</searchLink><br /><searchLink fieldCode="DE" term="%22deep+learning%22">deep learning</searchLink><br /><searchLink fieldCode="DE" term="%22YOLOv3%22">YOLOv3</searchLink><br /><searchLink fieldCode="DE" term="%22roof+region+detection%22">roof region detection</searchLink><br /><searchLink fieldCode="DE" term="%22drone+images%22">drone images</searchLink><br /><searchLink fieldCode="DE" term="%22high-performance+computing%22">high-performance computing</searchLink><br /><searchLink fieldCode="DE" term="%22Science%22">Science</searchLink> – Name: Abstract Label: Description Group: Ab Data: Due to the large data volume, the UAV image stitching and matching suffers from high computational cost. The traditional feature extraction algorithms—such as Scale-Invariant Feature Transform (SIFT), Speeded Up Robust Features (SURF), and Oriented FAST Rotated BRIEF (ORB)—require heavy computation to extract and describe features in high-resolution UAV images. To overcome this issue, You Only Look Once version 3 (YOLOv3) combined with the traditional feature point matching algorithms is utilized to extract descriptive features from the drone dataset of residential areas for roof detection. Unlike the traditional feature extraction algorithms, YOLOv3 performs the feature extraction solely on the proposed candidate regions instead of the entire image, thus the complexity of the image matching is reduced significantly. Then, all the extracted features are fed into Structural Similarity Index Measure (SSIM) to identify the corresponding roof region pair between consecutive image sequences. In addition, the candidate corresponding roof pair by our architecture serves as the coarse matching region pair and limits the search range of features matching to only the detected roof region. This further improves the feature matching consistency and reduces the chances of wrong feature matching. Analytical results show that the proposed method is 13× faster than the traditional image matching methods with comparable performance. – Name: TypeDocument Label: Document Type Group: TypDoc Data: article – Name: Format Label: File Description Group: SrcInfo Data: electronic resource – Name: Language Label: Language Group: Lang Data: English – Name: ISSN Label: ISSN Group: ISSN Data: 2072-4292 – Name: NoteTitleSource Label: Relation Group: SrcInfo Data: https://www.mdpi.com/2072-4292/13/1/127; https://doaj.org/toc/2072-4292 – Name: DOI Label: DOI Group: ID Data: 10.3390/rs13010127 – Name: URL Label: Access URL Group: URL Data: <link linkTarget="URL" linkTerm="https://doaj.org/article/6477b4f6972b477fa8b3e05c4e7d67a2" linkWindow="_blank">https://doaj.org/article/6477b4f6972b477fa8b3e05c4e7d67a2</link> – Name: AN Label: Accession Number Group: ID Data: edsdoj.6477b4f6972b477fa8b3e05c4e7d67a2 |
PLink | https://login.libproxy.scu.edu/login?url=https://search.ebscohost.com/login.aspx?direct=true&site=eds-live&scope=site&db=edsdoj&AN=edsdoj.6477b4f6972b477fa8b3e05c4e7d67a2 |
RecordInfo | BibRecord: BibEntity: Identifiers: – Type: doi Value: 10.3390/rs13010127 Languages: – Text: English PhysicalDescription: Pagination: PageCount: 1 StartPage: 127 Subjects: – SubjectFull: image matching Type: general – SubjectFull: deep learning Type: general – SubjectFull: YOLOv3 Type: general – SubjectFull: roof region detection Type: general – SubjectFull: drone images Type: general – SubjectFull: high-performance computing Type: general – SubjectFull: Science Type: general Titles: – TitleFull: YOLOv3-Based Matching Approach for Roof Region Detection from Drone Images Type: main BibRelationships: HasContributorRelationships: – PersonEntity: Name: NameFull: Chia-Cheng Yeh – PersonEntity: Name: NameFull: Yang-Lang Chang – PersonEntity: Name: NameFull: Mohammad Alkhaleefah – PersonEntity: Name: NameFull: Pai-Hui Hsu – PersonEntity: Name: NameFull: Weiyong Eng – PersonEntity: Name: NameFull: Voon-Chet Koo – PersonEntity: Name: NameFull: Bormin Huang – PersonEntity: Name: NameFull: Lena Chang IsPartOfRelationships: – BibEntity: Dates: – D: 01 M: 01 Type: published Y: 2021 Identifiers: – Type: issn-print Value: 20724292 Numbering: – Type: volume Value: 13 – Type: issue Value: 1 Titles: – TitleFull: Remote Sensing Type: main |
ResultId | 1 |