Bibliographic Details
Title: |
Video Frame Interpolation Transformer |
Authors: |
Shi, Zhihao, Xu, Xiangyu, Liu, Xiaohong, Chen, Jun, Yang, Ming-Hsuan |
Publication Year: |
2021 |
Collection: |
Computer Science |
Subject Terms: |
Computer Science - Computer Vision and Pattern Recognition |
More Details: |
Existing methods for video interpolation heavily rely on deep convolution neural networks, and thus suffer from their intrinsic limitations, such as content-agnostic kernel weights and restricted receptive field. To address these issues, we propose a Transformer-based video interpolation framework that allows content-aware aggregation weights and considers long-range dependencies with the self-attention operations. To avoid the high computational cost of global self-attention, we introduce the concept of local attention into video interpolation and extend it to the spatial-temporal domain. Furthermore, we propose a space-time separation strategy to save memory usage, which also improves performance. In addition, we develop a multi-scale frame synthesis scheme to fully realize the potential of Transformers. Extensive experiments demonstrate the proposed model performs favorably against the state-of-the-art methods both quantitatively and qualitatively on a variety of benchmark datasets. |
Document Type: |
Working Paper |
Access URL: |
http://arxiv.org/abs/2111.13817 |
Accession Number: |
edsarx.2111.13817 |
Database: |
arXiv |