Time-VLM: Exploring Multimodal Vision-Language Models for Augmented Time Series Forecasting

Bibliographic Details
Title:	Time-VLM: Exploring Multimodal Vision-Language Models for Augmented Time Series Forecasting
Authors:	Zhong, Siru, Ruan, Weilin, Jin, Ming, Li, Huan, Wen, Qingsong, Liang, Yuxuan
Publication Year:	2025
Collection:	Computer Science
Subject Terms:	Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
More Details:	Recent advancements in time series forecasting have explored augmenting models with text or vision modalities to improve accuracy. While text provides contextual understanding, it often lacks fine-grained temporal details. Conversely, vision captures intricate temporal patterns but lacks semantic context, limiting the complementary potential of these modalities. To address this, we propose Time-VLM, a novel multimodal framework that leverages pre-trained Vision-Language Models (VLMs) to bridge temporal, visual, and textual modalities for enhanced forecasting. Our framework comprises three key components: (1) a Retrieval-Augmented Learner, which extracts enriched temporal features through memory bank interactions; (2) a Vision-Augmented Learner, which encodes time series as informative images; and (3) a Text-Augmented Learner, which generates contextual textual descriptions. These components collaborate with frozen pre-trained VLMs to produce multimodal embeddings, which are then fused with temporal features for final prediction. Extensive experiments across diverse datasets demonstrate that Time-VLM achieves superior performance, particularly in few-shot and zero-shot scenarios, thereby establishing a new direction for multimodal time series forecasting. Comment: 19 pages
Document Type:	Working Paper
Access URL:	http://arxiv.org/abs/2502.04395
Accession Number:	edsarx.2502.04395
Database:	arXiv

More Details
Description not available.