DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

Bibliographic Details
Title:	DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding
Authors:	Wu, Zhiyu, Chen, Xiaokang, Pan, Zizheng, Liu, Xingchao, Liu, Wen, Dai, Damai, Gao, Huazuo, Ma, Yiyang, Wu, Chengyue, Wang, Bingxuan, Xie, Zhenda, Wu, Yu, Hu, Kai, Wang, Jiawei, Sun, Yaofeng, Li, Yukun, Piao, Yishi, Guan, Kang, Liu, Aixin, Xie, Xin, You, Yuxiang, Dong, Kai, Yu, Xingkai, Zhang, Haowei, Zhao, Liang, Wang, Yisong, Ruan, Chong
Publication Year:	2024
Collection:	Computer Science
Subject Terms:	Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Computation and Language
More Details:	We present DeepSeek-VL2, an advanced series of large Mixture-of-Experts (MoE) Vision-Language Models that significantly improves upon its predecessor, DeepSeek-VL, through two key major upgrades. For the vision component, we incorporate a dynamic tiling vision encoding strategy designed for processing high-resolution images with different aspect ratios. For the language component, we leverage DeepSeekMoE models with the Multi-head Latent Attention mechanism, which compresses Key-Value cache into latent vectors, to enable efficient inference and high throughput. Trained on an improved vision-language dataset, DeepSeek-VL2 demonstrates superior capabilities across various tasks, including but not limited to visual question answering, optical character recognition, document/table/chart understanding, and visual grounding. Our model series is composed of three variants: DeepSeek-VL2-Tiny, DeepSeek-VL2-Small and DeepSeek-VL2, with 1.0B, 2.8B and 4.5B activated parameters respectively. DeepSeek-VL2 achieves competitive or state-of-the-art performance with similar or fewer activated parameters compared to existing open-source dense and MoE-based models. Codes and pre-trained models are publicly accessible at https://github.com/deepseek-ai/DeepSeek-VL2.
Document Type:	Working Paper
Access URL:	http://arxiv.org/abs/2412.10302
Accession Number:	edsarx.2412.10302
Database:	arXiv

FullText	Text: Availability: 0 CustomLinks: – Url: http://arxiv.org/abs/2412.10302 Name: EDS - Arxiv Category: fullText Text: View this record from Arxiv MouseOverText: View this record from Arxiv – Url: https://resolver.ebsco.com/c/xy5jbn/result?sid=EBSCO:edsarx&genre=article&issn=&ISBN=&volume=&issue=&date=20241213&spage=&pages=&title=DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding&atitle=DeepSeek-VL2%3A%20Mixture-of-Experts%20Vision-Language%20Models%20for%20Advanced%20Multimodal%20Understanding&aulast=Wu%2C%20Zhiyu&id=DOI: Name: Full Text Finder (for New FTF UI) (s8985755) Category: fullText Text: Find It @ SCU Libraries MouseOverText: Find It @ SCU Libraries
Header	DbId: edsarx DbLabel: arXiv An: edsarx.2412.10302 RelevancyScore: 1128 AccessLevel: 3 PubType: Report PubTypeId: report PreciseRelevancyScore: 1128.04479980469
IllustrationInfo
Items	– Name: Title Label: Title Group: Ti Data: DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding – Name: Author Label: Authors Group: Au Data: <searchLink fieldCode="AR" term="%22Wu%2C+Zhiyu%22">Wu, Zhiyu</searchLink><br /><searchLink fieldCode="AR" term="%22Chen%2C+Xiaokang%22">Chen, Xiaokang</searchLink><br /><searchLink fieldCode="AR" term="%22Pan%2C+Zizheng%22">Pan, Zizheng</searchLink><br /><searchLink fieldCode="AR" term="%22Liu%2C+Xingchao%22">Liu, Xingchao</searchLink><br /><searchLink fieldCode="AR" term="%22Liu%2C+Wen%22">Liu, Wen</searchLink><br /><searchLink fieldCode="AR" term="%22Dai%2C+Damai%22">Dai, Damai</searchLink><br /><searchLink fieldCode="AR" term="%22Gao%2C+Huazuo%22">Gao, Huazuo</searchLink><br /><searchLink fieldCode="AR" term="%22Ma%2C+Yiyang%22">Ma, Yiyang</searchLink><br /><searchLink fieldCode="AR" term="%22Wu%2C+Chengyue%22">Wu, Chengyue</searchLink><br /><searchLink fieldCode="AR" term="%22Wang%2C+Bingxuan%22">Wang, Bingxuan</searchLink><br /><searchLink fieldCode="AR" term="%22Xie%2C+Zhenda%22">Xie, Zhenda</searchLink><br /><searchLink fieldCode="AR" term="%22Wu%2C+Yu%22">Wu, Yu</searchLink><br /><searchLink fieldCode="AR" term="%22Hu%2C+Kai%22">Hu, Kai</searchLink><br /><searchLink fieldCode="AR" term="%22Wang%2C+Jiawei%22">Wang, Jiawei</searchLink><br /><searchLink fieldCode="AR" term="%22Sun%2C+Yaofeng%22">Sun, Yaofeng</searchLink><br /><searchLink fieldCode="AR" term="%22Li%2C+Yukun%22">Li, Yukun</searchLink><br /><searchLink fieldCode="AR" term="%22Piao%2C+Yishi%22">Piao, Yishi</searchLink><br /><searchLink fieldCode="AR" term="%22Guan%2C+Kang%22">Guan, Kang</searchLink><br /><searchLink fieldCode="AR" term="%22Liu%2C+Aixin%22">Liu, Aixin</searchLink><br /><searchLink fieldCode="AR" term="%22Xie%2C+Xin%22">Xie, Xin</searchLink><br /><searchLink fieldCode="AR" term="%22You%2C+Yuxiang%22">You, Yuxiang</searchLink><br /><searchLink fieldCode="AR" term="%22Dong%2C+Kai%22">Dong, Kai</searchLink><br /><searchLink fieldCode="AR" term="%22Yu%2C+Xingkai%22">Yu, Xingkai</searchLink><br /><searchLink fieldCode="AR" term="%22Zhang%2C+Haowei%22">Zhang, Haowei</searchLink><br /><searchLink fieldCode="AR" term="%22Zhao%2C+Liang%22">Zhao, Liang</searchLink><br /><searchLink fieldCode="AR" term="%22Wang%2C+Yisong%22">Wang, Yisong</searchLink><br /><searchLink fieldCode="AR" term="%22Ruan%2C+Chong%22">Ruan, Chong</searchLink> – Name: DatePubCY Label: Publication Year Group: Date Data: 2024 – Name: Subset Label: Collection Group: HoldingsInfo Data: Computer Science – Name: Subject Label: Subject Terms Group: Su Data: <searchLink fieldCode="DE" term="%22Computer+Science+-+Computer+Vision+and+Pattern+Recognition%22">Computer Science - Computer Vision and Pattern Recognition</searchLink><br /><searchLink fieldCode="DE" term="%22Computer+Science+-+Artificial+Intelligence%22">Computer Science - Artificial Intelligence</searchLink><br /><searchLink fieldCode="DE" term="%22Computer+Science+-+Computation+and+Language%22">Computer Science - Computation and Language</searchLink> – Name: Abstract Label: Description Group: Ab Data: We present DeepSeek-VL2, an advanced series of large Mixture-of-Experts (MoE) Vision-Language Models that significantly improves upon its predecessor, DeepSeek-VL, through two key major upgrades. For the vision component, we incorporate a dynamic tiling vision encoding strategy designed for processing high-resolution images with different aspect ratios. For the language component, we leverage DeepSeekMoE models with the Multi-head Latent Attention mechanism, which compresses Key-Value cache into latent vectors, to enable efficient inference and high throughput. Trained on an improved vision-language dataset, DeepSeek-VL2 demonstrates superior capabilities across various tasks, including but not limited to visual question answering, optical character recognition, document/table/chart understanding, and visual grounding. Our model series is composed of three variants: DeepSeek-VL2-Tiny, DeepSeek-VL2-Small and DeepSeek-VL2, with 1.0B, 2.8B and 4.5B activated parameters respectively. DeepSeek-VL2 achieves competitive or state-of-the-art performance with similar or fewer activated parameters compared to existing open-source dense and MoE-based models. Codes and pre-trained models are publicly accessible at https://github.com/deepseek-ai/DeepSeek-VL2. – Name: TypeDocument Label: Document Type Group: TypDoc Data: Working Paper – Name: URL Label: Access URL Group: URL Data: <link linkTarget="URL" linkTerm="http://arxiv.org/abs/2412.10302" linkWindow="_blank">http://arxiv.org/abs/2412.10302</link> – Name: AN Label: Accession Number Group: ID Data: edsarx.2412.10302
PLink	https://login.libproxy.scu.edu/login?url=https://search.ebscohost.com/login.aspx?direct=true&site=eds-live&scope=site&db=edsarx&AN=edsarx.2412.10302
RecordInfo	BibRecord: BibEntity: Subjects: – SubjectFull: Computer Science - Computer Vision and Pattern Recognition Type: general – SubjectFull: Computer Science - Artificial Intelligence Type: general – SubjectFull: Computer Science - Computation and Language Type: general Titles: – TitleFull: DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding Type: main BibRelationships: HasContributorRelationships: – PersonEntity: Name: NameFull: Wu, Zhiyu – PersonEntity: Name: NameFull: Chen, Xiaokang – PersonEntity: Name: NameFull: Pan, Zizheng – PersonEntity: Name: NameFull: Liu, Xingchao – PersonEntity: Name: NameFull: Liu, Wen – PersonEntity: Name: NameFull: Dai, Damai – PersonEntity: Name: NameFull: Gao, Huazuo – PersonEntity: Name: NameFull: Ma, Yiyang – PersonEntity: Name: NameFull: Wu, Chengyue – PersonEntity: Name: NameFull: Wang, Bingxuan – PersonEntity: Name: NameFull: Xie, Zhenda – PersonEntity: Name: NameFull: Wu, Yu – PersonEntity: Name: NameFull: Hu, Kai – PersonEntity: Name: NameFull: Wang, Jiawei – PersonEntity: Name: NameFull: Sun, Yaofeng – PersonEntity: Name: NameFull: Li, Yukun – PersonEntity: Name: NameFull: Piao, Yishi – PersonEntity: Name: NameFull: Guan, Kang – PersonEntity: Name: NameFull: Liu, Aixin – PersonEntity: Name: NameFull: Xie, Xin – PersonEntity: Name: NameFull: You, Yuxiang – PersonEntity: Name: NameFull: Dong, Kai – PersonEntity: Name: NameFull: Yu, Xingkai – PersonEntity: Name: NameFull: Zhang, Haowei – PersonEntity: Name: NameFull: Zhao, Liang – PersonEntity: Name: NameFull: Wang, Yisong – PersonEntity: Name: NameFull: Ruan, Chong IsPartOfRelationships: – BibEntity: Dates: – D: 13 M: 12 Type: published Y: 2024
ResultId	1