DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

Bibliographic Details
Title: DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding
Authors: Wu, Zhiyu, Chen, Xiaokang, Pan, Zizheng, Liu, Xingchao, Liu, Wen, Dai, Damai, Gao, Huazuo, Ma, Yiyang, Wu, Chengyue, Wang, Bingxuan, Xie, Zhenda, Wu, Yu, Hu, Kai, Wang, Jiawei, Sun, Yaofeng, Li, Yukun, Piao, Yishi, Guan, Kang, Liu, Aixin, Xie, Xin, You, Yuxiang, Dong, Kai, Yu, Xingkai, Zhang, Haowei, Zhao, Liang, Wang, Yisong, Ruan, Chong
Publication Year: 2024
Collection: Computer Science
Subject Terms: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Computation and Language
More Details: We present DeepSeek-VL2, an advanced series of large Mixture-of-Experts (MoE) Vision-Language Models that significantly improves upon its predecessor, DeepSeek-VL, through two key major upgrades. For the vision component, we incorporate a dynamic tiling vision encoding strategy designed for processing high-resolution images with different aspect ratios. For the language component, we leverage DeepSeekMoE models with the Multi-head Latent Attention mechanism, which compresses Key-Value cache into latent vectors, to enable efficient inference and high throughput. Trained on an improved vision-language dataset, DeepSeek-VL2 demonstrates superior capabilities across various tasks, including but not limited to visual question answering, optical character recognition, document/table/chart understanding, and visual grounding. Our model series is composed of three variants: DeepSeek-VL2-Tiny, DeepSeek-VL2-Small and DeepSeek-VL2, with 1.0B, 2.8B and 4.5B activated parameters respectively. DeepSeek-VL2 achieves competitive or state-of-the-art performance with similar or fewer activated parameters compared to existing open-source dense and MoE-based models. Codes and pre-trained models are publicly accessible at https://github.com/deepseek-ai/DeepSeek-VL2.
Document Type: Working Paper
Access URL: http://arxiv.org/abs/2412.10302
Accession Number: edsarx.2412.10302
Database: arXiv
FullText Text:
  Availability: 0
CustomLinks:
  – Url: http://arxiv.org/abs/2412.10302
    Name: EDS - Arxiv
    Category: fullText
    Text: View this record from Arxiv
    MouseOverText: View this record from Arxiv
  – Url: https://resolver.ebsco.com/c/xy5jbn/result?sid=EBSCO:edsarx&genre=article&issn=&ISBN=&volume=&issue=&date=20241213&spage=&pages=&title=DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding&atitle=DeepSeek-VL2%3A%20Mixture-of-Experts%20Vision-Language%20Models%20for%20Advanced%20Multimodal%20Understanding&aulast=Wu%2C%20Zhiyu&id=DOI:
    Name: Full Text Finder (for New FTF UI) (s8985755)
    Category: fullText
    Text: Find It @ SCU Libraries
    MouseOverText: Find It @ SCU Libraries
Header DbId: edsarx
DbLabel: arXiv
An: edsarx.2412.10302
RelevancyScore: 1128
AccessLevel: 3
PubType: Report
PubTypeId: report
PreciseRelevancyScore: 1128.04479980469
IllustrationInfo
Items – Name: Title
  Label: Title
  Group: Ti
  Data: DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding
– Name: Author
  Label: Authors
  Group: Au
  Data: <searchLink fieldCode="AR" term="%22Wu%2C+Zhiyu%22">Wu, Zhiyu</searchLink><br /><searchLink fieldCode="AR" term="%22Chen%2C+Xiaokang%22">Chen, Xiaokang</searchLink><br /><searchLink fieldCode="AR" term="%22Pan%2C+Zizheng%22">Pan, Zizheng</searchLink><br /><searchLink fieldCode="AR" term="%22Liu%2C+Xingchao%22">Liu, Xingchao</searchLink><br /><searchLink fieldCode="AR" term="%22Liu%2C+Wen%22">Liu, Wen</searchLink><br /><searchLink fieldCode="AR" term="%22Dai%2C+Damai%22">Dai, Damai</searchLink><br /><searchLink fieldCode="AR" term="%22Gao%2C+Huazuo%22">Gao, Huazuo</searchLink><br /><searchLink fieldCode="AR" term="%22Ma%2C+Yiyang%22">Ma, Yiyang</searchLink><br /><searchLink fieldCode="AR" term="%22Wu%2C+Chengyue%22">Wu, Chengyue</searchLink><br /><searchLink fieldCode="AR" term="%22Wang%2C+Bingxuan%22">Wang, Bingxuan</searchLink><br /><searchLink fieldCode="AR" term="%22Xie%2C+Zhenda%22">Xie, Zhenda</searchLink><br /><searchLink fieldCode="AR" term="%22Wu%2C+Yu%22">Wu, Yu</searchLink><br /><searchLink fieldCode="AR" term="%22Hu%2C+Kai%22">Hu, Kai</searchLink><br /><searchLink fieldCode="AR" term="%22Wang%2C+Jiawei%22">Wang, Jiawei</searchLink><br /><searchLink fieldCode="AR" term="%22Sun%2C+Yaofeng%22">Sun, Yaofeng</searchLink><br /><searchLink fieldCode="AR" term="%22Li%2C+Yukun%22">Li, Yukun</searchLink><br /><searchLink fieldCode="AR" term="%22Piao%2C+Yishi%22">Piao, Yishi</searchLink><br /><searchLink fieldCode="AR" term="%22Guan%2C+Kang%22">Guan, Kang</searchLink><br /><searchLink fieldCode="AR" term="%22Liu%2C+Aixin%22">Liu, Aixin</searchLink><br /><searchLink fieldCode="AR" term="%22Xie%2C+Xin%22">Xie, Xin</searchLink><br /><searchLink fieldCode="AR" term="%22You%2C+Yuxiang%22">You, Yuxiang</searchLink><br /><searchLink fieldCode="AR" term="%22Dong%2C+Kai%22">Dong, Kai</searchLink><br /><searchLink fieldCode="AR" term="%22Yu%2C+Xingkai%22">Yu, Xingkai</searchLink><br /><searchLink fieldCode="AR" term="%22Zhang%2C+Haowei%22">Zhang, Haowei</searchLink><br /><searchLink fieldCode="AR" term="%22Zhao%2C+Liang%22">Zhao, Liang</searchLink><br /><searchLink fieldCode="AR" term="%22Wang%2C+Yisong%22">Wang, Yisong</searchLink><br /><searchLink fieldCode="AR" term="%22Ruan%2C+Chong%22">Ruan, Chong</searchLink>
– Name: DatePubCY
  Label: Publication Year
  Group: Date
  Data: 2024
– Name: Subset
  Label: Collection
  Group: HoldingsInfo
  Data: Computer Science
– Name: Subject
  Label: Subject Terms
  Group: Su
  Data: <searchLink fieldCode="DE" term="%22Computer+Science+-+Computer+Vision+and+Pattern+Recognition%22">Computer Science - Computer Vision and Pattern Recognition</searchLink><br /><searchLink fieldCode="DE" term="%22Computer+Science+-+Artificial+Intelligence%22">Computer Science - Artificial Intelligence</searchLink><br /><searchLink fieldCode="DE" term="%22Computer+Science+-+Computation+and+Language%22">Computer Science - Computation and Language</searchLink>
– Name: Abstract
  Label: Description
  Group: Ab
  Data: We present DeepSeek-VL2, an advanced series of large Mixture-of-Experts (MoE) Vision-Language Models that significantly improves upon its predecessor, DeepSeek-VL, through two key major upgrades. For the vision component, we incorporate a dynamic tiling vision encoding strategy designed for processing high-resolution images with different aspect ratios. For the language component, we leverage DeepSeekMoE models with the Multi-head Latent Attention mechanism, which compresses Key-Value cache into latent vectors, to enable efficient inference and high throughput. Trained on an improved vision-language dataset, DeepSeek-VL2 demonstrates superior capabilities across various tasks, including but not limited to visual question answering, optical character recognition, document/table/chart understanding, and visual grounding. Our model series is composed of three variants: DeepSeek-VL2-Tiny, DeepSeek-VL2-Small and DeepSeek-VL2, with 1.0B, 2.8B and 4.5B activated parameters respectively. DeepSeek-VL2 achieves competitive or state-of-the-art performance with similar or fewer activated parameters compared to existing open-source dense and MoE-based models. Codes and pre-trained models are publicly accessible at https://github.com/deepseek-ai/DeepSeek-VL2.
– Name: TypeDocument
  Label: Document Type
  Group: TypDoc
  Data: Working Paper
– Name: URL
  Label: Access URL
  Group: URL
  Data: <link linkTarget="URL" linkTerm="http://arxiv.org/abs/2412.10302" linkWindow="_blank">http://arxiv.org/abs/2412.10302</link>
– Name: AN
  Label: Accession Number
  Group: ID
  Data: edsarx.2412.10302
PLink https://login.libproxy.scu.edu/login?url=https://search.ebscohost.com/login.aspx?direct=true&site=eds-live&scope=site&db=edsarx&AN=edsarx.2412.10302
RecordInfo BibRecord:
  BibEntity:
    Subjects:
      – SubjectFull: Computer Science - Computer Vision and Pattern Recognition
        Type: general
      – SubjectFull: Computer Science - Artificial Intelligence
        Type: general
      – SubjectFull: Computer Science - Computation and Language
        Type: general
    Titles:
      – TitleFull: DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding
        Type: main
  BibRelationships:
    HasContributorRelationships:
      – PersonEntity:
          Name:
            NameFull: Wu, Zhiyu
      – PersonEntity:
          Name:
            NameFull: Chen, Xiaokang
      – PersonEntity:
          Name:
            NameFull: Pan, Zizheng
      – PersonEntity:
          Name:
            NameFull: Liu, Xingchao
      – PersonEntity:
          Name:
            NameFull: Liu, Wen
      – PersonEntity:
          Name:
            NameFull: Dai, Damai
      – PersonEntity:
          Name:
            NameFull: Gao, Huazuo
      – PersonEntity:
          Name:
            NameFull: Ma, Yiyang
      – PersonEntity:
          Name:
            NameFull: Wu, Chengyue
      – PersonEntity:
          Name:
            NameFull: Wang, Bingxuan
      – PersonEntity:
          Name:
            NameFull: Xie, Zhenda
      – PersonEntity:
          Name:
            NameFull: Wu, Yu
      – PersonEntity:
          Name:
            NameFull: Hu, Kai
      – PersonEntity:
          Name:
            NameFull: Wang, Jiawei
      – PersonEntity:
          Name:
            NameFull: Sun, Yaofeng
      – PersonEntity:
          Name:
            NameFull: Li, Yukun
      – PersonEntity:
          Name:
            NameFull: Piao, Yishi
      – PersonEntity:
          Name:
            NameFull: Guan, Kang
      – PersonEntity:
          Name:
            NameFull: Liu, Aixin
      – PersonEntity:
          Name:
            NameFull: Xie, Xin
      – PersonEntity:
          Name:
            NameFull: You, Yuxiang
      – PersonEntity:
          Name:
            NameFull: Dong, Kai
      – PersonEntity:
          Name:
            NameFull: Yu, Xingkai
      – PersonEntity:
          Name:
            NameFull: Zhang, Haowei
      – PersonEntity:
          Name:
            NameFull: Zhao, Liang
      – PersonEntity:
          Name:
            NameFull: Wang, Yisong
      – PersonEntity:
          Name:
            NameFull: Ruan, Chong
    IsPartOfRelationships:
      – BibEntity:
          Dates:
            – D: 13
              M: 12
              Type: published
              Y: 2024
ResultId 1