Do VLMs Need Vision Transformers? Evaluating State Space Models as Vision Encoders

A controlled study of Transformer, state space, and hybrid vision backbones for frozen vision-language models.

  • Controlled backbone swaps under a fixed LLaVA-style training recipe.
  • Matched evaluation across VQA, grounding, localization, and efficiency.
  • Released checkpoints across the backbone families; training and evaluation code will be released soon.

Abstract

Why this paper exists

Large vision-language models usually freeze a vision backbone and map its image features into an LLM through a lightweight connector. Most systems still rely on transformer-based vision encoders. This work asks whether state space model backbones can be a strong alternative in the same modular VLM setting.

Under matched ImageNet-1K initialization, the SSM backbone delivers the strongest overall balance across VQA and grounding/localization. After dense-task adaptation, SSM backbones remain competitive while operating at substantially smaller scale. The study also shows that higher ImageNet accuracy or larger backbones do not reliably predict better downstream VLM behavior, and that simple stabilizations can recover localization failures.

Scope

Controlled, frozen-backbone evaluation

The vision encoder is swapped while the multimodal interface and training recipe are held fixed, making the comparison about the backbone itself rather than about joint finetuning dynamics.

Findings

What the study shows

01

SSM backbones are strong VLM encoders

Under matched settings, VMamba improves localization while staying competitive on open-ended VQA, making SSMs a practical alternative to ViTs.

02

Dense-task adaptation helps across families

Detection or segmentation pretraining generally improves VQA and localization, with the largest gains appearing in backbones that need more spatial inductive bias.

03

ImageNet accuracy is not enough

Better classification scores and naive scaling do not consistently predict stronger downstream VLM behavior, especially for grounding-sensitive tasks.

04

Localization collapse can be stabilized

Some dense-objective checkpoints fail sharply in localization, but simple interface and connector adjustments recover much more robust behavior.

Selected Figures

Visual summary of the paper

Fig. 1

Overview of the controlled backbone study

Open PDF
Overview of the frozen VLM pipeline and the compared vision backbone families.
The paper evaluates Transformer, SSM, and hybrid vision encoders inside the same frozen-backbone VLM pipeline, then compares checkpoints from different pretraining objectives and model scales under a shared recipe.

Fig. 2

Grounding quality and token-region alignment

Open PDF
Qualitative grounding examples and token-region similarity maps comparing VMamba and ViT.
Qualitative examples show VMamba producing boxes closer to ground truth and sharper token-region similarity maps, indicating better preservation of spatial information than the matched ViT baseline.

Fig. 4

Inference cost as resolution scales

Open PDF
Latency and memory scaling plots for several vision backbones as resolution increases.
Figure 4 profiles batch-size-1 inference on a single NVIDIA H200 GPU for representative VMamba, ViT, and ViTDet VLMs. Host-side latency measures wall-clock time seen from the CPU, including preparation and launch overhead, while GPU latency isolates the time spent computing on the GPU; end-to-end latency captures the full multimodal forward pass. These three backbones were chosen to expose practical tradeoffs: ViT is a scale-matched baseline with similar model size to VMamba but lower downstream VLM performance, whereas ViTDet is a competitive but much larger and heavier alternative whose memory cost grows much faster.

Paper

Paper PDF

Do VLMs Need Vision Transformers? Evaluating State Space Models as Vision Encoders

Full paper PDF.

Download PDF