Do VLMs Need Vision Transformers? Evaluating State Space Models as Vision Encoders

Abstract

Why this paper exists

Large vision-language models usually freeze a vision backbone and map its image features into an LLM through a lightweight connector. Most systems still rely on transformer-based vision encoders. This work asks whether state space model backbones can be a strong alternative in the same modular VLM setting.

Under matched ImageNet-1K initialization, the SSM backbone delivers the strongest overall balance across VQA and grounding/localization. After dense-task adaptation, SSM backbones remain competitive while operating at substantially smaller scale. The study also shows that higher ImageNet accuracy or larger backbones do not reliably predict better downstream VLM behavior, and that simple stabilizations can recover localization failures.

Scope

Controlled, frozen-backbone evaluation

The vision encoder is swapped while the multimodal interface and training recipe are held fixed, making the comparison about the backbone itself rather than about joint finetuning dynamics.

Findings

What the study shows

01

SSM backbones are strong VLM encoders

Under matched settings, VMamba improves localization while staying competitive on open-ended VQA, making SSMs a practical alternative to ViTs.

02

Dense-task adaptation helps across families

Detection or segmentation pretraining generally improves VQA and localization, with the largest gains appearing in backbones that need more spatial inductive bias.

03

ImageNet accuracy is not enough

Better classification scores and naive scaling do not consistently predict stronger downstream VLM behavior, especially for grounding-sensitive tasks.

04

Localization collapse can be stabilized

Some dense-objective checkpoints fail sharply in localization, but simple interface and connector adjustments recover much more robust behavior.

Selected Figures

Visual summary of the paper

Overview of the frozen VLM pipeline and the compared vision backbone families. — The paper evaluates Transformer, SSM, and hybrid vision encoders inside the same frozen-backbone VLM pipeline, then compares checkpoints from different pretraining objectives and model scales under a shared recipe.

Qualitative grounding examples and token-region similarity maps comparing VMamba and ViT. — Qualitative examples show VMamba producing boxes closer to ground truth and sharper token-region similarity maps, indicating better preservation of spatial information than the matched ViT baseline.

Latency and memory scaling plots for several vision backbones as resolution increases. — Figure 4 profiles batch-size-1 inference on a single NVIDIA H200 GPU for representative VMamba, ViT, and ViTDet VLMs. Host-side latency measures wall-clock time seen from the CPU, including preparation and launch overhead, while GPU latency isolates the time spent computing on the GPU; end-to-end latency captures the full multimodal forward pass. These three backbones were chosen to expose practical tradeoffs: ViT is a scale-matched baseline with similar model size to VMamba but lower downstream VLM performance, whereas ViTDet is a competitive but much larger and heavier alternative whose memory cost grows much faster.

Paper

Paper PDF