Robust Promptable Video Object Segmentation

Abstract

The performance of promptable video object segmentation (PVOS) models substantially degrades under input corruptions, which prevents PVOS deployment in safety-critical domains. This paper offers the first comprehensive study on robust PVOS (RobustPVOS). We first construct a new, comprehensive benchmark with two real-world evaluation datasets of 351 video clips and more than 2,500 object masks under real-world adverse conditions. At the same time, we generate synthetic training data by applying diverse and temporally varying corruptions to existing VOS datasets. Moreover, we present a new RobustPVOS method, dubbed Memory-object-conditioned Gated-rank Adaptation (MoGA). The key to successfully performing RobustPVOS is two-fold: effectively handling object-specific degradation and ensuring temporal consistency in predictions. MoGA leverages object-specific representations maintained in memory across frames to condition the robustification process, which allows the model to handle each tracked object differently in a temporally consistent way. Extensive experiments on our benchmark validate MoGA's efficacy, showing consistent and significant improvements across diverse corruption types on both synthetic and real-world datasets, establishing a strong baseline for future RobustPVOS research. Our benchmark will be made publicly available.

Method: MoGA

We propose Memory-object-conditioned Gated-rank Adaptation (MoGA) for RobustPVOS. MoGA decomposes the weight matrix of a low-rank adapter into rank-1 components and selectively activates them using object-specific representations from the memory bank. This design enables the adapter to handle distinct objects differently while ensuring temporal consistency.

RobustPVOS Benchmark

We present the first RobustPVOS benchmark suite that includes manually annotated real-world evaluation datasets under adverse conditions and a synthetic corruption dataset with temporally varying degradation patterns.

Dataset	Clips	Frames	Objects
Real-world Evaluation Set
ACDC-Video	149	3,259	613
MVSeg	202	13,581	1,930
Synthetic Training Set
MOSE-C + DAVIS-C + YouTube-VOS-C	46,768	1,774,560	3,872,048
Synthetic Evaluation Set
YouTube-VOS-C	507	13,710	25,574

The synthetic corrupted datasets (-C) are generated by applying temporally varying corruptions to the original datasets above. The corruption generation code will be publicly available in our code repository.

Dataset Download

We release the two real-world evaluation test sets of our RobustPVOS benchmark. Each dataset contains video clips captured under natural adverse conditions (fog, rain, snow, nighttime, low light, motion blur, etc.) with dense, per-object, pixel-level annotations.

License & Citation

The RobustPVOS ACDC-Video Test set is derived from the ACDC dataset and is released under the same license terms as the original ACDC packages (see the License file included in the download). If you use the ACDC-Video dataset in your research, please cite both the RobustPVOS paper and the ACDC T-PAMI paper:

@inproceedings{lee2026robustpvos,
  author    = {Lee, Sohyun and Gwon, Yeho and Hoyer, Lukas and Schindler, Konrad and Sakaridis, Christos and Kwak, Suha},
  title     = {Robust Promptable Video Object Segmentation},
  booktitle = {CVPR},
  year      = {2026},
}

@article{sakaridis2021acdc,
  author    = {Sakaridis, Christos and Dai, Dengxin and Van Gool, Luc},
  title     = {{ACDC}: The Adverse Conditions Dataset with Correspondences for Semantic Driving Scene Understanding},
  journal   = {IEEE Transactions on Pattern Analysis and Machine Intelligence},
  year      = {2022},
}

The RobustPVOS MVSeg Test sets are derived from the MVSeg dataset. If you use the MVSeg dataset in your research, please cite both the RobustPVOS paper and the original MVSeg paper:

@inproceedings{lee2026robustpvos,
  author    = {Lee, Sohyun and Gwon, Yeho and Hoyer, Lukas and Schindler, Konrad and Sakaridis, Christos and Kwak, Suha},
  title     = {Robust Promptable Video Object Segmentation},
  booktitle = {CVPR},
  year      = {2026},
}

@inproceedings{ji2023mvss,
  author    = {Ji, Wei and Li, Jingjing and Bian, Cheng and Zhou, Zongwei and Zhao, Jiaying and Yuille, Alan L. and Cheng, Li},
  title     = {Multispectral Video Semantic Segmentation: A Benchmark Dataset and Baseline},
  booktitle = {CVPR},
  year      = {2023},
}

Qualitative Results

BibTeX

@inproceedings{lee2026robustpvos,
  author    = {Lee, Sohyun and Gwon, Yeho and Hoyer, Lukas and Schindler, Konrad and Sakaridis, Christos and Kwak, Suha},
  title     = {Robust Promptable Video Object Segmentation},
  booktitle = {CVPR},
  year      = {2026},
}