摘要

As traditional semantic segmentation methods evolve, they typically rely on closed-set training processes, limiting them to recognize only the classes they were trained on. To overcome this limitation, Zero -Shot Semantic Segmentation (ZSSeg) has been introduced, aiming to classify both labeled and unlabeled classes. Recently, large-scale vision-language pre-trained models, like CLIP, have gained traction in ZSSeg to harness pre-trained vision-language knowledge. However, their application has been restricted to using either predefined or learnable prompts for text feature extraction. Inspired by the human ability to understand classes through diverse descriptions, our work utilizes multiple predefined and learnable prompt learning based on CLIP for ZSSeg. Furthermore, we introduce a multi-scale contextual prompt learning method named ZegMP to combat overfitting. Our method was rigorously tested on standard ZSSeg benchmarks, including Pascal VOC and COCO-Stuff. Comprehensive experiments reveal that our approach sets a new benchmark, surpassing existing state-of-the-art techniques in zero-shot semantic segmentation. The code will available at ZegMP.

全文