重要通知 今晚23:00-24:00 进行维护,期间可能无法访问
All-pairs Consistency Learning for Weakly Supervised Semantic Segmentation
澳大利亚国立大学、OpenNLPLab、上海人工智能实验室、厦门大学、OPPO研究院
Self-supervised vision transformers for semantic segmentation
Semantic segmentation is a fundamental task in computer vision and it is a building block of many other vision applications. Nevertheless, semantic segmentation annotations are extremely expensive to collect, so using pre-training to alleviate the need for a large number of labeled samples is appealing. Recently, self-supervised learning (SSL) has shown effectiveness in extracting strong representations and has been widely applied to a variety of downstream tasks. However, most works perform sub-optimally in semantic segmentation because they ignore the specific properties of segmentation: (i) the need of pixel level fine-grained understanding; (ii) with the assistance of global context understanding; (iii) both of the above achieve with the dense self-supervisory signal. Based on these key factors, we introduce a systematic self-supervised pre-training framework for semantic segmentation, which consists of a hierarchical encoder–decoder architecture MEVT for generating high-resolution features with global contextual information propagation and a self-supervised training strategy for learning fine-grained semantic features. In our study, our framework shows competitive performance compared with other main self-supervised pre-training methods for semantic segmentation on COCO-Stuff, ADE20K, PASCAL VOC, and Cityscapes datasets. e.g., MEVT achieves the advantage in linear probing by +1.3 mIoU on PASCAL VOC.
A Transformer-based Adaptive Prototype Matching Network for Few-Shot Semantic Segmentation
近年来,由于深度学习在计算机视觉领域的快速发展,所以传统的语义分割取得了飞速进步。在这种情况下,少样本分割(few-shot segmentation, FSS)被提出用于模拟有限数据和多类别的真实世界场景。
DSMF-Net Dual Semantic Metric Learning Fusion Network for Few-Shot Aerial Image Semantic Segmentation
Semantic segmentation of aerial images is crucial yet resource-intensive. Inspired by human ability to learn rapidly, few-shot semantic segmentation offers a promising solution by utilizing limited labeled data for efficient model training and generalization. However, the intrinsic complexities of aerial images, compounded by scarce samples, often result in inadequate feature representation and semantic ambiguity, detracting from themodel’s performance. In this article, we propose to tackle these challenging problems via dual semantic metric learning and multisemantic features fusion
and introduce a novel few-shot segmentation Network (DSMF-Net). On the one hand, we consider the inherent semantic gap between the feature of graph and grid structures and metric learning of few-shot segmentation. To exploit multiscale global semantic context, we construct scale-aware graph prototypes from different stages of the feature layers based on graph convolutional networks (GCNs), while also incorporating prior-guided metric learning to further enhance context at the high-level convolution features. On the other hand, we design a pyramid-based fusion and condensa-
tion mechanism to adaptively merge and couple the multisemantic information from support and query images. The indication and fusion of different semantic features can effectively emphasize the representation and coupling abilities of the network. We have conducted extensive experiments over the challenging iSAID-5i andDLRSD benchmarks. The experiments have demonstrated our network’s effectiveness and efficiency, yielding on-par performance with the state-of-the-art methods.
Stronger, Fewer, & Superior Harnessing Vision Foundation Models for Domain Generalized Semantic Segmentation(DGSS)
https://github.com/w1oves/Rein.git
LGAD Local and Global Attention Distillation for Efficient Semantic Segmentation
Shaoxing University、Central South University
Class Tokens Infusion for Weakly Supervised Semantic Segmentation
Weakly Supervised Semantic Segmentation (WSSS) relies on Class Activation Maps (CAMs) to extract spatial information from image-level labels. With the success of Vision Transformer (ViT), the migration of ViT is actively conducted in WSSS. This work proposes a novel WSSS framework with Class Token Infusion (CTI). By infusing the class tokens from images, we guide class tokens to possess class-specific distinct characteristics and global-local consistency. For this, we devise two kinds of token infusion: 1) Intra-image Class Token Infusion (I-CTI) and 2)Cross-image Class Token Infusion (C-CTI). In I-CTI, we infuse the class tokens from the same but differently augmented images and thus make CAMs consistent among var-
ious deformations (i.e. view, color). In C-CTI, by infusing the class tokens from the other images and imposing the resulting CAMs to be similar, it learns class-specific distinct characteristics. Besides the CTI, we bring the background (BG) concept into ViT with the BG token to reduce the false positive activation ofCAMs. We demonstrate the effectiveness ofour method on PASCAL VOC 2012 and MS COCO 2014 datasets, achieving state-of-the-art results in weakly supervised semantic segmentation. The code is available at https://github.com/yoon307/CTI.
USE Universal Segment Embeddings for Open-Vocabulary Image Segmentation
Bosch Research North America、Bosch Center for Artificial Intelligence (BCAI)
LLMFormer Large LanguageModel for Open-Vocabulary Semantic Segmentation
Hunan University、Monash University
CorrMatch Label Propagation via Correlation Matching for Semi-Supervised Semantic Segmentation
Nankai University、NKIARI, Shenzhen Futian、SICE, UESTC
CC4S Encouraging Certainty and Consistency in Scribble-Supervised Semantic Segmentation
Deep learning-based solutions have achieved impressive performance in semantic segmentation but often require large
amounts of training data with fine-grained annotations. To alleviate such requisition, a variety of weakly supervised annotation
strategies have been proposed, among which scribble supervision is emerging as a popular one due to its user-friendly annotation way. However, the sparsity and diversity of scribble annotations make it nontrivial to train a network to produce deterministic and consistent predictions directly. To address these issues, in this paper we propose holistic solutions involving the design of network structure, loss and training procedure, named CC4S to improve Certainty and Consistency for Scribble-Supervised Semantic Segmentation. Specifically, to reduce uncertainty, CC4S embeds a random walkmodule into the network structure to make neural representations uniformly distributed within similar semantic regions, which works together with a soft entropy loss function to force the network to produce deterministic predictions. To encourage consistency, CC4S adopts self-supervision training and imposes the consistency loss on the eigenspace of the probability transition matrix in the random walk module (we named neural eigenspace). Such self-supervision inherits the category-level discriminability from the neural eigenspace and meanwhile helps the network focus on producing consistent predictions for the salient parts and neglect semantically heterogeneous backgrounds. Finally, to further improve the performance, CC4S uses the network predictions
as pseudo-labels and retrains the network with an extra color constraint regularizer. From comprehensive experiments, CC4S
achieves comparable performance to those from fully supervised methods and shows promising robustness under extreme supervision cases.
Scaling Up Multi-domain Semantic Segmentation with Sentence Embeddings
The state-of-the-art semantic segmentation methods have achieved impressive performance on predefined close-set individual datasets, but their generalization to zero-shot domains and unseen categories is limited. Labeling a large-scale dataset is challenging and expensive, Training a robust semantic segmentation model on multi-domains has drawn much attention. However, inconsistent taxonomies hinder the naive merging of current publicly available annotations. To address this, we propose a simple solution to scale up the multi-domain semantic segmentation dataset with less human effort. We replace each class label with a sentence embedding, which is a vector-valued embedding of a sentence describing the class. This approach enables the merging of multiple datasets from different domains, each with varying class labels and semantics. We merged publicly available noisy and weak annotations with the most finely annotated data, over 2 million images, which enables training a model that achieves performance equal to that of state-of-the-art supervised methods on 7 benchmark datasets, despite not using any images therefrom. Instead of manually tuning a consistent label space, we utilized a vector-valued embedding of short paragraphs to describe the classes. By fine-tuning the model on standard semantic segmentation datasets, we also achieve a significant improvement over the state-of-the-art supervised segmentation on NYUD-V2 (Silberman et al., in: European conference on computer vision, Springer, pp 746–760, 2012) and PASCAL-context (Everingham et al. in Int J Comput Visi 111(1):98–136, 2015) at 60% and 65% mIoU, respectively. Our method can segment unseen labels based on the closeness of language embeddings, showing strong generalization to unseen image domains and labels. Additionally, it enables impressive performance improvements in some adaptation applications, such as depth estimation and instance segmentation. Code is available at https://github.com/YvanYin/SSIW.
Scribble Hides Class Promoting Scribble-Based Weakly-Supervised Semantic Segmentation with Its Class Label
Peking University, Beijing, China
Learning Generalized Medical Image Segmentation from Decoupled Feature Queries
Jarvis Research Center、Wuhan University、Guangxi Medical University