Current location - Training Enrollment Network - Mathematics courses - Semi-supervised semantic segmentation using unreliable pseudo-tags
Semi-supervised semantic segmentation using unreliable pseudo-tags
CVPR2022

https://arxiv.org/pdf/2203.03884.pdf

Semi-supervised semantic segmentation using unreliable pseudo-tags

The key of semi-supervised semantic segmentation is to assign enough error marks to the pixels of unlabeled images. The common practice is to choose the prediction with high confidence as the false true value, but this will lead to a problem that most pixels may not be used because they are unreliable. We believe that every pixel is important for model training, even if its prediction is fuzzy. Intuitively, unreliable prediction may be confused in the top category (that is, the category with the highest probability), but it should have confidence in pixels that do not belong to other categories. Therefore, for those most unlikely categories, such pixels can be convincingly regarded as negative samples. Based on this understanding, we developed an effective pipeline to make full use of unlabeled data. Specifically, we separate reliable and unreliable pixels by predicting entropy, push each unreliable pixel into a category queue composed of negative samples, and try to train the model with all candidate pixels. Considering the evolution of training, the prediction becomes more and more accurate, and we adaptively adjust the threshold of reliable-unreliable division. The experimental results in various benchmark and training environments show that our method is superior to the most advanced alternative methods.

1. Introduction

Semantic segmentation is a basic task in computer vision field. With the rise of deep neural network, semantic segmentation has been greatly developed. The existing supervision methods rely on large-scale annotation data, and the cost of obtaining these data is too high in practice. In order to alleviate this problem, many attempts have been made to realize semi-supervised semantic segmentation, that is, only a small number of labeled samples and a large number of unlabeled samples are used to learn the model. In this case, how to make full use of unlabeled data becomes very important.

The typical solution is to assign the wrong label to the pixels without comments. Specifically, given an unlabeled image, the prior art and entropy minimization make use of powerful data enhancement functions, such as self-training based cropping.

When the prediction of the input image is generated from the teacher network, the wrong label is used to prevent over-fitting the correct wrong label. FixMatch. In our environment, we don't care how to measure uncertainty. We simply use the entropy of the pixel probability distribution as a measure.

Many successful research results apply contrastive learning to self-supervised learning. However, these methods ignore the common false negative samples in semi-supervised segmentation, and unreliable pixels may be mistakenly pushed away in contrast loss. Distinguishing between unlikely and unreliable pixel categories can solve this problem.

Negative learning aims to reduce the risk of false information by reducing the probability of negative samples, but the selection of these negative samples has high reliability. In other words, these methods still use reliable predictions. In contrast, we suggest making full use of these unreliable predictions for learning, rather than filtering them out.

3. Method

In this section, we use mathematical methods to establish our problems, and in the second section, we outline our proposed methods. 3. 1 first. Our strategy of filtering reliable pseudo-tags will be introduced in the second part. 3.2. Finally, we describe how to use unreliable pseudo-tags in Sec. 3.3.

3. 1. Overview

Given a labeled set Dl=(x l i, y l i)Nl i= 1 and a larger unlabeled set Du={x u i}Nu i= 1, our goal is to train the semantic segmentation model with a large number of unlabeled data sets and a smaller labeled data set.

Figure 3 gives an overview of U2PL, which follows a typical self-training framework and has two models with the same framework, named teachers and students respectively. The two models are different only when the weights are updated. The updating of the weight θs of the student model is consistent with the conventional practice, while the weight θt of the teacher model is the exponential moving average (EMA) updated by the weight of the student model. Each model includes an encoder H based on CNN, a decoder with a segment title F and a display title G. In each training step, we sample B-labeled image Bl and B-unlabeled image Bu equally. For each labeled image, our goal is to minimize the standard cross entropy loss in Equation (2). For each unlabeled image, we first put it into the teacher model and make a prediction. Then, based on the pixel-level entropy, when calculating the unsupervised loss in Equation (3), we ignore the unreliable pixel-level pseudo-labels. This part will be introduced in the second section. 3.2 Detailed description. Finally, we use contrast loss to make full use of unreliable pixels excluded from unsupervised loss, which will be introduced in the second section. 3.3.

Our optimization goal is to minimize the total loss, as follows:

Where Ls and Lu represent supervised loss and unsupervised loss on marked images and unlabeled images, respectively, and Lc represents the contrast loss of making full use of unreliable pseudo-labels. λu and λc are the weights of unsupervised loss and contrast loss, respectively. Ls and Lu are both cross entropy (CE) losses:

Where y l i represents the hand-shaped label mask label of the i-th marked image, y? U i is the pseudo-mark of the i-th unlabeled image. f? H is the composite function of h and f, that is to say, the image is sent to h first, and then to f to get the segmentation result. Lc is pixel-level information. Following this intuition, we filter out unreliable false labels according to Formula (6).

However, this contempt for unreliable fake labels may lead to the loss of information. Obviously, unreliable fake labels can provide better identification information. For example, the white cross in Figure 2 is a typical unreliable pixel. Its distribution shows the uncertainty of the model that distinguishes class people from class motorcycles. However, this distribution also proves the certainty of the model, that is, pixels will not be divided into ordinary cars, ordinary trains, ordinary bicycles and so on. This feature provides the main basis for us to use unreliable pseudo-tags for semi-supervised semantic segmentation.

The goal of U2PL is to distinguish better by using unreliable pseudo-label information, which is consistent with the recently popular comparative learning paradigm in distinguishing representation. However, due to the lack of labeled images in semi-supervised semantic segmentation, our U2PL is based on a more complex strategy. U2PL has three components, named anchor pixel, positive candidate and negative candidate. These components are obtained from some sets by sampling to reduce the huge calculation cost. Next, how to choose: (a) anchor pixel (query); (b) Positive samples of each anchor; (c) Negative samples of each anchor. Anchor pixel. During the training, we sample the anchor pixels (queries) of each class that appear in the current small batch. We represent the feature set of all labeled candidate anchor pixels of Class C as a 1c,

Where yij is the basic truth value of the j-th pixel of the marked image I, and Δ p represents the positive threshold of a specific category. After that, we use SBD, and the image is cut to a fixed resolution of Pascal VOC 20 12. For urban landscape, the previous method is suitable for sliding window evaluation, and so are we. Then, we use the mean of intersection (mIoU) on the union as a metric to evaluate these cropped images. All the results are in the urban landscape, which is a common shortcoming of semi-supervised learning tasks. Due to the extreme lack of labels, semi-supervised learning framework usually needs to pay the price for higher accuracy in time. We can further explore their training optimization in the future.