The source code is here.

# Challenges

Two types of noise are prevalent in practice:

label noise caused by human annotator errors, i.e., people assigned with the wrong identities

data outliers caused by person detector errors or occlusion

Having both types of noisy samples in a training set inevitably has a detrimental effect on the learned feature embedding:

**Noisy samples are often far from inliers of the same class in the input (image) space.**

**To minimise intra-class distance and pull the noisy samples close to their class centre, a ReID model often needs to sacrifice inter-class separability, leading to performance degradation.**

# Related Work

Existing robust deep learning approaches can be grouped into two categories depending whether human supervision/verification of noise is required.

no such additional human noise annotation or pattern estimation is needed.

These methods address label noise by either iterative label correction via bootstrapping, adding additional layers on top of a classification layer to estimate the noise pattern [1, 2] or loss correlation [3].

requires a subset of noisy data to be re-annotated (cleaned) by more reliable sources to verify which samples contain noise. This subset is then used as seed/reference set so that noisy samples in the full training set can be identified.

(1) The recently proposed CleanNet [4] learns the similarity between class- and query- embedding vectors, which is then used to detect noisy samples.

(2) MentorNet [5] on the other hand resorts to curriculum learning and knowledge distillation to focus on samples whose labels are more likely to be correct.

# DistributionNet

We propose to explicitly model the feature distribution of an individual image as Gaussian, i.e., we assume the feature vector of an image is drawn from a Gaussian distribution parametrised.

## Loss

$$L_{ce}=l(g_\phi(\mu), y)+\lambda(\frac{1}{N}\sum_{j=1}^N l(g_\phi(z^{(j)}), y))$$

$$L_{fu}=\max(0, \gamma-\sum_{i=1}^n q^{(i)})$$

DistributionNet models uncertainty and allocates it appropriately by introducing losses to promote high net uncertainty.

**Given a noisy training sample**, instead of forcing it to be closer to other inliers of the same class, **DistributionNet computes a large variance**, indicating that it is uncertain about what feature values should be assigned to the sample.

Training samples of larger variances have less impact on the learned feature embedding space.

Together with the supervised loss, they help the model identify noisy training samples, discount them/mitigate the negative impact of the noisy samples by assigning large variances and focus more on the clean inliers rather than overfitting to noisy samples, resulting in better class separability and better generalisation to test data.

## Reparameterisation Trick

$$z=\mu+\epsilon\Sigma,\qquad\epsilon\sim N(0, I)$$

# Discussion

## Why large variances for noisy samples

First, we need to understand what the supervised classification loss $L_{ce}$ wants: samples with large variances will lead to large loss values of $L_{ce}$; it is also noted that samples with wrong labels or outlying also have the same effect because they are normally far away from the class centres and the clean inliers.

Second, we explained that with the feature uncertainty loss $L_{fu}$, the model cannot simply satisfy $L_{ce}$ by reducing the variance of every sample to zero – the overall variance/uncertainty level has to be maintained.

So who will get the large variance?

Now the decision is clear: **reducing variances of noisy samples would still lead to large $L_{ce}$, whilst reducing those of clean inliers will have a direct impact of reducing $L_{ce}$**; the model therefore allocates large variance to noisy samples.

## Why samples with larger variance contribute less for model training

The reason is intuitive – if an image embedding has a large variance, when it is sampled, the outcome $z$ will be far away from its original point (the mean vector $\mu$) but with the same class label. So when several diverse $z^{(1)}, z^{(2)}, …, z^{(N)}$ and $\mu$ are fed to the classifier, it is likely that **their gradients will cancel each other out**.

On the other hand, when a sample has a small variance, all $z^{(j)}$ will be close to $\mu$; feeding these to the classifier gives consistent gradients thus reinforcing the importance of the sample.

The variance/uncertainty thus provides a mechanism for DistributionNet to give more/less importance to different training samples. Since noisy samples are given large variance, their contribution to model training is reduced, resulting in a better feature embedding space.

# References

[1] Sainbayar Sukhbaatar, Joan Bruna, Manohar Paluri, Lubomir Bourdev, and Rob Fergus. Training convolutional networks with noisy labels. In ICLR, 2015. [link]

[2] Jacob Goldberger and Ehud Ben-Reuven. Training deep neural-networks using a noise adaptation layer. In ICLR, 2017. [link]

[3] Giorgio Patrini, Alessandro Rozza, Aditya Krishna Menon, Richard Nock, and Lizhen Qu. Making deep neural networks robust to label noise: A loss correction approach. In CVPR, 2017. [link]

[4] Kuang-Huei Lee, Xiaodong He, Lei Zhang, and Linjun Yang. Cleannet: Transfer learning for scalable image classifier training with label noise. In ICCV, 2017. [link]

[5] Lu Jiang, Zhengyuan Zhou, Thomas Leung, Li-Jia Li, and Li Fei-Fei. Mentornet: Learning data-driven curriculum for very deep neural networks on corrupted labels. In ICML, 2018. [link]