全文链接：
https://arxiv.xilesou.top/abs/1905.00953

The source code is here.

Introduction

Major Challenges

As an instance-level recognition problem, person ReID faces two major challenges as illustrated in Figure 1 :

The intra-class (instance/identity) variations are typically big due to the changes of camera viewing conditions. - hard positives
Small inter-class variations - people in public space often wear similar clothes; from a distance as typically in surveillance videos, they can look incredibly similar. - hard negatives

Omni-Scale Featrue

To match people and distinguish them from impostors, features corresponding both small local regions and global whole body regions are important.

Looking at the global-scale features would narrow down the search to the true match (middle) and an impostor (right).
The local-scale features gives away the fact that the person on the right is an impostor. (For more challenging cases, more complicated and richer features that span multiple scales are required.)

OSNet:

enabling omni-scale feature learning;
a lightweight network.

benefits:
1. When trained on the ReID datasets, which are often of moderate sizes due to the difficulties in collecting across-camera matched person images, a lightweighted network with a small number of model parameters is less prone to overfitting.
2. In a large-scale surveillance application, the most practical way for ReID is to perform feature extraction at the camera end, in which case only features need to be sent to a central server instead of sending the raw videos.

Depthwise Seperable Convolutions [notes]

Traditional Convolution

Depthwise Convolution

Pointwise Convolution

即 $1\times 1$ 卷积：

Lite $3\times 3$ Convolution

Omni-Scale Residual Block

Multi-Scale Feature Learning

$$y=x+\tilde{x}$$

where

$$\tilde{x}=\sum_{t=1}^TF^t(x),\quad\mbox{s.t.}\quad T\ge 1$$

Unified Aggregation Gate

To learn omni-scale features, we propose to combine the outputs of different streams in a dynamic way, i.e., different weights are assigned to different scales according to the input image, rather than being fixed after training. - learnable neural network AG
1. The output of the AG network $G(x^t)$ is a vector rather a scalar for the $t$-th stream, resulting in a more fine-grained fusion that tunes each feature channel.
2. The weights are dynamically computed by being conditioned on the input data.

$$\tilde{x}=\sum_{t=1}^TG(x^t)\odot x^t,\quad\mbox{where}\quad x^t=F^t(x)$$

The AG is shared for all feature streams in the same omni-scale residual block.

advantages:
1. The number of parameters is independent of $T$ (number of streams), thus the model becomes more scalable.
2. The supervision signals from all streams are gathered together to guide the learning of $G$.
$$\frac{\partial L}{\partial G}=\frac{\partial L}{\partial \tilde{x}}\frac{\partial \tilde{x}}{\partial G}=\frac{\partial L}{\partial \tilde{x}}(\sum_{t=1}^T x^t)$$

Differences to Inception and ResNeXt

The multi-stream design in OSNet strictly follows the scale-incremental principle dictated by the exponent $T$. Specifically, different streams have different receptive fields but are bulit with the same Lite $3\times 3$ layers. Such a design is more effective at capturing a wide range of scales. In contrast, Inception [1] was originally designed to have low computational costs by sharing computations with multiple streams. Therefore its structure, which includes mixed operations of convolution and pooling, was handcrafted. ResNeXt [2] has multiple equal-scale streams thus learning representations at the same scale.
Inception/ResNeXt aggregates features by concatenation/addition while OSNet uses a unified AG, which facilitates the learning of combinations of multi-scale feature. Critically, it means that the fusion is dynamic and adaptive to each individual input image. Therefore, OSNet’s architecture is fundamentally different from that of Inception/ResNeXt in nature.
OSNet uses factorised convolutions and thus the building block and subsequently the whole network is lightweight.

Differences to SENet

SENet [3] aims to re-calibrate the feature channels by re-scaling the activation values for a single stream, whereas OSNet is designed to selectively fuse multiple feature streams of different receptive field sizes in order to learn omni-scale features.

References

[1] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In CVPR, 2015. [link]

[2] Saining Xie, Ross Girshick, Piotr Doll´ar, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. In CVPR, 2017. [link]

[3] Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. In CVPR, 2018. [link]

Wei Xie's Blog

Omni-Scale Feature Learning for Person Re-ID