全文链接：

https://arxiv.xilesou.top/abs/1905.00953

The source code is here.

# Introduction

## Major Challenges

As an instance-level recognition problem, person ReID faces two major challenges as illustrated in Figure 1 :

The intra-class (instance/identity) variations are typically big due to the changes of camera viewing conditions. -

**hard positives**Small inter-class variations - people in public space often wear similar clothes; from a distance as typically in surveillance videos, they can look incredibly similar. -

**hard negatives**

## Omni-Scale Featrue

To match people and distinguish them from impostors, features corresponding both **small local regions** and **global whole body regions** are important.

Looking at the global-scale features would narrow down the search to the true match (middle) and an impostor (right).

The local-scale features gives away the fact that the person on the right is an impostor. (For more challenging cases, more complicated and richer features that span multiple scales are required.)

## OSNet:

enabling omni-scale feature learning;

a lightweight network.

benefits:

1) When trained on the ReID datasets, which are often of moderate sizes due to the difficulties in collecting across-camera matched person images, a lightweighted network with a small number of model parameters is

**less prone to overfitting**.2) In a large-scale surveillance application, the most practical way for ReID is to perform feature extraction at the camera end, in which case only features need to be sent to a central server instead of sending the raw videos.

# Depthwise Seperable Convolutions [notes]

## Traditional Convolution

## Depthwise Convolution

## Pointwise Convolution

即 $1\times 1$ 卷积：

## Lite $3\times 3$ Convolution

# Omni-Scale Residual Block

## Multi-Scale Feature Learning

$$y=x+\tilde{x}$$

where

$$\tilde{x}=\sum_{t=1}^TF^t(x),\quad\mbox{s.t.}\quad T\ge 1$$

## Unified Aggregation Gate

To learn omni-scale features, we propose to combine the outputs of different streams

**in a dynamic way**, i.e., different weights are assigned to different scales according to the input image, rather than being fixed after training. -**learnable neural network AG**The output of the AG network $G(x^t)$ is a

**vector**rather a scalar for the $t$-th stream, resulting in a more fine-grained fusion that tunes each feature channel.The weights are dynamically computed by being conditioned on the input data.

$$\tilde{x}=\sum_{t=1}^TG(x^t)\odot x^t,\quad\mbox{where}\quad x^t=F^t(x)$$

The AG is

**shared**for all feature streams in the same omni-scale residual block.advantages:

The number of parameters is independent of $T$ (number of streams), thus the model becomes more scalable.

The supervision signals from all streams are gathered together to guide the learning of $G$.

$$\frac{\partial L}{\partial G}=\frac{\partial L}{\partial \tilde{x}}\frac{\partial \tilde{x}}{\partial G}=\frac{\partial L}{\partial \tilde{x}}(\sum_{t=1}^T x^t)$$

## Differences to Inception and ResNeXt

The multi-stream design in OSNet strictly follows the scale-incremental principle dictated by the exponent $T$. Specifically, different streams have different receptive fields but are bulit with the same Lite $3\times 3$ layers. Such a design is more effective at capturing a wide range of scales. In contrast, Inception [1] was originally designed to have low computational costs by sharing computations with multiple streams. Therefore its structure, which includes mixed operations of convolution and pooling, was handcrafted. ResNeXt [2] has multiple equal-scale streams thus learning representations at the same scale.

Inception/ResNeXt aggregates features by concatenation/addition while OSNet uses a unified AG, which facilitates the learning of combinations of multi-scale feature. Critically, it means that the fusion is dynamic and adaptive to each individual input image. Therefore, OSNet’s architecture is fundamentally different from that of Inception/ResNeXt in nature.

OSNet uses factorised convolutions and thus the building block and subsequently the whole network is lightweight.

## Differences to SENet

SENet [3] aims to re-calibrate the feature channels by re-scaling the activation values for a single stream, whereas OSNet is designed to selectively fuse multiple feature streams of different receptive field sizes in order to learn omni-scale features.

# References

[1] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In CVPR, 2015. [link]

[2] Saining Xie, Ross Girshick, Piotr Doll´ar, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. In CVPR, 2017. [link]

[3] Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. In CVPR, 2018. [link]