The source code is here.

# Introduction

## Chanllenge One

Since in supervised learning deep CNN is a data-driven method, it requires a large number of pair-wise labelled data in training to learn view-invariant representations. However, labelling sufficient pairwise RE-ID data is expensive and time-consuming. How to improve the performance and scalability of deep RE-ID algorithm without pair-wise labelled data (i.e., unsupervised learning) is a great challenge in recent person RE-ID research.

There have been a series of unsupervised image based methods to address this problem, which can be roughly divided into three categories:

1. image-to-image translation

transfer the source domain images to the target domain by GAN network

transfer the source domain trained model to the target domain in an unsupervised manner

3. unsupervised clustering

obtain the pseudo labels of target domain data through the unsupervised clustering algorithms and fine tune the source domain model with pseudo labels on target domain.

## Chanllenge Two

The precondition of above mentioned methods is that there are some similarities between the source domain and the target domain.

## Tracklet Based Methods

Due to the fact that UTAL [1] and TAUDL [2] match the underlying positive pairs in the mini batch, both of them need a large batch size to sample the underlying positive pairs.

RACE [3] and BUC [4], which progressively merge the underlying positive pairs in training, are easily damaged by merging noisy pairs.

# Unsupervised Graph Association

The core points are mining the cross-view relationships and reducing the damage of noisy associations.

Intra-camera learning stage is to learn representations of a person with regards to camera information, which helps to reduce false cross-view associations in inter-camera learning stage.

## Intra-camera Learning Stage

Each classifier branch corresponds to one camera’s classification task.

Suppose we have a dataset, captured from $T$ cameras. We adopt the sparse space-time tracklets sampling (SSTT [2]) to sample the training tracklets $\lbrace s_t^i, y_t^i\rbrace$ from each camera.

Denoting $s_t^i=\lbrace I_1^{s_t^i}, I_2^{s_t^i}, …, I_n^{s_t^i}\rbrace$, where $I_n^{s_t^i}$ is the $n$-th image of the $i$-th tracklet ($i∈ [1, . . . , Mt]$) in $t$-th camera ($t ∈ [1, . . . , T ]$).

We randomly assign a unique pseudo label $y_t^i$($y_t^i\in \lbrace y_t^1, …, y_t^{M_t}\rbrace$) for the $s_t^i$.

$\phi(\cdot )$ is the backbone function.

Note:

1. The batch normalization layer is effective to avoid overfitting and restrain negative pairs, i.e., reduce the average similarity score of the negative pairs and make the negative pairs easier to be distinguished.

2. The assumption of our experiments is that one person has only one tracklet in each camera through SSTT sampling.

## Inter-camera Learning Stage

### Tracklet’s representation

$$c_t^i=\frac{1}{N_{s_t^i}}\sum_{n=1}^{N_{s_t^i}}\phi(I_n^{s_t^i}),\quad I_n^{s_t^i}\in s_t^i$$

### Cross-View Graph

KNN set $\lbrace c_t^i\rbrace_K^m$ of $c_t^i$, which finds the nearest $K$ tracklets of $c_t^i$ in camera $m$.

$$e(c_t^i, c_m^j)= \begin{cases} \cos(c_t^i, c_m^j),\quad &\mbox{if}\quad\cos(c_t^i, c_m^j)\gt \lambda\quad \&\quad c_m^j\in \lbrace c_t^i\rbrace_K^m\quad \&\quad c_t^i\in \lbrace c_m^j\rbrace_K^t\\ 1, &\mbox{if}\quad c_t^i=c_m^j\\ 0, &\mbox{otherwise} \end{cases}$$

### Cross-camera loss

Graph neighbor set $N(s_t^i)$:

$$N(s_t^i)=\lbrace (s_m^a, y_m^a)|if\ e(c_t^i, c_m^a)\ne 0\rbrace$$

The weights of MBC are replaced with the corresponding nodes of CVG to fast updating CVG in the training process:

$$l_{ce}(I_n^{s_t^i}, s_m^a)=-\sum_{j=1}^{M_m}\log\left(\frac{\exp((c_m^j)^T\phi(I_n^{s_t^i}))}{\sum_{k=1}^{M_m}\exp((c_m^k)^T\phi(I_n^{s_t^i}))}\right)$$

### Graph weighted cross-camera loss

$$\begin{split} l_{inter}(I_n^{s_t^i})&=\sum_{N(s_t^i)-s_t^i}e(c_t^i, c_m^a)l_{ce}(I_n^{s_t^i}, s_m^a)+\alpha l_{ce}(I_n^{s_t^i}, s_t^i)\\ &=\sum_{N(s_t^i)}e(c_t^i, c_m^a)l_{ce}(I_n^{s_t^i}, s_m^a),\qquad\mbox{where}\quad\alpha=e(c_t^i, c_t^i) \end{split}$$

### CVG’s Updating

$$\frac{\partial l_{inter}}{\partial c_m^a}=-\sum_{N_{bs}}err(I_n^{s_t^i})e(c_t^i, c_m^a)\phi(I_n^{s_t^i})$$

where

$$err(I_n^{s_t^i})=1(y_m^a==j)-\frac{\exp((c_m^j)^T\phi(I_n^{s_t^i}))}{\sum_{k=1}^{M_m}\exp((c_m^k)^T\phi(I_n^{s_t^i}))}$$

$$c_m^a\leftarrow c_m^a+\eta\frac{\partial l_{inter}}{\partial c_m^a}$$

The updating of $c_t^i$ makes full use of underlying positive pairs from all camera views.

# References

[1] Minxian Li, Xiatian Zhu, and Shaogang Gong. Unsupervised tracklet person re-identification. IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 1–1, 2019. [link]

[2] Minxian Li, Xiatian Zhu, and Shaogang Gong. Unsupervised person re-identification by deep learning tracklet association. In Proceedings of the European Conference on Computer Vision (ECCV), pages 737–753, 2018. [link]

[3] Mang Ye, Andy J Ma, Liang Zheng, Jiawei Li, and Pong C Yuen. Dynamic label graph matching for unsupervised video re-identification. In Proceedings of the IEEE International Conference on Computer Vision, pages 5142–5150, 2017. [link]

[4] Yutian Lin, Xuanyi Dong, Liang Zheng, Yan Yan, and Yi Yang. A bottom-up clustering approach to unsupervised person re-identification. In AAAI Conference on Artificial Intelligence, volume 2, 2019. [link]