The source code is here.
Challenges
Deep re-ID models trained on the source domain may have a significant performance drop on the target domain due to the data-bias existing between source and target datasets.
-> unsupervised domain adaptation (UDA)
-> generative adversarial network (GAN)
The disparities of cameras are another critical factor influencing re-ID performance.
-> Hetero-Homogeneous Learning (HHL [1])
However, the performances of these UDA approaches are still far behind their fully-supervised counterparts. The main reason is that most previous works focus on increasing the training samples or comparing the similarity or dissimilarity between the source dataset and the target dataset but ignoring the similar natural characteristics existing in the training samples from the target domain.
Fully Supervised Pre-training
$$L_{baseline}=L_{softmax}+L_{triplet}$$
Self-Similarity Grouping
We compare the similarities between two persons not only by global information obtaining from the whole body but also by more fine and local information getting from upper and lower parts of a person, i.e., proposed SSG mines the potential similarity existing in target dataset automatically by different appearance cues (from global to local) in an unsupervised manner.
DBSCAN (Density Based Spatial Clustering of Applications with Noise)
$$L_{ssg}=L_{triple}(f_t, y_t)+L_{triple}(f_{t\_up}, y_{t\_up})+L_{triple}(f_{t\_low}, y_{t\_low})+L_{triple}(f_{t_e}, y_t)$$
Clustering-guided Semi-Supervised Training
We employ the k-reciprocal encoding [2], a variation of Jaccard distance between nearest neighbor sets, as the distance metric for similarity measurement.
We randomly sample a single image from each group to form a very small sub-dataset $X_g$ with $N_g$ images.
We label this small sub-dataset manually and perform labels assignment based on this annotation.
Since each image in $X_g$ is from different groups, it’s less possible that two different images share the same identity, which allows us to adopt the one shot learning approach [3] and further improve the performance. (multi gallery shot -> one shot)
$$L_{semi}=L_{triple}(f_t, y_{t_g})+L_{triple}(f_{t\_up}, y_{t\_up})+L_{triple}(f_{t\_low}, y_{t\_low})+L_{triple}(f_{t_e}, y_{t_g})$$
Joint training strategy
$$L_{jointly}=L_{ssg}+L_{semi}$$
References
[1] Zhun Zhong, Liang Zheng, Shaozi Li, and Yi Yang. Generalizing a person retrieval model hetero-and homogeneously. In ECCV, pages 172–188, 2018 [link]
[2] Zhun Zhong, Liang Zheng, Donglin Cao, and Shaozi Li. Reranking person re-identification with k-reciprocal encoding. In IEEE CVPR, pages 3652–3661, 2017. [link]
[3] Yu Wu, Yutian Lin, Xuanyi Dong, Yan Yan, Wanli Ouyang, and Yi Yang. Exploit the unknown gradually: One-shot video-based person re-identification by stepwise learning. In IEEE CVPR, 2018. [link]