MVP Matching - A Maximum-value Perfect Matching for Mining Hard Samples, with Application to Person Re-ID


The source code is here.


Hard Samples

  1. The appearance of different pedestrians may be highly similar;

  2. The pose of a person may vary significantly as time and space changed;

  3. The light conditions taken by some cameras are sometimes poor.

These hard samples would strongly slow down the convergence rate of the metric learning, which works on pulling similar samples to cluster together while pushing dissimilar ones to widen apart.

Or worse of all, the learned embedding metric and feature representation could be heavily biased by these hard samples.

Deep metric learning

  • Seminal contrastive loss [1] or triplet loss [2] learns the semantical information within image pairs or triplets based on the siamese-like networks [3]. They do not make full use of all available information within a batch of samples.

  • Batch all triplet loss [4] and N-pair loss [5] have been developed to remedy this flaw, but they do not attach enough attention to hard samples and require expensive resampling techniques to boost the performance.

  • Lifted loss [6] and quadruplet loss [7] only consider hard negative samples mining while ignoring the hard positive samples.

  • Batch hard triplet loss [4] considers the hardest positive and negative mining depended on the distances of features simply. Its performance is easily influenced by some outlier samples (e.g., indistinguishable or mislabeled images in person re-ID datasets), which could be regarded as hardest sample pairs by many other samples simultaneously and lead to oscillation during metric learning process.

Conventional batchwise loss objectives for metric learning can be optimized using all available information within training batches, so that all similar positive pairs with large ground distances and dissimilar negative pairs with small ground distances would be emphasized simultaneously.

  1. Overtraining:

    Similar positive pairs with small distances would still be optimized (e.g., $x_1$ and $x_4$).

  2. Counteracting:

    Or worse, if we treat the optimization as a whole rather than the individual particle, the metric learning process of these methods is vulnerable to oscillation. Since the hard samples might be emphasized by many particles simultaneously, they could all cancel each other out (e.g., $x_2$ and $x_5$).

In contrast, metric learning with the MVP matching based loss objective can guarantee that each sample selects one exclusive hard positive and negative pairs.

Thus, the learned MVP matching could deal with unbalanced samples according to the adaptive edge weights in the bipartite graph and the adverse optimization, e.g., overtraining and counteracting yielded by other anchors as black arrows shown in Figure 1.(c), would be effectively eliminated.

As a consequence, the convergence rate and recognition performance of metric learning could be improved significantly.


Adaptive Weighting for Positives and Negatives

We divide the complete bipartite graph $G(U,V,E)$ into two bipartite graphs $G_P$ and $G_N$ for positive and negative pairs respectively.

An adaptive weight $M_{ij}^+$ and $M_{ij}^−$ of each edge $(i, j)$ in $G_P$ and $G_N$:

M_{ij}^+(x_i, x_j; f)=


1,\quad&\mbox{if}\quad ||f(x_i)-f(x_j)||_2^2\gt\alpha\\
0,&\mbox{if}\quad ||f(x_i)-f(x_j)||_2^2\le\alpha

where $\alpha$ is a learnable variable.

This function would prevent the overtraining for positive samples in contrastive loss, which demands similar pairs gather as close as possible.

M_{ij}^-(x_i, x_j; f)=


1,\quad&\mbox{if}\quad ||f(x_i)-f(x_j)||_2^2\lt\beta\\
0,&\mbox{if}\quad ||f(x_i)-f(x_j)||_2^2\ge\beta

where $\beta = \epsilon + \alpha$.

This function penalizes the dissimilar pairs within the margin $\beta$ and ignores the others. (prevent the counteracting)

MVP Matching for Hard Samples Pairs

We define the matching variables $T_{ij}^+$ and $T_{ij}^−$ for the edge $(i, j)$ on these two weighted bipartite graphs $G_P$ and $G_N$:

1,\quad&\mbox{indicates a matching pair}\\
0,&\mbox{means unmatching one}

where $t_{ij}\in T^{+(or-)}$

$$\max_{T_{ij}\ge 0}\sum_{i,j=1}^nT_{ij}^{+(or-)}M_{ij}^{+(or-)}$$


Kuhn-Munkres assignment (KA) algorithm [8]

Definition 3.1. Feasible labeling :

$$l:\forall x_i, x_j\in U\cup V\rightarrow R$$

$$\mbox{s.t.}\quad M_{ij}\le l(x_i)+l(x_j)$$

Definition 3.2. Equality graph :

$$G_l=\lbrace\forall (i, j):M_{ij}=l(x_i)+l(x_j)\rbrace$$

Theorem 3.3. If $l$ is a feasible labeling function and $T$ is an arbitrary perfect matching (i.e., one-to-one matching) in the equality graph $G_l$, then $T$ must be a maximum-value perfect (MVP) matching $T^*$.

Proof: For $T$ in $G(U, V, E)$:

$$\sum_{i, j}^{(x_i, x_j)\in T}M_{ij}\le\sum_i^{x_i\in U}l(x_i)+\sum_j^{x_j\in V}l(x_j)$$

Only when $T$ is a matching in $G_l$:

$$\sum_{i, j}^{(x_i, x_j)\in T^*}M_{ij}=\sum_i^{x_i\in U}l(x_i)+\sum_j^{x_j\in V}l(x_j)$$

Batch-wise MVP Matching based Loss

L(x_i, x_j; f)&=L^++L^-\\


1,\quad &\mbox{if samples $x_i$ and $x_j$ are deemed similar}\\


$$\frac{\partial L}{\partial f(x_i)}=\sum_{j=1}^n2(f(x_i)-f(x_j))(\delta_{ij}^+y_{ij}T_{ij}^+-\delta_{ij}^-(1-y_{ij})T_{ij}^-)$$

$$\frac{\partial L}{\partial f(x_j)}=-\sum_{i=1}^n2(f(x_i)-f(x_j))(\delta_{ij}^+y_{ij}T_{ij}^+-\delta_{ij}^-(1-y_{ij})T_{ij}^-)$$

$\alpha$ is a learnable variable:

$$\frac{\partial L}{\partial \alpha}=\sum_{i,j=1}^n[-y_{ij}\delta_{ij}^++(1-y_{ij})\delta_{ij}^-]$$


[1] Sumit Chopra, Raia Hadsell, and Yann LeCun. Learning a similarity metric discriminatively, with application to face verification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 539–546. IEEE, 2005. [link]

[2] Gal Chechik, Varun Sharma, Uri Shalit, and Samy Bengio. Large scale online learning of image similarity through ranking. Journal of Machine Learning Research, 11(Mar):1109–1135, 2010. [link]

[3] Jane Bromley, Isabelle Guyon, Yann LeCun, Eduard S¨ackinger, and Roopak Shah. Signature verification using a “siamese” time delay neural network. In Advances in neural information processing systems, pages 737–744, 1994. [link]

[4] Alexander Hermans, Lucas Beyer, and Bastian Leibe. In defense of the triplet loss for person re-identification. arXiv preprint arXiv:1703.07737, 2017. [link]

[5] Kihyuk Sohn. Improved deep metric learning with multiclass n-pair loss objective. In Advances in Neural Information Processing Systems, pages 1857–1865, 2016. [link]

[6] Hyun Oh Song, Yu Xiang, Stefanie Jegelka, and Silvio Savarese. Deep metric learning via lifted structured feature embedding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4004–4012, 2016. [link]

[7] Weihua Chen, Xiaotang Chen, Jianguo Zhang, and Kaiqi Huang. Beyond triplet loss: a deep quadruplet network for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 403–412, 2017. [link]

[8] Harold W Kuhn. The hungarian method for the assignment problem. Naval research logistics quarterly, 2(1-2):83–97, 1955. [link]

-------------End of this articleThanks for reading-------------