全文链接：
http://openaccess.thecvf.com/content_ICCV_2019/html/Liu_Deep_Reinforcement_Active_Learning_for_Human-in-the-Loop_Person_Re-Identification_ICCV_2019_paper.html

Introduction

Most existing supervised person Re-ID approaches employ a train-once-and-deploy scheme, i.e., a large amount of pre-labelled data is put into training phrase all at once.

However, in practice this assumption is not easy to adapt:

Pairwise pedestrian data is prohibitive to be collected since it is unlikely that a large amount of pedestrian may reappear in other camera views.
The increasing number of camera views amplifies the difficulties in searching the same person among multiple camera views.

Solutions:

Unsupervised learning algorithms

Unsupervised learning based Re-ID models are inherently weaker compared to supervised learning based models, compromising Re-ID effectiveness in any practical deployment.
Semi-supervised learning scheme

These models are still based on a strong assumption that parts of the identities (e.g. one third of the training set) are fully labelled for every camera view.

-> Reinforcement Learning + Active Learning:

human-in-the-loop (HITL) model learning process [1]

A step-by-step sequential active learning process is adopted by exploring human selective annotations on a much smaller pool of samples for model learning.

These cumulatively labelled data by human binary verification are used to update model training for improving Re-ID performance.

Such an approach to model learning is naturally suited for reinforcement learning together with active learning, the focus of this work.

Methodology

Base CNN Network

$$L_{total}=L_{cross-entropy}+L_{triplet}$$

A Deep Reinforced Active Learner - An Agent

As each query instance arrives, we perceive its $n_s$-nearest neighbors as the unlabelled gallery pool.

Action

The action set defines to select an instance from the unlabelled gallery pool:

$$\pi(A_t|S_t)$$

Once $A_t=g_k$ is performed, the agent is unable to choose it again in the subsequent steps.

The termination criterion of this process depends on a pre-defined $K_{max}$ which restricts the maximal annotation amount for each query anchor.

State

At each discrete time step $t$, the environment provides an observation state $S_t$ which reveals the instances’ relationship, and receives a response from the agent by selecting an action $A_t$:

Mahalanobis distance

$$d(x, y)=\mbox{Mahalanobis}(x, y)=(x-y)^T\Sigma^{-1}(x-y)$$

The k-reciprocal neighbors

$$R(n_i, k)=\lbrace n_j|(n_i\in N(n_j, k))\land(n_j\in N(n_i, k))\rbrace$$

where $N(n_i, k)$ is the top k-nearest neighbors of $n_i$

Sparse similarity graph

The similarity value between every two samples $i,j(i\ne j)$

$$
\mbox{Sim}(i, j)=
\begin{cases}
1-\frac{d(i, j)}{\max_{i, j\in q, g}d(i, j)},\quad&\mbox{if}\quad j\in R(i, k)\\
0,&\mbox{otherwise}
\end{cases}
$$

The similarity value of the node $n_i$ is remained, otherwise be assigned with zero.

For a state $S_t$ at time $t$, the optimal action $A_t=g_k$ is selected via the policy network, which indicates the $k$-th instance among the unlabelled gallery pool being annotated by human oracle, who replies with binary feedback true or false against the query:

True Match:

$$y_k^t=1$$

then,

$$\mbox{Sim}(q, g_k)=1$$

$$\mbox{Sim}(q, g_i)=\frac12[\mbox{Sim}(q, g_i)+\mbox{Sim}(g_k, g_i)]$$

False Match:

$$y_k^t=-1$$

then,

$$\mbox{Sim}(q, g_k)=0$$

$$
\mbox{Sim}(q, g_i)=
\begin{cases}
\mbox{Sim}(q, g_i)-\mbox{Sim}(g_k, g_i),\quad&\mbox{if}\quad\mbox{Sim}(q, g_i)\gt\mbox{Sim}(g_k, g_i)\\
0,&\mbox{otherwise}
\end{cases}
$$

zoom in the distance among positives and push out the distance among negatives.

The k-reciprocal operation will also be adopt afterwards, and a renewed state $S_{t+1}$ is then obtained.

repeats until the maximum annotation amount $K_{max}$ for each query is exhausted

Reward

Loss for action $\pi$

We use data uncertainty as the objective function of the reinforcement learning policy, i.e., Higher uncertainty indicates that the sample is harder to be distinguished

$$R_t=[m+y_k^t(\max_{x_i\in X_p^t}d(x_i, g_k)-\min_{x_j\in X_n^t}d(x_j, g_k))]_+$$

All the future rewards $(R_{t+1}, R_{t+2}, \cdots)$:

$$Q^*=\max_{\pi}E[R_t+\gamma R_{t+1}+\gamma^2R_{t+2}+\cdots|\pi, S_t, A_t]$$

CNN Network Updating

When plentiful enough pair-wise labelled data are obtained, the CNN parameters enable to be updated via triplet loss function, which in return generates a new initial state for incoming data. Through iteratively executing the sample selection and CNN network refreshing, the proposed algorithm could quickly escalate.

This progress terminates when each image in the training data pool has been browsed once by our DRAL agent.

Conclusion

The key task for the model design becomes how to select more informative samples at a fixed annotation cost.
DRAL method releases the restriction of pre-labelling and keeps model upgrading with progressively collected data.

Reference

[1] Hanxiao Wang, Shaogang Gong, Xiatian Zhu, and Tao Xiang. Human-in-the-loop person re-identification. In ECCV, 2016. [link]

Wei Xie's Blog

Deep Reinforcement Active Learning for Human-In-The-Loop Person Re-Identification