Introduction
Most attention modules are usually trained in a weakly-supervised manner with the final objective, for example, the supervision from the triple loss or classification loss in the person ReID task.
As the supervision is not specifically designed for the attention module, the gradients from this weak supervisory signal might be vanishing in the back propagation process.
The attention maps learned in such manner are not always “transparent” in their meaning, and lack discrimination ability and robustness.
The redundant and misleading attention maps are hardly corrected without direct and appropriate supervisory signal.
The quality of the attention during training process can only be evaluated qualitatively by the human end-users, examining the attention map one by one, which is labor-intensive and inefficient.
We learn the attention with a critic which measures the attention quality and provides a powerful supervisory signal to guide the learning process.
Since most effective evaluation indicators are usually non-differentiable, e.g. the gain of attention model over the basic network, we jointly train our attention agent and critic in a reinforcement learning manner, where the agent produces the visual attention while the critic analyzes the gain from the attention and guides the agent to maximize this gain.
We design spatial- and channel-wise attention models with our critic module.
Approach
Self-critical Attention Learning
- Given the input image $I$ as the state, the feature maps $X$ extracted by the basic network $F$ is
$$
\begin{equation}
X=F(I|\psi)
\end{equation}
$$
where $\psi$ denotes the parameters of the basic network.
- The attention maps $A$ based on the feature maps $X$ is
$$
\begin{equation}
A=A(X|\theta)
\end{equation}
$$
where $\theta$ is the parameters of the attention agent $A$.
- Critic module $V$ is formulated as
$$
\begin{equation}
V=C(X, A|\phi)
\end{equation}
$$
where $\phi$ defines the parameters of the critic network.
- The classification reward $R_c$
$$
R_c=
\begin{cases}
1,\quad &y_i^c=y_i^p\\
0,&y_i^c\ne y_i^p
\end{cases}
$$
where $y_i^p$ denotes the prediction label by the attention-based features about person $i$ and the $y_i^c$ is the ground-truth classification label.
- The amelioration reward $R_a$
$$
R_a=
\begin{cases}
1,\quad &p^k(A_i, X_i)\gt p^k(X_i)\\
0,&p^k(A_i, X_i)\le p^k(X_i)
\end{cases}
$$
where $p^k$ indicates the predicted probability of the true classification.
- The final reward of the attention model $R$
$$R=R_c+R_a$$
Attention Agent
Spatial Attention
$$A^s=\sigma(W_2^s\max(0, W_1^s\overline X))$$
Channel-wise Attention
$$A^c=\sigma(W_2^c\max(0, W_1^cX_{pool}))$$
Stacked Attention Model
We stacked five attention models on the ResNet-50 network.
Optimization
Triplet loss
$$J_{tri}(\psi, \theta)=\frac1N\sum_{i=1}^N[||f_i-f_i^+||_2^2-||f_i-f_i^-||_2^2+m]_+$$
Cross-entropy loss with the label smooth regularization [1]
$$J_{cls}(\psi, \theta)=-\frac1N\sum_{i=1}^N\sum_{k=1}^K\log(p_i^k)((1-\epsilon)y_i^k+\frac{\epsilon}{K})$$
Since the classification loss is sensitive to the scales of features, we add a batch-norm (BN) layer before classification loss to normalize the scales.
The critic loss
$$J_{cri}(\theta)=-V_{\phi}^{A_{\theta}}(X, A)$$
The Mean Square Error (MSE) loss
$$J_{mse}(\phi)=(V_{\phi}^{A_{\theta}}(X, A)-R)^2$$
Reference
[1] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In CVPR, pages 2818–2826, 2016. [link]