<Show, Attend and Tell: Neural Image Caption Generation with Visual Attention>
1. This paper brings about "hard attention". The authors introduce two attention-based image caption generators under a common framework: 1) a "soft" deterministic attention mechanism trainable by standard back-propagation methods and 2) a "hard" stochastic attention mechanism trainable by maximizing an approximate variational lower bound or equivalently by REINFORCE.
2. We first introduce their common framework. The model takes a single raw image and generates a caption y encoded as a sequence of 1-of-K encoded words.
where K is the size of the vocabulary and C is the length of the caption.
3. We use a CNN to extract a set of feature vectors which we call annotation vectors.
4. We use LSTM to produce a caption by generating one word at every time step conditioned on a context vector, the previous hidden state and the previously generated words. The structure is like that:
where is a simple affine transformation with parameters that are learned.
The diagram is like that:
5. In simple terms, the context vector is a dynamic representation of the relevant part of the image input at time t. We define a mechanism that computes from the annotation vectors , I=1,...,L corresponding to the features extracted at different image locations. For each location i, the mechanism generates a positive weight which is the relative importance of it. The is computed by that:
where a is computed by a multilayer perceptron.
6. Once the weights are computed, the context vector is computed by
where is a function that returns a single vector given the set of annotation vectors and their corresponding weights.
7. The initial memory state and hidden state of the LSTM are predicted by an average of the annotation vectors fed through two separate MLPs (init, c and init, h):
8. We use a deep output layer to compute the output word probability given the LSTM state, the context vector and the previous word:
9. Stochastic "Hard" Attention.
We define a local variable as where the model decides to focus attention when generating the word. is an indicator one-hot variable representing that i-th location (out of L) is the one used to extract visual features. We can assign a multinoulli distribution parametrized by , and view as a random variable:
We define a new objective function that is a variational lower bound on the marginal log-likelihood of observing the sequence of words given image features
(I think that there is something wrong with the formula... I use the [f(x)g(x)]'=f(x)'g(x) + f(x)g(x)' to conduct it but failed...)
The equation above suggests a Monte Carlo based sampling approximation of the gradient with respect to the model parameters. This can be done by sampling the location from the distribution above, which is:
A moving average baseline is used to reduce the variance in the Monte Carlo estimator of the gradient. Upon seeing the mini-batch, the moving average baseline is estimated as an accumulated sum of the previous log likelihoods with exponential decay:
To further reduce the estimator variance, an entropy term on the multinouilli distribution H[s] is added. We set the sampled attention location to its expected value .Both techniques improve the robustness of the stochastic attention learning algorithm.
The final rule is like that:
10. Deterministic "Soft" Attention
Learning stochastic attention requires sampling the attention location each time, instead we can take the expectation of the context vector directly,
and formulate a deterministic attention model by computing a soft attention weighted annotation vector:
The whole model is smooth and differentiable under the deterministic attention. So learning end-to-end is trivial by using standard back-propagation.
11. We define the NWGM(normalized weighted geometric mean for the softmax word prediction:
12. The model is trained end-to=end by minimizing the following penalized negative log-likelihood:
which encourage , and it is called DOUBLY STOCHASTIC ATTENTION.