<Neural Machine Translation and Sequence-to-sequence Models: A Tutorial>
1. Machine Translation is a special case of seq2seq Learning, but the most important one. Because it can easily be applied to other tasks and it can also learn from other tasks.
2. A translation system includes Modeling, Learning and Search.
3. n-gram
3.1 Word-by-word Computation of Probabilities
3.2 Count-based n-gram Language Models
However, because of phrases which are too long to show in the corpus, we introduce n-gram, which is used to approximate the probability.
But there are still some phrases which do not show even n is small enough, then we need smoothing. Like that:
There are many smoothing techniques, such as: Context-dependent smoothing coefficients, Back-off, and Modified distributions. See the paper for reference.
3.3 Evaluation of Language Models
As the usual way of machine learning tasks, the language models need Training data, Development data(Validation data), and Test data.
We usually use log-likelihood as a measure:
Another common measure of language model accuracy is perplexity:
An intuitive explanation of the perplexity is “how confused is the model about its decision?” More accurately, it expresses the value “if we randomly picked words from the probability distribution calculated by the language model at each time step, on average how many words would it have to pick to get the correct one?”
3.4 Handling Unknown Words
We usually assume that unknown words obey uniform distribution.
3.5 Further Reading(long distance, large scale...)
3.6 Exercise
4. Log-linear Language Models(use loss function)
4.1 calculate score, use the softmax...
4.2
It is insightful and reasonable. Because when P is close to 1, the loss is close to 0, when P is close to 0, the loss is close to negative infinite(I think we should add a minus notation here).
There are also a few things to consider to ensure that training remains stable:
1.Adjusting the learning rate: learning rate decay.
2.Early stopping: find the best stop time to prevent over fitting.
3.Shuffling training order: prevent bias(the news example).
And some optimization methods:
SGD with momentum: It is easy.
AdaGrad: AdaGrad focuses on the fact that some parameters are updated much more frequently than others. For example, in the model above, columns of the weight matrix W corresponding to infrequent context words will only be updated a few times for every pass through the corpus, while the bias b will be updated on every training example. Based on this, AdaGrad dynamically adjusts the training rate η for each parameter individually, with frequently updated (and presumably more stable) parameters such as b getting smaller updates, and infrequently updated parameters such as W getting larger updates.
Adam: Adam is another method that computes learning rates for each parameter. It does so by keeping track of exponentially decaying averages of the mean and variance of past gradients, incorporating ideas similar to both momentum and AdaGrad.
4.3 Derivatives for Log-linear Models
The answer to the above two questions is subtle. They include onehot.
Why? We need to refer to matrix calculus. Matrix calculus - Wikipedia
Derivative of Softmax loss function 机器学习中常用的矩阵求导公式 - 程序园
I proved the first one, which is right. I think the second one is similar to the first one.
4.4 Other Features for Language Modeling
Context word features, Context class, Context suffix features, Bag-of-words features...
4.5 Further Reading
Whole-sentence language models, Discriminative language models...
4.6 Exercise