Jekyll2017-09-15T14:14:21+00:00ofir.io/ofirpress.github.ioSometimes deep sometimes learningNeural Language Modeling From Scratch (Part 1)2017-09-07T00:00:00+00:002017-09-07T00:00:00+00:00ofir.io/Neural-Language-Modeling-From-Scratch<p>Language models assign probability values to sequences of words. Those three words that appear right above your keyboard on your phone that try to predict the next word you’ll type are one of the uses of language modeling. In the case shown below, the language model is predicting that “from”, “on” and “it” have a high probability of being the next word in the given sentence. Internally, for each word in its vocabulary, the language model computes the probability that it will be the next word, but the user only gets to see the top three most probable words.</p>
<div class="imgcap">
<img src="/images/lm/keyboard.png" />
</div>
<p>Language models are a fundamental part of many systems that attempt to solve natural language processing tasks such as machine translation and speech recognition. Currently, all state of the art language models are neural networks.</p>
<p>The first part of this post presents a simple feedforward neural network that solves this task. In the second part of the post, we will improve the simple model by adding to it a recurrent neural network (RNN). The final part will discuss two recently proposed regularization techniques for improving RNN based language models.</p>
<h2 id="a-simple-model">A simple model</h2>
<p>To begin we will build a simple model that given a single word taken from some sentence tries predicting the word following it.</p>
<div class="imgcap">
<img src="/images/lm/w2v.svg" />
</div>
<p>We represent words using one-hot vectors: we decide on an arbitrary ordering of the words in the vocabulary and then represent the <code class="highlighter-rouge">n</code>th word as a vector of the size of the vocabulary (<code class="highlighter-rouge">N</code>), which is set to <code class="highlighter-rouge">0</code> everywhere except element <code class="highlighter-rouge">n</code> which is set to <code class="highlighter-rouge">1</code>.</p>
<p>The model can be separated into two components:</p>
<ul>
<li>
<p>We start by <strong>encoding</strong> the input word. This is done by taking the one hot vector representing the input word (<code class="highlighter-rouge">c</code> in the diagram), and multiplying it by a matrix of size <code class="highlighter-rouge">(N,200)</code> which we call the input embedding (<code class="highlighter-rouge">U</code>). This multiplication results in a vector of size <code class="highlighter-rouge">200</code>, which is also referred to as a word embedding. This embedding is a dense representation of the current input word. This representation is both of a much smaller size than the one-hot vector representing the same word, and also has some other interesting properties. For example, while the distance between every two words represented by a one-hot vectors is always the same, these dense representations have the property that words that are close in meaning will have representations that are close in the embedding space.</p>
</li>
<li>
<p>The second component can be seen as a <strong>decoder</strong>. After the encoding step, we have a representation of the input word. We multiply it by a matrix of size <code class="highlighter-rouge">(200,N)</code>, which we call the output embedding (<code class="highlighter-rouge">V</code>). The resulting vector of size <code class="highlighter-rouge">N</code> is then passed through the softmax function, normalizing its values into a probability distribution (meaning each one of the values is between <code class="highlighter-rouge">0</code> and <code class="highlighter-rouge">1</code>, and their sum is <code class="highlighter-rouge">1</code>). This distribution is denoted by <code class="highlighter-rouge">p</code> in the diagram above.</p>
</li>
</ul>
<p>The decoder is a simple function that takes a representation of the input word and returns a distribution which represents the model’s predictions for the next word: the model assigns to each word the probability that it will be the next word in the sequence.</p>
<p>To train this model, we need pairs of input and target output words. For the <code class="highlighter-rouge">(input, target-output)</code> pairs we use the Penn Treebank dataset which contains around 40K sentences from news articles, and has a vocabulary of exactly <code class="highlighter-rouge">10,000</code> words. To generate word pairs for the model to learn from, we will just take every pair of neighboring words from the text and use the first one as the input word and the second one as the target output word. So for example for the sentence <code class="highlighter-rouge">“The cat is on the mat”</code> we will extract the following word pairs for training: <code class="highlighter-rouge">(The, cat)</code>, <code class="highlighter-rouge">(cat, is)</code>, <code class="highlighter-rouge">(is, on)</code>, and so on.</p>
<p>We use stochastic gradient descent to update the model during training, and the loss used is the cross-entropy loss. Intuitively, this loss measures the distance between the output distribution predicted by the model and the target distribution at every iteration. The target distribution at each iteration is a one-hot vector representing the current target word.</p>
<p>The metric used for reporting the performance of a language model is its perplexity on the test set. It is defined as- <script type="math/tex">e^{-\frac{1}{N}\sum_{i=1}^{N} \ln p_{\text{target}_i}}</script>, where <script type="math/tex">p_{\text{target}_i}</script> is the probability given by the model to the target word at iteration ‘i’. Perplexity is a decreasing function of the average log probability that the model assigns to the target word at every iteration. We want to maximize the probability that we give to the target word at every iteration, which means that we want to minimize the perplexity (the optimal perplexity is <code class="highlighter-rouge">1</code>).</p>
<p>The perplexity for the simple model<sup id="fnref:sg"><a href="#fn:sg" class="footnote">1</a></sup> is about <code class="highlighter-rouge">183</code> on the test set, which means that on average it assigns a probability of about <script type="math/tex">0.005</script> to the target word in every iteration on the test set. It’s much better than a naive model which would assign an equal probability to each word (which would assign a probability of <script type="math/tex">\frac {1} {N} = \frac {1} {10,000} = 0.0001</script> to the correct word), but we can do much better.</p>
<h2 id="using-rnns-to-improve-performance">Using RNNs to improve performance</h2>
<p>The biggest problem with the simple model is that to predict the next word in the sentence, it only uses a single preceding word. If we could build a model that would remember even just a few of the preceding words there should be an improvement in its performance. To understand why adding memory helps, think of the following example: what words follow the word “drink”? You’d probably say that “coffee”, “beer” and “soda” have a high probably of following it. If I told you the word sequence was actually “Cows drink”, then you would completely change your answer.</p>
<p>We can add memory to our model by augmenting it with a <a href="http://karpathy.github.io/2015/05/21/rnn-effectiveness/">recurrent neural network</a> (RNN), as shown below.</p>
<div class="imgcap">
<img src="/images/lm/rnnlm.svg" />
</div>
<p>This model is similar to the simple one, just that after encoding the current input word we feed the resulting representation (of size <code class="highlighter-rouge">200</code>) into a two layer <a href="http://colah.github.io/posts/2015-08-Understanding-LSTMs/">LSTM</a>, which then outputs a vector also of size <code class="highlighter-rouge">200</code> (at every time step the LSTM also receives a vector representing its previous state- this is not shown in the diagram). Then, just like before, we use the decoder to convert this output vector into a vector of probability values. (LSTM is just a fancier RNN that is better at remembering the past. Its “API” is identical to the “API” of an RNN- the LSTM at each time step receives an input and its previous state, and uses those two inputs to compute an updated state and an output vector<sup id="fnref:api"><a href="#fn:api" class="footnote">2</a></sup>.)</p>
<p>Now we have a model that at each time step gets not only the current word representation, but also the state of the LSTM from the previous time step, and uses this to predict the next word. The state of the LSTM is a representation of the previously seen words (note that words that we saw recently have a much larger impact on this state than words we saw a while ago).</p>
<p>As expected, performance improves and the perplexity of this model on the test set is about <code class="highlighter-rouge">114</code>. An implementation of this model<sup id="fnref:zaremba"><a href="#fn:zaremba" class="footnote">3</a></sup>, along with a detailed explanation, is available in <a href="https://www.tensorflow.org/tutorials/recurrent">Tensorflow</a>.</p>
<h2 id="the-importance-of-regularization">The importance of regularization.</h2>
<p><code class="highlighter-rouge">114</code> perplexity is good but we can still do much better. In this section I’ll present some recent advances that improve the performance of RNN based language models.</p>
<h3 id="dropout">Dropout</h3>
<p>We could try improving the network by increasing the size of the embeddings and LSTM layers (until now the size we used was <code class="highlighter-rouge">200</code>), but soon enough this stops increasing the performance because the network overfits the training data (it uses its increased capacity to remember properties of the training set which leads to inferior generalization, i.e. performance on the unseen test set). One way to counter this, by regularizing the model, is to use dropout.</p>
<p>The diagram below is a visualization of the RNN based model unrolled across three time steps. <code class="highlighter-rouge">x</code> and <code class="highlighter-rouge">y</code> are the input and output sequences, and the gray boxes represent the LSTM layers. Vertical arrows represent an input to the layer that is from the same time step, and horizontal arrows represent connections that carry information from previous time steps.</p>
<div class="imgcap">
<img src="/images/lm/no_dropout.svg" />
</div>
<p>We can apply dropout on the vertical (same time step) connections:</p>
<div class="imgcap">
<img src="/images/lm/regular_dropout.svg" />
</div>
<p>The arrows are colored in places where we apply dropout. A dropout mask for a certain layer indicates which of that layers activations are zeroed. In this case, we use different dropout masks for the different layers (this is indicated by the different colors in the diagram).</p>
<p>Applying dropout to the recurrent connections harms the performance, and so in this initial use of dropout we use it only on connections within the same time step. Using two LSTM layers, with each layer containing <code class="highlighter-rouge">1500</code> LSTM units, we achieve a perplexity of <code class="highlighter-rouge">78</code> (we dropout activations with a probability of <code class="highlighter-rouge">0.65</code>)<sup id="fnref:zarembaLarge"><a href="#fn:zarembaLarge" class="footnote">4</a></sup>.</p>
<p>The recently introduced <a href="https://arxiv.org/abs/1512.05287">variational dropout</a> solves this problem and improves the model’s performance even more (to <code class="highlighter-rouge">75</code> perplexity) by using the same dropout masks at each time step.</p>
<div class="imgcap">
<img src="/images/lm/variational_dropout.svg" />
</div>
<h3 id="weight-tying">Weight Tying</h3>
<p>The input embedding and output embedding have a few properties in common. The first property they share is that they are both of the same size (in our RNN model with dropout they are both of size <code class="highlighter-rouge">(10000,1500)</code>).</p>
<p>The second property that they share in common is a bit more subtle. In the input embedding, words that have similar meanings are represented by similar vectors (similar in terms of <a href="https://en.wikipedia.org/wiki/Cosine_similarity#Definition">cosine similarity</a>). This is because the model learns that it needs to react to similar words in a similar fashion (the words that follow the word “quick” are similar to the ones that follow the word “rapid”).</p>
<p>This also occurs in the output embedding. The output embedding receives a representation of the RNNs belief about the next output word (the output of the RNN) and has to transform it into a distribution. Given the representation from the RNN, the probability that the decoder assigns a word depends mostly on its representation in the output embedding (the probability is exactly the softmax normalized dot product of this representation and the output of the RNN).</p>
<p>Because the model would like to, given the RNN output, assign similar probability values to similar words, similar words are represented by similar vectors. (Again, if, given a certain RNN output, the probability for the word “quick” is relatively high, we would also expect the probability for the word “rapid” to be relatively high).</p>
<p>These two similarities led us to recently propose a very simple method, <a href="https://arxiv.org/abs/1608.05859">weight tying</a>, to lower the model’s parameters and improve its performance. We simply tie its input and output embedding (i.e. we set U=V, meaning that we now have a single embedding matrix that is used both as an input and output embedding). This reduces the perplexity of the RNN model that uses dropout to <code class="highlighter-rouge">73</code>, and its size is reduced by more than 20%<sup id="fnref:inan"><a href="#fn:inan" class="footnote">5</a></sup>.</p>
<h4 id="why-does-weight-tying-work">Why does weight tying work?</h4>
<p>The perplexity of the variational dropout RNN model on the test set is <code class="highlighter-rouge">75</code>. The same model achieves <code class="highlighter-rouge">24</code> perplexity on the training set. So the model performs much better on the training set then it does on the test set. This means that it has started to remember certain patterns or sequences that occur only in the train set and do not help the model to generalize to unseen data. One of the ways to counter this overfitting is to reduce the model’s ability to ‘memorize’ by reducing its capacity (number of parameters). By applying weight tying, we remove a large number of parameters.</p>
<p>In addition to the regularizing effect of weight tying we presented another reason for the improved results. We showed that the word representations in the output embedding are of much higher quality than the ones in the input embedding of untied language models. This is shown using embedding evaluation benchmarks such as <a href="https://www.cl.cam.ac.uk/~fh295/simlex.html">Simlex999</a>. In a weight tied model, because the tied embedding’s parameter updates at each training iteration are very similar to the updates of the output embedding of the untied model, the tied embedding performs similarly to the output embedding of the untied model. So in the tied model, we use a single high quality embedding matrix in two places in the model. This contributes to the improved performance of the tied model<sup id="fnref:paper"><a href="#fn:paper" class="footnote">6</a></sup>.</p>
<p>To summarize, this post presented how to improve a very simple feedforward neural network language model, by first adding an RNN, and then adding variational dropout and weight tying to it.</p>
<p>In recent months, we’ve seen further improvements to the state of the art in RNN language modeling. The current state of the art results are held by two recent papers by <a href="https://arxiv.org/abs/1707.05589">Melis et al.</a> and <a href="https://arxiv.org/abs/1708.02182">Merity et al.</a>. These models make use of most, if not all, of the methods shown above, and extend them by using better optimizations techniques, new regularization methods, and by finding better hyperparameters for existing models. Some of these methods will be presented in part two of this guide.</p>
<p>Feel free to ask questions in the comments below.</p>
<hr />
<div class="footnotes">
<ol>
<li id="fn:sg">
<p>This model is the skip-gram word2vec model presented in <a href="https://arxiv.org/abs/1301.3781">Efficient Estimation of Word Representations in Vector Space</a>. <a href="#fnref:sg" class="reversefootnote">↩</a></p>
</li>
<li id="fn:api">
<p>For a detailed explanation of this watch Edward Grefenstette’s <a href="http://videolectures.net/deeplearning2016_grefenstette_augmented_rnn/">Beyond Seq2Seq with Augmented RNNs</a> lecture. <a href="#fnref:api" class="reversefootnote">↩</a></p>
</li>
<li id="fn:zaremba">
<p>This model is the small model presented in <a href="https://arxiv.org/abs/1409.2329">Recurrent Neural Network Regularization</a>. <a href="#fnref:zaremba" class="reversefootnote">↩</a></p>
</li>
<li id="fn:zarembaLarge">
<p>This is the large model from <a href="https://arxiv.org/abs/1409.2329">Recurrent Neural Network Regularization</a>. <a href="#fnref:zarembaLarge" class="reversefootnote">↩</a></p>
</li>
<li id="fn:inan">
<p>In parallel to our work, an explanation for weight tying based on <a href="https://arxiv.org/abs/1503.02531">Distilling the Knowledge in a Neural Network</a> was presented in <a href="https://arxiv.org/abs/1611.01462">Tying Word Vectors and Word Classifiers: A Loss Framework for Language Modeling</a>. <a href="#fnref:inan" class="reversefootnote">↩</a></p>
</li>
<li id="fn:paper">
<p>Our <a href="https://arxiv.org/abs/1608.05859">paper</a> explains this in detail. <a href="#fnref:paper" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>Language models assign probability values to sequences of words. Those three words that appear right above your keyboard on your phone that try to predict the next word you’ll type are one of the uses of language modeling. In the case shown below, the language model is predicting that “from”, “on” and “it” have a high probability of being the next word in the given sentence. Internally, for each word in its vocabulary, the language model computes the probability that it will be the next word, but the user only gets to see the top three most probable words.How to Start Learning Deep Learning2016-06-26T00:00:00+00:002016-06-26T00:00:00+00:00ofir.io/How-to-Start-Learning-Deep-Learning<p>Due to the recent achievements of artificial neural networks across many different tasks (such as <a href="https://research.facebook.com/publications/deepface-closing-the-gap-to-human-level-performance-in-face-verification/">face recognition</a>, <a href="http://blogs.microsoft.com/next/2015/12/10/microsoft-researchers-win-imagenet-computer-vision-challenge/">object detection</a> and <a href="https://deepmind.com/alpha-go">Go</a>), deep learning has become extremely popular. This post aims to be a starting point for those interested in learning more about it.</p>
<p><strong>If you already have a basic understanding of linear algebra, calculus, probability and programming:</strong> I recommend starting with Stanford’s <a href="http://cs231n.stanford.edu/">CS231n</a>. The course notes are comprehensive and well-written. The slides for each lesson are also available, and even though the accompanying videos were removed from the official site, re-uploads are quite easy to find online.</p>
<p><strong>If you don’t have the relevant math background:</strong> There is an incredible amount of free material online that can be used to learn the required math knowledge. <a href="http://ocw.mit.edu/courses/mathematics/18-06sc-linear-algebra-fall-2011/index.htm">Gilbert Strang’s course on linear algebra</a> is a great introduction to the field. For the other subjects, edX has courses from MIT on both <a href="https://www.edx.org/course/calculus-1a-differentiation-mitx-18-01-1x">calculus</a> and <a href="https://www.edx.org/course/introduction-probability-science-mitx-6-041x-1">probability</a>.</p>
<p><strong>If you are interested in learning more about machine learning:</strong> <a href="https://www.coursera.org/learn/machine-learning">Andrew Ng’s Coursera class</a> is a popular choice as a first class in machine learning. There are other great options available such as <a href="https://work.caltech.edu/telecourse.html">Yaser Abu-Mostafa’s machine learning course</a> which focuses much more on theory than the Coursera class but it is still relevant for beginners. Knowledge in machine learning isn’t really a prerequisite to learning deep learning, but it does help. In addition, learning classical machine learning and not only deep learning is important because it provides a theoretical background and because deep learning isn’t always the correct solution.</p>
<p><strong>CS231n isn’t the only deep learning course available online.</strong> <a href="https://www.coursera.org/course/neuralnets">Geoffrey Hinton’s Coursera class “Neural Networks for Machine Learning”</a> covers a lot of different topics, and so does <a href="https://www.youtube.com/playlist?list=PL6Xpj9I5qXYEcOhn7TqghAJ6NAPrNmUBH">Hugo Larochelle’s “Neural Networks Class”</a>. Both of these classes contain video lectures. <a href="https://www.cs.ox.ac.uk/people/nando.defreitas/machinelearning/">Nando de Freitas also has a course available online</a> which contains videos, slides and also a list of homework assignments.</p>
<p><strong>If you prefer reading over watching video lectures:</strong> <a href="http://neuralnetworksanddeeplearning.com/">Neural Networks and Deep Learning</a> is a free online book for beginners to the field. The <a href="http://www.deeplearningbook.org/">Deep Learning Book</a> is also a great free book, but it is slightly more advanced.</p>
<p><strong>Where to go after you’ve got the basics:</strong></p>
<ul>
<li><strong>Computer Vision</strong> is covered by most, if not all, of the deep learning resources mentoined above.</li>
<li><strong>Recurrent Neural Networks (RNNs)</strong> are the basis of neural network based models that solve tasks related to sequences such as machine translation or speech recognition. <a href="http://karpathy.github.io/2015/05/21/rnn-effectiveness/">Andrej Karpathy’s blog post on RNNs</a> is a great place to start learning about them. Christopher Olah has a <a href="http://colah.github.io/">great blog</a> where many deep learning concepts are explained in a very visual and easy to understand way. <a href="http://colah.github.io/posts/2015-08-Understanding-LSTMs/">His post on LSTM networks</a> is an introduction to LSTM networks which are a wildly used RNN variant.</li>
<li><strong>Natural Language Processing (NLP):</strong> <a href="http://cs224d.stanford.edu/">CS224d</a> is an introduction to NLP with deep learning. Advanced courses are available from both <a href="http://www.kyunghyuncho.me/home/courses/ds-ga-3001-fall-2015">Kyunghyun Cho</a> (with lecture notes <a href="https://github.com/nyu-dl/NLP_DL_Lecture_Note/blob/master/lecture_note.pdf">here</a>) and <a href="http://u.cs.biu.ac.il/~yogo/nnlp.pdf">Yoav Goldberg</a>.</li>
<li><strong>Reinforcement Learning:</strong> If you’d like to control robots or beat the human champion of Go, you should probably use reinforcement learning. <a href="http://karpathy.github.io/2016/05/31/rl/">Andrej Karpathy’s post on deep reinforcement learning</a> is an excellent starting point. David Silver also recently published a short <a href="https://deepmind.com/blog/deep-reinforcement-learning/">blog post</a> introducing deep reinforcement learning.</li>
</ul>
<p><strong>Deep learning frameworks:</strong> There are many frameworks for deep learning but the top three are probably <a href="http://tensorflow.org/">Tensorflow</a> (by Google), <a href="http://torch.ch/">Torch</a> (by Facebook) and <a href="http://deeplearning.net/software/theano/">Theano</a> (by <a href="https://mila.umontreal.ca/en/">MILA</a>). All of them are great, but if I had to select just one to recommend I’d say that Tensorflow is the best for beginners, mostly because of the great <a href="https://www.tensorflow.org/versions/r0.9/tutorials/index.html">tutorials</a> avialable.</p>
<p><strong>If you’d like to train neural networks you should probably do it on a GPU.</strong> You dont have to, but its much faster if you do. NVIDIA cards are the industry standard, and while most research labs use $1000 dollar graphics cards, there are a few affordable cards that can also get the work done. An even cheaper option is to rent a GPU-enabled instance from a cloud server provider like Amazon’s EC2 (short guide <a href="https://www.kaggle.com/c/facial-keypoints-detection/details/deep-learning-tutorial">here</a>).</p>
<p>Good luck!</p>Due to the recent achievements of artificial neural networks across many different tasks (such as face recognition, object detection and Go), deep learning has become extremely popular. This post aims to be a starting point for those interested in learning more about it.