Neural cache language model

3 min readAug 13, 2020

In this very short post i want to share you an interesting idea which i mentioned it in the title of the post. If you notice i have used the term post some times in this post! its actually the topic that we want to speak about. Research shows if you see a term in a document, the probability to see that term again increase. for example, the frequency of the word “tiger” on the Wikipedia page with the same title is 2.8%, while this number is 0.0037% in the whole Wikipedia [1]. This phenomena is responded by an engineering artifact named caching in language modeling.

The idea was introduced by Kuhm and De Mori(1990) and say that prevalent language models such n-grams capture statistics of the whole corpus which they trained on, while it is important for model to be aware of the tokens of the given document and the recent tokens.

For a recurrent language model (which is one of the most used models in the previous decade), the conditional probability of a word w given its contexts words (Xis) can be parameterized as:

where ht is computed as recurrent function of the previous states and the input of time t (xt):

For vanilla RNN this function is a tanh on the linear combination of the two given parameters.

Specifically, At time t, the model defines a probability distribution over words stored in the cache as follows:

The formula says that the probability of a word given its context words (xis) and their representations (his) can be parameterized as the sum of inner product of the representation of the previous state of appearing word t in the context to the final state t representation. That means words which their previous state are more similar to the ht and words which repeated more are weighted more by the model and probably will be choose as the next word. the parameter theta control the effect of the model and with choosing 0 we will have a uni-gram language model on the current document as each word will be weighed by the umber of times that it came in our cache.

Following the practice in n-gram cache-based language models, probability of a word calculated by the linear interpolation of the cache language model with the base language model:

Finally, the paper indicates the following advantages for the caching:

very efficient way to adapt a language model to a new domain.
can predict out-of-vocabulary words (OOV words), after seeing them once.
capture long-range dependencies in documents, in order to generate more coherent text.

[1] Grave E, Joulin A, Usunier N. Improving neural language models with a continuous cache. arXiv preprint arXiv:1612.04426. 2016 Dec 13.

[2] Kuhn R, De Mori R. A cache-based natural language model for speech recognition. IEEE transactions on pattern analysis and machine intelligence. 1990 Jun;12(6):570–83.

Neural cache language model

Written by Abdalsamad Keramatfar