Attention in the natural language processing realm

Abdalsamad Keramatfar
4 min readMay 15, 2021

Attention is a mechanism in which we focus on parts of data based on their importance to us. For example, We should pay more attention to sentimental words when we are doing sentiment classification. Attention mechanism help us in two direction; first, often results in better performance,
second, the classification decision can be more cleared and visualized so we have a more explainable model, which is of interest. In this post we first recap the math behind attention then run two models with and without that in order to have better intuition.

Figure 1 shows a conventional LSTM cell which is unfolded through time. At each step a word embedding (with d dimensional) is entered through the cell, and a s dimensional state is outgoing. The output is a matrix of the shape n*d dimension, where n is the sentence length.

Fig 1. An LSTM cell with its input

For the classification task, how can we use the output? An unwise manner is that we just hold the final state and drop the other states (recall that the last state has information from the last input and the previous state).

Fig 2. Using the last state of the LSTM

Here we can act more wisely, and utilize the attention mechanism. Essentially, we use a trainable vector alpha to provide us an importance value for each state, and then a weighted average of the states provides our output.

Fig 3. Attention
final attention-base representation of a sentence (h is the states matrix provided by LSTM)

But alpha does not train as a simple vector, because if so we will have rigid values for steps, which is not helpful. We calculate the alpha as follows:

alpha calculation

The formula is quite simple, just a SoftMax over ut and uw. What are these?

The first one is a transformed one of ht, which is calculated as a neural network layer:

non-linear transformation of the states

It provides more ability for the network. The second one is the context vector, which is learned by the other parameters. It will be learn such that can boast more important words, because each ut alpha weight is computed as the normalized similarity (dot product) of that and the uw. So, we have now a learnable non-rigid alpha and compute the final representation.

Now, we implement a sentiment classification using Keras. The models consist of an embedding layer, a LSTM, (and an attention layer in the second one), a dense layer. The task is to classify each tweet as positive, negative, neutral. we train the model for 60 epochs. If you are interested you can see the code in the following notebook:

The results is slightly better with attention; 0.7831 vs 0.7875.

Now, we visualize the alpha values for a sample to see how the model attends over different words (The notebook includes the code).

Fig 4. Visualization of attention alpha for a test sample

We can see that the “terrible” word has the “lowest” value. Also, we can see that the “no” and “hrs” are more negative words. the latest is because this is an airline sentiment dataset and this word refers to flight delays.

In this post i played we the essence of attention for the text classification. I plan to publish next posts about more attention advanced topics, such as GAT and Transformers.

Sources:

Keramatfar A, Amirkhani H, Jalaly Bidgoly A. Multi-thread hierarchical deep model for context-aware sentiment analysis. Journal of Information Science. February 2021. doi:10.1177/0165551521990617

Yang Z, Yang D, Dyer C, He X, Smola A, Hovy E. Hierarchical attention networks for document classification. InProceedings of the 2016 conference of the North American chapter of the association for computational linguistics: human language technologies 2016 Jun (pp. 1480–1489).

--

--