Introduction to Deep Learning for Natural Language Processing

In this guide we'll review key concepts regarding the application of deep learning for natural language processing.

4 years ago   •   7 min read

By Peter Foy

In this guide we'll review key concepts regarding the application of deep learning for natural language processing.

This article is based on notes from Udacity's AI for Trading program and is organized as follows:

  1. Introduction to RNNs & LSTMs
  2. Word Embeddings & Word2Vec
  3. Resources: Deep Learning for Natural Language Processing

1. Introduction to RNNs & LSTMs

Recurrent neural networks (RNNs) and LSTMs and well suited for dealing with text data as they learn from sequences of data. In particular, they pass in the hidden state from one step in the sequence to the next, combined with the input.

Long short-term memory (LSTM) networks are an improvement of RNNs as RNNs have a hard time remembering longer term memory.

LSTMs are useful when our neural network needs to switch from remembering recent data, and data from a longer time ago.

We won't cover RNNs in detail here, but if you want to learn more about them you can check out our Introduction to Recurrent Neural Networks & LSTMs.

Overview of LSTMs

As mentioned, LSTMs are an improvement of RNNs that makes use of long and short-term memory cells. In particular:

An LSTM network is a type of RNN that uses special units as well as standard units.

To do this, the network takes three pieces of information:

  • Long term memory
  • Short term memory
  • An event

These three pieces of information go into a node, a mathematical operation is performed, which then produces an output the form of probability. This output is then used to update both the long and short term memory.

Specifically, the architecture of an LSTM contains several gates:

  • Forget gate
  • Learn gate
  • Remember gate
  • Use gate

At a high-level, here's how all these work together:

  • The long term memory goes to the forget gate, where it forgets everything that it doesn't consider useful.
  • The short term memory and the event are joined together in the learn gate, which contains information that was recently learned and removes unnecessary information.
  • The long term memory that we haven't forgotten and the new learned information gets joined together in the remember gate. The remember gate outputs an updated long-term memory.
  • The use gate decides what information we use from what we previously know (long-term memory) and what we recently learned to make a prediction.
  • The output of the use gate is both the prediction and the new short term memory.

LSTM Architecture

Before looking at LSTMs let's recap the architecture of an RNN:

  • We take our event $E_t$ and memory $M_{t-1}$
  • We apply a simple tanh or sigmoid activation function to obtain the output and new memory $M_t$

The output is the prediction and also the memory that is carried to the next node.

The LSTM architecture is similar but with many more nodes inside, as well as two inputs and two outputs for long term $LTM_t$ and short term $STM_t$ memory.

The Learn Gate

The learn gate does the following:

  • Takes short term memory and the event and combines them
  • It then ignores part of it, only keeping what is deemed important.

Mathematically, the output of the learn gate is $N_t i_t$ where:

$$N_t = tanh(W_n [STM_{t-1}, E_t] + b_t)$$

$$i_t = \sigma (W_i)[STM_{t-1}, E_t] + b_i$$

$N_t$ is the new information and $i_t$ is the ignore factor.

The Forget Gate

The forget gate takes long-term memory and decides what to keep, and what to forget.

Mathematically, the output of the Forget Gate is $LTM_{t-1}f_t$ where:

$$f_t = \sigma(W_f[STM_{t-1}, E_t] + b_f)$$

The Remember Gate

The remember gate takes the long-term memory from the forget gate and short-term memory from the learn gate and add them together.

The output of the remember gate is as follows:

$$LTM_{t-1} f_t + N_t i_t$$

The Use Gate

The use gate uses long-term memory from the gorget gate and short-term memory from the learn gate to establish a new long-term memory and an output. The output then works as the new short-term memory.

The output of the Use Gate is $U_t, V_t$ where:

$$U_t = tanh(W_u LTM_{t-1} f_t + b_u)$$

$$V_t = \sigma(W_v[STM_{t-1}, E_t] + b_v)$$

Four Gate LSTM Architecture

This is just one LSTM architecture, but here's an overview of how these gates all work together:

  • The Forget Gate takes long-term memory and forgets part of it
  • The Learn Gate takes short-term memory and combines it with the event
  • The Remember Gate combines our long-term memory with the new information learned in order to update the long-term memory and output it
  • The Use Gate takes the information just learned and long-term memory and uses it to make a prediction and update the short-term memory

Another popular LSTM architecture is the Gated Recurrent Unit (GRU).

2. Word Embeddings & Word2Vec

Now let's look at two useful techniques for natural language processing: word embeddings and word2vec.

Word Embeddings

Word embedding is the collective term for models that learn to map a set of words or phrases to vectors of numerical values—these vectors are called embeddings.

We can use neural network models to learn how to do word embedding. In particular, we can use this technique to reduce the dimensionality of text data.

Embedding models can also learn interesting characteristics of words in a vocabulary.

Next we'll look at the Word2vec model, which learns to map words to embeddings that contain semantic meaning. For example, word embeddings can learn the relationship between past and present tense verbs, the relationship between gendered words (such as queen and woman).

It's important to note that the embeddings are learning from a body of text, which means that word associations in the source text will be present in the embeddings. This leads to the risk that the mappings are biased or incorrect.

Embedding Weight Matrix

As mentioned, embeddings can improve the ability of networks to learn from text data by reducing its dimensionality.

For example, let's say we have a neural network that is learning from a document with 10,000+ different words. Using these words as input to a neural network, such as an RNN or LSTM, we can one-hot encode them, although this leads to huge vectors that are thousands of values long, and only one of them is the one-hot encoded vector.

When we pass this long vector as input to a hidden layer the result is a huge matrix of values (most of which are 0), and as you can imagine, this is very computationally inefficient.

Embeddings can help solve this by providing a shortcut to performing the network's matrix multiplication.

To learn embeddings we can use a fully-connected linear layer, which we call the embedding layer, and its weights are the embedding weights. These weights are values learned during training this embedding model.

With this matrix we can skip the computationally inefficient multiplication by taking the input of the hidden layer from a row in the weight matrix.

This means we can use the embedding weight matrix as a lookup table, so instead of representing words as one-hot vectors we can encode each word as a unique integer.

This process is referred to as an embedding lookup and the number of hidden units is the embedding dimension.


The word2vec algorithm is a way of finding vectors that represent words and also contain context about the words.

As described in this guide on word2vec and neural word embeddings:

Word2vec is a two-layer neural net that processes text by “vectorizing” words. Its input is a text corpus and its output is a set of vectors: feature vectors that represent words in that corpus. While Word2vec is not a deep neural network, it turns text into a numerical form that deep neural networks can understand.

In essence, words that have a similar context, for example if the sentence is talking about drinking "coffee", "tea", or "water", will have vectors that are closer to each other.

Relationships between words can then be represented by mapping the distance in vector space.

There are two architectures for implementing word2vec:

  • CBOW (continuous bag-of-words)
  • Skip-gram

You can find a good tutorial about the skip-gram model here and the original word2Vec paper by Mikalov here, but here are a few high-level features of the model:

  • Words that show up often such as "the" or "of" typically don't provide much context to nearby words
  • We can discard some of these words with a process called subsampling, which uses the following equation:

$$P(w_i) = 1 - \sqrt\frac{t}{f(w_i)}$$


  • $t$ is a threshold parameter
  • $f(w_i)$ is the frequency of word $w_i$ in the dataset

Next we need to get our data into the right format to be passed into the network.

With the skip-gram architecture, we want to define a surrounding context and get words in a window around that word with size $C$. The window refers to a window in time, for example 3 words in the past and future from a given word.

The original word2vec paper gives less weight to distant words by sampling less from those in the training examples.

Now let's review how to build the word2vec model network. A high-level overview of the network structure is as follows:

  • The input is the batches of our trained word tokens
  • We pass this long list of integers into a hidden layer, the embedding layer. As mentioned, this takes the integers and basically creates a lookup table.
  • The embeddings are fed into a final fully connected softfax output layer. The output of the softmax layer is what's used to predict our target context words.

As mentioned, Word2Vec allows us to mathematically operate on words in vector space. To calculate how similar words are we can calculate the cosine similarity, which looks at two vectors $a$ and $b$ and the angle $\theta$ between them.

The cosine similarity of two words ends up being a value between 0 and 1 that tells you how similar two vectors are in vector space.

The skip-gram model can take a long time to train, so one way we can speed it up is with what's referred to as "negative sampling". What this does is approximate the loss from the softmax layer by only updating a small subset of all the weights at once. You can learn more about negative sampling and word2vec in this tutorial by Chris McCormick.

3. Resources: Deep Learning for Natural Language Processing

In this guide we covered the application of deep learning to natural language processing at a very high level. If you want to learn more you'll find additional resources I've found useful on the subject :

Spread the word

Keep reading