### Background

You can find a very good introduction to LSTMs on Christopher Olah's blog. I first discovered recurrent neural networks by reading Andrej Karpathy’s excellent blog post, ”The Unreasonable Effectiveness of Recurrent Neural Networks”.

### How it works

Using standard mathematical notation, the equations for updating an LSTM cell are:

- Make a forget gate \(f_t\) to tell the network what to remove from its memory cell $$ f_t = \sigma(W_f \ \dot \ [h_{t-1}, x_t] + b_f ) $$
- Make a input gate \(i_t\) to tell the network what to add to its memory cell $$ i_t = \sigma(W_i \ \dot \ [h_{t-1}, x_t] + b_i $$
- Make a tensor \( \tilde C_t\) with contents we
*might*add to its memory cell
$$ \tilde C_t = \mathrm{tanh}(W_C \ \dot \ [h_{t-1},x_t] + b_C) $$
- Make a new memory cell \( C_t\) that is a combination of the existing memory state \( C_{t-1}\) and the candidate state \( \tilde C_t\) $$ C_t = f_t \ * \ C_{t-1} + i_t \ * \ \tilde C_t $$
- Make an output gate \(o_t\) to tell the network what part of its updated memory cell to output $$ o_t = \sigma(W_o \ \dot \ [h_{t-1}, x_t] + b_o $$
- Finally, use the output gate \(o_t\) to choose information from the memory cell \(C_t\) and make a hypothesis \(h_t\) $$ h_t = o_t * \mathrm{tanh}(C_t) $$

The two parallel arrows represent the two internal 'states' of the cell and the vertical arrows represent data flowing in and out at each time step

Notice that the output will be a function of the current input vector \(x_t\), the previous hypothesis \(h_{t-1}\), and the cell's memory state \(C_t\). This is how LSTMs build internal representations of sequences that span dozens of neighboring timesteps.

Recurrent neural networks are notoriously difficult to train. To make training more efficient, Keras splits the four weight tensors into \(W\) and \(U\) components. First, it dots the input vector by weights \(W_i\), \(W_f\), \(W_c\), and \(W_o\). Then it dots \(h_{t-1}\) with weights \(U_i\), \(U_f\), \(U_c\), and \(U_o\) and evaluates the rest of the equations as they appear above. My numpy reimplementation is the same; it starts with the \(W\) tensors:

```
xi = np.dot(X, W_i) + b_i
xf = np.dot(X, W_f) + b_f
xc = np.dot(X, W_c) + b_c
xo = np.dot(X, W_o) + b_o
```

In a helper function we finish updating the cell's state:

```
i_t = self.hard_sigmoid(xi_t + np.dot(hprev, U_i)) #[1,m] + [m]*[m,m] -> [1,m]
f_t = self.hard_sigmoid(xf_t + np.dot(hprev, U_f)) #[1,m] + [m]*[m,m] -> [1,m]
o_t = self.hard_sigmoid(xo_t + np.dot(hprev, U_o)) #[1,m] + [m]*[m,m] -> [1,m]
c_t = f_t*Cprev + i_t * np.tanh(xc_t + np.dot(hprev, U_c)) #[1,m]*[m] + [1,m] * [1,m] -> [1,m]
h_t = o_t * np.tanh(c_t) #[1,m]*[1,m] (elementwise)
```

The demo model consists of two LSTM cells with a dense softmax classification layer perched on top. I used the same architecture as the Keras LSTM text example. Training takes a lot of computation, so I ran it as a 24-hour job on a GPU node of Dartmouth's Discovery computing cluster.

### Discussion

I succeeded in building a working deep LSTM from the ground up. It learned English well enough to spell words correctly most of the time and follow basic grammar rules.

In the future I hope to train my models on a dedicated computing cluster that has GPU. I also need to redesign the architecture because the model writes too slow for the purposes of this demo. One way to do this would be to move to a word-level model instead of a letter-level model.