First, read an overview of seq2seq if you haven’t already.
Essentially, a recurrent neural network takes in an input and past state. Then it emits an output and a new state. So it’s able to carry “memory” of previous inputs via the state variable.
LSTMs are a sophisticated type of RNN that are designed to have better gradient backpropagation.
GRUs are a weird variant of LSMs where the output and hidden state are the same. They should not work well, but for some reason they do.
It takes in a sentence in one language, which is passed through a series of RNN cells that update a state. Then, at the end of the sentence, the state is used to generate a translation of the sentence. Its “input” is just the last word, and it continues until hitting an end-of-sentence.
For some reason, it worked better when translating a sentence input backwards.
Attention is calculated over the input, then pointwise multiplied with the input. It allows the network to weight some features more heavily than others.
This is just to get started with Pytorch. Recommend using Google Colab.
Make sure to read over the code and understand what it’s doing. Some of the coding standards used aren’t ideal. For instance, at one point torch.bmm
is used, where torch.matmul
is the most general way of multiplying tensors (works with vectors, 2d and batched 3d tensors etc.).