On recent progress in dialogue generation

At True Ai, we apply state of the art machine learning models to improve our product helping agents in customer support to answer questions faster and better. As I recently started as an intern, I’ve spent some of my free time reading the literature. Here is a nonexhaustive summary of NLP papers I found relevant.

This article is quite a mess. I tried to keep as much structure for every review. Hopefully you’ll find something useful out of this jungle.

How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation

Papers: –> * <–, 2016

Problems

Evaluating the quality of generated answers from a seq2seq model is an open problem. Comparison of different metrics and different models used in the context of dialogue systems.

Solutions

No real solution to this problem, only a review of the metrics used.
Here is the word based metrics compared:

BLEU
METEOR
ROUGE
Here are the embedding based metrics compared:
Mean of word embedding,
Mean combined with TF-IDF (This one was not presented in the paper, but it is currently state of the art, see here)
Greedy matching. It computes the average sum of of cosine similarity between pair of words in both sentences, pairs matched greedily
Vector extrema. In each dimension of word embeddings, take the largest value to get the sentence embedding

Here are the retrieval based model compared:

TF-IDF: In the context of retrieval based model, every context, and response are encoded, and then responses are compared and ranked using cosine similarity.
Dual encoder: Have two RNN encode questions and answers respectively, both outputting a final embedding. Then, compute a probability that a reply \(a\) is good for a question \(q\) using \(p(a|q) = \sigma(q^T M a + bias)\). Trained with negative sampling (“word2vec loss”).

Here are the generative model compared:

LSTM: simple LSTM model
HRED: See my review on it

Results

Dataset: Twitter corpus and Ubuntu Dialogue Corpus
BLEU-3 has the best correlation with the Human judgment on dialogue answer generation. Better than METEOR, ROUGE, and all embeding based metrics.
Best performance of HRED when used with mean of word embedding.
HRED Better than LSTM everytime on Twitter and Ubuntu corpus.
TF-IDF not good performance, Dual encoder is better.
Embedding metrics: mean of word embedding is good, vector extrema are bad.

Mutual Information and Diverse Decoding Improve Neural Machine Translation

Papers: –> * <–, 2016

Problems

Vanilla seq2seq ignores the probability \(p(question|answer)\). In the context of translation, it has been proved that this information was relevant for the pertinence of the answer (See here). In there, they came with a formula computing the best answer using \(p(q|a)\) using Mutual Maximum Information and deriving their formula with Bayes theorem. However, as mentioned in the paper, it is impossible to compute this distribution in a sequential model like seq2seq, because you don’t know the answer until you’ve decoded it.
Beam search usually returns very similar answer.

Solution

Problem 1

They trained two seq2seq models asynchronously:

The first one \(S2S_f\) returns a list of potential answer, each one has been selected by a variant of beam search (see Problem 2)). It allows them to compute the log likelihood of an answer given the question.
The second seq2seq model \(S2S_b\) is trained to predict the question that would have to generate an answer. It can be seen as a reversed seq2seq model. This model allow use to compute \(log p(x|y)\) for all \(y\) in the beam-search list. This score is added to the final cost function (multiplied by a regularizer term) and allows to rank predicted answers.
Finally they also added a regularizer term for the length \(L_t\) of the sequence \(\nu L_t\).
The final cost consists of maximizing all this three term summed.

Problem 2:

For this second issue, the classic beam search optimization problem formulation has been a little bit transformed. During decoding, at each time-step \(t\), a penalization term (\(\lambda rank_t\)) is added to the logprob, where \(rank_t\) is the rank of the t-th most probable token in the beam search. So if beam-search size is \(K\), a re-ranking is done on the top-prediction of the decoder for every lists \([y_1^1, … y_{t-1}^1], [y_1^2, … y_{t-1}^2], …, [y_1^K, … y_{t-1}^K]\). Then, every prediction added to its score the opposite of its rank \(- \lambda rank_t\). Because the goal is to maximize, low ranked prediction are not privileged.

Results

Dataset: English German dataset
Got better BLEU score on training (~+4 in BLEU score between standart seq2seq and model trained with the new loss function)
For translation tasks, the average number of distinct unigrams and bigrams in the returned list of sequences doubled (from 0.5 to 1, and 1.5 to 2.8 for bigrams)

Sequence to Backward and Forward Sequences: A Content-Introducing Approach to Generative Short-Text Conversation

Paper: –> * <–, 2016

Problems

Answer generated by seq2seq model are too universally relevant, usually also too “sage” (Yes, No….)

Solutions

For a given question, defined as a list of tokens \(q = (q_1, q_2, … q_{|Q|}) \), they computed the PointWise Mutual Information of all words in the question with every other words \(w_r\) in the decoder vocabulary. From this comparaison, they selected the word \(w_r\) equals to \(argmax( \sum_{i=1}^{|Q|}log \frac{p(w_i|w_r)}{p(w_i)})\). This word can be seen as the most important word in the answer.
Then they used two seq2seq model; let’s called them \(S_b\), and \(S_f\). With the word \(w_r\) and \(S_b\)’s decoder, they generated in reverse order what will come in the sentence before \(w_r\). With \(S_f\)’s decoder, and the sequence generated, they predicted what will be after \(w_r\). To summarize, \(S_f\) predicts words before \(w_r\), and \(S_b\) those after. Both seq2seq models see and encode the question.

Results

Dataset: Human conversation from Baidu Tieba forum
Vanilla seq2seq generates longer sentences (5.61 versus 5.3 words)
Human evaluation gives better score to sequence generated using their method (0.67 versus 0.58)
Entropy for the answer is greater for their model than vanilla seq2seq.

A Simple, Fast Diverse Decoding Algorithm for Neural Generation

Paper: –> * <–, 2016

Problems

Not a lot of diversity in neural generation. Beam search gives a small list of redundant predictions compared to all the possibilities that could have been explored.

Solutions

See solution of Problem 2 for Mutual Information, and Diverse Decoding Improve Neural Machine Translation. It is the same global idea, but in this paper, they made \(\lambda\) a non-hyperparameter, by using a Policy Gradient algorithm to train a policy to predict the good \(lambda\) for input.

Results

Dataset: OpenSubtiles
They used BLEU score. Having a reinforcement learning predict lambda seems to slightly better results than with fixing lambda. The greatest difference between those two techniques in all the experiments is 6% more in BLEU score (baseline 1.88, no RL: 2.21, with RL: 2.32).

Neural Conversational Model with Mutual Information Ranking

Paper: –> * <–, 2016

Problems

The ”I don’t know” problem and de-prioritizing over-generic responses

Solutions

They trained a seq2seq model with soft-attention and a loss function for re-ranking beam search results. It is based on Mutual Information formula. After deriving their maths, they arrived to: \(r_i = arg max P(r|s_i)(ln(P(r|s_i) - ln(P(r)))\), which is different from the A Diversity-Promoting Objective Function for Neural Conversation Model. The first two terms are computing while decoding. Instead, the last one is precomputed, using a larger dataset.

Results

Dataset: Ubuntu
Well not a good paper, badly written. Results and experiments are not the ones we usually expected. They gave some samples, and it seems to produce more different answer.

A Diversity-Promoting Objective Function for Neural Conversation Model

Paper: –> * <–, 2016

Problems

the ”I don’t know” problem and de-prioritizing over-generic responses
Introducing another metric than perplexity for measuring diversity in generated sentences

Solutions

Problem 1

Started from the Maximum Mutual Information formula, they derived two models. In an incomplete approximated Bayesian framework, they took into account the marginal probability of an answer, as a regularizer term against generic responses. Here are the two models:

MMI-antiLM: First model consists of a new score function to choose during beam search: \(log p(A|Q) - \lambda U(A)\) where \(U(A) = 1\text{ if }|A| < k\), k choose equal to 5.
MMI-bidi: Second model does beam search until the end, then reranks answers using a scoring function: \((1 - \lambda) log p(A|Q) + \lambda p(Q|A)\). The first term is the standard loss function, the second one, well I don’t know how they computed it.

Problem 2

They also introduced a new metric for computing diversity: diversity-1, and diversity-2, respectively the number of unigrams, bi-grams in the generated sentences, divided by the number of tokens in the sentences.

A practical approach to dialogue response generation in closed domains

Papers: –> * <–, 2017

Problems

Each agent working in Amazon has a list of blurbs accessible to copy and paste to answer client questions. However, there are a lot of theses, and different agents may use more often different ones. Hence it would be good to have a list of templated answer based on the history of the agent.
Now that we have a proper list of templated answer for every agent, the goal is to find the best one given a client query.

Solutions

Problem 1

It isn’t answered in the paper, but question is at least raised.

Problem 2

Train a dual encoder network to encode a question \(q\), and an answer \(a\). Then try to predict the best response by ranking them using a score function: \(\sigma(q^T M a + bias)\). The model is trained to predict the best answer using negative sampling (the ratio of positive-negative pairs for training is 1:2).
Sample 400K answers from a dataset, compute a reply embedding by retrieving the output of the trained dual encoder. Cluster embeddings into 500 ones using mini batch K-mean with K-means++ initialization. Each cluster is assigned a template answer that is the one that has the closest embedding from the centroid. With this initial 500 groups, human evaluations pick 200 of them.
During inference, every 200 templated answer is compared with the question using the dual encoder. The answers giving the top-k highest probs are returned.

Results

Dataset: Amazon internal item delivery issue
Response suggestions are more relevant than TF-IDF methods (Human evaluation).
100 questions: In top 3 answers returned, model found a very relevant one for 48 questions, TF-IDF only 31.
Precision@3: 80% for their model, 65% for the tf-idf.

A Deep Reinforced Model for Abstractive Summarization

Papers: –> * <–, 2017

Problems

Attentional RNN-based encoder-decoder models for abstractive summarization have achieved good performance on short input and output sequences. However, for longer documents, these models perform badly and often include repetitive phrase. Also, models are trained only with the former often exhibit “exposure bias”, because ground truth is often provided while training.

Solutions

Model is based on the standard seq2seq encoder decoder probabilistic model. The “vanilla way” of doing attention is to compute att vectors at each decoding step t (vector \(c_t^e\)). However, this form of attention does not prevent the model for generating repetitive phrases in the generated summary.
A proposed solution is to do an intra decoding attention, introducing a new attention vector \(c_t\) for each decoding step. This attention mechanism looks at previous hidden state \(h_{t^1}\) of the decoding sequence. Here are the formula:

Attention vectors: \(e_{tt^\prime} = h_t W_{att}^{dec} h_{t^\prime}\),
Weighted coefficient: \(\alpha_{tt^\prime} = \frac{exp(e_{tt^\prime})}{\sum_{j} \exp(e_{tt^j})}\),
Intra decoding vector: \(c_t^d = \sum_j (\alpha_{t^j} e_{tt^j})\). This context vector is used when generating a new token.

While generating a new token, they allowed tokens from the input sequence to be copied by using a differentiable copying mechanism: at each decoding time-step \(t\), the model could decide to copy a token from the input or do a classical softmax.
The decision variable for copying from input is \(u_t\). Here is how it is computes: \(u_t = 1\) with probability \(p(u_t = 1) = \sigma(W [h_t||c_t^e||c_t^d] + b)\).

If \(u_t = 1\), then they choosed a token from the input, using the already computed weighted coefficient \(\alpha_t^e\) over the input sequence attention coefficient.
Otherwise, if \(u_t = 0\), they used the classical softmax formula which take as input the hidden state \(h_t\), the input sequence attention vector \(c_t^e\), and the intra decoding vectors \(c_t^d\).

Efficient Natural Language Response Suggestion for Smart Reply

Paper: –> * <–

Problems

End-to-end models are difficult to train: model needs to learn the language, but also to do “something with it.”
End-to-end models, like seq2seq, needs to be consistent in a dialogue with a human, maintaining some memory
While seq2seq model provided a generalized solution to answer generation, training is long and hard
seq2seq model are generative model, and after generating answer, we need a biased method to rank answers
This paper is base on the original SmartReply paper. In SmartReply, an initial neural network is trained to predict whether an email will be answered by a short answer. If the nn triggered ‘yes,’ then a list of possible answers is retrieved from a database using a seq2seq, with beam search, trained to maximize the conditional distribution of the reply given the question. To ensure diversity in the predicted answer, a clustering algorithm is applied to all answer to ensure no duplicate answers are provided. Furthermore, because all answers are labeled, before returning the top 3 answers, they ensure at least a negative and positive suggestions.

Solutions… an overview of the model

Instead of using a seq2seq model, they went for a feedforward neural network. The model try to find the \(argmax P(y|x) = \frac{P(x, y)}{\sum_{i=1}^K P(x, y)}\), where \(K\) is a number of response sampled to compute the conditionnal probability. Sampling is done uniformly and is used to reduce the computation bottleneck. The joint distribution \(P(x, y) = exp(S(x,y))\)is computed by a feedforward network. For faster inference, the model doesn’t modelize the all joint distribution, instead question and answer are passed as different input of a neural network, and information flow is combined only at the last layer, as a dot product between previous output. It returns a score function of how likely \(x, y\) should be paired, defined as \(S(h_x,h_y)\).

The embeddings: Answers and questions are computed using a combination of feature hashing, n-gram word representation, and word representation sums.
Loss function: In a batch, every \((x, y)\) corresponds to a real pair fo question-answer. For a \(x_i\), all other \(y_j, j \neq i\) are used as negative candidates. Minimizing this loss function is the training goal: \(L(x, y) = -\frac{1}{K} \sum_{i=1}^K log P(y_i|x_i) = -\frac{1}{K} \sum_{i=1}^K S(x, y) - log \sum_{j=1, j \neq i}^{K-1} exp(S(x_i, y_j))\)
Dealing with email systems, it is usually interesting to encode an email (i.e \(x\)), as a combination of multiple features. Hence for a single example, multiple features (i.e \(x_i\text{ ,} i \in |F|\)) are extracted from it. For these reason, \(i\) different models, with the same architecture, are created, each one returning a single score \(S(x_i, y)\), a response output vector \(h_{y^i}\text{ ,}i \in |F|\), and a feature output vector \(h_{x^i}\). Afteward, the mean of \(h_{x^i}\), and \(h_{y^i}\) is computed, and another new neural network with the same architecture, but where input are the mean of the responses output and the features output vector. The output is a single \(h_x\) and also the response vector \(h_y\).
Both the first neural network and the last one are trained jointly, and the final loss resulting is: \(= L(x, y) + \sum_{i=1}^{|F|} L(x_i, y)\).
During training, the approximation of the denominator leads to a bias estimation of the denominator, because common responses appearing more in the training data are often selected as negative sampled, as a result during inference, the model often returns long answers. One solution is to include a penality term \(+\alpha P_{LM}(y)\) to the final score. This penalization term is towards providing more generic answers

Results

Dataset: Google internal dataset, 300 million pairs of questions answers.
The joint neural networks outperform the other one (the one explained) on @precision1 given 100 answers, one right. But if we increase the batch size during training, the joint neural network is longer to train, and then it is better to use the one explained (independent model).
Model is not way better than the classical seq2seq model. However, it is way faster (100%)

End to end memory networks

Paper: –> * <–

Problems

The first paper of Memory Network was hard to train and required supervised help.

Solutions

The problem consists in predicting an answer \(a\) given a context (list of sentences \((x_0, …, x_i)\)), and a query \(q\).
The model is very similar to the original memory network architecture.

First, every sentence embeddings are computed as a weighted sum of their word embedding. For the sentence embedding, they went for a crazy weighting, that I’m personally not convinced about, but here is the formula: \(l_{jk} = ((1 - \frac{j}{J}) - \frac{k}{K}(1 - \frac{2j}{J})\), where \(j\) is the position of the word in the sentence, and \(k\) is the number of dimension of the word embeddings.

There are two main embedding matrices:

\(A\) transforms input into embeddings. Those embeddings are used to compute the match between a query and the input
\(C\) is only used to update our query vector. Theses embeddings are only used to update the query, not to try to find matches between inputs and the query.
Now, here are the formula:
\(s_i = \sum_{j}^{nbwords} l_j A[x_j]\), and \(c_i = \sum_{j}^{nbwords} l_j C[x_j]\)
The query \(q\) is also transformed in a vector \(u^0\) in the same way as other sentences.
Match between the query vector \(u^0\) and each sentence embedding \(s_i\) is computed as a dot product, followed by a softmax to have a weighted probability \(p_i\), corresponding to the importance of each input.
This weighting is used to create a new output vector \(o_{t} = sum_i^{nbwords} p_i c_i\), which will be linearly transformed with previous time-step \(u_{t-1}\) (i.e \(u_{t} = W(o_t + u_{t-1})\)).

\(t\) represents the number of times this “memory lookup” is done, it is often referred as the number of hops.

The difference between this paper and the previous M2M paper is that they used a softmax instead of the real max to retrieve some information from the sentences inputs.

Results

Datasets: Babi
M2MN2N performs almost as good as the supervised M2M. More hops help the model
At each hop, the matrix A and C are shared.
Join training on all tasks helps

Adversarial Generation of Natural Language

Paper: –> * <–

Problems

Training GAN for text generation is hard: in the context of text generation, the generator is asked to generate a fake sequence of tokens, which is then passed to the discriminator. Extracting the token from the generator often involved a nondifferentiable operation: arg max.
Generating text from noise is hard

Solutions

Problem 1

Instead of computing the arg max in the generator, the generator stops at the probability distribution (i.e \(softmax(o_t^TM)\)), where \(o_t\) is the output of a RNN cell at time-step \(t\), and M a dictionary matrix. This distribution vector list is passed to the discriminator, which is also a recurrent neural network. For sentences coming from the true distribution, we pass one hot vector instead, with each vector serving as an indicator of the observed word in the sample. I supposed they maybe used some feature hashing. At the last time-step, the discriminator is a binary logistic regression, determining how likely this sequence comes from the true distribution. Different loss have been employed (classic GAN, WGAN, LSGAN).
During inference, they always take the argmax instead of softmax one.

Problem 2

For the encoding of a fake sentence, at each time-step, the RNN receives as input some Gaussian noise. To create coherent sequences, classical RNN probabilistic model conditioned every output on the previous input. However, when we want to generate sample from noise, this conditioning does not help the model. For this reason, they added a peephole connection between the output \(o_t\) of the RNN and every gates \((i_{t+1}, o_{t+1}, f_{t+1})\) at the next time-step. I also think they could have used something like teacher forcing instead of always passing noise.

Results

Dataset: Chinese poetry
Still not better than MLE approach, but on its way
Best results with WGAN with gradient penalty instead of weight clipping.

A Hierarchical Recurrent Encoder-Decoder for Generative Context-Aware Query Suggestion

Paper: –> * <–

Problems

Predicting the next query given the previous one. Model using query co-occurrence lacks the ability of generalization for unseens or rare query sequence.

Solutions

Three RNNs are used in the model.

Query level encoder: The first one encode a query at the word level. The Hidden state is initialized to zero. At the last time-step, it returns a vector representing the all sentence \(h_q^n\) where \(n\) is the total number of queries in the sequence.
Session level encoder: This encoder is an RNN, which takes as input the query embedding.
Next query decoder: This RNN generates the next query given the previous ones. Its initial states is a linear transformation of the output of the session encoder output. Instead of using the using the recurrent state directly to compute the probability distribution, they used a linear transformation taking into account the previous recurrent states, and the previous word embedding decoded.
Trainning is done by maximizing the log-likelihood. BPTT also is used.

Side Note: This paper has been extended to dialogue in this article Building End-to-End Dialogue Systems Using Generative Hierarchical Neural Network Models

Results

Dataset: AOL search log dataset. 16M search for 600.000 different users
State of the art results

Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks

Paper: –> * <–

Problems

This paper tackles the problem of structured output prediction, in the specific case where the output is a sequence and we represent the sequence as a directed graphical model that generates from the first token to the last. The paper starts from the observation that training such models by maximum likelihood does not reflect well how the model is used at test time. Indeed, ML training implies that the model is effectively trained to predict each token conditioned on the previous tokens from the ground truth sequence (this is known as “teacher forcing”). When predicting a new input, the model will generate a sequence by generating tokens one after another and conditioning on its predicted tokens instead.

Solutions

A different training procedure is proposed, where at training time each conditioning ground truth token is sometimes replaced by the model’s previous prediction. The choice of replacing the ground truth by the model’s prediction is made by “flipping a coin” with some probability, independently for each token. Importantly, the authors propose to start with a high probability of using the ground truth (i.e. start close to ML) and anneal that probability closer to 0, according to some schedule (thus the name Schedule Sampling).
Scheduled Sampling helps is that ML training does not inform the model about the relative quality of the errors it can make. Regarding ML, it is as bad to put high probability on an output sequence that has just one token that’s wrong; then it is to put the same amount of probability on a sequence that has all symbols wrong.
There is still a problem; the current algorithm ignores the fact that, by sometimes taking as input its previous prediction, this induces an additional relationship between the model’s parameters and its ultimate prediction, a relationship that isn’t considered during training.

Convolutional Sequence to Sequence Learning

Paper: –> * <–, 2017

Problems

Training RNN is hard, long, and nonparallelizable. CNN has been applied with success to a wide range of applications, but not to sequence to sequence modeling.

Solutions… an overview of the model

Convolutional Neural Network does not depend on the computation of the previous time step
Multi convolutional neural network allows nearby input elements to interact together at the bottom layer, while long term dependencies interact at higher layer. Hierarchical structure provides a shorter path for long term dependencies. Fixing the number of known linearities also eases learning.

Here is an overview of the model:
As input, the model receives a 2D array where the first dimension is the size of the sequence \(m\), and the 2nd dimension corresponds to the embedding size \(d\). During decoding, the already computed output has the same shape. Furthermore, in each row is also added a positional embedding to give a sense to the model of which portion of the sequence in the input or output it is currently dealing with.
Convolution kernel \(W \in \mathbb{R}^{2d x kd}\), where \(k\) is the kernel size are used to performs 1D convolution along the first axis. Hence \(k\) rows are embedded into a single vector \(y\) of size \(\mathbb{R}^{2d}\). As we doubled the last dimension, we can use a gated linear unit as the nonlinearity (second half of the vector is used as a gated vector on the first half). Furthermore, to enable deep convolutional networks, they added residual connections from the input of each convolution to the output of the block.

For the encoder network, the output of the convolutional layers matches the input by padding the input. For the decoder network, they padded the input by \(k - 1\) element son both the left and right side and removed \(k\) elements from the end of the convolution output.

Furthermore, a linear mapping between the embedding size is done before the encoder input, at the output of the encoder, to the final layer of the decoder \(h^L\) before the softmax, and to all decoder layers \(h^l\) before computing attention scores.

For each decoder layer, they introduced a separate attention mechanism. At every layer \(l\), for every non linearity \(i\), the intermediate vector \(h_i^l\) is linearly combined with the embedding decoder input “at time-step” \(i\) (i.e \(h_i = W_d^l h_i^l + b_i + g_i\)). From this a probability distribution \(\alpha_{ij}\) is computed between \(h_i\) and all encoder output \(z_j\). This weights are used to compute a weighting sum for every intermediate vector as \(c_i = \sum_{j=1}^m \alpha_{ij}^l (z_j + e_j)\) where \(e_j\) is the embedding for the \(j\)th input. It is a more complex attention mechanism than in seq2seq model because at every decoder layer, for every intermediate vector, an conditionning is done over the encoder input, encoder output, and the previous decoder input. Once calculated, the attention vector is added to the corresponding decoder layer.

A lot of normalization: residual blocks are normalized to keep the variance unchanged. More precisely, to keep variance bounded when forwarding, they initialized weight matrices with a gaussian \(\mathbb{N}(0, \sqrt(\frac{4}{kernelsize}))\)

Results

Dataset: WMT English-Romanian, English-French, English-German
Slightly better than RNN, but model was highly optimized with extensive HP search
From 7 to 21 faster on GPU/CPU.
Removing the source position embeddings results in a larger accuracy decrease than target position embeddings.
2 layers decoder performs good, at nine layers there is no more gain in accuracy.

Coherent Dialogue with Attention-based Language Models

Paper: –> * <–, 2017

Problems

seq2seq models are better suited for converting the same information from one modality to another, than to dialogue modeling
Current attention mechanism supports attention over the output of the encoder, what if we extended to all the previous output of the generated sentence.

Solutions

Problem 1

Their architecture (RNN Language Model) is a simple recurrent neural network which takes as input a context + question and predicts the next utterance. Model is trained via Maximum Likelihood, (teacher forcing)

Problem 2

They created a new attention mechanism model (A-RNN). Attention vectors at time-step t are formulated combination of all the hidden vector \(h_{i \lt t}\) previous time-step, and the input embedding vector \(r_{i \lt t}\), such that \(\beta_{ti} = b^t tanh(W h_{it + U r_{it}}\), then applied a softmax to get a correct probability distribution, and finally compute attention as a convex sum \(z_t = \sum_{i=1}^t \alpha_{it}r_{it}\). This attention vector is used during the prediction of the token

Results

Dataset: MovieTriples and Ubuntu TroubleShoot
On MovieTriples, smallest perplexity compared to rnn, seq2seq, and HRED
Compared to classical rnn language modeling, recall@1, recall@2 is better (~+5%).
Measure of diversity distinct-1 is better than traditional rnn without dynamic attention
On Human Evaluation, A-RNN is 90% at least better than simple RNN

Decoupled Neural Interfaces Using Synthetic Gradients

Paper: –> * <–, 2016

Problems

Training Neural Networks is sequential in three ways: each layer has to wait for the previous layer to do some computation, each layer has to wait for future layer to give back some information about the backpropagated signal, and also backpropagation can occur only afterward a forward pass has been done before

Solutions

Every layer becomes a sort of independent module, and it updates its weights based on gradient computed by a neural network at the next layer. This neural network receives as input the post activation \(o_l\) of layer \(l\), and predicts what would be the partial derivative \(\frac{dL}{do_l}\). Papers mentioned that the input of the gradient predictor could contain other information that the output weight, but it appeared not significantly to improve the model.
Only the last layer is trained from the data.
The whole point of this technique is to allow individual layer to train without waiting on each other to finish forward and backpropagating. You train a layer based on the gradient of the Synthetic Gradient module of layer \(l+1\), and when error from loss is finally backpropagated, you updated the SG module, and pass the gradient to the previous layer

Hybrid computing using a neural network with dynamic external memory

Paper: –> * <–, 2016

Problems

Neural network model lacks the ability to access to memory. Previous architectures use external memory. DNC try to remedy this issue.

Solutions … an overview of the model

DNC uses a differentiable attention mechanism to define distribution over the N rows of the memory (size \(N x W\)) Reading and Writing

\(r\) is a read vector, and \(r = \sum_{i=1}^N M[i, *]w^r(i)\), where \(w^r\) is defined as a weighting vector
\(v\) is a write vector, it also used a weighting \(w^w\) with it. Writing is done for a column-row value in the memory (i.e \(M[i, j] = M[i, j] (1 - w^w[i] e[j]) + w^w[i] v[j] \))

There are three types of interaction between heads and memory.

Content lookup between a key vector emitted by the controller and all the locations in memory. The similarity score defined a weighting used by the read head for associative recall, or by the writing heads for modifying the existing vector in memory.
Records transitions between consecutively written location in a \(N x N\) matrix. \(L(i, j)\) is closed to 1 if I have been written after j.
The usage of each location is represented as a number between 0 and 1, and a weighting picks unused areas is delivered to the write head. It is automatically increased, decreased, after every write, respectively read.

The Controller
The controller network \(\mathbb{N}\) receives at each time-step \(t\) a vector \(x_t\), and emits an output vector \(y_t\), that could parametrizes a predictive distribution for a target vector \(v_t\). \(\mathbb{N}\) also recieves a set of R read vectors from the memory matrix at the previous time-step via the read heads. It then emits an \(\epsilon\) vector that defines its interactions with the memory at the current time-step. The controller network could be a LSTM cell for example, and all inputs (vector + read vectors) are concatenated.
\(v_t\) and \(\epsilon\) are functions of previous hidden vector, and \(y_t = v_t + W[r_t^1, …, r_t^R]\).

Before being used to parametrized memory interactions, \(\epsilon_t\) vector is subdivided into multiple vectors:

R read keys (in \(\mathbb{R}^W\))
R read strenghts (in \([1, \inf)\))
A write key (in \(\mathbb{R}^W\))
A write strengh (***)
An erase vector (in \([0, 1]^W\))
A write vector (in \(\mathbb{R}^W\))
R free gates (in \([0, 1]\))
An allocation gate (in \([0, 1]\))
A write gate (in \([0, 1]\))
R read modes (in \([0, 1]\))

Memory addressing

Content based addressing \(C(M, k, \beta)[i] = \frac{exp(cossim(k, M[i,]) \beta)}{sum_{j=1}^Nexp(cossim(k, M[j,]) \beta}\), where \(k \in \mathbb{R}^W\) is a lookup key, and \(\beta \in [1, \inf)\) is the key strengh.
Dynamic memory allocation We need to define a memory usage vector at time \(t\) (in \([0, 1]^N\), to do so we will used the free gates \(f_t^i\) outputed by the controller and define the memory retention vector \(\sigma_t \in [0, 1]^N\) which represents by how each location will not freed by the free gates \(\sigma_t = \prod_{i=1}^R(1 - f_t^i w_{t-1}^{r, i})\). If a row is read from the memory at the previous time-step, then highest is a read weighting, highest it will becomes a free location for writing (if the free gates has decided so). Because,free gates allows the neural network to decides if a row should be kept longer in memory even thought it has already been read. Afterwards, the usage vector is defined as \(u_t = (u_{t-1} + w_{t-1}^w - u_{t-1} \cdot w_{t-1}^w) \cdot \sigma_t\). Once \(u_t\) has been determined, we sort the vector and return a list of the ascending sorted index. First rows are the least used memory slot. Finally they defined an allocation weighting equals to \(a_t[\sigma_t[j]] = (1 - u_t[\sigma_t[j]]) \prod_{i=1}^{j-1}u_t[\sigma_t[i]]\). If all usages are one, then the controller cannot allocate memory without first freeing used locations.
Temporal memory linkage We first need a precedence weighting \(p_t \in [0, 1]^N\) where \(p_t[i]\) is the degree to which location i was the last one written to: \(p_t = (1 - \sum_{i=1}^N w_t^w[i])p_{t-1} + w_t^w\). Every time a location is modified, the link matrix is updated to remove odd links to and from that location:
\(L_0(i, j) = 0\text{ } \forall {i, j}\)
\(L_t(i, i) = 0\text{ } \forall i\)
\(L_t(i, j) = (1 - w_t^w[i] - w_t^w[j])L_{t-1}[i, j] + w_t^w[i] p_{t-1}[j]\) From this matrix are created two \(* R\) vectors: \(fo_{t}^i = L_t w_{t-1}^{r, i}\), and \(b_t^i = L_t^T w_{t-1}^{r, i}\). Every vector is of dimension \(N\) and represents to which degree a row \(j\) (i.e entry at index \(j\)), should be read whether because previously we read a row written just before, or we read a row written just after.
All of this should be done after computing the allocated weights for write.
To determine where to write in memory The system uses a combination of content based addressing and dynamic memory allocation. The controller can write to new allocated locations, or to location based on content addressing. First a write content weighting \(c_t = W(M_{t-1}, k_t^w, \beta_t^w)\) is constructed with the write keys, and the write strength. This weighting is interpolated with the allocated weights (i.e \(w_t^w = g_t^w[g_t^a a_t + (1 - g_t^a) c_t^w\)) to determined a writing weighting \(w_t^w\), where \(g_t^*\) are the write and allocated gates. If the write gates are zero, then nothing is written independently of other parameters.
To determine where to read The model uses a combination of content based addressing and temporal memory linkage. First a read weighting for each read head is created \(c_t^{r, i} = W(M_{t-1}, k_t^{r, i}, \beta_t^{r, i})\). Each read head also receives a read mode vector \(\pi \in [0, 1]^3\), which interpolates among the backward weighting, the forward weighting, and the content read weighting, hence determining the read weighting: \(w_t^{r, i} = \pi[0] b_t^i + \pi_t^i[1]c_t^{r, i} + \pi_t^i[2]f_t^i\). If \(\pi[1]\) dominates the read mode, then the weighting reverts to content lookup, otherwise \(\pi[0], pi[2]\) dominates, then the read head iterates in reverse order, forward order respectively.