Injecting Stochastic Noise during Dialogue Generation.

Discussion around a new generative model trained to produce meaningful dialogue utterances using a latent variable at the decoder level.

Posted by louishenrifranc on July 30, 2017

Injecting Stochastic Noise during Dialogue Generation.

In this blog, we’ll review a recent paper A Hierarchical Latent Variable Encoder-Decoder Model for Generating Dialogues from Serban & al 1. This article introduces a new generative model called VHRED trained to produce meaningful conversation utterances using a latent variable \(z\) at the decoder. As the joint distribution of \(p(z, x)\) where \(x\) is our data is intractable, they appeal to resort to variational inference’s method. As I’m currently learning about variational inference, I thought it was a good time to explain the mathematical part of this paper.

Nota Bene: The following part does not require a knowledge of variational inference, neither it requires you to read VHRED paper, but if I had some extra time, I’ll start by reading this fabulous introduction to VI, and goes quickly over the paper :).

This article is an extension of HRED, the hierarchical recurrent encoder-decoder 2, which belongs to the family of sequence to sequence model. For dialogue generation, instead of generating an answer given a question, the model is conditioned on all the previous utterances in the conversation (question and answer). Each utterance is firstly encoded via a recurrent neural network RNN, which produces an output vector. Each of this sentences is then passed to a second RNN, which encodes all dialogue as a single vector. The decoder’s task is to generate a response conditioned on this vector.

Model is trained to maximize the likelihood of the data via stocastic gradient descent. This likelihood can be modeled as a probability over the output of the recurrent neural network:

After having being trained, we can use our generative model to produce some utterances. To do so, we generate a new sentence, word after another, conditioning our word selection on all the previously sampled words. Usually, people use different techniques to sample a new word: Greedy sampling, Stochastic sampling. These methods only modify the output distribution of the model. VHRED try to remedy this problem by adding a latent variable which will conditioned all the decoding process.

For simplicity, I’ll follow the notation of the paper as much as possible, removing some flourishes to be as clear as possible.

Let’s start by defining our data: \((w_1, w_2, …, w_N)\) represents a list of sequences, each sequence contains a series of words \((x_1, x_2, …, x_M)\). N is the number of sequences, and M is the number of words for a given sequence.

Our goal is to train a model that maximized the probability of our list of sequences. If we have \(S\) list of sequences, we will maximize:

Here, we write the data distribution as a product of probability as we assume that every list of sequences was generated given the same distribution.

As we don’t have access to the exact distribution, we will try to infer a parameterized distribution \(p_{\theta}\) represented by our neural network.

The logarithm of the probability of multiple joint probabilities simplifies to the sum of the logarithms of the individual probabilities. (Maximizing this sum is also numerically more stable).

Let’s try to decompose the term inside the sum:

Wait.. what is this z? Well in practice we can define a marginal of a random variable as the summation of the joint distribution of this variable and a new random variable. This summation goes over all states of this new variable. In our case, we will refer to z as the latent variable, and \(Z\) the number of state of this variable.

We have arrived at a more complicated formula that we can’t optimize anymore. If z is continuous, then z has an infinite number of states, and we’ve just made the optimization a little bit hard as we will have to optimize over an infinite sum. Also, we can’t put the log inside the sum for two reasons. From the calculus perspective, it is just incorrect, and from a probabilistic viewpoint, we will have to maximize the joint distribution of the list of sequences and all given states of z. This essentially means that every latent variable \(z_i\) needs to do a good job of explaining the data! Bad.

One of the most used tricks in calculus is to extend an expression with other terms which keep the first expression correct (like adding 1, subtracting 1).

Let’s say for now that \(q_{\beta}(z_i|w_1, .. w_i)\) will be a function which will compute the probability of every state of the latent variable at decoding time step \(i\) given the previous utterances and the future one.

Now because this expression is intractable, we will go for a lower bound of this expression using the Jensens inegality. This relation states that given a concave function \(f\), \(\mathbb{E}_q(f(x)) > f(\mathbb{E}_q(x))\).

Replacing in our expression, we found that:

We can now decompose our fraction into two terms:

Yeah, we’re getting close! We now have two separate terms. The first term is the conditional likelihood of the next observed utterance given the previous ones, and a sampled z value from the approximate posterior. The second term is nothing else than the Kullback Leibler divergence between our approximated posterior distribution and the true one. If you want to read more about KL divergence, this article is a good place to start. In essence, minimizing this KL divergence will force the approximated distribution to match the real distribution, putting mass at the same points.

We can adjust our formula to match 1’s expression:

We have found a lower bound of the log likelihood of our single list of sequence \((w_1, …, w_N)\). Because we are using neural networks, the approximated posterior distribution will also be a neural network.

That’s it for the derivation. Now you might ask: How to compute the posterior at generation time?

Well to my understanding, this is where the math becomes vague. In the paper, during training, they optimized the lower bound of the data likelihood, approximating the posterior of the latent variable given the previous and current utterances. During generation, they conditioned only on the previous utterances. I don’t know why they did so, but it is mathematically incorrect. I guess it was just this moment in deep learning research when “it was just working better.”

Bibliography

  1. A Hierarchical Latent Variable Encoder-Decoder Model for Generating Dialogues from Iulian Vlad Serban, Alessandro Sordoni, Ryan Lowe, Laurent Charlin, Joelle Pineau, Aaron Courville, Yoshua Bengio  2

  2. Building End-To-End Dialogue Systems Using Generative Hierarchical Neural NetworkModels from Iulian V. Serban, Alessandro Sordoni, Yoshua Bengio, Aaron Courville and Joelle Pineau