Insu Jeon
Ask or search…
Comment on page

Neural Variational Dropout Processes


Human can generalize well even from only a few learning examples. But conventional Neural Networks (NN) based machine learning approaches require many training data to learn a new task. A goal in meta-learning is to develop a model that can generalize well on new tasks even from small learning examples.

Bayesian Meta-Learning

For efficient model adaptation in meta-learning, a probabilistic optimization objective based on the task-specific parameter of likelihood and the conditional posterior can be utilized. Given a collection of T tasks, the evidence lower-bound (ELBO) over the multi-task dataset can be defined as:
p(y|x) is the likelihood (or NN model) on t-th training data. q(ϕ^t│D^t;θ) is a tractable conditional posterior over the task-specific variable ϕ^t. p(ϕ^t ) provides a regularization to the conditional posterior. θ is the parameter (or structure) shared across multiple tasks. Essentially, the goal here is to learn and leverage the common structure θ to approximate the task-conditional posterior for each given task. The learned conditional posterior and shared structure enable a fast model adaptation to a new task and also provide uncertainty modeling in the prediction.

Neural Variational Dropout Processes

We propose a new model-based Bayesian meta-learning approach that extends the probabilistic regularization method for NN, Variational Dropout.

Conditional Dropout Posterior

Figure 3: the low-rank product of Bernoulli experts' conditional posterior model.
A conditional posterior in NVDPs is based on task-specifically predicted Bernoulli dropout. The conditional posterior over the task-specific parameter in NVDPs is defined as follows:
Here, the shared structure 𝜃 is the conventional NN’s deterministic parameter, while the dropout rate 𝑷^𝒕 is for each task. The key is to employ an NN-based meta-model 𝑔_𝜓 ("⋅") to predict the task-specific dropout rates from the small context set 𝐷_𝐶^𝑡.
Here 𝑠(⋅) is the sigmoid function. The set representation 𝑟^𝑡 is defined as the mean of features obtained from each data point in the 𝑡-th context set, summarizing the task conditional information. The conditional dropout posterior modeling in NVDPs can greatly reduce the complexity over the output space and also bypass the under-fitting or the posterior collapsing issues of the previous model-based Bayesian meta-learning approaches.

Variational Prior

The probabilistic graphical model of NVDPs with the concept of variational prior.
A prior distribution in NVDPs is defined as the same dropout posterior model, but the whole task dataset is used instead of the context set. The analytical derivation of the KL divergence is given as:
Where 𝑷^𝒕 is obtained from the small context set, while 𝑷 ̂^𝑡 is from the whole tasks set 𝐷^𝑡. The analytical derivation of KL does not depend on the deterministic shared parameter 𝜃, this ensures the constant optimization of the ELBO objective w.r.t all independent parameters in the Variational Inference framework. The overall optimization procedure can be easily performed with the SGD algorithm.


We tested NVDPs on various few-shot learning tasks such as 1D regression, image in-painting, and classification.
Figure 4.: The 1D few-shot regression results on GP dataset. The black (dash-line) represents the true unknown task function. Black dots are a few context points (S = 5) given to the posteriors. The blue lines (and light blue area in learned variance settings) are mean values (and variance) predicted from the sampled NNs.
Figure 5.: The result of 2D image in-painting tasks on the MNIST, CelebA, and Omniglot dataset.
Figure 6. Few-shot classification results on Omniglot and MiniImageNet dataset


NVDPs extend Variational Dropout in the context of the Bayesian meta-learning framework. NVDPs introduce a new concept of variational prior that can be universally applied to other Bayesian meta-learning approaches. NVDPs could bypass the under-fitting and posterior collapsing behavior of the previous model-based conditional posterior approaches and improve likelihood, prediction accuracy, and generalization in various few-shot learning tasks.