Yoshua Bengio, Kolya Malkin, Moksh Jain (2022)

(you have to click on the right-pointing triangle of toggle-able sections to see them if they are hidden)

A GFlowNet is a trained stochastic policy or generative model, trained such that it samples objects  through a sequence of constructive steps, with probability proportional to a reward function , where  is a non-negative integrable function. This makes a GFlowNet able to sample a diversity of solutions  that have a high value of .

It is a stochastic policy because it makes sense to apply it when constructing  can be done through a sequence of steps, which can be thought of as internal actions (they are meant to construct things like thoughts or plans or explanations). That sequential construction is convenient to sample compositional objects, which can often be defined by composing elements in some order, with generally many possible orders leading to the same object.

Figure 1. A constructive trajectory to build a lego object goes through a sequence of steps (each adding a lego block). The same object could be obtained in exponentially many different ways. The state of the GFlowNet describes the object (or partially constructed object) and the trajectories leading into it specify how it could have been built. The forward-sampling policy of the GFlowNet performs one change to the state at a time (e.g. adding one lego block), but we can also define a backward-sampling policy that undoes such changes (e.g. removing a lego block), and they can be combined (e.g., to form a Markov chain by which we can fix our current attempt at a lego boat). When we decide to show our construction, we have reached a terminal state and we may get a reward which, e.g., is higher for boat-looking constructions, and can also be interpreted as exp(-energy), i.e., as an unnormalized probability function. The GFlowNet forward-sampling policy tells us how we could generate the potentially exponentially many boat-looking constructions. An intermediate quantity can be estimated, the flow, and the flow in the initial state corresponds to a free energy or normalizing constant (summing over all values of the reward of attainable terminal states). This flow can be generalized to estimate marginalized quantities such as marginal and conditional probabilities over subsets of random variables, which require in principle summing over exponentially many paths or terminal states.

A neural net can be used to sample each of these forward-going constructive actions, one at a time. An object  is done being constructed when a special “exit” action or a deterministic criterion of the state (e.g.,  has exactly  elements) has been triggered, when we have reached a terminal state , and after which we can get a reward . Seen as a stochastic policy, the GFlowNet has an action space (for deciding what to do at each step) and a state space (for the partially constructed objects). Each sequence of actions  forms a trajectory  that is a sequence of states . There may be many trajectories that lead to the same state  (think about how one can construct a set by inserting elements in any order, or build a graph by pasting graph pieces again in many possible orders, transforming a vector through a sequence of stochastic steps like in Denoising Diffusion, or like in Figure 1, constructing the same lego boat in many different ways). The last state  of a complete trajectory  is an object  that the GFlowNet can sample, and the training objective aims at making it sample  with probability proportional to . Note that [2] discusses how to generalize this to having intermediate rewards and not just at the end of the trajectory.

A GFlowNet is a generative model because after (and during) training we can sample from it, but in its most basic form its training objective is about matching a reward function  rather than fitting a finite dataset (the usual objective for generative models). Since  is not an external quantity (like in typical RL) but an internal quantity (e.g. corresponding to an energy function in a world model), it means that the quality of training of a GFlowNet does not depend on the size of a fixed external training dataset but rather on how much compute is available to query  on many trajectories. Training the GFlowNet can be done by repeatedly querying that function, so we can make our GFlowNet as big a neural net as we can afford computationally, without necessarily worrying about overfitting. Underfitting is the real danger, instead. Going back to the lego construction, we are not trying to find the most boat-looking lego structure, but a way to sample from all the boat-looking structures. Because of the sequential construction of objects  (with an arbitrary number of steps), it is very convenient to generate variable-size structured objects, like sets, graphs, lists, programs or other recursively constructed data structures. Note that a GFlowNet can also define a backward-sampling procedure, i.e., given a constructed object, we could sample a plausible trajectory that could have led to constructing it. The forward-sampling policy is called  below and the backward-sampling policy is called .

Importantly, when the reward function  represents the product of a prior (over some random variable) times a likelihood (measuring how well that choice of value of the random variable fits some data), the GFlowNet will learn to sample from the corresponding Bayesian posterior. This makes GFlowNets amortized probabilistic inference machines that can be used both to sample latent variables (as in [14]) or parameters and theories shared across examples (as in [5]). They can also be used to perform approximate and amortized marginalization (without the need for sampling at run-time), as discussed in this other tutorial, and this is useful for training AI systems that can answer probabilistic questions (e.g., about probabilities or expected values of some variables given others, or to estimate information theoretic quantities like conditional entropies or mutual information).

In terms of neural net architectures, the simplest GFlowNet architecture is one where we have a neural net (as large as we can afford) that outputs a stochastic policy , where  represents a partially constructed object and  is one of the possible actions from . In regular GFlowNets, choosing  from  deterministically yields some , which means that we can also write  for that policy. It makes sense to consider deterministic transitions when the policy is an internal policy, not operating in some stochastic external environment, but instead performing computational choices (like what to attend to, what computation to perform, what memory to retrieve, etc). GFlowNets are motivated by the kind of internal policy which could model cognition, i.e., the sequence of micro-choices about internal computation of a thinking agent corresponding to a form of probabilistic inference.  stands for the “forward” transition probability along the GFlowNet constructive process to generate these structured internal computational constructions. The same neural net is used at every step, but it produces a stochastic output , from which a next state  is obtained (the form of  is application-dependent, e.g., following the rules of chemistry if  is a partially constructed molecule represented as a graph and  puts an atom as a specific new node connected to an existing node in the graph). Since the state is a variable-size object, the neural net better have an appropriate architecture for taking such objects in input (e.g., an RNN, a graph neural net or a transformer). We can thus more generally see the iterated application of the GFlowNet decisions as a particular form of recurrent stochastic net where the hidden recurrent state () is stochastic.

Figure 2: the most basic component of a GFlowNet is a neural network that defines its constructive policy, i.e., how to sample the next state  given the previous state , through the choice of an action . The new state  becomes the input for the next application of the GFlowNet policy denoted  or , until a special “exit” action signals the end of the sequence, that an object  has been generated, and a reward  is provided to encourage or discourage the policy from generating such a solution. The training objective optimizes the parameters of  towards making  be generated with probability proportional to .

Brain sciences show that conscious reasoning involves a sequential process of thought formation, where at each step a competition takes place among possible thought contents (and relevant parts of the brain with expertise on that content), and each thought involves very few symbolic elements (a handful). This is the heart of the Global Workspace Theory (GWT), initiated by Baars (1993,1997) and extended (among others) by Dehaene et al (2011, 2017, 2020) as well as through Graziano’s Attention Schema Theory (2011, 2013, 2017). In addition, it is plausible (and supported by works on the link between Bayesian reasoning and neuroscience) that each such step is stochastic. This suggests that something like a GFlowNet could learn the internal policy that selects that sequence of thoughts. In addition, the work on GFlowNet as amortized inference learners [4,5,7,9 below], both from a Bayesian [7] and variational inference [9] perspectives, would make this kind of computation very useful to achieve probabilistic inference in the brain. GFlowNets can learn a type of System 1 inference machine (corresponding to the amortized approximate inference  in variational inference) that is trained to probabilistically select answers to questions of a particular kind so as to be consistent with System 2 modular knowledge (the “world model”  in variational inference). That System 1 inference machinery  is also crucial to help train the model  (as shown in several of these papers). Unlike typical neural networks, the inference machinery does not need to be trained only from external (real) data, it can take advantage of internally generated (hallucinated) pseudo-data in order to make  consistent with . See also this blog post on model-based machine learning. The real data is of course important to make  (the world-model) compatible with the real world. The kind of sequential probabilistic inference that GFlowNets can perform is a powerful form of learned reasoning machinery, which could be used for interpretation of sensory inputs, interpretation of selected past observations, planning, and counterfactuals (what if an agent had acted differently in the past). The stochastic selection of just a few elements of content (that go into a thought) make GFlowNets a good candidate to implement the “consciousness priors” introduced by Bengio (2017) and elaborated by Goyal and Bengio (2022). In particular, the GWT bottleneck, when applied to such probabilistic inference, would enforce the inductive bias that the graph of dependencies between high-level concepts (that we can manipulate consciously and reason with) is very sparse, in the sense that each factor or energy term only has at most a handful of arguments.

A trained GFlowNet is both a sampler (to generate objects  with probability proportional to ) and an inference machine (it can be used to answer questions and predict probabilities about some variables in  given other variables, marginalizing over the others). But that is with respect to a given unnormalized probability function  or energy function  for that joint probability model. Where would a learning agent get that energy function from?

We are working on several approaches to do that (more details in this section below):

The energy function can be parametrized (say with parameters ) and its parameters learned by maximum likelihood (this is in the realm of energy-based modelling). It turns out that in order to obtain a maximum likelihood gradient on , we need to sample from the distribution over objects  captured by the energy function, and GFlowNets can do that for us. If there are latent variables  involved in addition to the observed  (a much more interesting scenario), then GFlowNets can approximately sample both from  and  which are needed to get this stochastic gradient (and the GFlowNet can also learn to sample from ,  or  if we train it accordingly).

We can view parameters  or the energy function  itself as a latent variable and be Bayesian about them, i.e., let the GFlowNet sample them too! In this case, given  (which indexes energy functions) and  (and optionally latent ), the energy is a fixed function (e.g., an MLP with weights  and inputs ). See [5] for a first paper on using a GFlowNet to sample from the Bayesian posterior over causal graphs. The GFLowNet objective is naturally suited to learn to sample from Bayesian posteriors because the posterior distribution (which would be estimated by a GFlowNet sampler) must be proportional to prior times likelihood. Both log-prior and log-likelihood are easy to compute (or to estimate unbiasedly with a minibatch) and can serve to form the GFlowNet reward.