Statistical model used in machine learning
A flow-based generative model is a generative model used in machine learning that explicitly models a probability distribution by leveraging normalizing flow,123 which is a statistical method using the change-of-variable law of probabilities to transform a simple distribution into a complex one.
The direct modeling of likelihood provides many advantages. For example, the negative log-likelihood can be directly computed and minimized as the loss function. Additionally, novel samples can be generated by sampling from the initial distribution, and applying the flow transformation.
In contrast, many alternative generative modeling methods such as variational autoencoder (VAE) and generative adversarial network do not explicitly represent the likelihood function.
Scheme for normalizing flows
Let be a (possibly multivariate) random variable with distribution .
For , let be a sequence of random variables transformed from . The functions should be invertible, i.e. the inverse function exists. The final output models the target distribution.
The log likelihood of is (see derivation):
To efficiently compute the log likelihood, the functions should be 1. easy to invert, and 2. easy to compute the determinant of its Jacobian. In practice, the functions are modeled using deep neural networks, and are trained to minimize the negative log-likelihood of data samples from the target distribution. These architectures are usually designed such that only the forward pass of the neural network is required in both the inverse and the Jacobian determinant calculations. Examples of such architectures include NICE,4 RealNVP,5 and Glow.6
Derivation of log likelihood
Consider and . Note that .
By the change of variable formula, the distribution of is:
Where is the determinant of the Jacobian matrix of .
By the inverse function theorem:
By the identity (where is an invertible matrix), we have:
The log likelihood is thus:
In general, the above applies to any and . Since is equal to subtracted by a non-recursive term, we can infer by induction that:
As is generally done when training a deep learning model, the goal with normalizing flows is to minimize the KullbackâLeibler divergence between the modelâs likelihood and the target distribution to be estimated. Denoting the modelâs likelihood and the target distribution to learn, the (forward) KL-divergence is:
The second term on the right-hand side of the equation corresponds to the entropy of the target distribution and is independent of the parameter we want the model to learn, which only leaves the expectation of the negative log-likelihood to minimize under the target distribution. This intractable term can be approximated with a Monte-Carlo method by importance sampling. Indeed, if we have a dataset of samples each independently drawn from the target distribution , then this term can be estimated as:
Therefore, the learning objective
is replaced by
In other words, minimizing the KullbackâLeibler divergence between the modelâs likelihood and the target distribution is equivalent to maximizing the model likelihood under observed samples of the target distribution.7
A pseudocode for training normalizing flows is as follows:8
The earliest example.9 Fix some activation function , and let with the appropriate dimensions, then The inverse has no closed-form solution in general.
The Jacobian is .
For it to be invertible everywhere, it must be nonzero everywhere. For example, and satisfies the requirement.
Nonlinear Independent Components Estimation (NICE)
Let be even-dimensional, and split them in the middle.4 Then the normalizing flow functions are where is any neural network with weights .
is just , and the Jacobian is just 1, that is, the flow is volume-preserving.
When , this is seen as a curvy shearing along the direction.
Real Non-Volume Preserving (Real NVP)
The Real Non-Volume Preserving model generalizes NICE model by:5
Its inverse is , and its Jacobian is . The NICE model is recovered by setting . Since the Real NVP map keeps the first and second halves of the vector separate, itâs usually required to add a permutation after every Real NVP layer.
Generative Flow (Glow)
In generative flow model,6 each layer has 3 parts:
The idea of using the invertible 1x1 convolution is to permute all layers in general, instead of merely permuting the first and second half, as in Real NVP.
Masked autoregressive flow (MAF)
An autoregressive model of a distribution on is defined as the following stochastic process:10
where and are fixed functions that define the autoregressive model.
By the reparameterization trick, the autoregressive model is generalized to a normalizing flow: The autoregressive model is recovered by setting .
The forward mapping is slow (because itâs sequential), but the backward mapping is fast (because itâs parallel).
The Jacobian matrix is lower-diagonal, so the Jacobian is .
Reversing the two maps and of MAF results in Inverse Autoregressive Flow (IAF), which has fast forward mapping and slow backward mapping.11
Continuous Normalizing Flow (CNF)
Instead of constructing flow by function composition, another approach is to formulate the flow as a continuous-time dynamic.1213 Let be the latent variable with distribution . Map this latent variable to data space with the following flow function:
where is an arbitrary function and can be modeled with e.g. neural networks.
The inverse function is then naturally:12
And the log-likelihood of can be found as:12
Since the trace depends only on the diagonal of the Jacobian , this allows âfree-formâ Jacobian.14 Here, âfree-formâ means that there is no restriction on the Jacobianâs form. It is contrasted with previous discrete models of normalizing flow, where the Jacobian is carefully designed to be only upper- or lower-diagonal, so that the Jacobian can be evaluated efficiently.
The trace can be estimated by âHutchinsonâs trickâ:[^finlay_3154%e2%80%933164-15]15
Given any matrix , and any random with , we have . (Proof: expand the expectation directly.)
Usually, the random vector is sampled from (normal distribution) or (Radamacher distribution).
When is implemented as a neural network, neural ODE methods16 would be needed. Indeed, CNF was first proposed in the same paper that proposed neural ODE.
There are two main deficiencies of CNF, one is that a continuous flow must be a homeomorphism, thus preserve orientation and ambient isotopy (for example, itâs impossible to flip a left-hand to a right-hand by continuous deforming of space, and itâs impossible to turn a sphere inside out, or undo a knot), and the other is that the learned flow might be ill-behaved, due to degeneracy (that is, there are an infinite number of possible that all solve the same problem).
By adding extra dimensions, the CNF gains enough freedom to reverse orientation and go beyond ambient isotopy (just like how one can pick up a polygon from a desk and flip it around in 3-space, or unknot a knot in 4-space), yielding the âaugmented neural ODEâ.17
Any homeomorphism of can be approximated by a neural ODE operating on , proved by combining Whitney embedding theorem for manifolds and the universal approximation theorem for neural networks.18
To regularize the flow , one can impose regularization losses. The paper [^finlay_3154%e2%80%933164-15] proposed the following regularization loss based on optimal transport theory: where are hyperparameters. The first term punishes the model for oscillating the flow field over time, and the second term punishes it for oscillating the flow field over space. Both terms together guide the model into a flow that is smooth (not âbumpyâ) over space and time.
Despite normalizing flows success in estimating high-dimensional densities, some downsides still exist in their designs. First of all, their latent space where input data is projected onto is not a lower-dimensional space and therefore, flow-based models do not allow for compression of data by default and require a lot of computation. However, it is still possible to perform image compression with them.19
Flow-based models are also notorious for failing in estimating the likelihood of out-of-distribution samples (i.e.: samples that were not drawn from the same distribution as the training set).20 Some hypotheses were formulated to explain this phenomenon, among which the typical set hypothesis,21 estimation issues when training models,22 or fundamental issues due to the entropy of the data distributions.23
One of the most interesting properties of normalizing flows is the invertibility of their learned bijective map. This property is given by constraints in the design of the models (cf.: RealNVP, Glow) which guarantee theoretical invertibility. The integrity of the inverse is important in order to ensure the applicability of the change-of-variable theorem, the computation of the Jacobian of the map as well as sampling with the model. However, in practice this invertibility is violated and the inverse map explodes because of numerical imprecision.24
Flow-based generative models have been applied on a variety of modeling tasks, including:
- Audio generation25
- Image generation6
- Molecular graph generation26
- Point-cloud modeling27
- Video generation28
- Lossy image compression19
- Anomaly detection29
Footnotes
-
Tabak, Esteban G.; Vanden-Eijnden, Eric (2010). âDensity estimation by dual ascent of the log-likelihoodâ. Communications in Mathematical Sciences. 8 (1): 217â233. doi:10.4310/CMS.2010.v8.n1.a11. â©
-
Tabak, Esteban G.; Turner, Cristina V. (2012). âA family of nonparametric density estimation algorithmsâ. Communications on Pure and Applied Mathematics. 66 (2): 145â164. doi:10.1002/cpa.21423. hdl:11336/8930. S2CIDÂ 17820269. â©
-
Papamakarios, George; Nalisnick, Eric; Jimenez Rezende, Danilo; Mohamed, Shakir; Bakshminarayanan, Balaji (2021). âNormalizing flows for probabilistic modeling and inferenceâ. Journal of Machine Learning Research. 22 (1): 2617â2680. arXiv:1912.02762. â©
-
Dinh, Laurent; Krueger, David; Bengio, Yoshua (2014). âNICE: Non-linear Independent Components Estimationâ. arXiv:1410.8516 [cs.LG]. â© â©2
-
Dinh, Laurent; Sohl-Dickstein, Jascha; Bengio, Samy (2016). âDensity estimation using Real NVPâ. arXiv:1605.08803 [cs.LG]. â© â©2
-
Kingma, Diederik P.; Dhariwal, Prafulla (2018). âGlow: Generative Flow with Invertible 1x1 Convolutionsâ. arXiv:1807.03039 [stat.ML]. â© â©2 â©3
-
Papamakarios, George; Nalisnick, Eric; Rezende, Danilo Jimenez; Shakir, Mohamed; Balaji, Lakshminarayanan (March 2021). âNormalizing Flows for Probabilistic Modeling and Inferenceâ. Journal of Machine Learning Research. 22 (57): 1â64. arXiv:1912.02762. â©
-
Kobyzev, Ivan; Prince, Simon J.D.; Brubaker, Marcus A. (November 2021). âNormalizing Flows: An Introduction and Review of Current Methodsâ. IEEE Transactions on Pattern Analysis and Machine Intelligence. 43 (11): 3964â3979. arXiv:1908.09257. doi:10.1109/TPAMI.2020.2992934. ISSNÂ 1939-3539. PMIDÂ 32396070. S2CIDÂ 208910764. â©
-
Danilo Jimenez Rezende; Mohamed, Shakir (2015). âVariational Inference with Normalizing Flowsâ. arXiv:1505.05770 [stat.ML]. â©
-
Papamakarios, George; Pavlakou, Theo; Murray, Iain (2017). âMasked Autoregressive Flow for Density Estimationâ. Advances in Neural Information Processing Systems. 30. Curran Associates, Inc. arXiv:1705.07057. â©
-
Kingma, Durk P; Salimans, Tim; Jozefowicz, Rafal; Chen, Xi; Sutskever, Ilya; Welling, Max (2016). âImproved Variational Inference with Inverse Autoregressive Flowâ. Advances in Neural Information Processing Systems. 29. Curran Associates, Inc. arXiv:1606.04934. â©
-
Grathwohl, Will; Chen, Ricky T. Q.; Bettencourt, Jesse; Sutskever, Ilya; Duvenaud, David (2018). âFFJORD: Free-form Continuous Dynamics for Scalable Reversible Generative Modelsâ. arXiv:1810.01367 [cs.LG]. â© â©2 â©3
-
Lipman, Yaron; Chen, Ricky T. Q.; Ben-Hamu, Heli; Nickel, Maximilian; Le, Matt (2022-10-01). âFlow Matching for Generative Modelingâ. arXiv:2210.02747 [cs.LG]. â©
-
Grathwohl, Will; Chen, Ricky T. Q.; Bettencourt, Jesse; Sutskever, Ilya; Duvenaud, David (2018-10-22). âFFJORD: Free-form Continuous Dynamics for Scalable Reversible Generative Modelsâ. arXiv:1810.01367 [cs.LG]. â©
-
Hutchinson, M.F. (January 1989). âA Stochastic Estimator of the Trace of the Influence Matrix for Laplacian Smoothing Splinesâ. Communications in Statistics - Simulation and Computation. 18 (3): 1059â1076. doi:10.1080/03610918908812806. ISSNÂ 0361-0918. â©
-
Chen, Ricky T. Q.; Rubanova, Yulia; Bettencourt, Jesse; Duvenaud, David K. (2018). âNeural Ordinary Differential Equationsâ (PDF). In Bengio, S.; Wallach, H.; Larochelle, H.; Grauman, K.; Cesa-Bianchi, N.; Garnett, R. (eds.). Advances in Neural Information Processing Systems. Vol. 31. Curran Associates, Inc. arXiv:1806.07366. â©
-
Dupont, Emilien; Doucet, Arnaud; Teh, Yee Whye (2019). âAugmented Neural ODEsâ. Advances in Neural Information Processing Systems. 32. Curran Associates, Inc. â©
-
Zhang, Han; Gao, Xi; Unterman, Jacob; Arodz, Tom (2019-07-30). âApproximation Capabilities of Neural ODEs and Invertible Residual Networksâ. arXiv:1907.12998 [cs.LG]. â©
-
Helminger, Leonhard; Djelouah, Abdelaziz; Gross, Markus; Schroers, Christopher (2020). âLossy Image Compression with Normalizing Flowsâ. arXiv:2008.10486 [cs.CV]. â© â©2
-
Nalisnick, Eric; Matsukawa, Teh; Zhao, Yee Whye; Song, Zhao (2018). âDo Deep Generative Models Know What They Donât Know?â. arXiv:1810.09136v3 [stat.ML]. â©
-
Nalisnick, Eric; Matsukawa, Teh; Zhao, Yee Whye; Song, Zhao (2019). âDetecting Out-of-Distribution Inputs to Deep Generative Models Using Typicalityâ. arXiv:1906.02994 [stat.ML]. â©
-
Zhang, Lily; Goldstein, Mark; Ranganath, Rajesh (2021). âUnderstanding Failures in Out-of-Distribution Detection with Deep Generative Modelsâ. Proceedings of Machine Learning Research. 139: 12427â12436. PMCÂ 9295254. PMIDÂ 35860036. â©
-
Caterini, Anthony L.; Loaiza-Ganem, Gabriel (2022). âEntropic Issues in Likelihood-Based OOD Detectionâ. pp. 21â26. arXiv:2109.10794 [stat.ML]. â©
-
Behrmann, Jens; Vicol, Paul; Wang, Kuan-Chieh; Grosse, Roger; Jacobsen, Jörn-Henrik (2020). âUnderstanding and Mitigating Exploding Inverses in Invertible Neural Networksâ. arXiv:2006.09347 [cs.LG]. â©
-
Ping, Wei; Peng, Kainan; Gorur, Dilan; Lakshminarayanan, Balaji (2019). âWaveFlow: A Compact Flow-based Model for Raw Audioâ. arXiv:1912.01219 [cs.SD]. â©
-
Shi, Chence; Xu, Minkai; Zhu, Zhaocheng; Zhang, Weinan; Zhang, Ming; Tang, Jian (2020). âGraphAF: A Flow-based Autoregressive Model for Molecular Graph Generationâ. arXiv:2001.09382 [cs.LG]. â©
-
Yang, Guandao; Huang, Xun; Hao, Zekun; Liu, Ming-Yu; Belongie, Serge; Hariharan, Bharath (2019). âPointFlow: 3D Point Cloud Generation with Continuous Normalizing Flowsâ. arXiv:1906.12320 [cs.CV]. â©
-
Kumar, Manoj; Babaeizadeh, Mohammad; Erhan, Dumitru; Finn, Chelsea; Levine, Sergey; Dinh, Laurent; Kingma, Durk (2019). âVideoFlow: A Conditional Flow-Based Model for Stochastic Video Generationâ. arXiv:1903.01434 [cs.CV]. â©
-
Rudolph, Marco; Wandt, Bastian; Rosenhahn, Bodo (2021). âSame Same But DifferNet: Semi-Supervised Defect Detection with Normalizing Flowsâ. arXiv:2008.12577 [cs.CV]. â©