Abstract
Energy-based models (EBMs) estimate unnormalized densities in an elegant framework, but they are generally difficult to train. Recent work has linked EBMs to generative adversarial networks, by noting that they can be trained through a minimax game using a variational lower bound. To avoid the instabilities caused by minimizing a lower bound, we propose to instead work with bidirectional bounds, meaning that we maximize a lower bound and minimize an upper bound when training the EBM. We investigate four different bounds on the log-likelihood derived from different perspectives. We derive lower bounds based on the singular values of the generator Jacobian and on mutual information. To upper bound the negative log-likelihood, we consider a gradient penalty-like bound, as well as one based on diffusion processes. In all cases, we provide algorithms for evaluating the bounds. We compare the different bounds to investigate, the pros and cons of the different approaches. Finally, we demonstrate that the use of bidirectional bounds stabilizes EBM training and yields high-quality density estimation and sample generation.
Similar content being viewed by others
Data Availability
All data analyzed during this study are publicly accessible.
Notes
Refer to Sec. A for the rationale behind the design of \(p_g\).
Note that Eq. (10) is only the density on the spanned manifold, and that the density is zero off the manifold.
References
Abbasnejad, M Ehsan., Shi, Qinfeng., Hengel, Anton van den., & Liu, Lingqiao. A generative adversarial density estimator. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10782–10791, 2019.
Abbasnejad, M Ehsan., Shi, Javen., Hengel, Anton van den., & Liu, Lingqiao. Gade: A generative adversarial approach to density estimation and its applications. International Journal of Computer Vision, 128 (10):2731–2743, 2020.
Ackley, David H., Hinton, Geoffrey E., & Sejnowski, Terrence J. (1985). A learning algorithm for boltzmann machines. Cognitive Science, 9(1), 147–169.
Alemi, Alexander A., Fischer, Ian., & Dillon, Joshua V. Uncertainty in the variational information bottleneck. In Uncertainty in Artificial Intelligence Workshop, 2018.
Arjovsky, Martin., Chintala, Soumith., & Bottou, Léon. Wasserstein generative adversarial networks. In International conference on machine learning, pages 214–223. PMLR, 2017.
Blei, David M., Kucukelbir, Alp., & McAuliffe, Jon D. Variational inference: A review for statisticians. Journal of the American statistical Association, 112(518):859–877, (2017).
Burda, Yuri., Grosse, Roger., & Salakhutdinov, Ruslan. Accurate and conservative estimates of MRF log-likelihood using reverse annealing. In Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics, volume 38, pages 102–110. PMLR, 09–12 May 2015.
Che, Tong, Zhang, Ruixiang, Sohl-Dickstein, Jascha, Larochelle, Hugo, Paull, Liam, Cao, Yuan, & Bengio, Yoshua. (2020). Your GAN is secretly an energy-based model and you should use discriminator driven latent sampling. In Advances in Neural Information Processing Systems, 33, 12275–12287.
Choi, Hyunsun., Jang, Eric., & Alemi, Alexander A. WAIC, but why? Generative ensembles for robust anomaly detection. arXiv preprint arXiv:1810.01392, 2018.
Choi, Jaemoo., Choi, Jaewoong., & Kang, Myungjoo. Generative modeling through the semi-dual formulation of unbalanced optimal transport. Advances in Neural Information Processing Systems, 36, 2024.
Dai, Zihang., Almahairi, Amjad., Bachman, Philip., Hovy, Eduard., & Courville, Aaron. Calibrating energy-based generative adversarial networks. In International Conference on Learning Representations, 2017.
Dieng, Adji B., Ruiz, Francisco JR., Blei, David M., & Titsias, Michalis K Prescribed generative adversarial networks. arXiv preprint arXiv:1910.04302, 2019.
Dinh, Laurent., Krueger, David., & Bengio, Yoshua. Nice: Non-linear independent components estimation. In International Conference on Learning Representations Workshop, 2015.
Du, Yilun., & Mordatch, Igor. Implicit generation and modeling with energy based models. In Advances in Neural Information Processing Systems, volume 32, 2019.
Frellsen, Jes., Winther, Ole., Ghahramani, Zoubin., & Ferkinghoff-Borg, Jesper. Bayesian generalised ensemble markov chain monte carlo. In Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, volume 51, pages 408–416. PMLR, 09–11 May 2016.
Gao, Ruiqi., Song, Yang., Poole, Ben., Wu, Ying Nian., & Kingma, Diederik P. Learning energy-based models by diffusion recovery likelihood. International Conference on Learning Representations, 2021.
Geng, Cong, Wang, Jia, Gao, Zhiyong, Frellsen, Jes, & Hauberg, Søren. (2021). Bounds all around: training energy-based models with bidirectional bounds. Advances in Neural Information Processing Systems, 34, 19808–19821.
Geng, Cong., Han, Tian., Jiang, Peng-Tao., Zhang, Hao., Chen, Jinwei., Hauberg, Søren., & Li, Bo. Improving adversarial energy-based model via diffusion process. Proceedings of the 41th International Conference on Machine Learning, 2024.
Goodfellow, Ian, Pouget-Abadie, Jean, Mirza, Mehdi, Bing, Xu., Warde-Farley, David, Ozair, Sherjil, Courville, Aaron, & Bengio, Yoshua. (2020). Generative adversarial networks. Communications of the ACM, 63(11), 139–144.
Grathwohl, Will Sussman., Kelly, Jacob Jin., Hashemi, Milad., Norouzi, Mohammad., Swersky, Kevin., & Duvenaud, David. No MCMC for me: Amortized sampling for fast and stable training of energy-based models. In International Conference on Learning Representations, 2021.
Grosse, Roger B., Maddison, Chris J., & Salakhutdinov, Russ R. Annealing between distributions by averaging moments. Advances in Neural Information Processing Systems, 26, 2013.
Gulrajani, Ishaan., Ahmed, Faruk., Arjovsky, Martin., Dumoulin, Vincent., & Courville, Aaron C. Improved training of wasserstein gans. Advances in neural information processing systems, 30, 2017.
Gutmann, Michael., & Hyvärinen, Aapo. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pages 297–304. JMLR Workshop and Conference Proceedings, 2010.
Han, Tian., Nijkamp, Erik., Fang, Xiaolin., Hill, Mitch., Zhu, Song-Chun., & Wu, Ying Nian. Divergence triangle for joint training of generator model, energy-based model, and inferential model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8670–8679, 2019.
Han, Tian., Nijkamp, Erik., Zhou, Linqi., Pang, Bo., Zhu, Song-Chun., & Wu, Ying Nian. Joint training of variational auto-encoder and latent energy-based model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7978–7987, 2020.
Havtorn, Jakob D., Frellsen, Jes., Hauberg, Søren., & Maaløe, Lars. Hierarchical vaes know what they don’t know. In International Conference on Machine Learning, pages 4117–4128. PMLR, 2021.
Hendrycks, Dan., & Gimpel, Kevin. A baseline for detecting misclassified and out-of-distribution examples in neural networks. In International Conference on Learning Representations, 2017.
Hendrycks, Dan., Mazeika, Mantas., & Dietterich, Thomas. Deep anomaly detection with outlier exposure. In International Conference on Learning Representations, 2019.
Hinton, Geoffrey E. (2002). Training products of experts by minimizing contrastive divergence. Neural computation, 14(8), 1771–1800.
Hinton, Geoffrey E., Sejnowski, Terrence J. Optimal perceptual inference. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, volume 448, 1983.
Hinton, Geoffrey E., Osindero, Simon., & Teh, Yee-Whye. A fast learning algorithm for deep belief nets. Neural computation, 18(7):1527–1554, 2006.
Hjelm, R Devon., Fedorov, Alex., Lavoie-Marchildon, Samuel., Grewal, Karan., Bachman, Phil., Trischler, Adam., & Bengio, Yoshua. Learning deep representations by mutual information estimation and maximization. In International Conference on Learning Representations, 2019.
Ho, Jonathan, Jain, Ajay, & Abbeel, Pieter. (2020). Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems, 33, 6840–6851.
Hopfield, J J. Neural networks and physical systems with emergent collective computational abilities. Proceedings of the National Academy of Sciences, 79 (8):2554–2558, 1982. ISSN 0027-8424.
Hutchinson, Michael F. (1989). A stochastic estimator of the trace of the influence matrix for laplacian smoothing splines. Communications in Statistics-Simulation and Computation, 18(3), 1059–1076.
Hyvärinen, Aapo. Estimation of non-normalized statistical models by score matching. Journal of Machine Learning Research, 6(4), 2005.
Kan, Ge., Lü, Jinhu., Wang, Tian., Zhang, Baochang., Zhu, Aichun., Huang, Lei., Guo, Guodong., & Snoussi, Hichem. Bi-level doubly variational learning for energy-based latent variable models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18460–18469, 2022.
Kang, Minguk, & Park, Jaesik. (2020). ContraGAN: Contrastive learning for conditional image generation. In Advances in Neural Information Processing Systems, 33, 21357–21369.
Kim, Dongjun., Shin, Seungjae., Song, Kyungwoo., Kang, Wanmo., & Moon, Il-Chul. Soft truncation: A universal training technique of score-based diffusion model for high precision score estimation. In Proceedings of the 38th International Conference on Machine Learning, 2021.
Kim, Taesup., & Bengio, Yoshua. Deep directed generative models with energy-based probability estimation. arXiv preprint arXiv:1606.03439, 2016.
Knyazev, Andrew V. (2001). Toward the optimal preconditioned eigensolver: Locally optimal block preconditioned conjugate gradient method. SIAM journal on scientific computing, 23(2), 517–541.
Krizhevsky, Alex., Hinton, Geoffrey., & et al. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009.
Kumar, Abhishek., Poole, Ben., & Murphy, Kevin. Regularized autoencoders via relaxed injective probability flow. In International Conference on Artificial Intelligence and Statistics, pages 4292–4301. PMLR, 2020.
Kumar, Rithesh., Ozair, Sherjil., Goyal, Anirudh., Courville, Aaron., & Bengio, Yoshua. Maximum entropy generators for energy-based models. arXiv preprint arXiv:1901.08508, 2019.
LeCun, Yann., Chopra, Sumit., Hadsell, Raia., Ranzato, M., & Huang, F. A tutorial on energy-based learning. Predicting structured data, 1(0), 2006.
Lee, Hankook., Jeong, Jongheon., Park, Sejun., & Shin, Jinwoo. Guiding energy-based models via contrastive latent variables. In International Conference on Learning Representations, 2023.
Li, Yingzhen., & Turner, Richard E. Gradient estimators for implicit models. In International Conference on Learning Representations, 2018.
Liu, Ziwei., Luo, Ping., Wang, Xiaogang., & Tang, Xiaoou. Deep learning face attributes in the wild. In Proceedings of the IEEE international conference on computer vision, pages 3730–3738, 2015.
Luhman, Eric., & Luhman, Troy. Knowledge distillation in iterative generative models for improved sampling speed. arXiv preprint arXiv:2101.02388, 2021.
Metropolis, Nicholas., Rosenbluth, Arianna W., Rosenbluth, Marshall N., Teller, Augusta H., & Teller, Edward. Equation of state calculations by fast computing machines. The journal of chemical physics, 21(6): 1087–1092, 1953.
Miyato, Takeru., Kataoka, Toshiki., Koyama, Masanori., & Yoshida, Yuichi. Spectral normalization for generative adversarial networks. In International Conference on Learning Representations, 2018.
Neal, Radford M., & et al. MCMC using hamiltonian dynamics. Handbook of markov chain monte carlo, 2 (11):2, 2011.
Nijkamp, Erik., Hill, Mitch., Zhu, Song-Chun., & Wu, Ying Nian. Learning non-convergent non-persistent short-run mcmc toward energy-based model. Advances in Neural Information Processing Systems, 32, 2019.
Nijkamp, Erik., Hill, Mitch., Han, Tian., Zhu, Song-Chun., & Wu, Ying Nian. On the anatomy of MCMC-based maximum likelihood learning of energy-based models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 5272–5280, 2020.
Osogami, Takayuki. Boltzmann machines and energy-based models. arXiv preprint arXiv:1708.06008, 2017.
Paszke, Adam., Gross, Sam., Chintala, Soumith., Chanan, Gregory., Yang, Edward., DeVito, Zachary., Lin, Zeming., Desmaison, Alban., Antiga, Luca., & Lerer, Adam. Automatic differentiation in pytorch. NIPS 2017 Workshop Autodiff, 2017.
Radford, Alec., Metz, Luke., & Chintala, Soumith. Unsupervised representation learning with deep convolutional generative adversarial networks. In International Conference on Learning Representations, 2016.
Ren, Jie., Liu, Peter J., Fertig, Emily., Snoek, Jasper., Poplin, Ryan, Depristo, Mark., Dillon, Joshua., & Lakshminarayanan, Balaji. Likelihood ratios for out-of-distribution detection. In Advances in Neural Information Processing Systems, volume 32, 2019.
Sajjadi, Mehdi SM., Bachem, Olivier., Lucic, Mario., Bousquet, Olivier., & Gelly, Sylvain. Assessing generative models via precision and recall. Advances in neural information processing systems, 31, 2018.
Salakhutdinov, Ruslan., & Hinton, Geoffrey. Deep boltzmann machines. In Proceedings of the Twelth International Conference on Artificial Intelligence and Statistics, volume 5, pages 448–455. PMLR, 16–18 Apr 2009.
Salakhutdinov, Ruslan., & Murray, Iain. On the quantitative analysis of deep belief networks. In Proceedings of the 25th International Conference on Machine Learning, pages 872–879, 2008.
Salimans, Tim., Karpathy, Andrej., Chen, Xi., & Kingma, DiederikP. Pixelcnn++: Improving the pixelcnn with discretized logistic mixture likelihood and other modifications. In International Conference on Learning Representations, Jan 2017.
Scellier, Benjamin. A deep learning theory for neural networks grounded in physics. PhD thesis, Université de Montréal, Quebec, Canada, 2020.
Shi, Jiaxin., Sun, Shengyang., Zhu, Jun. A spectral approach to gradient estimation for implicit distributions. In International Conference on Machine Learning, pages 4644–4653. PMLR, 2018.
Smolensky, P. (1986). Information Processing in Dynamical Systems: Foundations of Harmony Theory, page 194–281. Cambridge, MA, USA: MIT Press.
Song, Jiaming., Meng, Chenlin., & Ermon, Stefano. International conference on learning representations. 2021a.
Song, Yang., & Ermon, Stefano. Generative modeling by estimating gradients of the data distribution. Advances in Neural Information Processing Systems, 32, 2019.
Song, Yang, & Ermon, Stefano. (2020). Improved techniques for training score-based generative models. Advances in neural information processing systems, 33, 12438–12448.
Song, Yang, Durkan, Conor, Murray, Iain, & Ermon, Stefano. (2021). Maximum likelihood training of score-based diffusion models. Advances in Neural Information Processing Systems, 34, 1415–1428.
Song, Yang., Sohl-Dickstein, Jascha., Kingma, Diederik P., Kumar, Abhishek., Ermon, Stefano., & Poole, Ben. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021c.
Song, Yang., Dhariwal, Prafulla., Chen, Mark., & Sutskever, Ilya. Consistency models. Proceedings of the 40th International Conference on Machine Learning, 2023.
Särkkä, Simo, & Solin, Arno. (2019). Applied Stochastic Differential Equations. Cambridge, UK: Cambridge University Press.
Thanh-Tung, Hoang., Tran, Truyen., & Venkatesh, Svetha. Improving generalization and stability of generative adversarial networks. In International Conference on Learning Representations, 2019.
Vincent, Pascal. (2011). A connection between score matching and denoising autoencoders. Neural computation, 23(7), 1661–1674.
Wang, Binxu., & Ponce, Carlos R. The geometry of deep generative image models and its applications. In International Conference on Learning Representations, 2021.
Wu, Ying Nian., Xie, Jianwen., Lu, Yang., & Zhu, Song-Chun. Sparse and deep generalizations of the frame model. Annals of Mathematical Sciences and Applications, 3 (1):211–254, 2018.
Xiao, Zhisheng., Kreis, Karsten., & Vahdat, Arash. Tackling the generative learning trilemma with denoising diffusion gans. International Conference on Learning Representations, 2022.
Xie, Jianwen., Hu, Wenze., Zhu, Song-Chun., & Wu, Ying Nian. Learning sparse FRAME models for natural image patterns. International Journal of Computer Vision, 114 (2):91–112, 2015.
Xie, Jianwen., Lu, Yang., Zhu, Song-Chun., & Wu, Ying Nian. Inducing wavelets into random fields via generative boosting. Applied and Computational Harmonic Analysis, 41 (1):4–25, 2016.
Xie, Jianwen., Zhu, Song-Chun., & Nian Wu, Ying. Synthesizing dynamic patterns by spatial-temporal generative convnet. In Proceedings of the ieee conference on computer vision and pattern recognition, pages 7093–7101, 2017.
Xie, Jianwen., Zheng, Zilong., Gao, Ruiqi., Wang, Wenguan., Zhu, Song-Chun., & Wu, Ying Nian. Learning descriptor networks for 3d shape synthesis and analysis. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8629–8638, 2018a.
Xie, Jianwen., Lu, Yang., Gao, Ruiqi., & Wu, Ying Nian. Cooperative learning of energy-based model and latent variable model via MCMC teaching. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018b.
Xie, Jianwen., Lu, Yang., Gao, Ruiqi., Zhu, Song-Chun., & Wu, Ying Nian. Cooperative training of descriptor and generator networks. IEEE transactions on pattern analysis and machine intelligence, 42(1):27–45, 2018c.
Xie, Jianwen., Zheng, Zilong., Fang, Xiaolin., Zhu, Song-Chun., & Wu, Ying Nian. Cooperative training of fast thinking initializer and slow thinking solver for conditional learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021a.
Xie, Jianwen., Zheng, Zilong., & Li, Ping. Learning energy-based model with variational auto-encoder as amortized sampler. In The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI), volume 2, 2021b.
Zhai, Shuangfei., Cheng, Yu., Feris, Rogerio., & Zhang, Zhongfei. Generative adversarial networks as variational training of energy based models. arXiv preprint arXiv:1611.01799, 2016.
Zhu, Song Chun., & Mumford, David. Grade: Gibbs reaction and diffusion equations. In Sixth International Conference on Computer Vision (IEEE Cat. No. 98CH36271), pages 847–854. IEEE, 1998.
Zhu, Song Chun., Wu, Yingnian., & Mumford, David. Filters, random fields and maximum entropy (FRAME): Towards a unified theory for texture modeling. International Journal of Computer Vision, 27 (2):107–126, 1998.
Zhu, Yaxuan., Xie, Jianwen., Wu, Yingnian., & Gao, Ruiqi. Learning energy-based models by cooperative diffusion recovery likelihood. In International Conference on Learning Representations, 2024.
Funding
This work was supported by the National Natural Science Foundation of China under Grant U24A20220 and partly funded by the Novo Nordisk Foundation through the Center for Basic Machine Learning Research in Life Science (NNF20OC0062606), STCSM under Grant 22DZ2229005, and 111 project BP0719010. It also received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research, innovation program (757360), National science foundation of China under grant 61771305, and Shanghai Municipal Science and Technology Major Project (2021SHZDZX0102). JF was supported in part by the Novo Nordisk Foundation (NNF20OC0065611) and the Independent Research Fund Denmark (9131-00082B). SH was supported in part by research grants (15334, 42062) from VILLUM FONDEN. The authors also acknowledge the support of the Pioneer Centre for AI, DNRF grant number P1.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Competing interests
The authors have no relevant financial or non-financial interests to disclose. The authors have no competing interests to declare that are relevant to the content of this article. All authors certify that they have no affiliations with or involvement in any organization or entity with any financial interest or non-financial interest in the subject matter or materials discussed in this manuscript. The authors have no financial or proprietary interests in any material discussed in this article.
Additional information
Communicated by Zhouchen Lin.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix A Model architecture
Our key motivation for design of \(p_g\) is to ensure that the model has support over the entire data space. This is needed for the reason that the KL divergence requires the two distributions to be defined on the same measurable space. Since our energy distribution is defined for the entire data space, we need the generated distribution to have the same support. The simplest way to achieve this is with our choice of \(x=G(z)+\epsilon \), where \(\epsilon \) is a small-variance Gaussian variable. This concentrates the distribution around the manifold spanned by our generator while ensuring that the generated distribution has full support over the data space. Moreover, if we do not add noise to our generator, our mutual information lower bound is infinite. The added noise makes the bound finite.
We consistently assume that \(G: \mathbb {R}^d \rightarrow \mathbb {R}^D\) spans an immersed d-dimensional manifold in \(\mathbb {R}^D\) as this allows us to apply the change-of-variables formula to get a density for the generator. In order to ensure that this assumption is valid, we minimally restrict the architecture of the generator neural network.
An immersion implies that the Jacobian of G exists and has full rank. We can then ensure existence by using activation functions with at least one derivative almost everywhere. Any smooth activation satisfies this assumption, but we emphasize that, for example, ReLU activations lack a derivative at a single point. The non-smooth region has measure zero if the linear map inside the activation is not degenerate, in which case the change-of-variables technique still applies.
We are unaware of architectural tricks to ensure that the Jacobian has full rank. Still, a minimum requirement is no hidden layer may have dimensionality less than the d dimensions of the latent space. This is a natural requirement that is practically always satisfied. Our model maximizes the generator entropy, which practically ensures full-rank Jacobians (noting that degeneracy in the Jacobian reduces the entropy). We have in practice generally not experienced degenerate Jacobians during training.
Appendix B Training details
Table 9 gives the hyper-parameters used during training. In Eq. 33 we set \(p=2\). We normalize the data to \([-1, 1]\) and do not use dequantization. We augment only using random horizontal flips during training. We used the official train-test split from PyTorch for cifar10 and an 85:15 train-test split for animeface. We set \(\frac{M}{s_1^2}=\frac{0.1}{z_{dim}}\) for our gradient penalty upper bound, where \(s_1\) is the smallest singular value of Jacobian and \(z_{dim}\) is the latent dimension. We find this set of parameters generally works well across datasets. Furthermore, we choose \(\frac{M}{s_1^2}\) with dynamic decay, i.e.@ \(\frac{M}{s_1^2}=\frac{a(b+i)^{-\gamma }}{2}\) where i denotes the iteration, \(\gamma = .55\) and a and b are chosen such that \(\frac{2M}{s_1^2}\) decreases from 0.01 to 0.0001 during training. These dynamics are consistent with the approximation of \(\textrm{KL}\left( p_g \Vert p_\theta \right) \) from the Taylor series point of view. We find that practically for \(\textrm{EBM}_{\textrm{SV}+\textrm{GP}}\), \(\frac{M}{s_1^2}\) must be adjusted with entropy regularizer \(\lambda \) together. We also observe the implications of the network design on the final performance. For example, not using batch normalization in the energy function can improve the stability of training on ANIMEFACE dataset.
Appendix C Selection of bidirectional bound pairings
Given the two available options for both the lower and the upper bound, we provide recommendations for selecting bidirectional bound pairings based on our training experience. For real-world applications, as elaborated in the main text, we generally recommend using combinations \(\textrm{EBM}_{\textrm{SV}+\textrm{GP}}\) or \(\textrm{EBM}_{\textrm{MI}+\textrm{diff}}\) first due to their respective advantages in memory and time efficiency. If the data and latent space dimensions are relatively small and the network is shallow, we recommend using \(\textrm{EBM}_{\textrm{SV}+\textrm{GP}}\), as it tends to achieve better performance in such scenarios. Conversely, for larger dimensions or deeper networks, we suggest using \(\textrm{EBM}_{\textrm{MI}+\textrm{diff}}\) to reduce computational cost. Furthermore, \(\textrm{GP}\) and \(\textrm{SV}\) may suffer from training instability due to their reliance on iterative optimization. Therefore, in cases of unstable training, it is advisable to consider replacing \(\textrm{GP}\) or \(\textrm{SV}\) with \(\textrm{diff}\) or \(\textrm{MI}\).
Appendix D Additional results
1.1 D.1 CIFAR-10
Although our bidirectional EBM obtains superior performance among EBMs, it still lags far behind diffusion-based models in terms of generation metrics. To demonstrate the scalability and potential of our model, we also scale up our network structure to UNet for CIFAR-10 dataset. We adopt the same architecture in UOTM (Choi et al., 2024), where its discriminator is utilized as our energy function. Our generator has the same scale as the network in NCSN and DDPM. We borrow some training strategies from UOTM to stabilize training and improve generative performance, such as the form of energy discrepancy and cost function. For the choice of bidirectional bounds, we only use our \(\textrm{EBM}_{\textrm{MI}+\textrm{diff}}\) since UNet network is complex and the input has the same dimension as the output, causing a costly Jacobian computation. We present our quantitative results in Table 10 and qualitative results in Fig. 10. All baseline results used for comparison are sourced from Xiao et al. (2022) and Song et al. (2023). For our method, we run the experiments with different seeds and report the standard deviation. For FID evaluation, we use the training set as a reference, aligning common practice in diffusion-based models (Xiao et al., 2022). From Table 10 we can observe our \(\textrm{EBM}_{\textrm{MI}+\textrm{diff}}\) achieves results comparable to diffusion-based models while using only one-step generation, offering significantly greater efficiency. Although recent techniques like distillation and consistency models can also enable one-step generation with minimal performance decline, our model demonstrates competitive results compared with them. This experiment suggests that the superiority of diffusion-based models stems more from the large network architecture and meticulous parameter fine-tuning than from the methodology itself. Our bidirectional EBM can achieve performance comparable to recent advanced diffusion-based models when equipped with a similar-scale UNet and an appropriate training strategy. Moreover, as shown in Geng et al. (2024), EBM can be further improved by integrating it with a diffusion process, which presents an opportunity for future work.
1.2 D.2 Imagenet
To further evaluate the efficacy of our model on complex datasets that better approximate real-world scenarios, we assess its generative performance on \(32\times 32\) ImageNet dataset, which contains 1,281,167 training images and 50,000 test images. We perform experiments using both DCGAN and UNet architectures, as applied to the CIFAR-10 dataset. For \(\textrm{EBM}_{\textrm{SV}+\textrm{GP}}\), we set the number of inner loop iterations in \(\textrm{SV}\) to zero and find that they contribute to stabilizing the training process; a detailed exploration of this phenomenon is left for future work. Table 11 and Fig. 12 show our quantitative and qualitative results respectively. For quantitative evaluation, we report only the FID score, as it is the most widely used metric for assessing generative models on this dataset. From Table 11 we can see our \(\textrm{EBM}_{\textrm{MI}+\textrm{diff}}\)-Large gets the best performance compared to other advanced generative models including diffusion-based model DDPM++ (Kim et al., 2021) and the latest energy-based model CDRL (Zhu et al., 2024). Even our \(\textrm{EBM}_{\textrm{SV}+\textrm{GP}}\) and \(\textrm{EBM}_{\textrm{MI}+\textrm{diff}}\) only use simple DCGAN architecture, they still outperform competitive EBM framework CLEL-Large which employs a deep ResNet network. This experiment further verifies the effectiveness of our bidirectional EBM on large complex datasets.
1.3 D.3 LSUN Church
We also test our model on a large-scale LSUN dataset. For this dataset, we choose the Church outdoor category with a size of \(128\times 128\). We use 126,227 training images and 300 testing images following the official PyTorch split. The latent dimension is set to 256. We employ the same ResNet architecture as that used for the ANIMEFACE and CelebA datasets. We trained \(\textrm{EBM}_{\textrm{SV}+\textrm{GP}}\) for 100,000 iterations and \(\textrm{EBM}_{\textrm{MI}+\textrm{diff}}\) for 200,000 iterations. Figure 11 illustrates the scalability of our models. While these samples are not state-of-the-art, they do showcase the ability of our model to scale to high-dimensional problems. Further improving these results most likely requires work on picking the right network architecture, fine-tuning parameters, and specialized optimization strategies. We leave this for future work.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Geng, C., Wang, J., Chen, L. et al. Exploring Bidirectional Bounds for Minimax-Training of Energy-Based Models. Int J Comput Vis 133, 5898–5919 (2025). https://doi.org/10.1007/s11263-025-02460-0
Received:
Accepted:
Published:
Version of record:
Issue date:
DOI: https://doi.org/10.1007/s11263-025-02460-0