Tractable Generative Modelling of Cosmological Numerical Simulations
Amit Parag
1 a
and Vaishak Belle
1,2 b
1
The University of Edinburgh, U.K.
2
Alan Turing Institute, U.K.
Keywords:
Cosmological Simulations, Generative Models, Sum-Product Networks.
Abstract:
Cosmological simulations aim to understand the matter distribution in the universe by employing either semi-
analytic methods or hydrodynamical models of matter distribution. These simulations describe the evolution
of baryonic structures within dark matter potential wells, where dark matter is modeled as a self-gravitating,
collisionless system. Despite advances in reducing computational costs, these simulations still require millions
of CPU hours to achieve stable solutions. This raises the question: can generative models predict galaxy
properties from a partial history of their dynamical evolution? Tractable probabilistic models, such as sum-
product networks, enable efficient computation of conditional probabilities, allowing conditional marginals
to be computed in time linear to the model size. In this work, we investigate the application of sum-product
networks to compactly represent and learn distributions for predictions in concordance cosmology. Using
the Eagle suite of cosmological hydrodynamical simulations, we demonstrate that these graphical models can
effectively reproduce mock galaxy catalogs, capturing the relationship between baryonic and dark matter with
promising accuracy.
1 INTRODUCTION
Probabilistic models, like Bayesian and Markov net-
works, are key in statistical machine learning for ex-
pressing dependencies compactly. However, infer-
ence in such models is intractable, making learn-
ing difficult (Koller et al., 2009). Tractable learn-
ing, which enables efficient probabilistic querying,
addresses this issue, with early work focusing on low
tree-width models (Bach and Jordan, 2002) and later
efforts utilizing local structures (Chavira and Dar-
wiche, 2008), leading to arithmetic circuits (ACs) that
support exact inference in polynomial time.
Sum-product networks (SPNs) (Poon and Domin-
gos, 2011) are notable instances of ACs with a recur-
sive structure, where an SPN is a weighted sum of
products of SPNs, and its leaves represent tractable
distributions (e.g., univariate Bernoulli). SPNs are of-
ten seen as tractable deep architectures due to their
hierarchical nature. While deep learning models face
challenges in structure learning (Bengio et al., 2009),
SPNs inherently provide a reliable structure learning
framework. Although SPNs can be manually speci-
fied, weight learning and adherence to conditions like
a
https://orcid.org/0009-0001-6597-3976
b
https://orcid.org/0000-0001-5573-8465
completeness and decomposability make automated
structure learning preferable.
Since their introduction, various structure learning
methods for SPNs and related models have emerged,
such as LearnSPN (Gens and Domingos, 2013), OSL
(Hsu et al., 2017), and ID-SPN (Liang et al., 2017).
Extensions like RAT-SPNs (Peharz et al., 2020), deep
tractable models (Vergari et al., 2021), and hybrid
SPN frameworks (Rahman and Gogate, 2014) further
demonstrate their adaptability. Although related mod-
els may differ in properties and features (Liang et al.,
2017), SPNs remain attractive for generative model-
ing due to their simplicity and scalability.
In this work, we study how SPNs can be used to
model a novel and challenging problem related to the
evolution of galaxies, particularly in the context of
concordance cosmology with hydrodynamical simu-
lations. Since actual experiments in cosmology are
not feasible, simulations are crucial for testing cosmo-
logical theories, properties, and parameters. Numeri-
cal simulations are indispensable due to the inability
to derive analytic solutions for gravitationally inter-
acting particles, which form the basis of all cosmo-
logical simulations. Dark matter, modeled as particles
interacting solely through gravity, plays a pivotal role
in these simulations. When combined with baryonic
Parag, A. and Belle, V.
Tractable Generative Modelling of Cosmological Numerical Simulations.
DOI: 10.5220/0013229100003890
Paper published under CC license (CC BY-NC-ND 4.0)
In Proceedings of the 17th International Conference on Agents and Artificial Intelligence (ICAART 2025) - Volume 3, pages 901-908
ISBN: 978-989-758-737-5; ISSN: 2184-433X
Proceedings Copyright © 2025 by SCITEPRESS Science and Technology Publications, Lda.
901
physics, numerical simulations validate cosmological
models and provide a general picture of structure for-
mation in the universe.
Simulations attempt to describe the cosmological
structure by modeling galaxies inside dark matter ha-
los, which involves making unverifiable assumptions
(Somerville and Dav
´
e, 2015) and incurs high com-
putational costs, as shown by the millions of CPU
hours required by simulations like Eagle and Illus-
tris (Schaye et al., 2014; Llinares, 2017). Several
algorithms aim to reduce computation times while
improving accuracy (Llinares, 2017; Gheller et al.,
2015). Machine learning has been used in cosmol-
ogy for particle tracing and classification (Guest et al.,
2018), but its potential for enhancing cosmological
research is still underexplored. This work takes initial
steps toward using recent advances in tractable proba-
bilistic models for generative modeling in cosmology.
The focus of the paper is on applying these models
rather than extending existing algorithms. As demon-
strated, this involves understanding concordance cos-
mology and overcoming challenges in feature selec-
tion. Ultimately, we hope our work inspires further
interdisciplinary research, enabling machine learning
to tackle significant cosmological problems.
2 BACKGROUND
In this section, we proceed by first by giving a fairly
informal picture of the cosmological model, before
turning to the equations driving the computational
task. We then discuss works where machine learning
has been used to model structure formation.
2.1 Concordance Cosmology
The concordance cosmological model, the ΛCDM
model, is based on the Copernican principles of
isotropy and homogeneity (Ryden, 2016). This model
assumes that observers in any location in the universe
cannot be the central observers, implying that the uni-
verse is isotropic and homogeneous on the largest
scales. These properties, however, only hold on cos-
mological scales where the differences between bary-
onic features are smoothed over.
Under these assumptions, the current cosmolog-
ical model describes a universe with rotational and
translational symmetry, dominated by dark matter.
Dark energy comprises the majority of the universe’s
energy, while dark matter plays a secondary role.
Baryonic matter, forming structures like galaxies, is
a small fraction of the universe. The ΛCDM model
(Bertschinger, 1994) describes the curvature of space-
time using the Robertson-Walker metric, and the evo-
lution of the universe follows the Friedmann equa-
tions (M
¨
ortsell, 2016).
The concordance cosmology posits that the large-
scale structure of baryonic matter today originated
from seeds in dark matter halos. Galaxies form within
these halos, with their morphology largely determined
by the properties of the surrounding halo, merger his-
tory, and feedback effects. The presence of a dark
matter halo is essential for galaxy formation, with
massive subhalos at the center of halos containing
the central galaxies. Modeling the universe involves
painting a temporal picture of the cosmic web.
There are broadly two flavors of simulations that
model matter distribution in the universe (Dolag et al.,
2008): semi-analytic simulations and hydrodynami-
cal simulations. The predictive power of both of these
approaches is in agreement with actual observations.
We refer the reader to (Benson et al., 2001) for a
comparison between semi-analytic methods and hy-
drodynamical modelling. In particular, physical pro-
cesses critical to galaxy formation and evolution such
as core collapse supernovae, accretion shocks, stel-
lar winds, involve multiple sets of partial differen-
tial equations (Somerville and Dav
´
e, 2015) such that
modeling structure formation through either approach
becomes extremely difficult. The already intractable
complexity of this problem is further compounded by
the addition of approximations of physical phenom-
ena which cannot be derived ab initio.
2.2 Learning Structure Formation
Using machine learning algorithms to model structure
formation has inevitably resulted in varying degrees
of efficacy. Algorithms like k-nearest neighbors and
support vector machines used in (Xu et al., 2013) have
conclusively shown that machine learning galaxy-
halo relation is not unsuccessful. The work was fur-
ther extended in (Kamdar et al., 2016), (Agarwal
et al., 2018), and (Cavuoti et al., 2018) by including
other discriminative or ensemble algorithms like de-
cision trees and/or random forests. Recent advances
in deep learning have shown significant potential in
this domain. For instance, (Villaescusa-Navarro et al.,
2021) introduced the CAMELS project, leveraging
convolutional neural networks (CNNs) and varia-
tional autoencoders (VAEs) to predict galaxy proper-
ties. Similarly, (Lucie-Smith et al., 2023) employed
graph neural networks (GNNs) to model complex
galaxy-halo interactions. (Ho et al., 2022) explored
the application of foundation models in astrophysics,
focusing on tasks like large-scale structure model-
ing. Additionally, (Heitmann et al., 2021) emphasized
ICAART 2025 - 17th International Conference on Agents and Artificial Intelligence
902
the use of emulators for precision cosmology, and
(Kobayashi et al., 2022) introduced machine learn-
ing frameworks for predicting baryonic effects on the
matter power spectrum. Works like (Davies et al.,
2021) further demonstrate how simulations combined
with machine learning can enhance our understand-
ing of galaxy morphologies, while (Tamosiunas et al.,
2023) applied semi-supervised learning to improve
galaxy classification in limited data scenarios.
However, focusing on the algorithmic aspects of
the task is equally important since the choice of algo-
rithms usually involves some trade-offs between scal-
ability and accuracy, while certain algorithms like de-
cision trees are prone to overfitting. To our knowl-
edge, tractable graphical models have never been ap-
plied to this problem. Our contribution in this paper is
to apply a deep architecture with probabilistic seman-
tics, sum product networks (SPNs) (Poon and Domin-
gos, 2011), to estimate a generative model for the data
such that a mock catalog of galaxies can be built. The
added advantage of using SPNs is that they guaran-
tee that inference will always be in time linear in the
model size.
3 METHOD
Making machines learn to recognize halos and their
corresponding baryonic content broadly involves two
steps.
The first step is finding features which are good
representatives of a halo-galaxy system and indicate a
strong correlation between the potential well of host
dark matter halo and the galaxy inside it. This is usu-
ally followed up by providing a merger history of the
galaxy-halo system to the machine. The choice of the
depth of history to be provided is generally the pre-
rogative of the machine learning practitioner. How-
ever, this choice comes with a few caveats. Since
galaxy clusters generally formed in a very early uni-
verse, their merger histories usually cover billions of
years and involve thousands of progenitors. Provid-
ing a description of all the progenitors of any galaxy
is simply an impractical task. A good way to approach
this choice is by constraining the number of progen-
itors of a galaxy (subhalo) and providing their corre-
sponding properties only for a subset of the cosmic
time. This is done keeping in mind that even though a
subhalo may have thousands of progenitors and con-
tinuously morphs through multiple collisions and ac-
cretions, only a few of its progenitors play an over-
whelming role in its overall shape and so only these
few progenitors are sufficient to indicate the overall
lineage of the subhalo. A partial merger history is
choosing how far to travel along the main branch of
a galaxy. As shown in Figure 1, the morphology
of a galaxy at some redshift, is the result of evolu-
tion along many branches, but its protogalaxies along
the main branch can adequately trace the history of a
galaxy.
An alternate approach is to provide the algorithm
with only a few random snapshots of the universe cor-
responding to different look-back times. In this ap-
proach, merger history need not be provided. The al-
gorithm learns the underlying generative model which
can be subsequently used to infer the morphology of
galaxies at different redshifts. The motivation behind
this is the drastic reduction in the dimensionality of
the dataset. This then reduces the computation time.
Table 1: Features of dark and baryonic matter.
Dark Matter Features
Feature Description
Halo Group
Mass
Aggregate Group Mass of all subhalos
within a larger halo
Mass Critical
200, M
200
Defines the mass of a halo
Radius Critical
200, R
200
The Radius that bounds Mass Critical
200
Number of
Subhalos
Representative of the number of smaller
subhalos that make up a larger halo
Baryonic Matter Features
Feature Description
Black Hole
Mass
The mass of the central black hole in a
halo
Stellar Mass Representative of the stellar content of
a galaxy
Velocity Dis-
persion
Provides a measure of velocity of a
galaxy
Maximum of
Circular Veloc-
ity, Vmax
Maxima of the circular velocity curve
of a galaxy
In this paper, we find the set of progenitors of a
galaxy along the main branch between redshift 0 and
0.5 sufficient for our purposes. We provide progenitor
history in our first approach. In our second approach
to model the relation between dark and baryonic mat-
ter, we do not provide progenitor history at all. The
dataset construction, as well as a reporting of the re-
sults in discussed in a subsequent section.
The added advantage of using a graphical model is
the greater interpretability. SPNs augment this by al-
lowing probabilistic semantics even when there are no
conditional dependencies present (Butz et al., 2017),
while guaranteeing inference in time linear in tree
width of the network. We generate the dataset using
the results of the Eagle suite of smoothed particle hy-
drodynamical simulations (Schaye et al., 2014).
In the interest of space, we do not go into the de-
tails of SPNs, and refer interested readers to (Poon
Tractable Generative Modelling of Cosmological Numerical Simulations
903
and Domingos, 2011). These data structures allow
the modes and marginals of a probability distribution
to be computed efficiently. (See (Kisa et al., 2014)
for other data structures with such properties.) More-
over, the size, shape and the weights of the network
can also be learned from the data, either discrimi-
natively or generatively (Gens and Domingos, 2012;
Gens and Domingos, 2013). In this work, we learn
generatively with the leaf nodes of our SPN explicitly
encoded to contain univariate Bernoulli distributions.
This is however not a strict requirement: as shown
in works such as (Molina et al., 2017; Molina et al.,
2018; Rashwan et al., 2016; Hsu et al., 2017; Bueff
et al., 2018), SPNs can also be learned online with
Gaussian and other distributions, which might be use-
ful for future work on modelling physical phenomena
via generative models.
4 EMPIRICAL EVALUATIONS
Simulation Overview. The suite of Eagle simula-
tions (Schaye et al., 2014) uses a modified version
of Gadget3 hydrodynamical code, last described in
(Springel, 2005), to evolve resolution elements in
boxes of size 12, 25, 50 and 100 comoving mega
parsecs (cMpc) on a side. The cosmology employed
in the simulations is consistent with the results of
(Planck et al., 2014), where
Λ
= 0.693,
m
= 0.307,
b
= 0.04825, σ
8
= 0.8288, n
s
= 0.9611, h = 0.677,
where,
Λ
,
m
,
b
,σ
8
, n
s
, h stand for the contribu-
tions to matter/energy content of the universe from
cosmological constant, matter, baryons respectively,
h is the dimensionless Hubble parameter, n
s
is the
spectral index of the primordial power spectrum while
σ
8
is the rms amplitude of the linear mass fluctua-
tions. High resolution simulations correspond to sim-
ulations with an initial baryonic particle mass of m
g
=
2.26 x 10
5
M
while intermediate resolution simula-
tions have a higher initial baryonic particle mass, m
g
= 1.81 10
6
M
, where M
is 1 solar mass.
The key run of the simulations, which we use in
this paper, the Fiducial Ref-L0100N1504 simulation
is an intermediate resolution simulation with periodic
box with a volume of (100cMpc)
3
, initially containing
1504
3
gas particles, with an initial mass of 1.81 10
6
M
and the same amount of dark matter particles with
9.70 10
6
M
.
Substructures, including galaxies, in Eagle simu-
lations were identified using the SUBFIND algorithm
(developed in (Springel et al., 2001)). First, halos
were detected with the Friends-of-Friends (FOF) al-
gorithm (More et al., 2011) on dark matter particles,
with a linking length of 0.2 times the mean interpar-
ticle separation. Gas and star particles were assigned
to the same halo as their nearest dark matter particles.
Next, SUBFIND identified substructure candidates by
locating overdense regions within halos, defined by
saddle points in the density distribution. Finally, grav-
itationally unbound particles were removed, and the
remaining substructures were classified as galaxies.
The simulations themselves have a finite resolu-
tion and are generally not reliable on lower mass
range of satellite galaxies and dwarf halos; the
physics on lower scales is more influenced by feed-
back effects and stellar winds which are poorly un-
derstood and have no analytic solutions. In general,
many galaxy properties are unreliable below a stellar
mass of 10
9
M
. Thus, we only select central galaxies
with halo mass above 10
10
M
. For a comprehensive
discussion on the parameters of the simulation, we re-
fer readers to (Schaye et al., 2014).
4.1 Feature Engineering
Modeling galaxy formation involves complex physi-
cal processes like supernovae, accretion shocks, and
stellar winds, which require approximations and par-
tial differential equations. Probabilistic machine
learning can help model the interactions between
baryonic and dark matter, capturing their joint dynam-
ical evolution. However, feature selection remains
challenging due to the high-dimensional nature of
cosmological datasets and the influence of non-linear
processes.
Feature selection in cosmology relies heavily on
domain knowledge, particularly because most of the
universe’s energy and matter is dark. A simplified
galaxy model includes four components: dark matter
halo, stellar halo, central black hole, and stellar bulge.
The virial radius, which represents the galaxy’s size,
is often used heuristically but may not always be ac-
curate, especially for galaxies undergoing tidal stress
or collisions.
A generative model of dark and baryonic mat-
ter is crucial for understanding their distribution and
the mapping between them. For instance, predicting
baryonic content based on a halo’s merger history is
challenging, particularly considering the peak of star
formation at redshift 1–2.
The baryonic features we model as random vari-
ables are:
Black hole mass: Modeled in simulations with
feedback from active galactic nuclei, where black
holes grow through mergers and accretion.
Stellar mass: Determined within a 30 kpc aper-
ture, aligning with observed data from the Galaxy
ICAART 2025 - 17th International Conference on Agents and Artificial Intelligence
904
And Mass Assembly (Baldry et al., 2012) and
SDSS (Li and White, 2010).
Velocity dispersion: Calculated as
p
2E
k
/3M,
with E
k
being kinetic energy and M the stellar
mass within the aperture.
Maximum circular velocity: Derived using
v
c
(r) =
q
G
N
M(<r)
r
, where M(< r) is the enclosed
mass at radius r.
Dark matter features include:
Halo mass (M
200
) and radius (R
200
), defined at the
virial radius.
Halo group mass, referring to the total mass of
dark matter subhalos within a group, as identified
by the SUBFIND and FOF algorithms.
4.2 Dataset Construction
Since our method involves two different approaches,
we construct four datasets by querying both the fidu-
cial and dark matter-only models in the database for
the properties of sub-halos (galaxies) with their cor-
responding dark matter halos and halos only.
The first approach, where we provide a merger
history, corresponds to Dataset 1 and Dataset 3. With
Dataset 1, we provide SPNs with a selection of prop-
erties of the central galaxy at zero redshift in each halo
along with a description of their corresponding cen-
tral subhalo merger history from redshift 0 to redshift
0.50. This is equivalent to providing the halo history
for approximately the last 5 billion years. The merger
tree was traversed only along the main branch, see
Figure 1, of every galaxy.
The galactic properties we model are the mass of
its central black hole, stellar mass, velocity dispersion
of the stars and the maximum of the circular velocity
rotation curve of the galaxy. Dataset 3 was generated
in a similar way through the Dark Matter-Only snap-
shots in the Eagle simulations. In Dataset 3, we only
use halo properties and halo merger histories, from
redshift 0 to redshift 0.50, as inputs and query for
properties redshift 0. The common factors in Dataset
1 and Dataset 3 are the halo properties and merger
histories.
In the second approach, applied to Datasets 2 and
4 where halo history was not provided, SPN generates
models of matter distribution from snapshots. These
datasets, created from the fiducial and dark matter-
only runs, focus on galaxy-halo systems and halos
between redshifts 3.5 and 1.7. This redshift range
was chosen to model the universe more accurately, as
star formation peaked during this period. The genera-
tive model trained on this data was tested by querying
Figure 1: Merger history of a galaxy with stellar mass,
M
star
> 10
5
M
. Figure from the Eagle Database (McAlpine
et al., 2016). A galaxy’s present state results from mergers
over billions of years. Redshift 0 represents the present,
while redshift 10 corresponds to a 12 Gigayear lookback
time. The galaxy’s merger history follows a main pro-
genitor branch, shown as a thick black line in the figure.
The Descendant ID represents a galaxy at a specific time,
while the TopLeafID indicates the first progenitor along the
main branch. All other branches are indicated with a thin
line. The merger history is traced along the main progenitor
branch.
properties at redshift 0 to evaluate how well SPN cap-
tures the matter distribution. The use of dark matter-
only simulations helps assess how well SPNs approx-
imate N-body calculations, given the lack of an ana-
lytic solution and the reliance on energy and momen-
tum conservation for convergence.
5 ANALYSIS
In this section, we present and discuss the results
obtained when applying the algorithm to the Eagle
data. Using dark matter internal halo properties as
inputs, we predict the following baryonic features:
black hole mass, stellar mass, velocity dispersion, and
V
max
. These attributes result from billions of years
of evolution through dissipative, nonlinear baryonic
processes. While large-scale structure formation fol-
lows the ΛCDM model, smaller scales are vastly more
complex.
Once the SPN captures the joint distribution over
all the variables at its root node, it can be queried for
conditional and marginal likelihoods of any random
variable, such as stellar mass or central black hole
mass. The SPN’s root node can also generate syn-
thetic datasets following the learned joint distribution.
Out trained SPN has 143 edges with 144 nodes in 20
layers. The network has 19 sum nodes, 40 product
nodes and 85 leaf nodes each modeling a univariate
Bernoulli distribution. The SPN took 149 seconds to
be learned.
Tables 2 and 3 show minimal difference in errors
Tractable Generative Modelling of Cosmological Numerical Simulations
905
Table 2: Dataset 1: The structure of SPN for this dataset was learned in 847.6 seconds. Progenitor history was provided.
Feature MSE R
2
Accuracy Score PearsonR
Central Black Hole
Mass
0.041714 0.464182 0.958286 0.743506
Stellar Mass 0.019964 0.732150 0.980036 0.870518
Velocity Dispersion 0.118812 0.464540 0.881188 0.727086
V
max
0.065533 0.680239 0.934467 0.837789
Table 3: Dataset 2: The structure of SPN for this dataset was learned in 144.7 seconds. Only random snapshots were provided.
Feature MSE R
2
Accuracy Score PearsonR
Central Black Hole
Mass
0.039717 0.469593 0.960283 0.735701
Stellar Mass 0.019542 0.727792 0.980458 0.867607
Velocity Dispersion 0.107178 0.512861 0.892822 0.751796
V
max
0.055159 0.728921 0.944841 0.863211
between the two approaches. However, the computa-
tion time to learn the joint distribution is much shorter
when only snapshots are provided. Merger histories
do not significantly enhance model richness. Tables
4 and 5 present similar results for dark matter proper-
ties.
The results show that SPNs can recreate mock cat-
alogs with properties similar to those from hydrody-
namic codes. Baryonic properties, which are mass-
dependent, are predicted accurately, with the central
black hole and stellar mass linearly related to halo
mass M
200
, and velocity dispersion and V
max
governed
by mass and radius. The predicted and true distri-
butions for stellar mass and central black hole mass
match closely.
A key observation is that progenitor history does
not improve prediction accuracy, even with increased
computation time. For example, the mean squared
errors for stellar mass are nearly identical with and
without progenitor history, but computation time in-
creases with progenitor history.
This trend also holds for dark matter-only runs,
where errors in subhalo number and halo group mass
are similar with or without progenitor history, though
training with progenitor history takes longer.
Overall, the results are surprising given the com-
plexity of numerical simulations. While our model
cannot replace numerical simulations, it provides a
useful tool for exploring the galaxy-halo connection
and the impact of different simulation physics, as seen
in semi-analytic modeling.
6 DISCUSSION AND
CONCLUSIONS
We conducted an empirical study to explore the rela-
tionship between dark matter halos and their enclosed
galaxies using a Sum-Product Network (SPN), a prob-
abilistic graphical model, in the context of a large
cosmological hydrodynamic simulation. Our study
demonstrates that SPNs offer significant computa-
tional savings, making predictions in minutes com-
pared to the millions of CPU hours required by hy-
drodynamical simulations. The model accurately pre-
dicts baryonic properties like stellar mass and central
black hole mass, with strong and Pearson correla-
tion metrics. Additionally, SPNs generate synthetic
datasets, enabling further exploration of galaxy-halo
relationships. Comparing different approaches, we
found that using random snapshots instead of progen-
itor histories does not greatly affect accuracy. SPNs’
hierarchical structure and probabilistic nature provide
enhanced interpretability over other machine learning
models.
However, the model is insensitive to progenitor
histories, questioning its ability to capture complex
baryonic feedback. Its phenomenological nature lim-
its its ability to simulate processes like AGN feed-
back or star formation. The need for extensive domain
knowledge and the finite resolution of the Eagle sim-
ulations also limit its generalizability. Future work
could combine SPNs with physics-based constraints,
test on other simulations or observational data, and
develop methods to better model temporal dependen-
cies. Despite these challenges, SPNs show promise
for tractable generative modeling in cosmology, of-
fering efficiency and accuracy while complementing
traditional simulations.
The aim of this work was to assess how dark mat-
ter properties can inform the evolutionary properties
of galaxies, rather than replicating a numerically iden-
tical population. The results suggest SPNs can ef-
fectively mimic galaxy evolution in a hydrodynamic
context, with runtimes in the order of minutes versus
the millions of hours required by simulations. This
highlights the potential of probabilistic models in an-
alyzing complex physical phenomena and their role
ICAART 2025 - 17th International Conference on Agents and Artificial Intelligence
906
Table 4: Dataset 3: The structure of SPN for this dataset was learned in 1890.15 seconds. Dark Matter Only run with halo
history.
Feature MSE R
2
Accuracy Score PearsonR
Number of Subhalos 0.053304 0.442150 0.946696 0.701084
Halo group Mass 0.014672 0.799276 0.985328 0.905383
M
200
0.005449 0.929433 0.994551 0.965560
R
200
0.012702 0.938336 0.987298 0.969134
Table 5: Dataset 4: The structure of SPN for this dataset was learned in 149.5 seconds. Dark Matter Only run with just
snapshots.
Feature MSE R
2
Accuracy Score PearsonR
Number of Subhalos 0.051744 0.464274 0.948256 0.714493
Halo group Mass 0.015195 0.793547 0.974805 0.914319
M
200
0.004964 0.933783 0.994036 0.963175
R
200
0.010756 0.947684 0.989244 0.973857
in testing machine learning models. Future work will
explore advanced algorithms to further integrate ma-
chine learning in cosmology.
ACKNOWLEDGEMENTS
This work is partly supported by the EPSRC grant To-
wards Explainable and Robust Statistical AI: A Sym-
bolic Approach
REFERENCES
Agarwal, S., Dav
´
e, R., and Bassett, B. A. (2018). Painting
galaxies into dark matter haloes using machine learn-
ing. Monthly Notices of the Royal Astronomical Soci-
ety, 478(3):3410–3422.
Bach, F. R. and Jordan, M. I. (2002). Kernel independent
component analysis. Journal of machine learning re-
search, 3(Jul):1–48.
Baldry, I., Driver, S. P., Loveday, J., Taylor, E., Kelvin, L.,
Liske, J., Norberg, P., Robotham, A., Brough, S., Hop-
kins, A. M., et al. (2012). Galaxy and mass assembly
(gama): the galaxy stellar mass function at 0.06.
Monthly Notices of the Royal Astronomical Society,
421(1):621–634.
Bengio, Y. et al. (2009). Learning deep architectures for
ai. Foundations and trends® in Machine Learning,
2(1):1–127.
Benson, A., Pearce, F., Frenk, C., Baugh, C., and Jenk-
ins, A. (2001). A comparison of semi-analytic and
smoothed particle hydrodynamics galaxy formation.
Monthly Notices of the Royal Astronomical Society,
320(2):261–280.
Bertschinger, E. (1994). Cosmic structure formation. Phys-
ica D: Nonlinear Phenomena, 77(1):354 379. Spe-
cial Issue Originating from the 13th Annual Interna-
tional Conference of the Center for Nonlinear Studies
Los Alamos, NM, USA.
Bueff, A., Speichert, S., and Belle, V. (2018). Tractable
querying and learning in hybrid domains via sum-
product networks. CoRR, abs/1807.05464.
Butz, C. J., Oliveira, J. S., and dos Santos, A. E. (2017). On
learning the structure of sum-product networks. Com-
putational Intelligence (SSCI), 2017 IEEE Symposium
Series on, pages 1–8.
Cavuoti, S., Brescia, M., Riccio, G., Longo, G., et al.
(2018). Stellar formation rates in galaxies us-
ing machine learning models. arXiv preprint
arXiv:1805.06338.
Chavira, M. and Darwiche, A. (2008). On probabilistic in-
ference by weighted model counting. Artificial Intel-
ligence, 172(6-7):772–799.
Davies, E. et al. (2021). Galsim: Combining simulations
and machine learning for galaxy morphology analysis.
Astronomy and Computing, 35:100453.
Dolag, K., Borgani, S., Schindler, S., Diaferio, A., and
Bykov, A. M. (2008). Simulation techniques for cos-
mological simulations. Space science reviews, 134(1-
4):229–268.
Gens, R. and Domingos, P. (2012). Discriminative learning
of sum-product networks. Advances in Neural Infor-
mation Processing Systems, pages 3239–3247.
Gens, R. and Domingos, P. (2013). Learning the structure
of sum-product networks. International conference on
machine learning, pages 873–880.
Gheller, C., Wang, P., Vazza, F., and Teyssier, R. (2015).
Numerical cosmology on the gpu with enzo and ram-
ses. In Journal of Physics: Conference Series, volume
640, page 012058. IOP Publishing.
Guest, D., Cranmer, K., and Whiteson, D. (2018). Deep
learning and its application to lhc physics. Annual Re-
view of Nuclear and Particle Science, 68:161–181.
Heitmann, K. et al. (2021). Precision cosmology with
machine learning emulators. Astrophysical Journal,
909:122.
Ho, S. et al. (2022). Astrophysics with foundation mod-
els: Prospects and challenges. Nature Astronomy,
6:489–495.
Hsu, W., Kalra, A., and Poupart, P. (2017). Online struc-
ture learning for sum-product networks with gaussian
leaves. CoRR, abs/1701.05265.
Tractable Generative Modelling of Cosmological Numerical Simulations
907
Kamdar, H., Turk, M., and Brunner, R. (2016). Machine
learning and cosmological simulations. American As-
tronomical Society Meeting Abstracts# 227, 227.
Kisa, D., Van den Broeck, G., Choi, A., and Darwiche, A.
(2014). Probabilistic sentential decision diagrams. In
Fourteenth International Conference on the Principles
of Knowledge Representation and Reasoning.
Kobayashi, K. et al. (2022). Predicting baryonic effects on
the matter power spectrum using machine learning.
Monthly Notices of the Royal Astronomical Society,
511:3453–3463.
Koller, D., Friedman, N., and Bach, F. (2009). Probabilistic
graphical models: principles and techniques. MIT
press.
Li, C. and White, S. D. (2010). Autocorrelations of
stellar light and mass in the low-redshift universe.
Monthly Notices of the Royal Astronomical Society,
407(1):515–519.
Liang, Y., Bekker, J., and Van den Broeck, G. (2017).
Learning the structure of probabilistic sentential deci-
sion diagrams. In Proceedings of the 33rd Conference
on Uncertainty in Artificial Intelligence (UAI).
Llinares, C. (2017). The shrinking domain framework i: a
new, faster, more efficient approach to cosmological
simulations. arXiv preprint arXiv:1709.04703.
Lucie-Smith, J. et al. (2023). Learning galaxy-halo relation-
ships with graph neural networks. Monthly Notices of
the Royal Astronomical Society, 519:501–516.
McAlpine, S., Helly, J. C., Schaller, M., Trayford, J. W., Qu,
Y., Furlong, M., Bower, R. G., Crain, R. A., Schaye,
J., Theuns, T., et al. (2016). The eagle simulations of
galaxy formation: Public release of halo and galaxy
catalogues. Astronomy and Computing, 15:72–89.
Molina, A., Natarajan, S., and Kersting, K. (2017). Pois-
son sum-product networks: A deep architecture for
tractable multivariate poisson distributions. AAAI,
pages 2357–2363.
Molina, A., Vergari, A., Di Mauro, N., Natarajan, S., Espos-
ito, F., and Kersting, K. (2018). Mixed sum-product
networks: A deep architecture for hybrid domains.
Proceedings of the AAAI Conference on Artificial In-
telligence (AAAI).
More, S., Kravtsov, A. V., Dalal, N., and Gottl
¨
ober, S.
(2011). The overdensity and masses of the friends-
of-friends halos and universality of halo mass func-
tion. The Astrophysical Journal Supplement Series,
195(1):4.
M
¨
ortsell, E. (2016). Cosmological histories from the fried-
mann equation: The universe as a particle. European
Journal of Physics, 37(5):055603.
Peharz, R., Vergari, A., Stelzner, K., Molina, A., de Cam-
pos, C. P., and Kersting, K. (2020). Randomly assem-
bled tractable probabilistic models. Journal of Ma-
chine Learning Research, 21(148):1–60.
Planck, Ade, P., Aghanim, N., Armitage-Caplan, C., et al.
(2014). Planck 2013 results. xvi. cosmological param-
eters. Astron. Astrophys, 571:A16.
Poon, H. and Domingos, P. (2011). Sum-product networks:
A new deep architecture. Computer Vision Workshops
(ICCV Workshops), 2011 IEEE International Confer-
ence on, pages 689–690.
Rahman, T. and Gogate, V. (2014). Hybrid probabilis-
tic models with tractable inference. Artificial Intel-
ligence, 266:196–225.
Rashwan, A., Zhao, H., and Poupart, P. (2016). Online and
distributed bayesian moment matching for parameter
learning in sum-product networks. In Artificial Intel-
ligence and Statistics, pages 1469–1477.
Ryden, B. (2016). Introduction to cosmology. Cambridge
University Press.
Schaye, J., Crain, R. A., Bower, R. G., Furlong, M.,
Schaller, M., Theuns, T., Dalla Vecchia, C., Frenk,
C. S., McCarthy, I., Helly, J. C., et al. (2014). The
eagle project: simulating the evolution and assembly
of galaxies and their environments. Monthly Notices
of the Royal Astronomical Society, 446(1):521–554.
Somerville, R. S. and Dav
´
e, R. (2015). Physical models of
galaxy formation in a cosmological framework. An-
nual Review of Astronomy and Astrophysics, 53:51–
113.
Springel, V. (2005). The cosmological simulation code
gadget-2. Monthly notices of the royal astronomical
society, 364(4):1105–1134.
Springel, V., White, S., Tormen, G., and Kauffmann, G.
(2001). Populating a cluster of galaxies-i. results at
[formmu2] z= 0, mnras 328 (dec., 2001) 726–750.
arXiv preprint astro-ph/0012055.
Tamosiunas, A. et al. (2023). Semi-supervised learning for
galaxy classification with limited labeled data. Astron-
omy & Astrophysics, 672:A1.
Vergari, A., Mauro, N. D., Esposito, F., and Peharz,
R. (2021). Compositional generative models with
tractable inference. In Proceedings of the 34th Inter-
national Conference on Neural Information Process-
ing Systems (NeurIPS).
Villaescusa-Navarro, F. et al. (2021). The camels project:
Machine learning cosmological and astrophysical
constraints from galaxy catalogues. Astrophysical
Journal, 915:71.
Xu, X., Ho, S., Trac, H., Schneider, J., Poczos, B., and
Ntampaka, M. (2013). A first look at creating mock
catalogs with machine learning techniques. The As-
trophysical Journal, 772(2):147.
ICAART 2025 - 17th International Conference on Agents and Artificial Intelligence
908