networks are constructed from a set of neurons that
simply sum up the inputs from other neurons or in-
puts with a specific weight and then apply an activa-
tion function over the results. Each neuron might be
interconnected with many other neurons. Assuming
the use of non-linear activation functions in these net-
works, these complex networks can be used to solve
and address complex problems(Hopfield, 1985)(Guo
et al., 2022).
In (Hopfield, 1985), the authors use ANN net-
works to show the possibility of using these net-
works for optimizing different problems. The authors
specifically apply these networks to Traveling Sales-
man Problem which is known as a complex optimiza-
tion problem and represent its power over solving this
problem.
In the literature different ANN architectures are
proposed to address different sets of problems. Dis-
crete Hopfield Neural Networks (DHNN), is a spe-
cific type of ANN network that is designed to address
the combinatorial problem. These networks divided
the neurons into the input and output neurons and neu-
rons are either in bipolar (-1,1) or binary (0,1) states.
The lack of a symbolic rule to represent the connectiv-
ity of the neurons is one of the problems in the DHNN
networks. In (Guo et al., 2022), authors propose a
novel logical rule that is capable of mixing system-
atic logical rules with non-systematic logical rules by
exploiting a random clause generator.
The Variational Auto Encoders are another type
of ANN network that can be used in different fields
such as compressing, de-noising, and generating data.
These networks are constructed from two main parts
known as Encoder and Decoder. Encoders and De-
coders are constructed using multi-layers of neural
networks. The network has a bottleneck in the middle
with a low capacity to pass the information from the
Encoder network to the Decoder side of the network
forcing the system to generate a lower dimensional
representation of each signal known as latent codes.
In (Cosmo et al., 2020) VAE is used to generate de-
formable 3D shapes with higher accuracy. This sys-
tem uses a disentanglement technique by dividing the
latent space into two sub-spaces, one part dedicated
to intrinsic features of the 3D shapes and the other to
extrinsic features of the shape. By this method, each
sub-space of the latent code stores a specific type of
data related to the object. This algorithm uses three
Loss Functions for training which are Reconstruc-
tion, Interpolation, and Disentanglement. The Dis-
entanglement Loss term itself is also obtained from a
combination of two terms that are disentanglement-
int, and disentanglement-ext.
In (Pu et al., 2016), a specific configuration of
VAE is used to predict the labels and captions for the
images. In this system, a Convolutional Neural Net-
work (CNN) is used as an encoder and a Deep Gen-
erative Deconvolutional Network (DGDN) as a de-
coder. The mentioned latent codes are also fed into a
Bayesian Support Vector Machine (BSVM) to gener-
ate labels. It is also connected to a Recurrent Neural
Network (RNN) to generate captions for the images
using their latent code. In this work, the latent space is
shared between the DGDN network which is respon-
sible for decoding and reconstruction of the images,
and the BSVM or RNN network. The training of the
network has been done by minimizing the variational
lower bound of the Cost Function.
In (Venkataramani et al., 2019), a VAE is used for
source separation. To realize this objective this sys-
tem learns a shared latent code space between mixed
and clear voice signals. This work considers source
separation as a style transfer problem in VAEs. It as-
sumes the mixture data is actually a clean voice mixed
up with some noises and hence the objective of the
network is to transfer the style of input and represent
it as the output which is a clean form of the signal.
In (Sadeghi and Alameda-Pineda, 2020) and
(Sadeghi et al., 2020), VAEs have been used to en-
hance voice quality using audio-visual information.
This model exploits the visual data of lips movements
in addition to the audio data. The main idea is that
although the voice might be recorded improperly or
degraded by the noise, the visual data of lip move-
ments are mostly untouched and extractable. In this
work, the Short Time Fourier Transform (STFT) of
the voice is extracted in the first step which provides
the frequency representation of the voice signals in
each time frame. The VAE is trained using these bins
and learns the latent domain distribution for the voice
signals. The lower bound of the Likelihood Function
ELBO is maximized to estimate the parameters of the
network. In addition to this Audio VAE (AVAE) net-
work, this method introduces two networks to fetch
the speech data from the visual inputs which are re-
ferred to as Base Visual VAE (BVVAE) and Aug-
mented Visual VAE (AVVAE). The inputs of these
networks are actually lips images that are captured
and centered using computer vision methods. The
BVVAE is a two-layered fully connected network.
Finally, in order to obtain a combined audio-visual
model, the Conditional VAE (CVAE) framework has
been exploited. For training the combined model, the
network is provided with the data as well as the re-
lated class labels in order to estimate the data distri-
bution.
In (Palash et al., 2019), the authors use a VAE
in processing the textual data in order to transfer the
ICAART 2023 - 15th International Conference on Agents and Artificial Intelligence
428