
and adaptability across various data classification sce-
narios.
Firstly, we build a 3D matrix concatenating the
classifiers’ predictions in the following order: training
and testing. In this matrix, the first dimension repre-
sents the classifier, the second represents the sample
objects, and the third represents the class. Therefore,
each matrix cell represents the value predicted for a
specific category by a classifier for a given sample ob-
ject.
The second step concerns normalizing the 3D pre-
diction matrix using an initial function that places the
initial predictions in the range [0, 1]. We employed the
softmax function for this work, but other functions,
such as normalization and sigmoid, can be used.
In the third step, we start the optimization process.
The algorithm creates a random population of size
population size + new individuals. The random gen-
eration of each tree follows some restrictions: (i) the
root node is always a function (calculation) node, (ii)
the node with a depth equal to max depth is always an
extraction node, and (iii) the remaining nodes are cho-
sen randomly between the calculation functions and
the extraction function.
It is important to note that the relationship be-
tween population size and the number of genera-
tions is inversely proportional. A larger population
may require fewer generations to find optimal solu-
tions, and vice versa. In addition, the available GPU
memory capacity plays a crucial role in determin-
ing the population size. A higher GPU capacity al-
lows for larger populations, making exploring the so-
lution space more comprehensively in fewer genera-
tions easier.
Once the population is generated, each tree’s qual-
ity is calculated, which is the accuracy in the training
set of the prediction matrix resulting from the tree.
Then, the algorithm performs the crossover operator
based on each tree’s fitness and applies the mutation
operator to each resulting tree. Furthermore, new ran-
dom individuals are generated (new individuals) and
then added to the population. These processes repeat
until the number of generations is reached.
In this work, GP comprises two types of nodes:
• Prediction matrix extraction nodes: these nodes
are always leaves of the tree and copy one of the
2D matrices of some classifier. When generating
these nodes, a classifier is chosen randomly.
• Function nodes: when this type of node is gen-
erated, one of the functions is randomly chosen.
Table 1 lists all available functions employed in
this work.
Notice that this approach can be applied to any
classification problem, i.e., text, image, video, audio,
or a combination of classifiers designed for different
types of data, i.e., some video classifiers to classify
the video, while text classifiers to analyze and cate-
gorize the video description. The only requirement is
that the order of the sample objects and classes must
be the same in all classifiers.
Figure 1 illustrates the pipeline regarding the
methodology of this work.
3.1 Datasets
The experiments were performed on two datasets:
• The HMDB51 dataset is a large collection of re-
alistic videos from movies and web videos, com-
prising 6,766 clips in 51 action categories, with
a fixed frame rate of 30 FPS, a fixed height of
240, and a scaled width to maintain the original
aspect ratio. These categories cover a wide range
of human actions, like driving, fighting, running,
and drinking, among other classes (Kuehne et al.,
2011).
• The UCF101 dataset, an extension of UCF50,
contains 13,320 video clips classified into 101 cat-
egories. All videos are sourced from YouTube,
with a fixed frame rate of 25 FPS and a resolution
of 320×240. Some videos may include difficul-
ties like inadequate lighting, busy backgrounds,
and significant movement of the camera (Soomro
et al., 2012).
3.2 Experimental Setup
In this section, we present the experimental setup con-
cerning the optimization process employing the GP
algorithm:
• Maximum depth of the tree (max depth): [2, 7].
• Population size (population size): 20 and 10 for
HMDB51 and UCF101, respectively.
• New individuals (new individuals): 10 and 5 for
HMDB51 and UCF101, respectively.
• Number of generations: 400.
• Mutation rate (mutation rate): 0.5.
The values above were empirically set.
Six groups of functions are used in this work:
Mathematical, Fuzzy, Geometric, Average, Weighted
Average, and Self-Functions. All these functions ad-
here to a crucial premise: they accept input values
in the range [0, 1] and return values within the same
interval. The specific functions can be found in Ta-
ble 1. Weighted average and self-math functions are
the foundation for the system’s core functions, which
VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications
790