the inner nodes, not between the nodes with different
labels (i.e. their content). (For example, a unit can be-
come sensitive to a tree (a(b –)), i.e. it is insensitive
to the contents of the right child, being an inner tree
node). STQD is defined as
STQD =
N
∑
i=1
p
i
s
′
i
(4)
where s
′
i
denotes STRF of ith neuron and p
i
is the
same as above.
The other pair of measures was introduced to
quantify the discrimination capacity of the models to
unambiguously represent different trees. This implies
the ability to uniquely represent all vertices (subtrees)
contained in all trees (Hammer et al., 2004b). One
view of the vertex representation is in terms of a sep-
arate winner reserved for it. The alternative view is
based on a distributed representation of vertices that
entails the overall map output activation. The pro-
posed measures focus on these alternatives.
So, the third measure – the winner differentiation
(WD) refers to the level of winners and is computed
as the ratio of the number of different winners inden-
tified and the number of all different subtrees in the
data set (including the leaves), that is
WD =
|{ j|∃t : j = i
∗
(t)}|
|{vertices in data set}|
. (5)
WD<1 indicates that not all vertices could be distin-
guished by the map (i.e. two or more different vertices
would share the winner). The fourth, more “detailed”
measure looks at the differences between map output
activation vectors, and yields the difference between
the two most similar representations (probably corre-
sponding to two very similar trees such as (v(d(an))
and (v(dn))). We will refer to this measure as the nor-
malized minimum Euclidean distance
MED = argmin
u6=v
{ky(T
u
) − y(T
v
)k/N}, (6)
where y(T
z
) is the map output activation vector
(whose components are obtained using Eq. 2) cor-
responding to processing of the root of the tree T
z
.
MED>0 implies that all vertices can be distiguished
in terms of map output activation vectors.
4 EXPERIMENTS
We tuned the model parameters experimentally ac-
cording to the task difficulty. We started with ba-
sic maps of 10×10 units in case of binary trees and
15×15 units for ternary propositions. We looked for
the models with the best discrimination capacity as
determined by the largest number of unique winners
in representing different vertices (captured by WD
measure). We also tested larger maps with 225 and
400 units using the optimal parameters (α,β) found
for the initial map size. In MSOM, the parameter
of the context descriptor was set to its default value
γ = 0.5 in both experiments. In all simulations, the
leaves were assigned localist (one-hot) codes (to be
treated as symbols). We systematically searched the
(α,β) parameter space, as each model can trade-off
the effect of leaves and contexts. It was observed that
increasing α (while keeping β constant) did not affect
output representations of leaves but led to the overall
decrease of activations for trees. Increasing β (with
constant α) led to gradual vanishing of output repre-
sentations of leaves, and to focusing of activations for
trees (and also vanishing in combination with higher
α).
4.1 Binary Syntactic Trees
This data set contained 7 syntactic trees with labeled
leaves and unlabeled inner nodes (vertices) (Table 1).
The trees were generated by a simple grammar origi-
nally developed for testing the representational capac-
ity of the Recursive Auto-Associative Memory (Pol-
lack, 1990). For the RAAM, being a two-layer per-
ceptron trained as auto-associator, the ability to repre-
sent a tree involvesits successful encoding (at the hid-
den layer) and the subsequent unambiguous decoding
(at the output layer). In case of our unsupervisedfeed-
back maps, there is only the encodingpart. The ability
of the map to represent a tree implies its ability to also
uniquely represent all vertices (subtrees) contained in
the training set (listed in the right half of Table 1).
Similarly to RAAM, processing a tree in a feed-
back map proceeds bottom-up from the leaves, for
which context activations are set to zero vectors, up to
the root. When processing the inner nodes, the inputs
s(t) are set to zero vectors. Intermediate results (ac-
tivations p
ch( j)
) are stored in a buffer to be retrieved
later. The weights are updated in each discrete step.
The models were trained for 2000 epochs. During
the first 60% of epochs, the neighborhood width was
set to linearly decrease, σ:3→0.5 (ordering phase),
and was then kept constant (fine-tuning phase). For
the larger maps the initial neighborhood width was
proportionally increased and the profile was kept the
same. The learning rate linearly decreased µ:0.3→0.1
during the ordering phase, and was then kept constant.
For the best models of all sizes, we present the four
quantitative measures (averaged over 100 runs) in Ta-
bles 2 and 3 and also the (typical) graphical informa-
tion about unit weights and output activations. Stan-
RECURSIVE SELF-ORGANIZING NETWORKS FOR PROCESSING TREE STRUCTURES - Empirical Comparison
461