view, these non-semantic aims have the upper hand to such an extent that semantic
structure is mostly very difficult to discern from syntactic structure; syntactic structure
is then more of an encryption of semantics than a useful gateway into it. This section
addresses this question, and provides some evidence that one need not be overly pes-
simistic.
What we look at is the relationship between parse-quality and AR performance.
First we establish the performance of a particular parsing system, making uses of par-
ticular linguistic knowledge bases (see full in the tables below). For this discussion the
details of this parsing system are not important, save to say that it is a chartparser whose
final step is to select a sequence of edges spanning the input. Though the parsing system
is imperfect, it could be worse. Randomly removing 50% of the linguistic knowledge
base should make for worse structures (see thin50), as should manually stripping out
parts of it (see manual), while worst of all should be the entirely flat parses which result
if the knowledge bases are empty. In each case, the AR performance was determined. To
try to get a picture of AR performance if the parse quality improved, we hand-corrected
the parses of the queries and their correct answers – see gold in the tables below.
The tables below give the results using the sub-tree distance and weighted sub-tree
distance. The numbers describe the distribution of the correct cutoff over the queries,
so lower numbers are better. Figure 2 shows the empirical cumulative density function
(ecdf) for the correct cutoff obtained with weighted sub-tree with wild cards measure:
so for each cut-off d, it plots the percentage of queries q with cc(q) ≤ d. Clearly as we
move through the different parse settings, the performance improves.
sub tree
Parsing 1st Qu. Median Mean 3rd Qu.
flat 0.1559 0.2459 0.2612 0.3920
manual 0.0349 0.2738 0.2454 0.3940
thin50 0.01936 0.1821 0.2115 0.4193
full 0.0157 0.1195 0.1882 0.2973
gold 0.00478 0.04 0.1450 0.1944
weighted sub tree
Parsing 1st Qu. Median Mean 3rd Qu.
flat 0.1559 0.2459 0.2612 0.3920
manual 0.0215 0.2103 0.2203 0.3926
thin50 0.01418 0.02627 0.157 0.2930
full 0.00389 0.04216 0.1308 0.2198
gold 0.00067 0.0278 0.1087 0.1669
The results indicate that the more successful the parser is in recovering correct syn-
tax structure, the more successfully tree-distance can be used in the AR task. By re-
versing this implication, this is also suggestive of a way in which the AR task could
be used as an evaluation metric for a parser. Such an application has some attractions,
the main one being that the raw materials for the AR task are questions and answers, as
plain text, not syntactic structures. This avoids the difficulties that arise when trying to
evaluate by comparing a parser’s native output with non-native gold-standard treebank
parses. Just so long as the parser can assign tree-structures to queries and answers, the
tree-distance measure can compare them, regardless of that parser’s native notational or
theoretical commitments.
5 Distance Measures Compared
In section 2 some parameters of variation in the definition of tree-distance were intro-
duced: sub-tree vs whole tree, weights, wild cards, and lexical emphasis. The perfor-
157