Figure 1: Term mapping.
Automatic indexation that is based on term
detection requires the processing chain in fig 2:
Figure 2: Indexation and classification architecture.
First step is a tokenisation, then a morphological
analysis determines word structure delivering a
(richly) tagged sentence. A grammatical analysis
determines phrases, multi word units, and syntactic
variations of compounds. A component not
mentioned so far is a named entity recognition
(NER) also on a linguistic basis (not be described
here). After the detection of terms a weighting
component determines descriptors’ relevance.
LINSearch uses a heuristic procedure (not TF-IDF).
The relevant factors: Term frequency (TF): The
more frequent a term the more important it is.
Linguistic analysis allows for also taking compound
parts into account. The weighting takes ‘semantic
classes’ into account, the number of items of a
specific semantic class to which a descriptor
belongs. The idea is that if a text mainly exhibits a
specific semantic class, say e.g. ‘institution’, then it
can be expected that to be predominantly about
institutions. Another factor is the place of a term in
the text, e.g. in a heading or in plain text. A factor
that is under research is how linguistic information
is relevant, e.g. ‘information structure of sentence’
(theme – rheme). A statistical component described
in the next section classifies the document.
2.2 Query Processing
The previous section described how term processing
is used for indexation. Variants of terms are mapped
on standard thesaurus terms. The query may have
the variants again so that the same problem for
retrieval occurs, namely the mapping on standard
terms. This is shown in the lower half of fig1.
Promising attempts have been made with query
processing. Each query is indexed like an ordinary
text document in the database delivering the terms
for search in the database, so that ‘hospital cost
reduction strategy’ leads to ‘strategies for cost
reduction in hospitals’.
Retrieval shows other problems as well. A
query corpus exhibited many reasons that lead to 'no
hits': For German there is the problem of
morphological variants (Haus – Häuser). Then, it
often happened that the user applied a wrong
language parameter, searching English documents
with German terms. Another problem was that the
query contained orthographic errors. The first error
type will be addressed by using a full analysis of
both the query and the data base and do the mapping
on the level of ‘lexical unit’. A language detection
finds out which language the query is. A spell
checker provides correction proposals and suggests
an automatic correction of orthographic errors. The
effect of all the tools is a substantial reduction of ‘no
hits’ (roughly 70%).
2.3 Evaluation
The basis of evaluation is a manual reference
indexation of a 500 documents carefully done by 5
(!) experts (to provide consistency). The measure for
evaluation is recall (results against the reference)
and precision (the proportion of the correct items
and the absolute number of items found).
Recall has turned out to be 40% for the
reference corpus while precision is 28 % which
seems not impressive. If synonyms (that are
thesaurus terms) are accepted instead of the
descriptors the situation improves to about 45%.
Mere recall and precision is not sufficient however.
Therefore a 'qualitative' evaluation has been done
which is an evaluation by experts in terms of
'qualitative acceptability'. The general result is that
automatic indexation is of such a good quality that
the system will be used for everyday work.
KDIR 2009 - International Conference on Knowledge Discovery and Information Retrieval
260