put for several patent related applications, machine translation being the most impor-
tant one.
The NLU component is based on empirical data gathered from real patent claims
and feature robust data-tuned techniques that allow for natural language understand-
ing of a combination of such linguistic phenomena as long distance dependencies,
parallel structures, noun and verb attachments of prepositional phrases, syntactic
gaps, etc., known as hard parsing problems. Parsing proper is based on domain-tuned
lexicon and a combination of phrase- and dependency grammars.
Although the correlation between the sentence length and ambiguity is clear, the
great portion of ambiguities occurs in treatment the higher (clause) nodes in the syn-
tactic tree, on the contrary, processing on the phrase level, such as NP, PP, etc., do
not usually generate more ambiguity as a sentence becomes longer [1].
The specificity of our approach is that it does not rely on the structural information
of higher levels than a simple clause in the syntactic tree. Our strategy is to divide the
analysis into two levels, - a phrase level and a simple clause level, and to use a gen-
erator in the NLU component to take over the load of detecting the information about
the hierarchy of individual clauses of a long sentence. In addition to this the genera-
tion module is used to display the results of clause understanding to the user so that
these results can be interactively corrected, if necessary, before the final parse is sent
to a subsequent module of a particular application. The generator is equipped with a
natural language interface at the user end.
The NLU component has been initially modeled for patent claims in English and
Russian but it is readily extensible to other languages, - Danish and Swedish are now
being added to the system. For better understanding all examples in this paper are in
English. The implementation of the NLU component is done in C++.
2 Related Work
One known strategy for parsing long sentences is to automatically limit the number of
words. Very long sentences are broken up along punctuation and certain words (like
“which”), and parsed separately as, for example, in many MT systems [2]. This
method can lead to errors in the subsequent analysis, as segmentation can be incorrect
due to the punctuation ambiguity. The problem is sometimes approached by being
very selective about which sentences to parse. To find the most relevant sentences, a
statistical relevance filter is built, which returns a relevance score for each sentence in
a message [3]. In the Pattern-based English-Korean MT chunking information is used
on the phrase-level of the parse result to which a sentence pattern is applied directly
thus overcoming many problems in sentence segmentation [4].
A number of successful statistical parsers [5] interleave all disambiguation tasks
demonstrating a good performance. It is seldom possible, however, to directly incor-
porate such parsers into a particular NLP application due to unavailability of tagged
corpora for parser retraining. There are parsers that use a constraint-based integration
of deep and shallow parsing techniques, e.g., a combination of a shallow tagger or
chunker and a deep syntactic parser [6]. There is also a trend to merge different stages
121