NLU METHODOLOGIES FOR CAPTURING NON-REDUNDANT
INFORMATION FROM MULTI-DOCUMENTS
A Survey
Michael T. Mills and Nikolaos G. Bourbakis
College of Engineering, ATRC, Wright State University, Dayton, Ohio 45435, U.S.A.
Keywords: Natural language understanding, Natural language processing, Document analysis.
Abstract: This paper provides a comparative survey of natural language understanding (NLU) methodologies for
capturing non-redundant information from multiple documents. The scope of these methodologies is to
generate a text output with reduced information redundancy and increased information coverage. The
purpose of this paper is to inform the reader what methodologies exist and their features based on evaluation
criteria selected by users. Tables of comparison at the end of this survey provide a quick glance of these
technical attributes indicators abstracted from available information in the publications.
1 INTRODUCTION
Over the past several years, information has become
so vast that professionals, such as medical doctors,
have difficulty keeping up to date within their
respective fields. Time is wasted reading redundant
information from various documents. Needed
information may also be lost in the process of
summarization. Advanced methods of search,
database technologies, data mining, and other areas
have helped, but not enough to meet the growing
need from these professionals.
For the past 40 years, researchers have advanced
in automatically or semi-automatically capturing
information from single and multiple documents into
less redundant text, typically in the form of
summaries. Several methodologies have been
developed to advance the area of natural language
processing in order to find solutions to this problem.
However, no known methodology appears to capture
the needed information and generate text with
enough quality and speed to satisfy this need. Thus,
this survey summarizes current methodologies,
which deal with the removal of redundancy for
documents retrieved from different resources. The
purpose is to document the progress in natural
language understanding research and how it can be
applied to capturing concepts from multi-documents
and producing non-redundant text while attempting
to maximize coverage of the significant information
needed by the user.
The methodologies under evaluation in this paper
cover the following areas: (1) detection of important
sentences, (2) concept extraction from text, (3)
building concept graphs, (4) attribute and relation
structures leading toward knowledge discovery from
text, (5) increasing efficiency in the processes
leading to concept representations, (6) generation of
non-redundant text summaries, and (7) maximizing
the readability (or coherence) of automatically
generated or extracted text.
2 METHODS AND FEATURES
In this section we present a variety of methodologies
classified according to their features. In particular
this section covers the various groups: text
relationship map with latent semantic analysis,
extraction methods for text summarization, cluster
summarization, formulated semantic relations, SPN
representation for document understanding, concepts
representation for text, learning ontologies from text,
synthesis of documents, generation of semantically
meaningful text using logic order, text generation
methods, document structural understanding, and
other relevant methods. The methods presented here
will be compared and evaluated based on their
maturity. The overall results are presented in section
3.
384
T. Mills M. and G. Bourbakis N. (2010).
NLU METHODOLOGIES FOR CAPTURING NON-REDUNDANT INFORMATION FROM MULTI-DOCUMENTS - A Survey.
In Proceedings of the 5th International Conference on Software and Data Technologies, pages 384-393
DOI: 10.5220/0002998603840393
Copyright
c
SciTePress
2.1 Text Relationship Map with Latent
Semantic Analysis (LSA)
Yeh et al present two methodologies, text
relationship map (TRM) and latent semantic analysis
(LSA), used together for text summarization. TRM
uses feature weights to create similarity links
between sentences forming a text relationship map
(Yeh et al, 2008a).
Advantages: This methodology captures various
features that help in calculating the similarity of
sentences throughout one or more documents. The
paper gives significant detail about the methodology.
Disadvantages: This methodology is based at the
word level.
Yeh et al.’s LSA-based text relationship map
(T.R.M.) approach derives semantically salient
structures from a document. Latent semantic
analysis (LSA) is used for extracting and inferring
relations of words with their expected context (Yeh
et al., 2008b).
Advantages: The paper gives significant detail
about the methodology. Several features are used in
the similarity calculation.
Disadvantages: This methodology is based at the
word level. The LSA approach uses a Word-
Sentence matrix that can get very large due to the
number of words in a document or in multi-
documents.
2.2 Extraction Methods for Text
Summarization
Ko and Seo present a hybrid sentence extraction
method that uses some context information
augmented with mainline statistical approaches to
find important sentences in documents. This model
combines two consecutive sentences into a bi-gram
pseudo sentence representation to overcome feature
sparseness (Ko and Seo, 2008).
Advantages: Test results of the hybrid sentence
extraction approach showed that it out performed
other approaches listed by a small percentage.
Disadvantages: What the authors (of the hybrid
approach) call context information is limited to two
consecutive (i.e., adjacent) sentences with no
apparent global context capability. Generally,
context implies more extensive surrounding
information than groups of two adjacent sentences.
2.3 Cluster based Summarization
Moens et al. extract important sentences and detect
redundant content across sentences. It uses generic
linguistic resources and statistical techniques to
detect important content from topics and patterns of
themes throughout text (Moens et al., 2005).
Advantages: Moens et al. methodology provides a
significant capability in automatically finding
content from text and representing it by hierarchical
topics and subtopics. This provides flexibility in
selecting how much detail goes into the summary.
From competitive testing at DUC 2002 and 2003,
the performance of the methodology provided good
results, even when compared with trained
methodologies.
Disadvantages: Topic trees and themes are the main
information sources to be captured using this
methodology. Although these contribute to forming
a summary, more queues could be added to enhance
the accuracy of this approach. The authors discuss
several improvements that could be made. This
system incorporates several technologies to provide
flexibility. It appears that system integration could
be improved to make this a better product.
Radev et al. present a Cluster Centroid-Based
summarization technique called MEAD that detects
topics and tracks to evaluate the results. This
methodology measures how many times a word
appears in a document, and what percentage of all
documents in a collection contains a given word. A
cluster is a set of words that are statistically
important to a cluster of documents and are used to
identify important (or salient) sentences in a cluster
(Radev et al., 2004).
Advantages: The authors state that the MEAD
algorithms produced summaries similar in quality to
summaries produced by humans for the same
documents.
Disadvantages: Additional factors could be
addressed to help provide higher quality output.
Scores determined by using this methodology are
limited to word frequency, position, and sentence
overlap. More factors could be added to improve
redundancy removal of the resulting summary
output.
2.4 Chaining Lexically to Formulate
Semantic Relations
Silber and McCoy propose an algorithm to improve
NLU METHODOLOGIES FOR CAPTURING NON-REDUNDANT INFORMATION FROM MULTI-DOCUMENTS -
A Survey
385
the execution time and space complexity of creating
lexical chains from exponential to linear in order to
make computation feasible for large documents.
Lexical chains are created as an intermediate
representation to extract the most important concepts
from text to be used for generating a summary. An
implementation of Lexical chains is evaluated as an
efficient intermediate representative format. Silber
and McCoy implicitly store every interpretation of
source documents without creating each
interpretation as a lexical chain, thus reducing the
vast number of lexical chains from multiple word
senses per noun instance (Silber and McCoy, 2002).
Advantages: Silber and McCoy’s algorithm provide
linear time for calculating lexical chains which is a
big step from former exponential time complexity
implementations they reference from 1997
implementations and earlier.
Disadvantages: Their focus is on efficiency of one
part of the entire process. They leave some issues
left for future work.
Manabu & Hajime provide lexical chaining based on
a topic submitted by a user. Lexical chains are
sequences of words related to each other that form a
semantic unit. This procedure increases coherency
and readability of resulting summaries which yields
improved accuracy or relevance to the user. (This
has an objective increasing coherency and
readability of a generated text summary similar to
Barzilay and Lapata but applies the lexical chaining
methodology.) The methodology constructs lexical
chains, calculates scores of the chains based on high
connectivity with other sentences, and constructs
clusters of words using the similarity score (Manabu
& Hajime, 2000).
Advantages: This methodology provides a higher
level calculation of semantic similarity and offers a
potential increase in accuracy.
Disadvantages: Results showed improved accuracy
but left possibilities of ignoring other useful
information. More improvements need to be made.
Reeve et al. propose to use lexical chaining for
concept chaining (distinguished from term chaining)
to identify candidate sentences for extraction for use
in generating biomedical summaries. This concept
chaining process consists of text to concept
mapping, concept chaining, identifying strong
chains, identifying frequent concepts and
summarizing. The resulting sentences are used to
generate the summary (Reeve et al., 2006).
Advantages: Test results (90 % precision and 92 %
recall) are high compared to results of other lexical
chaining methodologies in this survey.
Disadvantages: Concept disambiguation is not
implemented but planned for future work.
Complexity appears not to be addressed. Internal
evaluation was specifically toward quality of
generated summary.
2.5 Stochastic Petri-net (SPN)
Representations
Bourbakis and Manaris presented a paper on an SPN
based Methodology for Document Understanding.
They describe four levels of processing: lexical to
enforce case (subject-verb) agreement, syntactic to
combine words into sentences, semantic to assign
meaning to words and sentences, and pragmatic to
form context from relations to previous sentences,
paragraphs, topics, and information from related
data (Bourbakis and Manaris, 1998).
Advantages: The combination of augmented
semantic grammars (ASGs) and SPNs in this
methodology provides significant capability in not
only capturing semantic meaning from text but
extracting contextual and other available information
to resolve ambiguities. The methodology suggested
in this paper shows how SPNs, used with ASGs, can
model a tremendous amount of interrelationships
that exist in both text and imagery. It provides
significant potential for extended areas such as
knowledge abstraction and representation and
extending their capabilities.
Disadvantages: SPNs have existed for a long time.
However, the methodology presented in this paper
illustrates the potential for SPNs to model
technologies in ways that significantly enhance their
modeling capabilities compared to conventional
(main line) approaches in using SPNs.
2.6 Building Concept Representations
from Text
Ye et al. propose a concept lattice to represent text
understanding and to extract text from multiple
documents and generate an optimized summary. The
concept lattice provides indexing of local topics
within a hierarchy of topics (Ye et al., 2007).
Advantages: The document concept lattice
approach provides an efficient way to account for all
possible word senses without calculating them all on
line. This provides significant improvement in
accuracy without the computational complexity.
ICSOFT 2010 - 5th International Conference on Software and Data Technologies
386
According to the authors, the approach reduces
complexity from O(n
2
) to O(1), i.e. linear.
Disadvantages: WordNet is required for this
approach. New tools adopting this approach may be
restricted to use WordNet, depending on any
implementation dependent concerns.
Guo and Stylios investigate event indexing by
applying cognitive psychology to create clusters for
building concept representations from text. Their
methodology extracts the most prominent content by
lexical analysis at phrase and clause levels in
multiple documents (Guo and Stylios, 2005).
Advantages: Working at the phrase or clause level
is an advantage over word level. This reduces the
number of possible combinations of pairs (phrase,
sentence) instead of (word, sentence) for example.
Multi-document capability is another plus for the
user. Features such as actors, time/space
displacements, causal chains, and intention chains
add a significantly more capability to detecting
sentence similarities. Reducing all this potentially
multi-dimensional vector data to two dimensional
index clustering is a significant savings in
complexity, especially storage complexity.
Disadvantages: Dimension reduction can
sometimes hide important vector component data.
Cimiano et al. formed concept hierarchies using
formal concept analysis (FCA) through unsupervised
learning. Their methodology automatically acquires
(through learning) concept hierarchies from
collections of text (corpus) (Cimiano et al., 2005).
Advantages: Automatic (unsupervised) leaning
approach is a big plus, reducing the traditional
manual work to near zero. The concept similarity
calculation uses more characteristics that can result
in greater accuracy of output text. The authors state
“this is a first time approach.” Similarity
calculations are made at the concept and semantic
level, using LSA.
Disadvantages: The approach appears to be
integrated with the LoPar parser implementation, but
benefits in capability are significant.
2.7 Learning Ontology from Text
Bendaould et al. used relational concept analysis
(RSA) to formulate concepts through text-based
ontology. This paper presents a semi-automatic
methodology that builds ontology from a set of
terms extracted from resources consisting of text
corpora, a thesaurus for a particular domain, and
syntactic patterns representing a set of objects
(Bendaould et al., no year given).
Advantages: This is a very methodological
treatment at the higher level concept representation.
This methodology is more for building ontology and
less on capturing the information from text, but has
significant capability.
Disadvantages: Based on the methodology
description, the computation could have high
complexity.
Valakos et al. used machine learning to build and
maintain concept representations called allergens
ontology. Building ontologies include: selecting
concepts, specifying their attributes and relations
(between concepts), and filling (populating) their
properties with instances (Valakos et al., 2006).
Advantages: Authors machine learning approach
provides a way to capture new knowledge in the
form of concepts, attributes, properties, and
relations. They maintain (or update) the knowledge
with what has been established. The approach
includes lexical to semantic relations to transform
lexical to semantic information which is a
contribution toward proving concepts.
Disadvantages: Details about extraction of the
information to form the concepts is not presented.
The approach is specific to maintaining ontology
within a medical (allergen) domain but its general
principles could be applied to other applications.
Zhou and Su use machine learning to integrate
evidence from internal (within the word) and
external (context) to formulate named entity
recognition. This method extracts and classifies text
elements into predefined categories of information
(Zhou and Su, 2005).
Advantages: This named entity recognition
approach provides significant and useful detail that
could be applied to information extraction from text.
Machine learning is applied to recognizing named
entities and is used with constraint recognition,
Hidden Markov Models to determine tags, and
mutual information to increase coverage of non-
redundant information.
Disadvantages: This concept provides significant
capabilities on the theoretical level but appears to
need further development before product
information with metrics is available.
Shunsfard and Barforoush propose an automatic
ontology building approach, starting with a small
NLU METHODOLOGIES FOR CAPTURING NON-REDUNDANT INFORMATION FROM MULTI-DOCUMENTS -
A Survey
387
ontology kernel and implement text understanding to
construct the ontology. Their model can handle
multiple viewpoints, flexible to domain changes, and
can build ontology from scratch without a large
knowledgebase (Shunsfard and Barforoush, 2004).
Advantages: This system can create ontology from
scratch by learning from text. This significantly
reduces manual interaction to create and build
ontology. This methodology is based on an
integration of learning, clustering and splitting of
concepts, similarity measures, and several other
techniques that, together, form a unique capability
that shows promise.
Disadvantages: The current implementation and
testing has been limited to Persian text, but the
authors plan to expand the system to other
languages.
Hahn and Marko form concepts from text through
machine learning of both grammars and ontologies
and use evidence, or background knowledge, to steer
refinement of generated text. This methodology is an
integrated approach for learning lexical (syntactic)
and conceptual knowledge as it is applied to natural
text understanding (Hahn and Marko, 2002).
Advantages: Evidence within both lexical and
conceptual hypotheses is used together to bound the
resulting number of hypothesis search space to a
manageable quantity. This refines the lexical and
conceptual quality, thus increasing the accuracy of
text understanding.
Disadvantages: Complexity of the approach can be
extensive but tractable.
Loh et al. provides a text mining approach to form
concepts from phrases and analyzes their distributios
throughout a document. The approach combines
categorization to identify concepts within text and
mining to discover patterns by analyzing and
relating concept distributions in a collection (Loh et
al., 2003).
Advantages: This approach captures concepts from
phrases, finds patterns from concept distributions,
and discovers themes within a document by
collecting concepts and generating centroids to
represent the collections. Together, these features
contribute to a knowledge discovery technique.
Disadvantages: This approach was developed for
decision support systems and may have some
features dedicated to that application.
Rajaraman and Tan constructed a conceptual
knowledge base, called a concept frame graph, for
mining concepts from text. A learning algorithm
constructs the concept map which is guided by the
user via supervised learning (Rajaraman and Tan,
2002).
Advantages: The approach captures conceptual
knowledge from text by constructing a concept map
to produce a knowledge base. This provides a high
level representation including concepts, relations to
other concepts, and relations to synonyms. Such
representations can be used to reduce redundancy at
the high, concept level. A clustering algorithm
discovers word sense to reduce ambiguous words.
Disadvantages: The supervised learning in this
approach may not be useful for applications
requiring automatic (unsupervised) learning. This
word sense disambiguation depends on a Wordnet
tool, which may include some implementation
dependency within the approach.
Pado and Lapata propose a general framework for
semantic models that determines context in terms of
semantic relations. Their algorithm constructs
semantic space models from text annotated with
syntactic dependency relations to provide a
representation that contains significant linguistic
information (Pado and Lapata, 2007).
Advantages: This methodology operates at the
semantic level and finds context in terms of
semantic relations and contains significant linguistic
information. The authors state that their model
provides a linear runtime performance. A GNU
website is provided for a Java implementation of the
general framework for semantic models.
Disadvantages: This proposed methodology will
need time to mature after implementation.
Maedche and Staab present a generic architecture for
ontology learning which consists of components:
ontology management (browse, validate, modify,
version, evolve), resource processing (discover,
import, analyze, transform input data), algorithm
library, and coordination (interaction with ontology
learning components for resource sharing and
algorithm library access) (Maedche and Staab,
2004).
Advantages: The methodology finds semantic
patterns and structures and concept pairs.
Disadvantages: As a new methodology, it will
require time to mature into a product.
Dahab et al. discuss a methodology for constructing
ontology from natural domain text using a semantic
pattern-based approach. Their “TextOntoEx” tool
ICSOFT 2010 - 5th International Conference on Software and Data Technologies
388
extracts candidate relations from text and maps them
to meaning representations to help construct an
ontology representation (Dahab et al., 2008).
Advantages: Provides semantic pattern formats for
converted paragraphs.
Disadvantages: Manual editing is required for the
library of semantic patterns.
2.8 Redundancy Synthesis
Bourbakis et al. presents a methodology for
retrieving multimedia web documents and removal
of redundant information from text and images
Bourbakis et al. (1999).
Advantages: Out of the papers surveyed, this is the
only methodology that provides an integrated
similarity detection and redundancy removal of both
paragraphs of text and corresponding images. This
approach is also integrated with the authors’
developed query language that includes Webpage
(text) and image similarity criteria to yield increased
definitive returns closer to the user’s intended query.
Disadvantages: Since the time of the article, other
authors have created new features for similarity
detection. More text reduction opportunities should
be possible with some of the newer features various
authors have created. Counts and histograms of text
components can detect paragraph similarities up to a
certain point. By using approaches similar to this as
a baseline, future developments in capturing the
meaning from multiple documents should advance
similarity detection, resulting in less text redundancy
in the synthesized document.
Yang and Wang (2008), apply the hierarchical and
redundancy sharing characteristics of fractal theory
to increase the performance of text summarization
when compared to non-hierarchical approaches
(Yang and Wang, 2008).
Advantages: This hierarchical approach to
summarization provides multiple levels of
abstraction and takes advantage of fractal theory
capabilities in representing multiple levels of
hierarchy.
Disadvantages: More salient features could be
added to make this approach more accurate.
Hilberg proposes an approach to produce and store
higher levels of abstraction that represent sequences
of words, and sentences in the higher (hidden) levels
of a neural net (Hilberg, 1997).
Advantages: This proposal has some unique
possibilities for representing abstraction and
possibly extending it to paragraphs and documents.
Disadvantages: Getting this to work at a large
enough scale (such as large or multi document) may
be challenging. The learning of representative
corpus of text may be computationally hard to make
it work on a large enough scale to get beyond the
prototype stage.
2.9 Generating Semantically
Meaningful Text through
Coherence and Logical Order
Barzilay and Lapata, by representing and measuring
local coherence, provide a framework to increase
readability and semantic meaning to automatically
generated sentences such as a summary of multiple
documents. The goal is to order sentences in a way
that maximizes local coherence (Barzilay and
Lapata, 2008).
Advantages: This methodology provides a needed
capability to make generated text more coherent and
readable. This entity distribution approach provides
significant improvement in sentence meaning
representation which can result in improved,
automatically generated text. Results of testing
showed increased accuracy.
Disadvantages: New approaches like this will need
time to mature, but the benefits should be
significant.
Stein et al. provide a methodology that clusters
documents, uses extraction to find main topics and
organizes the resulting information for a logical
presentation of a summary of multiple documents.
This is an interactive approach that focuses on
summarizing news line documents (reducing text to
15%) (Stein et al., 2000).
Advantages: This methodology both summarizes
multi-document text and is designed to provide a
smooth flow of the summary to the reader. It clusters
single document representative summaries with
similar topics to reduce redundancy. It orders the
generated summary for multiple documents based on
paragraph similarity to minimize the jerkiness of
topic changes from paragraph to paragraph. The
result is improved readability.
Disadvantages: The multi-document summarizer
currently uses simple similarity scoring approaches
but plans to replace them with better performing
ones.
NLU METHODOLOGIES FOR CAPTURING NON-REDUNDANT INFORMATION FROM MULTI-DOCUMENTS -
A Survey
389
Nomoto and Matsumoto provide a method to exploit
diversity of concepts in text in order to evaluate
information based on how well source documents
are represented in automatically generated
summaries (Nomoto and Matsumoto, 2003).
Advantages: This approach provides an
improvement in clustering on the information level.
The paper provides detailed analysis of its approach
verses other traditional approaches and favorable
test results including a favorable comparison with
human summarization.
Disadvantages: Disadvantages have yet to be found.
The authors present this approach as novel, at least
in the 2003 timeframe.
Marco et al. improved reading order of
automatically generated text. The approach is
implemented in a system and is designed to analyze
heterogeneous documents (Marco et al., 2002).
Advantages: This approach is implemented in a
system that captures the physical and logical layout
of generic documents.
Disadvantages: Most of the discussion focuses on
the physical portions of a document and the reading
order considers large chunks of what is on a page of
a document. It applies more to the big (mostly
physical) view of a document, little toward the
actual knowledge or understanding level.
2.10 Text Generation Methodologies
Dalianis uses aggregation before generating text to
eliminate redundant text in documents before they
can be paraphrased (generated) into natural
language. This methodology provides aggregation at
the syntax level (Dalianis, 1999).
Advantages: This approach provides four types of
aggregation with rules which should provide more
information for generating significantly less
redundant summaries.
Disadvantages: An update to this paper could
provide a more accurate indicator for the state of this
methodology.
2.11 Document Processing
& Understanding
Aiello et al. presents a methodology to capture the
structural layout and logical order of text blocks
within several documents and represents this
information in connected graphs (Aiello et al., 2002)
Advantages: This document level methodology
captures physical layout of partitioned text blocks
spanning over multiple documents with a
complexity of O (n
4
).
Disadvantages: This only provides top level
information about a set of documents. Without being
used in conjunction with other methodologies
discussed in this survey, the information provided
does not include information from within text
blocks. Information within text blocks is a needed
feature addressed by other methodologies.
2.12 Other Relevant Methodologies
Feldman et al. describes a natural language
processing (NLP) system, called the LitMiner
system, that uses semantic analysis to mine
biomedical literature (Feldman et al., 2003).
Advantages: Although this paper addresses the
biomedical domain, it is quite useful in providing the
various steps and different methodologies for text
mining, plus describing in detail the specific system
with good evaluation results for this type of system.
Disadvantages
: Several of the key elements are
interlinked with the biomedical domain. However,
several of the methodologies presented appear to be
applicable to several domains. (Different data bases
and tools would be needed.) The system described
requires some pre-processing and is a semi-
automatic process with a visualization system.
Neustein uses sequence analysis to improve natural
language understanding from conversations. A goal
of this analysis of sequence packages (or frames) of
speech is to uncover important information that
might otherwise get unnoticed (Neustein, 2001).
Advantages: The proposed sequence analysis would
address context dependency in natural language,
especially in speech context. Success in this kind of
analysis should provide benefits toward reducing
ambiguity in natural language processing and
understanding.
Disadvantages: This discussion is basically a
proposed approach to a difficult problem area and
didn’t appear to be implemented at the time the
paper was written. Little details of the approach
were presented at the time this paper was published.
ICSOFT 2010 - 5th International Conference on Software and Data Technologies
390
Capability Definitions 1.
C1 Topics C6 Context C11 Chains (Lex, Sem, Con) C16 Statistical
C2 Concepts C7 Aggregation C12 Hierarchical C17 Word Sense
C3 Relations C8 Overlap C13 Learning C18 Large Document
C4 Semantic C9 Clusters C14 Detect Themes C19 Multi-Document
C5 Hierarchical C10 Quarry C15 Answer Evaluation
Table 1: Comparing Key Capabilities/Approaches in Survey.
Authors
C1
C2
C3
C4
C5
C6
C7
C8
C9
C10
C11
C12
C13
C14
C15
C16
C17
C18
C19
Aiello
Barzilay
Bendaould
Bourbakis1
Bourbakis2
Cimiano
Dahab
Dalianis
Feldman
Guo
Hahn-1
Hilberg
Ko
Liddy
Loh
Manabu
Marco
Meadche
Moens
Neustein
Nomoto
Pado
Radev
Rajaraman
Reeve
Shunsfard
Silber
Stein
Valakos
Yang
Ye
Yeh
Zhou
Union All
NLU METHODOLOGIES FOR CAPTURING NON-REDUNDANT INFORMATION FROM MULTI-DOCUMENTS -
A Survey
391
3 COMPARATIVE TABLE
OF METHODOLOGIES
AND APPROACHES
The following table captures some of the main
features and approaches over a global comparison of
papers throughout the survey. The intent of this
comparison is to provide a collective picture of what
main capabilities exist from the papers in this
survey.
The above table shows capabilities from various
approaches. Intuitively, as more pertinent
information is captured, higher quality (minimal
redundancy and maximum information coverage)
should result. However, most of the performance
qualities are not addressed. This may be due to the
overall maturity of the technical area which is
currently striving for accuracy as measured in the
Document Understanding Conferences (DUC) that
some of the authors reference. Performance time
characteristics, other than computational complexity,
appear to be a future effort.
4 CONCLUSIONS
This survey revealed very little commonality among
the methodologies that were found. However, the
methodologies were able to be categorized into some
general headings. The papers covered in the survey
did not include enough maturity information that
could be used for comparison. A resulting
conclusion suggests that this area of natural language
processing has not matured enough to provide this
kind of product information.
Methodologies that were tested provided
precision and recall results and some included
complexity. Most were theoretical. According to a
definition found on the Oracle web site, precision
measures how well non-relevant information is
screened (not returned), and recall measures how
well the information sought is found.
A few of the most capable methodologies show
promise in providing an approximately optimized,
minimum redundancy with maximum information
coverage. However, more research needs to be
performed in natural language understanding before
maturity of these methodologies can transform into
high volume, commercial products. Normally,
providing the more capability to produce accurate
text comes with a computational (time and space)
complexity price, especially when heuristics are
involved. Some of the concept graphical approaches,
chain, meta-chains, and hierarchical approaches
provided impressive opportunities to compress and
optimize resulting text. Finding an efficient
methodology to accomplish all this would be a
significant step toward eventual technical maturity.
REFERENCES
Aiello, M., Monz, C., Todoran, L., Worring, M., 2002.
Document understanding for a broad class of
documents, Int. Journal on Document Analysis
Recognition.
Barzilay, R., Lapata, Mirella, 2008. Modeling Local
Coherence: An Entity-Based Approach, Association
for Comput Linguistics, pages 34.
Bendaould, R., Hacene, M.R., Toussaint, Y., Delecroix,
B., Napoli, A., Text-based ontology construction using
relational concept analysis, (http://simbad.u-
strasbg.fr/simbad/sim-fid)
Bourbakis, N., Manaris, R., 1998. An SPN based
Methodology for Document Understanding, IEEE
International Conference on Tools for Artificial
Intelligence, Tapei, Taiwan, pages 10-15.
Bourbakis, N., Meng, W., Zhang, C., Wu, Z., Salerno, N.
J., Borek, S., 1999. Removal of Multimedia Web
Documents and Removal of Redundant Information,
International Journal on Artificial Intelligence Tools
(IJALT), Vol. 8, No. 1, pages 19-42, World Scientific
Pubs.
Cimiano, P., Hotho, A. Staab, S., 2005. Learning Concept
Hierarchies from Text Corpora using Formal Concept
Analysis, Journal of Artificial Intelligence Research,
Vol. 24, pages 305-339.
Dahab, M. D., Hassan, H. A., Rafea, A., 2008.
TextOntoEx: Automatic ontology construction from
natural English text, Expert Systems with Applications,
Vol. 34, pages 1474-1480.
Dalianis, H., 1999. Aggregation in Natural Language
Generation, Computational Intelligence, Vol. 15, No.
4, pages 31.
Feldman, R., Regev, Y., Hurvitz, E., Finkelstein-Landau,
M., 2003. Mining the biomedical literature using
semantic analysis and natural language processing
techniques, BIOSILICO, Vol. 1, No. 2, pages 12.
Guo, Yi, Stylios, G., 2005. An intelligent summarization
system based on cognitive psychology, Information
Sciences 174, pages 1-36.
Hahn, Udo, Marko, K. G., 2002. An integrated, dual
learner for grammars and ontologies, Data &
Knowledge Engineering, Vol.42, p 273-291.
Hilberg, W., 1997. Neural networks in higher levels of
abstraction, Biological Cybernetics, 76, pp. 23-40.
Ko, Y., Seo, J., 2008. An effective sentence-extraction
technique using contextual information and statistical
approaches for text summarization, Pattern
Recognition Letters 29, p 1366-1371.
Loh, S., De Oliveria, J, Gameiro, Mauricio, 2003.
Knowledge Discovery in Texts for Constructing
ICSOFT 2010 - 5th International Conference on Software and Data Technologies
392
Decision Support Systems, Applied Intelligence, 18,
pp. 357-366.
Manabu, O., Hajime, M., 2000. Query-Based
Summarization Based On Lexical Chaining,
Computational Intelligence, Vol. 16, 4,), pp. 8.
Marco, A., Monz, C., Todoran, L., Worring, M., 2002.
Document understanding for a broad class of
documents, International journal on Document
Analysis and Recognition, Vol. 5, pages 1-16.
Meadche, A., Staab, S., 2004. Ontology Learning,
Handbook on Ontologies, Pages 18.
Moens, M.F, Angheluta, R., Dumortier J., 2005. Generic
technologies for single- and multi-documents
summarization, Information Processing and
Management, Vol. 41, pages 569-586.
Neustein, A., 2001. Using Sequence Package Analysis to
Improve Natural Language Understanding, Int.
Journal of Speech Technology, Vol. 4, pages 31-44.
Nomoto, T, Matsumoto, Y, 2003. The diversity-based
approach to open-domain text summarization,
Information Processing and Management, 39 pages
363-389.
Pado, S., Lapata, M., 2007. Dependeny-Based Constuction
of Semantic Space Models, Association for
Computational Linguistics, pages 40.
Radev, D. R., Jing, H., Stys, M. Tam, D., 2004. Centroid-
based summarization of multiple documents,
Information Processing and Management, 40, pages
919-938.
Rajaraman, K, Tan, 2002. A-H, Knowledge Discovery
from Texts: A Concept Frame Graph Approach, CIKM
2002, pages 3.
Reeve, L., Han, H., Brool, A.D., 2006. BioChain: Lexical
Chaining Methods for Biomedical Text
Summarization, SAC 2006, ACM, pages 5.
Shunsfard, M.,et.al., 2004. Learning ontologies from
natural language texts, International J. Human-
Computer Studies, 60, pages 17-63.
Silber, H. G., McCoy, K., 2002. Efficiently Computed
Lexical Chains as an Intermediate Representation for
Automatic Text Summarization, Association for
Computational Linguistics, pages 10.
Stein, C.S., Strzalkowski, T., Wise, G.B., 2000.
Interactive, Text-Based Summarization of Multiple
Documents, Computational Intelligence, Vol. 16, Nov.
4, pp.8.
Valakos, A.G., Karkaletsis, V., Alexopoulou, D.
Papadimitriou, E., Spyropoulos, C.D., Vouros, G.,
2006. Building an Allergens Ontology and
Maintaining it using Machine Learning Techniques,
Computers in Biology and Medicine Journal, pages
32.
Yang, C. C., Wang, F. L., 2008. Hierarchical
Summarization of Large Documents, Journal of the
American Society for Information Science and
Technology, Vol. 59, Num. 6, pages 887-902.
Ye, Shiren, Chua, T-S, Kan, M-Y., Qiu, L., 2007.
Document concept lattice for text understanding and
summarization, Information Processing and
Management, Vol. 43, pages 1643-1662.
Yeh, J-Y., Ke, H-R., Y, W-P, Meng, I-H., 2005. Text
summarization using a trainable summarizer and latent
semantic analysis, Information Processing and
Management, Vol. 41, pages 75-95.
Zhou, G., Su, J., 2005. Machine learning-based named
entity recognition via effective integration of various
evidences, Natural Language Engineering, Vol. 11,
No. 2, pages 189-206.
NLU METHODOLOGIES FOR CAPTURING NON-REDUNDANT INFORMATION FROM MULTI-DOCUMENTS -
A Survey
393