relation generator to extract the possible relations for
each sentence; in a second step, the system maps each
relation into a feature vector and trains a Na
¨
ıve Bayes
model. The Reverb system (Fader et al., 2011) is an
evolution of such former works and is implemented
as an extractor for verb-based relations which uses
a logistic regression classifier trained with syntactic
features. The extracted relations are filtered by two
kinds of analysis: syntactic and lexical; the syntactic
analysis requires the constituents of the relation to
match a pre-defined set of POS tags patterns. The
lexical analysis filters out overly-specific relations by
looking to the frequency of their constituents over the
considered input corpus.
Finally, the work (Rusu et al., 2007), (Atap-
attu Mudiyanselage et al., 2014) and (Ceran et al.,
2012) propose a rule-based approach for the ex-
traction of subject-verb-object triplets from unstruc-
tured texts: the parsing tree of the input sentence
is browsed and the subject is computed as the first
noun found in the noun-phrase, the verb is the deep-
est leaf found in the verb-phrase, while the object is
the first noun/adjective found in a phrase sibling of
the verb-phrase. Such an approach is very fast if com-
pared to its predecessors as it does not require learn-
ing pre-defined relations; on the other hand its pre-
cision seems to be quite low (as shown further in the
experimental section) and this is due to the use of only
the POS Tagging information.
The main contribution of this paper is the de-
velopment of a methodology to extract meaningful
triplets from unstructured texts. We chose to build
our methodology on top of the baseline algorithm de-
veloped by Rusu et al. (Rusu et al., 2007): such
as algorithm, if compared to other state-of-the-art
methods, needs very low computational requirements,
avoids learning and does not need hand-tagged exam-
ples. Such baseline algorithm, on the other hand, uses
only syntactic information to extract relations and this
may lead to a low quality of the generated triplets
(poor precision/recall); to this aim we propose to in-
tegrate Latent Semantic Analysis (Landauer and Dut-
nais, 1997; Deerwester et al., 1990), whose statisti-
cal foundation has been recently explained in (Pilato
and Vassallo, 2015) in order to filter out low quality
triplets thus improving precision/recall. At a prelimi-
nary stage, pairs of tightly semantically related words
are extracted from the corpus; in a further stage the
sentences containing such words are parsed and syn-
tactically analyzed to discover the linking relations.
As this is a preliminary work, we will focus only on
the extraction of first-order relationships i.e., triplets
in the form (subject, verb, object).
2 THE PROPOSED APPROACH
The main purpose of this work is to extract a set of
triplets of words in a subject-verb-object form from
a documents corpus; the proposed algorithm does not
require any pre-defined relation and tries to discover
valid relations through the analysis of the semantic of
words. Firstly, the algorithm builds a semantic space
by means of the Latent Semantic Analysis (LSA) in
order to reveal pairs of words which are somehow re-
lated each other. Any pair of words in a tight semantic
association (high values of the cosine similarity in the
semantic space) is then expanded in a triplet form (if
possible) by looking into the corpus for a verb binding
that pair: i.e., the algorithm looks for s−v−o triplets.
The triplets extraction task is accomplished by
four main blocks: a) Micro-documents Extraction:
extracts sentences from the corpus and records them
as micro-documents; b)Word-Tags-Documents Gen-
eration: pre-processes the micro-documents and cre-
ates a list of unique words; each word will be as-
sociated to a tagset representing its different part of
speech (POS) tags as well as to its frequency counts
in the extracted micro-documents corpus; c) Pairs
Extraction: builds the LSA semantic space and se-
lects relevant pairs of words semantically related;
d) Triplet Generation: tries to match the relevant
pairs into s − v − o triplets extracted from the micro-
documents.
2.1 Micro-documents Extraction
The aim of this block is to segment the input corpus
in a sequence of sentences. As described in Figure 1,
the sentences-extractor module reads the corpus and
produces a sequence of sentences (i.e., a sequence of
words included between two periods). Case-folding
is applied to the sentences. Each sentence is therefore
saved in a text file.
2.2 Word-Tags-Documents Generation
As shown in Figure 2, the extracted micro-documents
are tokenized, tagged and lemmatized in order to re-
duce the dimensionality of the term-document matrix
that will be provided as input of LSA; stop-words are
also removed. After such pre-processing steps, the
dictionary extractor module associates to each dis-
tinct word w
i
∈ W its related tagset tags
i
∈ T and
the frequency counts doc f req
i, j
of the word w
i
in
each micro-document d
j
∈ D, where W, T, D are re-
spectively the sets of unique words, tags and micro-
documents. The output of the block is a dictionary
data structure W T D = {w
i
,tags
i
, doc f req
i, j
}.
DART 2015 - Special Session on Information Filtering and Retrieval
598