words is to use the TF-IDF measure. Given a docu-
ment in a corpus, a term in the document with a higher
TF-IDF value implies that it appears more frequently
in the document, and less frequently in the remaining
documents. However, for a corpus of a small number
of documents, the TF-IDF value of a keyword would
almost equal to its frequency.
The Word Co-occurrence (WCO) method (Mat-
suo and Ishizuka, 2004) is a better method, which ap-
plies to a single document without a corpus. WCO
first extracts frequent terms from the document, and
then collects a set of of word pairs co-occurring in
the same sentences (sentences include titles and sub-
titles), where one of the words is a frequent term. If
term t co-occurs frequently with a subset of frequent
terms, then it is likely to be a keyword. The authors
of WCO showed that WCO offers comparable perfor-
mance to TF-IDF without the presence of a corpus.
The Rapid Automatic Keyword Extraction
(RAKE) algorithm (Rose et al., 2010) is another
keyword extraction method based on word pair
co-occurrences. In particular, RAKE first divides
the document into words and phrases using prede-
termined word delimiters, phrase delimiters, and
positions of stop words. RAKE then computes a
weighted graph, where each word is a node, and a
pair of words are connected with weight n if they
co-occur n times. RAKE then assigns a score to each
keyword candidate, which is the summation of scores
of words contained in the keyword. Word scores
may be calculated by word frequency, word degree,
or ratio of degree to frequency. Keyword candidates
with top scores are then selected as keywords for
the document. RAKE is superior over WCO in the
following aspects: It is simpler and achieves a higher
precision rate and about the same recall rate. We will
use RAKE to extract keywords.
Early title generation methods include Naive
Bayesian with limited vocabulary, Naive Bayesian
with full vocabulary, Term frequency and inverse doc-
ument frequency (TF-IDF), K-nearest neighbor, and
Iterative Expectation Maximization. These methods,
however, only generate an unordered set of keywords
as a title without concerning syntax. Using the F1
score metric as a base for comparisons, it was shown
through extensive experiments that the TF-IDF title
generation method has the best performance over the
other five methods (Jin and Hauptmann, 2001).
For practical purposes we would like to gener-
ate syntactically correct titles. Recent methods used
sentence trimming to convert a title candidate into a
shorter sentence or phrase, while trying to maintain
syntactic correctness. Sentence trimming has been
studied in recent years and has met with certain suc-
cess. For example, Knight and Marcu (Knight and
Marcu, 2002) and Turner and Charniak (Turner and
Charniak, 2005) used a language model (e.g., the tri-
gram model) to trim sentences. Vandegehinste and
Pan (Vandegehinste and Pan, 2004) used context-free
grammar (CFG) trees to trim a sentence to gener-
ate subtitle candidates with appropriate pronuncia-
tions for the deaf and hard-of-hearing people. They
used a spoken Dutch corpus for evaluation. More re-
cently, Zhang et al. (Zhang et al., 2013) used sentence
trimming on the Chinese text to generate titles, also
through CFG trees. We note that CFG is ambiguous
and the complexity of constructing CFG trees is high.
Dependency trees are recent development in natu-
ral language processing with a number of advantages.
For example, dependency trees offer better syntactic
representations of sentences (Sylvain, 2012) and they
are easier to work with. These advantages motivated
us to explore automatic title generation using depen-
dency trees. We present the first such algorithm.
Our approach has the following major differences
from the previous approaches:
1. We use RAKE to generate keywords and define a
better measure to select central sentences.
2. We use dependency grammar to construct a de-
pendency tree for each title candidate for trim-
ming.
3. We construct a set of empirical rules to generate
titles.
We show that, through experiments, DTATG gener-
ates titles comparable to titles generated by human
writers. In addition to this evaluation, we also evalu-
ate the F1 scores and show that, through experiments,
DTATG is superior over the TF-IDF method (see Sec-
tion 4.3).
The remainder of this paper is organized as fol-
lows: In Section 2 we first provide an overview of
DTATG. We then explain DTATG in detail, including
extraction of central sentences, dependency parsing,
and the dependency tree compression model. In Sec-
tion 4 we describe our experiment evaluation setups
and present results from our experiments. We con-
clude the paper in Section 5.
2 DETAIL DESCRIPTION OF
DTATG
The DTATG system framework to generate a title for
a given document is shown in Figure 1, where we as-
sume that each document already has a title for com-
parison. If a document does not have a title, then this
comparison will not be executed. Given a document,
DTATG generates a title as follows:
DTATG: An Automatic Title Generator based on Dependency Trees
167