Given the small amount of parallel data corpus,
i.e., corpus with pairs of sentences and its compres-
sions, to train neural networks-based models, (Filip-
pova and Altun, 2013) developed a method for auto-
matically generating a corpus by matching the com-
pressed sentence dependency tree to a sub-tree of the
original sentence dependency tree. (Filippova et al.,
2015) used the method proposed in (Filippova and
Altun, 2013) to create a corpus of millions of pairs
of sentences to make feasible the training of a neural
network-based model inspired by the advancements
in neural machine translation (NMT). The model of
(Filippova et al., 2015) implemented a stack of Long
Short-Term Memory (LSTM) networks to classify
each word in a sentence as retained or removed and
evaluated the impact of some syntactic features over
the model performance.
After (Cohn and Lapata, 2008), (Rush et al., 2015)
was the first work to make a new attempt on do-
ing sentence compression in a non-deletion based ap-
proach. Inspired by the attention model of (Bahdanau
et al., 2014), widely used in NMT, (Rush et al., 2015)
is also the first that named the task as sentence sum-
marization. It proposes an attention-based encoder to
learn a vectorial representation for a sentence and uses
a neural language model to decode this sentence en-
coded vector into a new sentence, shorter but with the
same meaning of the original. (Chopra et al., 2016)
extends the model from (Rush et al., 2015) and re-
places the neural language model by a recurrent neu-
ral network to create the compressed sentence.
The works of (Jing, 2000), (Knight and Marcu,
2000) and (McDonald, 2006), as first supervised
methods, had a concern about the quality of their
models when applied to sentences outside the do-
mains they were trained on. However, with the
advancement of neural networks-based models, this
concern was gradually being neglected. Since so,
(Wang et al., 2017) work is the first to investigate
how to ensure that a model trained on some domain
can generalize the sentence compression task prop-
erly into domains under which it has not been trained.
To do so, it extends the (Filippova et al., 2015) model
using a bidirectional LSTM instead of an ordinary
LSTM to capture contextual information better and
using a set of specific syntactic features instead of
merely using words as performed in (Filippova et al.,
2015). Additionally, the work of (Wang et al., 2017),
inspired by the work of (Clarke and Lapata, 2008),
uses integer linear programming to find the optimal
combination of labels.
The problem of sentence compression has been
addressed differently over the years. So, it is hard
to compare the proposed models and algorithms. The
advancement of neural networks brought significant
improvement for the field, but also brought the al-
ready known problems of data acquisition. Except
for the model of (Wang et al., 2017), all presented
neural network-based models require large amounts
of labeled data to achieve good performances. These
models take advantage of these massive amounts of
data to extract important features from the words, by
themselves, without using any syntactic information.
Nevertheless, these massive amounts of data are not
always open and publicly available. So, with small
amounts of data, these models can not extract, by
themselves, the features they need to achieve better
performances.
We propose a model that aims to overcome this
problem by previously extracting some important fea-
tures and applying some pre-processing steps. Our
model is an extension of (Wang et al., 2017). So, the
model proposed in (Wang et al., 2017) is our baseline.
6 EXPERIMENTS
In this section, we discuss the experimental evalua-
tion.
6.1 Dataset
The dataset used in these experiments is a set of
10,000 pairs of sentences (original and compressed)
publicly released in (Filippova et al., 2015), both for
training and for model evaluation. These sentences
were automatically extracted from Google News us-
ing a method developed in (Filippova and Altun,
2013). The 10,000 sentences were split into a train-
ing set consisting of approximately 8,000 sentences,
around 1,000 sentences for the validation set, and fi-
nally, the test set with 1,000 sentences. We study how
our model and the baseline perform with a small train-
ing set. So, we consider three different samples of
training set: one with the full 8,000 sentences, an-
other with the first 5,000 sentences, and the last one
with the first 2,000 sentences.
6.2 Experimental Setup
In the experiments, we trained all models using Adam
(Kingma and Ba, 2014) as the optimizer with the
learning rate starting at 0.001 and using cross-entropy
as the loss function. The hidden layers of the LSTM
for all models were set to 100. The word embedding
vectors were set to 100. They were initialized using
pre-trained GloVe vectors of the same dimension and
were not updated during training. The embedding
ICAART 2020 - 12th International Conference on Agents and Artificial Intelligence
136