4 EXPERIMENTS
4.1 Datasets
We conduct experiments on the E2E NLG dataset,
where each instance is a pair of an MR and the corre-
sponding human-written reference sentences. Since
an MR could correspond to several references, we
take all different MRs in the training set. For each
MR, we randomly pick up one of its corresponding
references. The statistics of data used are listed in Ta-
ble 1. Compared with the total number of 42061 train-
ing instances in the original E2E NLG dataset, only a
small fraction of instances are used in our experiment.
To validate the proposed masked hard coverage
mechanism, we conduct experiments on TV and
Laptop NLG datasets. Each instance is a pair of
Dialogue Act (DA) and corresponding natural lan-
guage realization. One example of the format of
instance is: DA: inform no match(type=television;
pricerange=cheap; hdmiport=2; family=l2); Ref:
There are no cheap televisions in the l2 family with
2 hdmi ports. In total, there are 14 different DAs in
each dataset. To have DA information attached in the
input, we take DA strings, such as inform no match,
as the prompt input of the GPT-2 decoder. The re-
maining slot-value pairs and references are used as
the input of encoder and decoder, respectively. The
statistics of these datasets are documented in Table 1.
Table 1: The statistics of three NLG datasets.
E2E Laptop TV
Training set 4862 7944 4221
development set 547 2649 1407
Test set 630 2649 1407
4.2 Evaluation Metrics
For the E2E NLG dataset, we compare with other
models using several automatic n-gram overlap eval-
uation metrics, including BLEU, NIST, METEOR,
ROUGE-L, and CIDEr
1
. For the Laptop and TV NLG
datasets, we evaluate our generation results on metrics
BLEU, ERR,
2
and BERTScore (Zhang et al., 2019).
ERR is computed by the number of erroneous slots in
generated utterances divided by the total number of
slots in given MRs. Unlike BLEU that merely mea-
sures n-grams overlap, BERTScore computes token-
wise similarity using contextual embeddings for can-
1
The evaluation script is provided by https://github.com/
tuetschek/e2e-metrics.
2
The tool is from https://github.com/shawnwun/
RNNLG.
didate sentences and references, which correlates bet-
ter with human judgments.
4.3 Results and Analyses
For the E2E NLG dataset, evaluation results are listed
in Table 2. TGen (Du
ˇ
sek and Jur
ˇ
c
´
ı
ˇ
cek, 2016) is the
strong baseline system that no other single system
could outperform on all evaluation metrics. The best
previous results on different metrics are reported by
Roberti et al. (2019), Du
ˇ
sek et al. (2018), Puzikov and
Gurevych (2018), Zhang et al. (2018), Gong (2018),
respectively. “Original” is the framework (Chen et al.,
2020) on which our mechanism is built. “Origi-
nal+covloss” is the original model with the soft cov-
erage mechanism (See et al., 2017).
Compared with the original framework, our
model’s generation results have most of the evaluation
results improved. Further comparison with the soft
coverage mechanism demonstrates the validity and
powerful copying constraint of our masked hard cov-
erage mechanism. Considering the best previous re-
sults are separately achieved by different models, our
models yield fairly well and balanced results regard-
ing all metrics, and outperforms the baseline model
TGen on metrics BLEU, METEOR, ROUGE-L, and
CIDEr.
We list two groups of sentences produced by the
original framework and our model in Table 4. For
the first example, over-generation occurs in the utter-
ances generated by the original framework and orig-
inal framework with soft coverage mechanism (See
et al., 2017); that is, the keywords “coffee shop”
appear twice. In the second group, other systems
produce utterances without covering the information
“Japanese” food. In contrast, sentences produced by
our model have all the information included. Appar-
ently, over-generation and under-generation are elim-
inated in the examples.
To further validate the proposed coverage mech-
anism, experimental results on Laptop and TV NLG
datasets are displayed in Table 3. HDC (Wen et al.,
2015a) is a handcrafted generator capable of covering
all the slots. SCLSTM (Wen et al., 2015b) is a base-
line model, a statistical language generator based on a
semantically controlled LSTM structure. “Ori+Cov”
is the original framework (Chen et al., 2020) with soft
coverage mechanism (See et al., 2017).
In terms of BLEU, SCLSTM surpasses all oth-
ers; however, the comparable results of BERTScore
among different models demonstrate that all of them
generate sentences similar to the references. Regard-
ing ERR, our model outperforms SCLSTM in the
Laptop domain shows its potentiality for more ap-
ICAART 2021 - 13th International Conference on Agents and Artificial Intelligence
1180