3.4 Results and Analysis
Table 2 shows the results of each model with the re-
spectively intents, acts and slots recognition. The last
column of FrameAcc shows the proportion of correct
recognition of intents, acts and slots for each model
used in the test experiment. The second row lists the
test data set used, and “Overall” represents the new
test set combination of Sim-R and Sim-M.
For MemNet and SDEN, we find that the model
using random initialized word embedding gives better
performance on sim-R dataset with larger sample
size. However, with the sim-M dataset with smaller
sample size, the model with the pre-trained word em-
bedding is more satisfied.
For the recognition of intent, the model NoCon-
text is significantly worse than all other models. This
can explain that the task of intent recognition is more
dependent on context. Due to the introduction of con-
textual information, all other models obtain high ac-
curacy in intent recognition. MSDU model achieves
obviously the best results compared with other mod-
els.
For the task of act recognition, the performance of
NoContext is still lower than other models, which
proves that the information from context is still help-
ful. The performance of MSDU in the act recognition
is obviously better than that of other models, that
means MSDU has a stronger ability in understanding
the relationship between context and the current user
utterance.
For the recognition of slot tagging, there is no sig-
nificant difference of performance for the Models
MemNet, SDEN and NoContext. In the other hand,
MSDU and its variant models achieve better results.
At the same time, we also find that MSDU-Concat is
nearly the same as MSDU for slot recognition, mean-
ing that the concatenation process is not very useful
for slot recognition improvement.
From the test results, we find that the MSDU
model achieves about 5% better for FrameAcc than
MemNet and SDEN models.
It is interesting to notice that SDEN does not ob-
tain a better result than MemNet even although the
forth one using a more complex context encoding
method. MSDU-BERT-Concat and above two mod-
els use the same random initialized word embedding
method. The difference lies mainly in term of context
encoding: the model MSDU-BERT-Concat uses a hi-
erarchical-GRU to encode context information,
which is even simpler than the context encoding
method used by MemNet, however it obtains about
2% better for FramAcc than MemNet and SDEN.
This causes a doubt for the necessity of attention
mechanism in context encoding.
From the results produced by the MSDU variant
models, we can also conclude that the concatenation
procedure brings about 1.1% of improvement, the
BERT module brings about 2.7%, and
the combina-
tion of the both gives 3.4% of improvement.
4 CONCLUSIONS AND FUTURE
WORKS
The MSDU model is proposed for the recognition of
intents, acts and slots with the historical information
in a multi-turn spoken dialogue through training with
different datasets and variant modification. The test
result shows that the design concept of MSDU model
is more effective and brings important improvement.
For future works, we will study how to apply this
new model architecture for higher level dialogue un-
derstanding tasks, such as ontology-based slot recog-
nition, and the alignment of intent-act-slot. For the
moment, we have not discussed the subordinate rela-
tionship among intents, acts and slots, which is essen-
tial to dialogue understanding.
REFERENCES
Ankur Bapna, Gokhan Tür, Dilek Hakkani-Tur and Larry
Heck. 2017. Sequential Dialogue Context Modeling for
Spoken Language Understanding. arXivpre-
print.arXiv:1705.03455.
B.Liu and I. Lane. 2016. Attention-based recurrent neural
network models for joint intent detection and slot fill-
ing.arXivpreprint.arXiv:1609.01454.
Bordes, Y. L. Boureau, and J. Weston.2017. Learning end-
to-end goal-oriented dialog. In Proceedings of the 2017
International Conference on Learning Representations
(ICLP).
Yun-Nung Chen, Dilek Hakkani-Tür, GokhanTür et al.
2016. End-to-End Memory Networks with Knowledge
Carryover for Multi-Turn Spoken Language Under-
standing. In Proceedings of the 2016 Meeting of the In-
ternational Speech Communication Association.
DilekHakkani-Tür, GokhanTür, AsliCelikyilmaz, Yun-
Nung Chen, Jianfeng Gao, Li Deng, and Ye-Yi Wang.
2016. Multi-Domain Joint Semantic Frame Parsing us-
ing Bi-directional RNN-LSTM. In Proceedings of the
2016 Annual Conference of the International Speech
Communication Association.
E. Grave, P. Bojanowski, P. Gupta, A. Joulin, T. Mikolov,
et al. 2018. Learning Word Vectors for 157 Lan-
guages. In Proceedings of the 2018 International Con-
ference on Language Resources and Evaluation(LREC).