CONCEPT DISCOVERY FOR LANGUAGE UNDERSTANDING IN
AN INFORMATION-QUERY DIALOGUE SYSTEM
Nathalie Camelin, Boris Detienne, St
´
ephane Huet, Dominique Quadri and Fabrice Lef
`
evre
LIA - University of Avignon, BP 91228, 84911 Avignon Cedex 09, France
Keywords:
Concept discovery, Language understanding, Latent Dirichlet analysis, Dialogue systems.
Abstract:
Most recent efficient statistical approaches for natural language understanding require a segmental annotation
of training data. Such an annotation implies both to determine the concepts in a sentence and to link them
to their corresponding word segments. In this paper we propose a two-steps alternative to the fully manual
annotation of data: an initial unsupervised concept discovery, based on latent Dirichlet allocation, is followed
by an automatic segmentation using integer linear optimisation. The relation between discovered topics and
task-dependent concepts is evaluated on a spoken dialogue task for which a reference annotation is avail-
able. Topics and concepts are shown close enough to achieve a potential reduction of one half of the manual
annotation cost.
1 INTRODUCTION
Generally, information-query spoken dialogue sys-
tems are used to interface a database with users orally.
Once a speech recogniser has transcribed the sig-
nal, the meaning of the user’s queries is extracted
by a Spoken Language Understanding (SLU) module.
The very first step of this module is the identification
of literal concepts. These concepts are application-
dependent and fine-grained so as to provide efficient
and usable information to the following reasoning
modules (e.g. the dialogue manager). To this re-
spect they can also be composed in a global tree-based
structure to form the overall meaning of the sentence.
To address the issue of concept tagging several
techniques are available. Some of these techniques
now classical rely on probabilistic models, which can
be either discriminant or generative. To obtain good
performance when probabilistic models are used in
such systems, field data are to be collected, tran-
scribed and annotated at the semantic level. It is then
possible to train efficient models in a supervised man-
ner. However, the annotation process is costly and
constitutes a real hindrance to a widespread use of the
systems. Therefore any means to avoid it would be
highly appreciable.
It seems out of reach to derive the concept def-
initions from a fully automatic procedure. Anyhow
the process can be bootstrapped, for instance by in-
duction of semantic classes such as in (Siu and Meng,
1999) or (Iosif et al., 2006). Our assumption here is
that the most time-consuming parts of concept inven-
tory and data tagging could be obtained in an unsu-
pervised way, even though a final (but hopefully min-
imal) manual procedure is still required to tag the de-
rived classes so as to manually correct the automatic
annotation.
Unlike the previous attempts cited before, which
developed ad-hoc approaches, in the work described
here we investigate the use of broad-spectrum knowl-
edge discovery techniques. In this context the notion
most related to that of concept in SLU seems to be the
topic, as used in information retrieval systems. For a
long time, the topic detection task was limited to the
association of a single topic to a document and thus
did not fit our requirements. The recently proposed
latent Dirichlet allocation (LDA) technique has the
capacity to derive a probabilistic representation of a
document as a mixture of topics. As such LDA can
consider that several topics can co-occur inside a sin-
gle document or sentence and that the same topic can
be repeated.
From these favorable characteristics we consider
the application of LDA to concept discovery for SLU.
Anyhow, LDA does not take into account the sequen-
tiality of the data (due to the exchangeability assump-
tion). It is then necessary to introduce constraints for
a better segmentation of the data: assignment of top-
ics proposed by LDA is modified to be more coherent
in a segmental way.
24
Camelin N., Detienne B., Huet S., Quadri D. and Lefevre F..
CONCEPT DISCOVERY FOR LANGUAGE UNDERSTANDING IN AN INFORMATION-QUERY DIALOGUE SYSTEM.
DOI: 10.5220/0003640500240029
In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval (KDIR-2011), pages 24-29
ISBN: 978-989-8425-79-9
Copyright
c
2011 SCITEPRESS (Science and Technology Publications, Lda.)
The paper is organised as follows. Principles of
automatic induction of semantic classes are presented
in Section 2, followed by the presentation of an induc-
tion system based on LDA and the additional step of
segmentation using integer linear programming (ILP).
Then evaluations and results are reported in Section 3
on the French MEDIA dialogue task.
2 AUTOMATIC INDUCTION OF
SEMANTIC CLASSES AND
ANNOTATION
2.1 Context Modelization
The main idea of automatic induction of semantic
classes is based on the assumption that concepts often
share the same context (syntactic or lexical) (Siu and
Meng, 1999), (Pargellis et al., 2001). While there may
be still room for improvement in these techniques we
decided instead to investigate general knowledge dis-
covery approaches in order to evaluate their potential-
ity.
In that purpose a global two-steps strategy is pro-
posed: first semantic classes (topics) are induced;
then topics are assigned to words with segmental con-
straints. The major interest of the approach is to sep-
arate the tasks of detecting topics and aligning topics
with words. It is then possible to introduce additional
constraints (such as locality, number and types of seg-
ments, etc) in the second task which would otherwise
hinder topic detection in the first place.
Several approaches are available for topic detec-
tion in the context of knowledge discovery and in-
formation retrieval. In this work we were motivated
by the recent development of a very attractive tech-
nique which has interesting distinct features such as
the detection of multiple topics in a single document.
LDA (Blei et al., 2003) is the first principled descrip-
tion of a Dirichlet-based model of mixtures of latent
topic variables. It generates a set of topics with prob-
abilities for each topic to be associated with a word
in a sentence. In our case this knowledge is thereafter
used to infer a segmentation of the sentence using in-
teger linear optimisation.
2.2 Implementation of an Automatic
Induction System based on LDA
Basically LDA is a generative probabilistic model
for text documents. LDA follows the assumption
that a set of observations can be explained by la-
tent variables. More specifically documents are rep-
resented by a mixture of topics (latent variables) and
topics are characterized by distributions over words.
The LDA parameters are {α,β}. α represents the
Dirichlet parameters of K latent topic mixtures as
α = [α
1
,α
2
,..., α
K
]. β is a matrix representing a
multinomial distribution in the form of a conditional
probability table β
k,w
= P(w|k). Based on this repre-
sentation, LDA can estimate the probability of a new
document d of N words d = [w
1
,w
2
,..., w
N
] using the
following procedure.
A topic mixture vector θ is drawn from the Dirich-
let distribution (with parameter α). The correspond-
ing topic sequence κ = [k
1
,k
2
,..., k
N
] is generated for
the whole document accordingly to a multinomial dis-
tribution (with parameter θ). Finally each word is
generated by the word-topic multinomial distribution
(with parameter β, that is p(w
i
|k
i
,β)). After this pro-
cedure, the joint probability of θ, κ and d is then:
p(θ,κ,d|α,β) = p(θ|α)
N
i=1
p(k
i
|θ)p(w
i
|k
i
,β) (1)
To obtain the marginal probability of d, a final inte-
gration over θ and a summation over all possible top-
ics considering a word is necessary:
p(d|α,β) =
Z
p(θ|α)
N
i=1
k
i
p(k
i
|θ)p(w
i
|k
i
,β)
!
(2)
The framework is comparable to that of probabilis-
tic latent semantic analysis, but the topic multinomial
distribution in LDA is assumed to be sampled from
a Dirichlet prior and is not linked to training docu-
ments.
In recent years, several studies have been carried
out in language processing using LDA. For instance,
(Tam and Schultz, 2006) worked on unsupervised lan-
guage model adaptation, (Celikyilmaz et al., 2010)
ranked candidate passages in a question-answering
system and (Phan et al., 2008) implemented LDA to
classify short and sparse web texts.
We chose to use an implementation of LDA,
GIBBSLDA++ tool,
1
to annotate each user’s utter-
ance of a dialogue corpus with topics. Each utterance
of more than one word is included in the training set
as its sequence of words. Once the models are trained,
inference on data corpus assigns each word in a docu-
ment with the highest probability topic. [LDA] system
executes 2,000 iterations with the default parameters
for α and β.
Notice that single-word utterances are processed
separately using prior knowledge. Names of cities,
month, day or short answers (e.g. “yes”, “no”,
“yeah”) and numbers are parsed in these utterances.
1
http://gibbslda.sourceforge.net/
CONCEPT DISCOVERY FOR LANGUAGE UNDERSTANDING IN AN INFORMATION-QUERY DIALOGUE
SYSTEM
25
Indeed, LDA cannot be applied on such utterances
where no co-occurrences can be observed.
2.3 Alignment with Integer Linear
Programming (ILP)
The topic alignment problem we have to solve in this
paper can be considered as a combinatorial optimiza-
tion problem. ILP is a basic method to deal with such
problems.
ILP is a mathematical method for determining
the best way to optimize a given objective. More
specifically, an integer linear model aims at optimiz-
ing a linear objective function (for instance represent-
ing a cost, a benefit, a probability...) subject to lin-
ear equalities or inequalities (named the constraints).
Both the objective function and the constraints are ex-
pressed using integer unknown variables (called de-
cision variables). The coefficients of the objective
function and the constraints are the input data of the
model. Consequently, solving an ILP consists in as-
signing integer values to decision variables, such that
all constraints are satisfied and the objective function
is optimized (maximized or minimized). We report
the interested reader to (Chen et al., 2010) for an in-
troduction on approaches and applications of ILP.
We propose an ILP formulation for solving the
topic alignment problem for one document. The in-
put data are: an ordered set d of words (indexed from
1 to N), a set of K available topics and, for each word
w
i
d and topic k = 1...K, the natural logarithm of
the probability p(w
i
|k) that k is assigned to w
i
in the
considered document. Model [ILP] determines the
highest-probability assignment of one topic to each
word in the document, such that at most χ
max
differ-
ent topics are assigned.
[ILP] : max
N
i=1
K
k=1
p(w
i
|k)x
ik
(3)
K
k=1
x
ik
= 1 i = 1...N (4)
y
k
x
ik
0 i = 1...N,k = 1...K (5)
kκ
y
k
χ
max
k = 1...K (6)
x
ik
{0,1} i = 1...N,k = 1...K
y
k
{0,1} k = 1...K
Decision variable x
ik
is equal to 1 if topic k is as-
signed to word w
i
, and equal to 0 otherwise. Con-
straints (4) ensure that exactly one topic is assigned to
each word. Decision variable y
k
, is equal to 1 if topic
k is used. Constraints (5) force variable y
k
to take a
value of 1 if at least one variable x
ik
is not null. More-
over, constraints (6) limit the total number of topics
used. The objective function (3) merely states that we
want to maximize the total probability of the assign-
ment. Through this model, our assignment problem is
identified as a p-centre problem (see e.g. (ReVelle and
Eiselt, 2005) for a survey on such location problems).
Since the number of instances considered in this
work is small, [ILP] can be straightforwardly solved
to optimality using a ILP solver as ILOG-CPLEX. Nu-
merical results are reported in Section 3. The consid-
ered system is denoted [LDA + ILP] since p(w
i
|k) are
given by [LDA]. χ
max
has been chosen according to
the desired concept annotation. As on average a con-
cept support contains 2.1 words, χ
max
is defined em-
pirically according to the number of words: with i =
[[2, 4]] :χ
max
= i with i = [[5, 10]] words: χ
max
= i 2
and for utterances containing more than 10 words:
χ
max
= i/2.
3 EVALUATION AND RESULTS
3.1 MEDIA Corpus
The MEDIA corpus is used to evaluate the proposed
approach. MEDIA is a French corpus related to
the domain of tourism information and hotel reser-
vation (Bonneau-Maynard et al., 2005). 1257 dia-
logues were recorded from 250 speakers with a WoZ
technique (a human simulating an automatic phone
server). In our experiments we only consider the 17k
user utterances. This dataset contains 123,538 words,
for a total of 2470 distinct words.
The MEDIA data have been manually transcribed
and semantically annotated. The semantic annotation
is rich of 75 concepts (e.g.: location, hotel-state, time-
month. . . ). Each concept is supported by a sequence
of words, the concept support. The null concept is
used to annotate every word segment that does not
support any of the 74 other concepts. Concepts do
not appear at the same frequency as shown in Table 1.
For example, 33 concepts (44% of the concepts) are
supported by 100 occurences at most, while 15 con-
cepts (21% of the concepts) present more than 1,000
occurences (only null is above 9,000).
Table 1: Number of concepts according to their occurrence
range.
[1,100] [100,500] [500,1k] [1k,9k] [9k,15k]
33 21 6 14 1 (null)
On average, a concept support contains 2.1 words,
3.4 concepts are included in a turn and 32% of the
utterances are single-word turns (generally yes or no).
3.2 Automatic Evaluation Protocol
As MEDIA reference concepts are very fine-grained,
KDIR 2011 - International Conference on Knowledge Discovery and Information Retrieval
26
we introduce a high-level concept hierarchy con-
taining 18 clusters of concepts. For example, a
high-level concept payment is created, correspond-
ing to the four concepts payment-meansOfPayment,
payment-currency, payment-total-amount, payment-
approx-amount, a high-level concept location corre-
sponds to 12 concepts (location-country, location-
district, location-street, . . . ). Thus, two levels of con-
cepts are considered for the evaluation: the high-level
(18 classes) and the fine-level (75 classes).
To evaluate the unsupervised procedure in a fully
automatic way, it is necessary to associate each in-
duced topic with a MEDIA concept. To that purpose,
topics are aligned with concepts based on their word
support for each utterance according to the reference
annotation. A co-occurrence matrix is computed and
each topic is associated to its most co-occurring con-
cept. Table 2 analyses this automatic association
for two values of K, the number of topics induced
by LDA. Some concepts may not be associated with
any topic, the CC column (Concept Coverage) gives
the percentage of concepts that are associated with a
topic. The NNT column (Not Null Topic) computes
the percentage of topics not associated with the null
concept.
Table 2: Analysis of the automatic association between top-
ics and concepts. CC: concept coverage by the topics. NNT:
% of topics not associated with null.
K=50 topics K=200 topics
CC NNT CC NNT
high LDA 61 60 72 51
LDA+ILP 67 62 82 57
fine LDA 21 50 34 47
LDA+ILP 24 58 39 53
Considering [LDA] and fine-level concepts, only
one fifth of the MEDIA concepts are retrieved for
K = 50 and up to one third for K = 200. Though 72%
of the high-level concepts are retrieved with K = 200.
Considering [ILP + LDA], this value even increases
to 82%. This ”lost concept” phenomenom at high-
level can be explained by the fact that 72% of the
concepts are supported by less than 500 concept sup-
ports, which seems a bit low for [LDA] to modelize
them as a topic. [LDA + ILP] helps to cover more
concepts. Obviously, when more topics are induced,
more concepts are covered. It is also interesting to
notice that about half of the topics are associated with
the null concept. When the number of topics is in-
creased, more concepts are discovered but also more
topics are associated with null.
3.3 Generated Topic Observations
In Table 3, six topics generated by [LDA] are repre-
sented by their 8 highest probability words. For topic
13, it is interesting noting that words have quite sim-
ilar weights. The most represented words are “du”
(“from”) and “au” (“to”) and other words are numbers
or month that a priori leads to a “time-date” topic. For
topic 43, the word “oui” (“yes”) is given a 0.62 proba-
bility, other words are “absolutely” or “okay” leading
to an a priori “answer-yes” topic.
To observe which MEDIA concept is associated to
these topics, the list of the 3 most co-occurring con-
cepts and the number of co-occurrences are shown
in Table 4. The 2 first most co-occurring concepts
in a topic are the same in [LDA] and [LDA + ILP].
However, the number of co-occurrence is higher in
[LDA + ILP] than in [LDA]. An entropy measure
Ent(t) is computed for each topic t in order to evalu-
ate the reliability of the topic-concept association over
all the possible concepts. It is computed as follows:
Ent(t) =
concepts c
p(c|t) log p(c|t) (7)
with p(c|t) =
#(c
T
t)
#t
The topic entropy is always smaller considering
[LDA + ILP] than [LDA]. This indicates that the re-
assignment due to ILP alignment improves the re-
liability of the topic-concept association. Entropies
measured with high-level concepts are always lower
than with fine-level concepts, in particular because
less classes are considered (18 instead of 75). For
topic 18, we can see that high-level enables to con-
sider this topic as a Location concept and not a null
one but the entropy is quite high. On the over hand,
topic 43 shows a low entropy, specifically in [LDA +
ILP]. This shows that word “yes” is strongly associ-
ated with concept Answer”. Other topics represent-
ing the null concept can show an entropy of 0.47 like
the 6th topic (“there”, “is”, “what”, “how”, “does”,
. . . )
3.4 Results
The evaluation is presented in terms of F-measure,
combining precision and recall measures. Quality of
topic assignment is considered also according to 2
levels:
alignment corresponds to a full evaluation where
each word is considered and associated with one
topic,
generation corresponds to the set of topics gener-
ated for a turn (no order, no word-alignment).
CONCEPT DISCOVERY FOR LANGUAGE UNDERSTANDING IN AN INFORMATION-QUERY DIALOGUE
SYSTEM
27
Table 3: Examples of topics discovered by LDA (K = 100).
Topic 0 Topic 13 Topic 18 Topic 35 Topic 33 Topic 43
information time-date sightseeing politeness location answer-yes
words prob. words prob. words prob. words prob. words prob. words prob.
d’ 0.28 du 0.16 de 0.30 au 0.31 de 0.30 oui 0.62
plus 0.17 au 0.11 la 0.24 revoir 0.27 Paris 0.12 et 0.02
informations 0.16 quinze 0.08 tour 0.02 madame 0.09 la 0.06 absolument 0.008
autres 0.10 dix-huit 0.07 vue 0.02 merci 0.08 pr
`
es 0.06 autre 0.008
d
´
etails 0.03 d
´
ecembre 0.06 Eiffel 0.02 bonne 0.01 proche 0.05 donc 0.007
obtenir 0.03 mars 0.06 sur 0.02 journ
´
ee 0.01 Lyon 0.03 jour 0.005
alors 0.01 dix-sept 0.04 mer 0.01 villes 0.004 aux 0.02 Notre-Dame 0.004
souhaite 0.003 nuits 0.04 sauna 0.01 bient
ˆ
ot 0.003 gare 0.02 d’accord 0.004
Table 4: Topic repartitions among the high or fine-level concepts for [LDA] and [LDA + ILP] (K = 100).
Topic 18 Topic 33 Topic 43
sightseeing location answer-yes
#occ. concept Ent(t) #occ. concept Ent(t) #occ. concept Ent(t)
high 292 Location 571 Location 705 Answer
LDA 258 null 2.25 156 null 1.72 107 null 1.10
94 Name 87 Comparative 27 Location
fine 258 null 226 loc.-distanceRel. 705 answer
136 loc.-placeRel. 2.78 190 location-city 2.57 107 null 1.19
100 loc.-distanceRel. 156 null 17 object
high 300 Location 661 Location 846 Answer
LDA 200 null 2.19 123 null 1.52 109 null 0.76
+ 102 Name 115 Comparative 24 Location
ILP fine 200 null 234 loc.-distanceRel. 846 answer
163 loc.-placeRel. 2.64 223 location-city 2.44 109 null 0.80
98 name-hotel 129 loc.-placeRel. 16 name-hotel
Plots comparing the different systems imple-
mented w.r.t. the different evaluation levels in terms
of F-measure are reported in Figures 1 and 2.
46
48
50
52
54
56
58
60
62
64
66
68
50 100 150 200
Fmeasure
Number of topics
[LDA+ILP] high
[LDA] high
[LDA+ILP] fine
[LDA] fine
Figure 1: F-measure of the concept generation as a function
of the number of topics.
The [LDA] system generates topics which are cor-
rectly correlated with the high-level concepts. It can
be observed that the bag of 75 topics reaches an F-
measure of 61.6% (Figure 1), corresponding to a pre-
cision of 59.4% and a recall of 64%. When [LDA] is
asked to generate too few topics, induced topics are
not specific enough to fit the fine-grained concept an-
34
36
38
40
42
44
46
48
50
52
54
56
50 100 150 200
Fmeasure
Number of topics
[LDA+ILP] high
[LDA] high
[LDA+ILP] fine
[LDA] fine
Figure 2: F-measure of the concept alignment as a function
of the number of topics.
notation of MEDIA.
On the other hand, Figure 2 shows that a too high
increase of the number of topics does not affect the
bag of high-level topics significantly but induces a
substantial decrease of the F-measure for the align-
ment evaluation. This effect can be explained by the
automatic alignment method chosen to transpose top-
ics into reference concepts. When there are too many
topics, they co-occur with many concepts and are as-
signed to the most co-occurring one when some other
concepts can co-occur only slightly less. In such sit-
KDIR 2011 - International Conference on Knowledge Discovery and Information Retrieval
28
uations, it is likely that null is the most co-occurring
concept and the other concepts because they are too
much scattered are not associated to enough topics.
So they appear in the utterance but not on enough
words to be retained by the segmentation process.
From the high-level to fine-level concept evalua-
tion, results globally decrease of 10%. A loss of 12%
is observed from the generation to the alignment eval-
uation. In the fine-level evaluation, a maximum F-
measure of 52.5% is observed for the generation of
75 topics (Figure 1), corresponding to 54.9% in pre-
cision and 50.3% in recall whereas the F-measure de-
creases to 41% (precision=46.7% and recall=36.7%)
in the alignment evaluation (Figure 2).
To conclude on the [LDA] system, we can see that
it generates topics having a good correlation with the
high-level concepts, seemingly the best representation
level between topics and concepts. It is obvious that
an additional step is needed to obtain a more accurate
segmental annotation, what is expected with the use
of ILP.
[LDA + ILP] performs better whatever the level of
evaluation. For instance, an F-measure of 66% is ob-
served considering the high-level concept generation
for 75 topics (Figure 2). As for [LDA], the same losses
are observed between high-level and fine-level con-
cepts and generation and alignment paradigms. Nev-
ertheless, an F-measure of 54.8% is observed at the
high-level concept in alignment evaluation (Figure 2)
that corresponds to a precision of 56.2% and a recall
of 53.5%, which is not so low considering a fully-
automatic high-level annotation system.
4 CONCLUSIONS
In this paper an approach has been presented for con-
cept discovery and segmental semantic annotation of
user’s turns in an information-query dialogue system.
An evaluation based on an automatic association be-
tween generated topics and expected concepts has
been shown that topics induced by LDA are close to
high-level task-dependent concepts. The segmental
annotation process increases performance both for the
generation and alignment evaluations. On the whole
these results confirm the applicability of the technique
to practical tasks with expected gain in data produc-
tion.
Future work will investigate the use of n-grams to
extend LDA and to increase its accuracy for provid-
ing better hypotheses to the following segmentation
techniques. Also another technique for automatic re-
alignment, based on IBM models used in stochastic
machine translation, will be examined.
REFERENCES
Blei, D., Ng, A., and Jordan, M. (2003). Latent Dirichlet al-
location. The Journal of Machine Learning Research,
3:993–1022.
Bonneau-Maynard, H., Rosset, S., Ayache, C., Kuhn, A.,
and Mostefa, D. (2005). Semantic annotation of the
French MEDIA dialog corpus. In Proceedings of Eu-
rospeech.
Celikyilmaz, A., Hakkani-Tur, D., and Tur, G. (2010). LDA
based similarity modeling for question answering. In
Proceedings of the NAACL HLT 2010 Workshop on
Semantic Search, pages 1–9. Association for Compu-
tational Linguistics.
Chen, D., Batson, R. G., and Dang, Y. (2010). Applied
Integer Programming: Modeling and Solution. Wiley.
Iosif, E., Tegos, A., Pangos, A., Fosler-Lussier, E., and
Potamianos, A. (2006). Unsupervised combination of
metrics for semantic class induction. In Proceedings
of the IEEE Spoken Language Technology Workshop,
pages 86–89.
Pargellis, A., Fosler-Lussier, E., Potamianos, A., and Lee,
C. (2001). Metrics for measuring domain indepen-
dence of semantic classes. In Proceedings of Eu-
rospeech.
Phan, X., Nguyen, L., and Horiguchi, S. (2008). Learning
to classify short and sparse text & web with hidden
topics from large-scale data collections. In Proceeding
of the 17th international conference on World Wide
Web, pages 91–100. ACM.
ReVelle, C. S. and Eiselt, H. A. (2005). Location analysis:
A synthesis and survey. European Journal of Opera-
tional Research, 165(1):1–19.
Siu, K. and Meng, H. (1999). Semi-automatic acquisition of
domain-specific semantic structures. In Proceedings
of Eurospeech.
Tam, Y. and Schultz, T. (2006). Unsupervised language
model adaptation using latent semantic marginals. In
Proceedings of Interspeech, pages 2206–2209.
CONCEPT DISCOVERY FOR LANGUAGE UNDERSTANDING IN AN INFORMATION-QUERY DIALOGUE
SYSTEM
29