
queries of spoken documents and speech queries of
text documents, the choice of suitable subword units
for multimedia retrieval is important. The advantage
of subword units is that the transcript is readable by
humans and can be used to translate text queries into
subword sequences so as to be acceptable in SDR.
The present authors have been developing an SDR
system in which retrieval is conducted by calcu-
lating the distance between the parts of a subpho-
netic segment(SPS) sequence extracted from under-
lying speech recognition. As the system is based
on matching SPS sequences directly, the system is
not constrained in terms of vocabulary or grammar,
and is robust with respect to recognition error(Tanaka,
2001)(Lee, 2002). Most existing SDR systems are
based on matching text, and speech recognition sys-
tems usually employ the integration of likelihood val-
ues of acoustic phoneme sequences given from a top-
down hypotheses. Thus, it should be possible to
merge both acoustic and symbolic processing simul-
taneously. In this work, the feasibility of subpho-
netic units for retrieval in an SDR system is investi-
gated. The effect of varying the distance measure is
also examined in an attempt to improve the perfor-
mance of the shift continuous dynamic programming
(Shift-CDP) matching based on SPS sequences. Fi-
nally, SDR experiments are conducted to evaluate the
performance of the proposed system in both monolin-
gual and multilingual tasks.
2 SPOKEN DOCUMENT
RETRIEVAL SYSTEM
A spoken document database containing a signifi-
cantly high proportion of OOV words is assumed,
such as names and places. Such words will be suscep-
tible to poor retrieval performance due to misrecogni-
tion. Speech retrieval is similar to text retrieval, ex-
cept for a number of difficulties in actual application
such as accurate detection of word boundaries, recog-
nition errors, and acoustic mismatching. For this rea-
son, existing SDR systems perform retrieval using a
text-based database linked to multimedia material in
the speech-based database. The SDR system pro-
posed here aims to retrieve speech keyphrases directly
from the object multimedia database. In the system,
if the object multimedia database has parts similar to
those included in the input queries, the relevant data
can be retrieved using only the accumulated distance
between arbitrary durations of SPS sequences. Such
a scheme is suitable for an open-vocabulary system.
This function can be performed by applying Shift-
CDP for optimal matching between SPS sequences.
This is an essential difference from the conventional
speech processing methods. In the proposed system,
the input utterance is first encoded in terms of acous-
tic features. Then, the SPS extracted by a recognizer
is transferred to Shift-CDP(Tanaka, 2001). Figure 1
shows the overall block diagram of the proposed SDR
system.
Figure 1: Block diagram of proposed SDR system based on
subphonetic segments
3 SUBWORD UNITS
In order to allow user-friendly queries of a multi-
media database, speech signals are converted into
words, phonemes, or other subword units, using
a speech recognition system. This work focuses
on a SPS-based approach, where spoken documents
are recognized as SPS sequences and the retrieval
process is carried out based on matching the dy-
namic programming scores of these transcriptions.
Although word-based approaches have consistently
outperformed phoneme approaches(Voorhees, 1998),
there are several compelling reasons for using SPS, as
mentioned above.
The present authors have been developing an ar-
chitecture for speech processing systems based on the
universal phonetic code (UPC)(Tanaka, 2001). All
of the speech data in the systems are once encoded
into UPC sequences, and then the speech process-
ing systems, such as recognition, retrieval, and diges-
tion, are constructed in the UPC domain. The inter-
national phonetic alphabet (IPA) or extended speech
assessment methods phonetic alphabe (XSAMPA) is
the candidate set for the UPC set. Here SAMPA is a
machine-readable phonetic alphabet. The SPS is de-
rived from XSAMPA and is refined under the consid-
eration of acoustic-articulatory effects. For example,
the XSAMPA (i.e., IPA) contains partly extra-detailed
categorization to be modeled in an engineering sense.
Therefore, only primary IPA symbols are adopted,
ROBUST SPOKEN DOCUMENT RETRIEVAL BASED ON MULTILINGUAL SUBPHONETIC SEGMENT
RECOGNITION
135