Figure 1: An extracted isosurface from raw MRI without
any preliminary modifications. Observe that the osseous tis-
sue is merged with the air volume.
which then allows for the application of statistical
analysis on numerical results.
The robustness of VT surface extraction has been
validated by applying the algorithm to a dataset of 3D
MRI from one male and one female test subject ut-
tering vowels, comprising of 109 images.
1
This da-
taset is used as the reference of the characteristics of
the MRI data in this article. We show two examples
of challenging VT geometries for the extraction. In
Figure 8, the resolution of the MRI data is insuffi-
cient for resolving the piriform sinuses. In Figure 7
(which is not part of the validation data set), the ear-
lier version of the extraction algorithm (Aalto et al.,
2013) was unable to resolve the vicinity of the uvula
whereas the proposed algorithm is able reproduce the
opening as required.
Measurements are performed on a Siemens Mag-
netom Avanto 1.5T scanner using 3D VIBE MRI se-
quence (Rofsky et al., 1999). Further details about
the acquisition of the MRI data have been explained
in (Aalto et al., 2014, Section 3).
2 BACKGROUND
Vocal tract segmentation from MRI is a long-standing
technical challenge in speech research. Semi-
automated algorithms have been developed since the
MRI resolution and the scanning time first became
practicable for capturing articulation (Niikawa et al.,
1996; Baer et al., 1987; Baer et al., 1991; Engwall and
1
The original dataset of the two test subjects contained
114 MR images, of which five were deemed as failed scans,
and hence, excluded from the validation set of this article.
Badin, 1999; Story et al., 1996; Story et al., 1998; Ta-
kemoto et al., 2004; Aalto et al., 2014; Aalto et al.,
2011). When using 3D MRI, experiments involving
prolonged vowel utterances have been most common.
There exist more generic softwares that can be
used for extracting VT geometries, but the particu-
lar challenges related to the head and neck area ana-
tomy require a highly tailored approach. For exam-
ple, the segmentation software Vascular Modelling
Toolkit (Vascular Modeling Toolkit, 2016) for medi-
cal data on blood vessels can be used due to the tubu-
lar shape of the VT. However, generic software – or
software mainly intended for other purposes – require
user input to define initial configurations, etc., as well
as various parameters. These parameter values can-
not always be directly inferred from the data, and the
user must usually proceed based on trial and error to
produce high quality segmented data. Moreover, the
sensitivity with respect to parameter values is an is-
sue, and different parts of the anatomy may benefit
from different parameterisations. All this adds to the
amount of manual work which easily becomes prohi-
bitively high for large scale studies or in commercial
applications.
Segmentation approaches based on an estimated
VT centreline have been proposed (Poznyakovskiy
et al., 2015). This approach reduces the 3D segmenta-
tion task into two dimensions where, e.g., active con-
tour methods (i.e., snakes) can be used. With such
methods, a multitude of parameters is needed. More-
over, special care must be taken when generating the
VT centreline (lacking a unique definition due to the
complicated geometry) which is a non-trivial task in
itself. It should be pointed out that VT centrelines and
intersection surface areas are also required in some of
the speech acoustic models. It is, however, a different
matter to derive such centrelines from a triangulated
VT surface model, compared to deriving them from
raw voxel data. In two-dimensional sections, some
parts of the VT (such as piriform sinuses and the val-
leculae) may appear not connected even though they
are connected in three dimensions. This adds to the
complications when using an active contour method.
An almost automatic segmentation technique was
presented in (Aalto et al., 2013), and the current work
is based on the lessons learned since then. The earlier
approach requires artefact model geometries for the
maxilla and mandible that have to be created for each
test subject separately. The artefact models are then
automatically aligned with the surface extracted from
the target data in three dimensions in order to mask
away the osseous structures that would interfere with
the air volume in the VT as shown in Fig. 1. Both the
surface models had to be represented as point clouds
BIOIMAGING 2017 - 4th International Conference on Bioimaging
78