network is an effective way to train an acoustic
We have presented a novel application of CNNs to
phoneme recognition. We have shown how the
TIMIT speech corpus can be used for labelled
spectrogram patches for the CNN-AM training. The
results whilst not surpassing the current state of the
art are encouraging, and the usability and
transparency of the output processing have proved
that CNNs are a very viable way to do speech
recognition. We have also done some initial
experiments with NTIMIT which contains noise from
various telephone networks and as it is telephone
speech it has a narrower frequency range [0, 3.3kHz].
Typically, we have found that NTIMIT results are
around 10% less than for TIMIT. However, we have
found that we are within 1% of the TIMIT networks
performance in our preliminary tests which suggests
that the CNN approach is much more noise robust.
In the near future, we plan to develop strategies to
acquire large volumes of phonetic transcriptions for
training more robust CNN-AM. We are also in the
process of training a sequence-to-sequence language
model to transform the phonetic output to text.
