Authors:
Engin Bumbacher
1
and
Vivienne Ming
2
Affiliations:
1
EPFL, Switzerland
;
2
U.C., United States
Keyword(s):
Pitch perception, Gist, Sparse coding, Generative hierarchical models, Gaussian mixture models, Bayesian inference, Auditory processing, Speech processing.
Related
Ontology
Subjects/Areas/Topics:
Applications
;
Audio and Speech Processing
;
Bayesian Models
;
Cardiovascular Imaging and Cardiography
;
Cardiovascular Technologies
;
Digital Signal Processing
;
Exact and Approximate Inference
;
Health Engineering and Technology Applications
;
Multimedia
;
Multimedia Signal Processing
;
Pattern Recognition
;
Perception
;
Signal Processing
;
Software Engineering
;
Telecommunications
;
Theory and Methods
Abstract:
The neural basis of pitch perception, our subjective sense of the tone of a sound, has been a great ongoing debates
in neuroscience.Variants of the two classic theories - spectral Place theory and temporal Timing theory -
continue to continue to drive new experiments and debates (Shamma, 2004). Here we approach the question
of pitch by applying a theoretical model based on the statistics of natural sounds. Motivated by gist research
(Oliva and Torralba, 2006), we extended the nonlinear hierarchical generative model developed by Karklin et
al. (Karklin and Lewicki, 2003) with a parallel gist pathway. The basic model encodes higher-order structure
in natural sounds capturing variations in the underlying probability distribution. The secondary pathway provides
a fast biasing of the model’s inference process based on the coarse spectrotemporal structures of sound
stimuli on broader timescales. Adapting our extended model to speech demonstrates that the learned code describes
a more detai
led and broader range of statistical regularities that reflect abstract properties of sound such
as harmonics and pitch than models without the gist pathway. The spectrotemporal modulation characteristics
of the learned code are better matched to the modulation spectrum of speech signals than alternate models,
and its higher-level coefficients capture information which not only effectively cluster related speech signals
but also describe smooth transitions over time, encoding the temporal structure of speech signals. Finally, we
find that the model produces a type of pitch-related density components which combine temporal and spectral
qualities.
(More)