ance of the model to a variety of pitch phenomena to
the model.
However, the given implementation of the model
hampers its ability to robustly capture this distinct re-
lationship between the low-frequency and the high-
frequency units with higher precision because of two
reasons. First, the restricted number of sparse com-
ponents inherently introduces a trade-off in their fre-
quency resolution as well as in their capacity to en-
code the phase of the signal. Second, the model
training implements a block based approach to signal
encoding, with the sound segments being randomly
drawn from the set of available sentences, irrespec-
tive of the phase structure of the signals. Therefore,
the representation of the higher-order structure related
to temporal pitch is highly sensitive to the learning
process. We plan on addressing these issues in future
work.
REFERENCES
Bell, T. and Sejnowski, T. (1995). An information-
maximization approach to blind separation and blind
deconvolution. Neural Computation, 7:1129–1159.
Garofolo, J., Lamel, L., Fisher, W., Fiscus, J., Pallett, D.,
Dahlgren, N., and Zue, V. (1990). TIMIT Acoustic-
Phonetic Continuous Speech Corpus.
Griffiths, T., Buechel, C., Frackowiak, R., and Patterson, R.
(1998). Analysis of temporal structure in sound by the
human brain. Nature Neuroscience, 6:633–637.
Harding, S., Cooke, M., and Konig, P. (2007). Auditory
gist perception: An alternative to attentional selection
of auditory streams? In Lecture Notes in Computer
Science. Springer.
Karklin, Y. and Lewicki, M. (2003). Learning higher-order
structures in natural images. Network: Computation
in Neural Systems, 14:483–499.
Karklin, Y. and Lewicki, M. (2005). A hierarchical bayesian
model for learning nonlinear statistical regularities in
nonstationary natural signals. Neural Computation,
17:397–423.
Klein, D., Konig, P., and Kording, K. (2003). Sparse spec-
trotemporal coding of sound. EURASIP J. on Ad-
vances in Signal Processing.
Ming, V., Rehn, M., and Sommer, F. (2009). Sparse cod-
ing of the auditory image model. UC Berkeley Tech
Report.
Oliva, A. and Torralba, A. (2006). Building the gist of a
scene: The role of global image features in recogni-
tion. Progress in Brain Research.
Olshausen, B. and Field, D. (1996). Emergence of simple-
cell receptive field properties by learning a sparse code
for natural images. Nature, 381:607–609.
Olshausen, B. and Field, D. (1997). Sparse coding with
an overcomplete basis: A strategy employed by v1?
Vision Research, 37:3311–3325.
Oxenham, A., Bernstein, J., and Penagos, H. (2004). Cor-
rect tonotopic representation is necessary for complex
pitch perception. In Proc Natl Acad Sci USA, volume
101, pages 1114–1115.
Patterson, R., Uppenkamp, S., Johnsrude, I., and Grif-
fiths, T. (2002). The processing of temporal pitch
and melody information in auditory cortex. Neuron,
36:767–776.
Saul, L. and Roweis, S. (2000). Nonlinear dimensional-
ity reduction by locally linear embedding. Science,
22:2323.
Shamma, S. (2001). On the role of space and time in audi-
tory processing. Trends in Cognitive Science, 5:340–
348.
Shamma, S. (2004). Topographic organization is essential
for pitch perception. PNAS, 5:1114–1115.
Shannon, R., Zeng, F., Kamath, V., Wygonski, J., and Eke-
lid, M. (1995). Speech recognition with primarily
temporal cues. Science, 270.
Singh, N. and Theunissen, F. (2003). Modulation spectra
of natural sounds and ethological theories of auditory
processing. J Acoust Soc Am, 114:3394–3411.
Turner, R. and Sahani, M. (2007). Probabilistic amplitude
demodulation. In Lecture Notes in Computer Science,
volume 4666, pages 544–551.
APPENDIX
Inference in the Fully Extended Density
Component Model
As described previously, the alterations of the origi-
nal density component model affect the encoding and
learning procedures, while the generative model itself
remains the same. As the transformation from the
sound to the higher-order representation vdc is fun-
damentally nonlinear, the optimal coefficient values
for the representation cannot be expressed in closed
form
2
. In order to encode a given signal, the MAP es-
timation of the sparse (SC) and the density component
coefficients (DC) is illustrated in figure 6:
1. Choose a whitened sound extract x
w
of length T .
2. Generate the corresponding spectrogram S
x
of
temporal length T
S
> T and frequency resolution
F
res
, using a logarithmic scaling of the frequen-
cies.
3. Calculate the gist information:
(a) Project S
x
into the space of the first 50 principal
components explaining approximately 95% of
the variance:
u
G
= W
S
pca
S
x
, W
S
pca
= D
−1/2
S
E
T
S
, (12)
2
Closed form means that the expression can be written
analytically in terms of a bounded number of certain well-
known functions (i.e. no infinite series, etc.)
PITCH-SENSITIVE COMPONENTS EMERGE FROM HIERARCHICAL SPARSE CODING OF NATURAL SOUNDS
227