Regularity structure of the BM vibration pattern. A, Vibration of the basilar membrane produced by a periodic sound S(x,t) (clarinet musical note), at places x tuned to different frequencies (modeled by band-pass filters). B, The vibration at one place is transformed into spikes produced by an auditory nerve fiber (bottom, poststimulus time histogram of spikes). In Licklider’s model, the fiber projects to a coincidence detector neuron through two axons with conduction delays differing by δ. The neuron fires maximally when the signal’s periodicity T equals δ. C, If the signal’s period T is smaller than the neuron’s refractory time, then the neuron must detect coincidences between spikes coming from different fibers. D, If the fibers originate from slightly different places x and y on the cochlea, then the neuron responds to similarities between BM vibrations at different places. E, Vibration pattern of the BM produced by a nonperiodic sound (noise): there is no regularity structure across place and time. F, Vibration pattern produced by a musical note: there are signal similarities across time (horizontal arrows) and place (oblique arrow).
Harmonic resolvability and cross-channel structure. A, Amplitude and phase spectrum of two gammatone filters. Only a pure tone of frequency f (“Input” waveform) is attenuated in the same way by the two filters (red and blue waveforms: filter outputs). At that frequency, the delay between the outputs of the two filters is δ = Δφ/f. B, If several harmonic components fall within the bandwidths of the two filters, then the outputs of the two filters differ (no cross-channel similarity). C, Excitation pattern produced on the cochlea by a harmonic complex. Top, Amplitude versus center frequency of gammatone filters. Bottom, Spectrum of harmonic complex and of gammatone filters. Harmonic components are resolved when they can be separated on the cochlear activation pattern. Higher-frequency components are unresolved because cochlear filters are broader. D, Resolved components produce cross-channel similarity between many pairs of filters (as in A). Unresolved components produce little cross-channel structure (as in B). E, Thus, the vibration pattern produced by resolved components displays both within-channel and cross-channel structure (left), while unresolved components only produce within-channel structure (right).
Domain of existence of pitch. A, Within-channel structure produced by a periodic sound can be decoded if the sound’s period is smaller than the maximal neural delay δmax. When δmax = 4 ms, it occurs for sounds of fundamental frequency greater than 250 Hz. B, A pure tone or resolved harmonic produces cross-channel structure with arbitrarily small delays between channels, corresponding to the phase difference between the two filters at the sound’s frequency: here a 100 Hz tone produces two identical waveforms delayed by δ = 2 ms, while the sound’s period is 10 ms. C, A transposed tone with a high-frequency carrier (>4 kHz) modulated by a low-frequency envelope (<320 Hz) elicits a very weak pitch (Oxenham et al., 2004a) (top: f0 = 120 Hz). Such sounds produce only within-channel structure because they only have high-frequency content (middle). The structural theory of pitch predicts an absence of pitch when the envelope’s periodicity is larger than δmax, which is consistent with psychophysics if δmax< 3 ms. D, A pure tone with the same fundamental frequency (f0 = 120 Hz) produces cross-channel structure with short delays. The structural theory of pitch predicts the existence of pitch in this case, consistently with psychophysical results (Oxenham et al., 2004a). E, Complex tones with f0 between 400 Hz and 2 kHz and all harmonics above 5 kHz elicit a pitch (Oxenham et al., 2011) (top, spectrum of a complex tone; middle, temporal waveform). Such tones produce only within-channel structure in high frequency (bottom), and the structural theory of pitch predicts the existence of pitch if the sound’s period is smaller than δmax, which is consistent with psychophysics if δmax > 2.5 ms.
Neural network model of pitch estimation using within- and cross-channel structure. A, Spectrogram of a trumpet sound showing the first two harmonics. Two neurons with CF around the first harmonic and input delay δ receive the same signal (red and blue rectangles and input signals below). As a result, the two neurons fire synchronously for all three neuron models used: biophysical model of chopper and octopus cells, and leaky integrate-and-fire model (voltage traces). B, Spectrogram of a rolling sea wave sound, which shows no regularity structure. In particular, the two neurons do not receive the same signals (input, shaded area: difference between the two signals) and thus do not fire synchronously. C, Spectrogram of a harpsichord sound with unresolved harmonics in high frequency. The inset shows the periodicity of the envelope. Two neurons fire synchronously if they receive inputs from the same place delayed by δ = 1/f0. D, In the same high-frequency region, the inharmonic sound of a sea wave does not produce within-channel structure and therefore the two neurons do not fire synchronously. E, Synaptic connections for a pitch-selective group tuned to f0 = 220 Hz. Harmonics are shown on the left (red comb) superimposed on auditory filters. Resolved harmonics (bottom) produce regularity structure both across and within channels: color saturation represents the amplitude of the filter output while hue represents its phase for different delays (horizontal axis) and characteristic frequencies (vertical axis). Neurons with the same color fire synchronously and project to a common neuron. Unresolved harmonics (top) produce regularity structure within channels only. Here two identical colors correspond to two identical input signals only when the neurons have identical CF (same row). F, Same as E for f0 = 261 Hz, producing a different regularity structure, corresponding to a different synchrony pattern in input neurons. Synchronous neurons project to another group of neurons, selective for this pitch.
Pitch recognition by a neural network model based on the structural theory. A, Top, Spectrogram of a sequence of sounds, which are either either environmental noises (inharmonic) or musical notes of the chromatic scale (A3-A4) played by different instruments. Bottom, Firing rate of all pitch-specific neural groups responding to these sounds (vertical axis: preferred pitch, A3−A4). B, Distribution of firing rates of pitch-specific groups for instruments played at the preferred pitch (blue) and for noises (grey) for three different sound levels. C, Top, Pitch recognition scores of the model (horizontal axis: error in semitones) on a set of 762 notes between A2 and A4, including 41 instruments (587 notes) and five sung vowels (175 notes). Bottom, Firing rate of all pitch-specific groups as a function of the difference between presented f0 and preferred f0, for all sounds (solid black: average). Peaks appear at octaves (12 semitones) and perfect fifths (7 semitones). D, Impact of the number of frequency channels (top) and maximal delay δmax (bottom) on recognition performance.
Pitch discriminability. A, Two neurons tuned to the same frequency (within-channel) but with delay mismatch δ = 1/f produce phase-locked spikes (red and blue crosses) in response to a tone (sine waves). When the tone frequency is f (left), the two input signals match and the difference of phases of spikes ΔΦ(f) between the two neurons is distributed around 0 (shaded curve). When the tone frequency is f + df (right), the two signals are slightly mismatched and the distribution of ΔΦ(f) is not centered on 0. B, Two neurons tuned to different frequencies (cross-channel) respond at different mean phases to tones (red and blue curves). C, The discriminability index d' is defined as the distance µ between the centers of be two phase difference distributions (ΔΦ(f) and ΔΦ(f + df)) relative to their standard deviation σ. D, The standard deviation of the phase distribution is related to the precision of phase locking, measured by the vector strength (dots: vector strength vs characteristic frequency for guinea pig auditory fibers; solid curve: fit). E, Mean phase of spikes produced by auditory nerve fibers of guinea pigs for different tone frequencies (data from Palmer and Shackleton, 2009), as a function of CF (crosses) with fits (solid lines). F, Weber fraction (Δf/f, where Δf is the just noticeable difference in frequency) as a function of tone frequency for cross-channel structure (colored curves) and within-channel structure (black curve). Color represent different frequency spacings between the two channels (1 − 6 semitones). Dotted lines represent the limitations implied by a maximal delay δmax = 5 ms.