Speech consists of vibrations produced in the vocal tract. The vibrations themselves can be represented by speech waveforms. It is not possible to read the phonemes in a waveform, but if we breakdown the waveform into its frequency components, we obtain a spectrogram which can be deciphered. The quality of a sound such as a vowel depends upon its containing pitches, so-called formants, which are the result of the different shapes of the vocal tract. These formants are shown as dark horizontal bars on the spectrogram.
A spectrogram such as the one at the bottom of Figure 2.3 is
created by displaying all of the Linear Predictive Coding (LPC)
parameters computed from the speech waveform. The vertical axis in a
spectrogram represents frequency, with 0 8 kHz (from the bottom to
the top). All of the spectra computed by the Fourier transform are
displayed parallel to this vertical or y-axis. The horizontal axis
represents time; as we move right along the x-axis we shift forward in
time, traversing one spectrum after another. For reference, we
performed a spectral analysis on 25-msec sections of waveform using a
broad analysis filter with intervals of 2 msec.