The raw speech signal is initially converted into a sequence of
acoustic vectors
(input pattern
sequence) to represent some of the speech dynamics (see
Figure 2.9). Many signal analysis techniques are available
which can extract useful features without losing any important
information. The most conventional method is the Fourier analysis
(FFT) that yields discrete frequencies over time. Frequencies are
often distributed using a Mel scale, which is linear in the low
range but logarithmic in the high range[16]. In the
system, actually, Mel Frequency Cepstral Coefficients (MFCCs),
i.e., MFCCs(12)+
MFCCs(12)+
Energy(1), were used
to parameterise the speech. All the speeches were sampled at 16 kHz
and 16 bit, and their features were extracted shifting a Hamming
window of duration 25 msec at the interval of 10 msec.