2.3.1 Signal Processing

The raw speech signal is initially converted into a sequence of acoustic vectors (input pattern sequence) to represent some of the speech dynamics (see Figure 2.9). Many signal analysis techniques are available which can extract useful features without losing any important information. The most conventional method is the Fourier analysis (FFT) that yields discrete frequencies over time. Frequencies are often distributed using a Mel scale, which is linear in the low range but logarithmic in the high range[16]. In the system, actually, Mel Frequency Cepstral Coefficients (MFCCs), i.e., MFCCs(12)＋ MFCCs(12)＋ Energy(1), were used to parameterise the speech. All the speeches were sampled at 16 kHz and 16 bit, and their features were extracted shifting a Hamming window of duration 25 msec at the interval of 10 msec.

Next: 2.3.2 Hidden Markov Models Up: 2.3 Speech Recognition Methods Previous: 2.3 Speech Recognition Methods

Jo Chul-Ho
Wed Oct 13 17:59:27 JST 1999