3.4.1 Segmentation

An utterance, which is to be recognized, is more complex than a steady sound, and a speech pattern almost always involves a sequence of short-time acoustic representations. The pattern-comparison technique for speech recognition, therefore, has to be able to compare sequences of acoustic features. The problem associated with spectral sequence comparison for speech comes from the fact that different acoustic renditions or tokens of the same speech utterance (e.g., word, phrase, sentence) are seldom realized at the same speech speaking rate across the entire utterance. Hence when comparing different tokens of the same utterance, speaking rate variation as well as duration variation should not contribute to the dissimilarity score. Thus there is a need to normalize speaking rate fluctuations in order for the utterance comparison to be meaningful before a recognition decision can be made[18].

For this reason, we adopted an HMM (Hidden Markov Model) to solve such time alignment problems. We use the Japanese phoneme HMMs with 3 states and 16 mixture components that are trained with the ASJ (Acoustic Society of Japan) database of 132 male speakers (approximately, 20K sentence utterance per speaker) to generate phonemic segmentations. Since HMMs are constructed only for phonemes, word or sentence HMM is generated by concatenating them according to the transcribed phonemic symbols. The Viterbi alignment is then applied to the given utterance. The Viterbi algorithm finds the single best path on the trellis defined by speech frames and HMM states, and aligns them by phoneme. During the Viterbi algorithm, the forward probability is also yielded as an HMM log-likelihood score, which represents the matching degree, i.e., the larger it is, the closer the utterance is to the standard phoneme models.

Each phonemic symbol is labeled below the acoustic waveform window in Figure 3.3. We have evaluated the precision of our segmentation over utterances, and confirmed that the central portion of each phoneme is precisely located, whereas there exists inconstant gap (within 30 ms) on its boundaries. It is generally known that there are virtually no abrupt boundaries between phonemes, and even experienced phoneticians, when asked to locate boundaries between phonemes precisely, do not always agree with each other[22]. In this system, only the central portion is segmented to analyze the cause of articulation errors.

Next: 3.4.2 Scoring Up: 3.4 Automatic Segmentation and Previous: 3.4 Automatic Segmentation and

Jo Chul-Ho
Wed Oct 13 17:59:27 JST 1999