6.4.9 SpectralMeanNormalization

6.4.9.1 Outline of the node

This node subtracts the mean of features from the input acoustic features. However, as a problem, to realize real-time processing, the mean of the utterance concerned cannot be subtracted. It is necessary to estimate or approximate mean values of the utterance concerned using some values.

6.4.9.2 Necessary file

No files are required.

6.4.9.3 Usage

When to use

This node is used to subtract the mean of acoustic features. This node can remove mismatches between the mean values of recording environments of audio data for acoustic model training, and audio data for recognition. Properties of a microphone cannot be standardized often for some speech recording environments. In particular, speech-recording environments of acoustic model training and recognition are not necessarily the same. Since different persons are usually in charge of speech corpus creation for training and recording of audio data for recognition, it is difficult to arrange the same environment. Therefore, it is necessary to use features that do not depend on speech recording environments. For example, microphones used for acquiring training data and those used for recognition are usually different. Differences in the properties of microphones appears as a mismatch of the acoustic features of the recording sound, which causes recognition performance degradation. The difference in properties of microphones does not change with time and appears as a difference of mean spectra. Therefore, the components that simply depend on recording environments can be subtracted from features by subtracting the mean spectra.

Typical connection

\includegraphics[width=100mm]{fig/modules/SpectralMeanNormalization}
Figure 6.69: Connection example of SpectralMeanNormalization 

6.4.9.4 Input-output and property of the node

Table 6.62: Parameter list of SpectralMeanNormalization 

Parameter name

Type

Default value

Unit

Description

FBANK_COUNT

int 

13

 

Dimension number of input feature parameter

Input

FBANK

: Map<int, ObjectRef>  type. A pair of the sound source ID and feature vector as Vector<float>  type data.

SOURCES

: It is Vector<ObjectRef>  type. Sound source position.

Output

OUTPUT

: Map<int, ObjectRef>  type. A pair of the sound source ID and feature vector as Vector<float>  type data.

Parameter

FBANK_COUNT

: int  type. Its range is 0 or a positive integer.

6.4.9.5 Details of the node

This node subtracts the mean of features from the input acoustic features. However, as a problem, to realize real-time processing, the mean of the utterance concerned cannot be subtracted. It is necessary to estimate or approximate mean values of the utterance concerned using some values. Real-time mean subtraction is realized by assuming the mean of the former utterance as an approximate value and subtracting it instead of subtracting the mean of the utterance concerned. In this method, a sound source direction must be considered additionally. Since transfer functions differ depending on sound source directions, when the utterance concerned and the former utterance are received from different directions, the mean of the former utterance is inappropriate compared with the mean approximation of the utterance concerned. In such a case, the mean of the utterance that is uttered before the utterance concerned from the same direction is used as a mean approximation of the utterance concerned. Finally, the mean of the utterance concerned is calculated and is maintained in memory as the mean for the direction of the utterance concerned for subsequent mean subtraction. When the sound source moves by more than 10[deg] during utterance, a mean is calculated as another sound source.