Comparison with native speakers would be necessary to establish norm ranges that are required for detecting a learner's error. The scores were computed using native acoustic models by means of automatic scoring, and they represent the degree of match between the non-native speech and the native models: the higher the score is, the better the speech fits the phoneme models and the better the perceived quality is[23]. However, in order to automatically detect pronunciation errors, the degree of match between them should be defined. Therefore, we investigate several techniques for automatically detecting pronunciation error. Such techniques are usually based on empirically derived thresholds on the native speakers' scores. At this point, some common variations are possible according to the use of one native model speaker or a group of native speakers. Some researchers have insisted that an announcer, who was generally trained with standard language, was desirable for the definition of norm-range scores. In the experiments, we have tried several possible methods for each task. The effectiveness of these techniques can be evaluated based on its correlation with human judgement for the training speech of non-native learners.
The general idea of this technique is that a human being has the auditory sense to decide whether the pronunciation is good or not when hearing a word or sentence, and therefore it can be expressed by a threshold function. This technique was applied to the M-set task in the experiment.
Another possible technique is to use relative thresholds, i.e., to set up local thresholds of each phoneme. In particular, for the statistical analysis, the use of absolute thresholds on phoneme scores is of little use[24]. Thresholds are defined based on the speech of one ideal model speaker. This method was applied to the T-set task in the experiment.
In order to detect the pronunciation errors caused by linguistic disparity, two above methods are incorporated into combined thresholds. That is to say, the means and standard deviations are calculated for each phoneme of speech from native speakers, and used as relative (or local) thresholds. Absolute (or global) threshold is defined as an average value of all the local thresholds. Throughout our preliminary experiments, the combinational use of these two kinds of thresholds works better than the use of every phonemic threshold or an absolute threshold to reliably detect pronunciation errors. The pronunciation is judged as an error when the score exceeds the global threshold as well as its phonemic threshold. In the experiment, the P-set task was the target of this novel method.