3.6.4 Correlation between Human Ratings and Combined Thresholds

The machine-produced scorings should be consistent with the ratings obtained from the human judges. Therefore, it is important to verify the consistency of scoring between human judges and automatic scoring. To validate the performance of the automatic scoring method, we had five Japanese native speakers judge all the utterances of eight learners for the phonemes /j/ and /u:/. Using binary rating, i.e., correct or incorrect, the five judges evaluated every speech sample. All judges listened to the speech material and assigned scores individually.

Since it was not possible to have all judges score all speech samples (it would cost too much time and it would be too tiring for the judges) one consonant and one vowel were proportionally assigned to the five judges. The scores assigned by the judges were then combined to compute correlations with the machine scores. The reliability of the automatic scoring method was confirmed by the agreement of the automatic scores with the judges. We were very curious to see how they would relate to the judges scores. These results are shown in Figure 3.7 below.

   figure1073
図 3.7: Agreement between automatic scoring and perceptual judgement

The phonemes that had shown a significant degradation in score were also perceived as incorrect by most of the judges. Although distinct correlation between automatic scoring and perceptual judgement is not observed, they match for the extremely degraded samples, which enables automatic error detection. It is clear from the result that the human judgments strongly correlate with the HMM-log likelihood scores on the basis of the global threshold, but there are ambiguities near the scores between the global threshold and local thresholds. Apparently, the HMM-log likelihood score worked as a good predictor of the human ratings.

Through the whole perceptual judgement, pronunciation errors were successfully detected by our statistical automatic scoring method, except for only a few speech samples that are characterized by accent and rhythm[27]. The greater the segmental error, the easier it is to detect. Here, we need to discuss the distribution of vocalic errors observed in our data set regardless of the L1 background of non-native learners. We must first point out that a specific error doesn't necessarily receive the same rating. This is due to the fact that the human judgement was influenced by the context in which the error occurred. The influence of the context, and what is actually understood by context, is still open for debate and cannot be investigated yet due to lack of data. However, it is a rather important issue given that the task consists in matching automatic scoring with the evaluations of human judges.


next up previous contents
Next: 3.7 Conclusion Up: 3.6 Experiments Previous: Human Judgement

Jo Chul-Ho
Wed Oct 13 17:59:27 JST 1999