Acoustic Analysis of Dialogue Speech
Shigeyoshi KITAZAWA, Satoshi KOBAYASHI, Takao MATSUNAGA,
Hidetsugu ICHIKAWA, and Junichi NISHIYAMA
Department of Computer Science, Shizuoka University
5-1, 3 Chome, Hamamatsu, 432, Japan
e-mail: kitazawa@cs.shizuoka.ac.jp
Nonverbal communications play an important role in human dialogue
where participants use natural speech so called spontaneous speech.
That nonverbal information interchanged between people is called as
the "paralanguage". There are number of aspects of paralanguage.
Some of them are sensational features, and difficult to realize
measurements. The others, however, are measurable as concrete
acoustical features. Here we described how to measure the speech
rate. Another aspect of dialogue research is the transcription into
text of the paralinguistic features such as voice loudness, tone, and
speech rate. We made various measurements of validity and consistency
of descriptions between different transcribers.
The speech rate is the number of moras per second. A mora is
a prosodical term that is a conjunction of a consonant and a short
vowel. One technical measurement of the rate of speech is achievable
through the phoneme recognition, that is, the point of time of each
phoneme is marked along the time line and the resulting phoneme (hence
the mora) lengths are averaged along some intervals resulting a number
of moras per second. This definition, however, is difficult to
realize with automatic speech recognition technology, but possible
only by hand labeling. There are several possible hypotheses of
perception of the rate of speech. We estimate the tempo phoneme
independently and do not exploit speech recognition technology, since
we assume we can recognize the rate of speech without recognizing the
content of speech. We can perceive, for example, the utterance speed
from the narrow band filtered speech sound. This fact suggests that
our sense of a tempo can be perceivable from the envelope of the
waveform.
In the preceding study, the rhythm is correlated to the interval of
the center of energy between adjacent syllables.
Because Japanese has the CV-syllable-timed feature, downswings of the
envelope appear at every consonant segment almost at the same
interval. A speech envelope changes dynamically from a consonant to a
vowel and then to the next consonant forming peaks and valleys. The
intervals between peaks and valleys are expected to be approximately
equal or periodic because of the syllable-timed feature. If this is
true, then we can extract this periodicity through the following
procedure: the DFT (Discrete Fourier Transform) of the Hamming
windowed envelope pattern. We employed window size about one second
through our experiment. The window includes local pauses and
non-lexical voicing due to non-verbal or paralinguistic expressions.
In order to obtain an envelope of the speech waveform, we first
rectified the wave to obtain a half-wave, on which then we
low-pass-filtered to obtain an approximate envelope. We designed a
low pass filter of the cutoff frequency at 80 Hz to keep the envelope
details. This filter deals about ten times of the average mora per
second.
The speaking rate is observed as a dominant spectral peak in a
frequency domain, where the speaking rate is visually represented in
frequency-time plane like the formantic pattern of spectrograms. The
frequency and the time are scaled downward to one two hundredth of the
8 kHz sampling rate of the normal spectrogram. We could observe gray
gauged monochrome patterns in the 20 Hz frequency region with one
second of time window.
We employed the bandpass filtered speech with the auditory
model for the source of the speech envelope. In both speech waves of
bandpass filtered and full-band, we could observe in the spectrogram of
the wave envelope such concentrated spectral energy around the
frequency corresponding to the speech rate.
First, we examined with the synthesized sounds such as a
stationary half wave rectified signal with a short silent gap
corresponding to a consonant interval, and then gradually decreasing
intervals between envelope peaks. These test signals are processed
according to the described procedure to obtain spectrograms of the
envelope. We could observe dark bars corresponding to the speech
rates.
We examined real dialogue speech taken from TV programs. The
real speech rate is measured manually by segmentation of individual
phonemes. Then we computed DFTs of the envelop waves to find the
frequency of the peak energy as an estimate of the speech rate. The
manual estimation and the DFT estimation correlated with coefficient
0.57.
Spectrograms of the speech envelope of real speech show
complicated texture than the test signals. Therefore, it was
difficult to recognize the speech rate as a unique dark bar pattern.
Text encoded dialogue is a useful form of analysis data,
however, consistency between transcribers and omission and misleading
are unavoidable. We estimated these errors and inconsistencies with
cross check between different transcribers. Their descriptions are
according to the TEI encoding scheme of utterances and paralinguistic
descriptions of real dialogue. We could find 92% agreements between
different transcribers of phonetic transcriptions. We found 50% of
disagreements of nonverbal transcriptions. These results suggest us
that some acoustic parameters will help consistent description of
nonverbal features.
Keywords: dialogue speech, speech rate, TEI, evaluation of description