The audition system of the highly intelligent humanoid requires
localization of sound sources and identification of meanings of the
sound in the auditory scene. The active audition focuses on improved
sound source tracking by integrating audition, vision, and motor
movements. Given the multiple sound sources in the auditory scene,
"SIG the humanoid" actively moves its head to improve
localization by aligning microphones orthogonal to the sound source
and by capturing the possible sound sources by vision. However,
such an active head movement inevitably creates motor noise. The
system must adaptively cancel motor noise using motor control
signals. The experimental result demonstrates that the active
audition by integration of audition, vision, and motor control
enables sound source tracking in variety of conditions.
Real-Time Multiple Speaker Tracking System
Real-time processing is crucial for sensorimotor tasks in tracking,
and multiple-object tracking is crucial for real-world applications.
Multiple sound source tracking needs perception of a mixture of
sounds and cancellation of motor noises caused by body movements.
However its real-time processing has not been reported yet.
Real-time tracking is attained by fusing information obtained by
sound source localization, multiple face recognition, speaker
tracking, focus of attention control, and motor control. Auditory
streams with sound source direction are extracted by active audition
system with motor noise cancellation capability from 48KHz
sampling sounds. Visual streams with face ID and 3D-position are
extracted by combining skin-color extraction, correlation-based
matching, and multiple-scale image generation from a single camera.
These auditory and visual streams are associated by comparing the
spatial location, and associated streams are used to control focus
of attention. Auditory, visual, and association processing are
performed asynchronously on different PC's connected by TCP/IP
network. The resulting system implemented on an upper-torso
humanoid can track multiple objects with the delay of 200\,msec,
which is forced by visual tracking and network latency.
Three Simultaneous Speech Recognition
We are working on listening to three simultaneous talkers by a
humanoid with two microphones. In such situations, sound separation
and automatic speech recognition (ASR) of the separated speech are
difficult, because the number of simultaneous talkers exceeds that
of its microphones, the signal-to-noise ratio is quite low (around
-3 dB) and noise is not stable due to interfering voices. To
improve recognition of three simultaneous speeches, two key ideas
are introduced --- acoustical modeling of robot head by scattering
theory and two-layered audio-visual integration in both name and
location, that is, speech and face recognition, and speech and face
localization. Sound sources are separated in real-time by an active
direction-pass filter (ADPF), which extracts sounds from a specified
direction by using interaural phase/intensity difference estimated
by scattering theory. Since features of sounds separated by ADPF
vary according to the sound direction, multiple Direction- and
Speaker-dependent (DS-dependent) acoustic models are used. The
system integrates ASR results by using the sound direction and
speaker information by face recognition as well as confidence
measure of ASR results to select the best one. The resulting system
shows around 10\% improvement on average against recognition of
three simultaneous speeches, where three talkers were located 1
meter from the humanoid and apart from each other by 0 to 90 degrees
at 10-degree intervals.
Audio-Visual Speech Recognition
It is a challenging topic to improve automatic speech recognition in
the real world. Audio-visual speech recognition, that is,
integration of ASR and lipreading is one of promising methods to
improve robustness and accuracy of real-world speech recognition.
However, in case of robot, lipreading is not always available
because of occlusion and low resolution of robot cameras. We are
working on development of audio-visual speech recognition when
lipreading and speech are partially-observed. Currently, we are
developing a lipreading function.
Sound Source Separation
Sound source separation is essential to understand general sounds
because we usually hear a mixture of sounds. We developed an
active direction-pass filter (ADPF) that separates sounds
originating from a specified direction detected by a pair of
microphones. Thus the ADPF is based on
directional processing -- a process used in visual processing. The
ADPF is implemented by hierarchical integration of visual and auditory
processing with hypothetical reasoning of interaural phase difference
(IPD) and interaural intensity difference (IID) for each sub-band.
The ADPF gives differences in resolution in sound localization and
separation depending on where the sound comes from, the resolving
power is much higher for sounds coming directly from the front of the
humanoid than for sounds coming from the periphery. This directional
resolving property is similar to that of the eye whereby the visual
fovea at the center of the retina is capable of much higher resolution
than is the periphery of the retina.
To exploit the corresponding ``auditory fovea'', the ADPF controls the
direction of the head. The human tracking and sound source separation
based on the ADPF is implemented on the upper-torso of the humanoid
and runs in real-time using distributed processing by 5 PCs networked
via a gigabit Ethernet. The signal-to-noise ratio (SNR) and noise
reduction ratio of each sound separated by the ADPF from a mixture of
two or three speeches of the same volume were increased by about
2.2dB and 9dB, respectively.
An application of robot audition, we study human-robot interaction.
Internal states of robots such as personality, introduction of other
sensory information and a new function such as directional speaker are
Personality in Robot
We are studying how to create social physical agents, i.e.,
humanoids, that perform actions empowered by real-time audio-visual
tracking of multiple talkers. Social skills require complex
perceptual and motor capabilities as well as communicating ones. It
is critical to identify primary features in designing building
blocks for social skills, because performance of social interaction
is usually evaluated as a whole system but not as each component.We
investigate the minimum functionalities for social interaction,
supposed that a humanoid is equipped with auditory and visual
perception and simple motor control but not with sound output.
Real-time audio-visual multiple-talker tracking system is
implemented on the humanoid, SIG, by using sound source
localization, stereo vision, face recognition, and motor control. It
extracts either auditory or visual streams and associates audio and
visual streams by the proximity in localization. Socially-oriented
attention control makes the best use of personality variations
classi- fied by the Interpersonal Theory of psychology. It also
provides task-oriented funcitons with decaying factor of belief for
each stream. We demonstrate that the resulting behavior of SIG
invites the users' participation in interaction and encourages the
users to explore SIG's behaviors. These demonstrations show that
SIG behaves like a physical non-verbal Eliza.
Tactile sensing is one of such activities. We use Piezo
sensor based tactile sensing. This sensor detects duration and
intensity of stimuli. Mesh of Piezo sensors is installed inside soft
skin of SIG2 so that any kind of stimuli can be localized.
Human-robot interaction by integration of tactile sensing and concepts
of distance between human and robot in ``proxemics'' is on-going
Ultra-Sonic Directional Speaker
A new communication method specialized in robots is studied by
using a ultra-sonic directional speaker. This speaker realize
sound spotting by making use of its directivity.
By combination of an omni-directional (normal) speaker and a
directional speaker installed in SIG2, we show preliminary
demonstrations of multilingual announcement for a group of people.
Auditory Awareness by Using Missing Feature Theory
Auditory awareness for a robot is one of critical technologies in
realizing an intelligent robot operating in daily environments.
We are working on an auditory awareness system by using a
new interface between sound source separation and automatic speech
A mixture of speeches captured with a pair of microphones installed
in the ear positions is separated into each speech by using active
direction-pass filter (ADPF). The ADPF can extract a sound source
from a specific direction in real-time by using interaural
phase/intensity differences. The separated speech is recognized by a
speech recognizer based on missing feature theory (MFT). By using a
missing feature mask, the MFT can cope with distorted and missing
features that are caused by speech separation. The ADPF and the ASR
are interfaced by the missing feature mask, that is, a missing
feature mask for each separated speech is generated in speech
separation and is sent to the ASR with the separated speech. Thus,
the new interface improves the performance of the auditory awareness
The auditory awareness system by using the new interface are
implemented in three humanoids, i.e., Honda ASIMO, SIG and SIG2 of
Kyoto Univ. As a result, the performance of automatic speech
recognition improves over 50% in every humanoid in case of
simultaneous three speeches.