Description of Robot Audition
The ultimate goal of the project is to realize robot audition that
works in the real world in real-time.
In the real environments, the robot should cope with the following
difficult situations by using its own microphones as follows:
- The robot should listen to a specific sound source under noisy environments. This capability in human is known as "Cocktail Party Effect".
- The robot should listen to several speeches simultaneously.
This is required to cope with such a case that someone or something
playing sounds interrupts conversation. It is known as
"Barge-in" in spoken dialog systems.
We consider four issues for realization of such robot
audition; these are active audition,
multi-modal integration and general sound understanding.
Active Audition
Active audition that couples audition, vision, and motor control
system is critical. Active audition can be implemented in various
aspects. Take the most visible example, the system should be able to
dynamically align microphone positions against sound sources to obtain
better resolution. Consider that a humanoid has a pair of microphones.
Given the multiple sound sources in the auditory scene, the humanoid
should actively move its head to improve localization, separation and
recognition by aligning microphones orthogonal to the sound source.
Active audition requires movement of the components that mounts
microphone units. In many cases, such a mount is actuated by motors
that create considerable noise. In a complex robotic system, such as
humanoid, motor noise is complex and often irregular because numbers
of motors may be involved in the head and body movement. Removing
motor noise from auditory system requires information on what kind of
movement the robot is making in real-time. In other words, motor
control signals need to be integrated as one of the perception
channels. If dynamic noise canceling of motor noise fails, one may
end-up using "stop-perceive-act" principle reluctantly, so
that the audition system can receive sound without motor noise. To
avoid using such an implementation, we implemented an adaptive noise
canceling scheme that uses motor control signal to anticipate and
cancel motor noise.
Multimodal Integration
Multimodal integration is necessary in any case to improve performance
of real-world applications. In particular, we focuss on Audio-Visual
(AV) integration because vision and audition are the two main sensory
information in human beings.
For sound source localization, the error in direction determined by a
CASA application is about 10 degrees, which is
similar to that of a human, i.e. 8 degrees.
However, this is too coarse to separate sound streams from a mixture
of sounds. In visual processing, there are other problems such as
narrow visual field of an ordinary camera and visual occlusion on
overlapping persons. It is difficult to solve these problems by only
visual or auditory processing. Therefore integration of vision and
audition is necessary. Audition and vision should be integrated in a
real environment for robust speaker localization and tracking.
Generally, AV integration in speech recognition uses visual speech,
that is, lip-reading. This is
effective to improve speech recognition. In a robot, however,
lip-reading is not always available because, when a person is away
from the robot, resolution of images from robot's camera is
insufficient for detecting the lips. Therefore, we consider another
AV integration for speech recognition as well, that is, integration of
speech and face recognition because the face is generally detected
easier than the lips due to its size.
Thus, AV integration is effective in various information levels. We
also propose a hierarchical AV integration model that provides
inter-level integration of auditory and visual information.
General Sound Understanding
Usually we hear a mixture of sounds, not a sound of single source. It
is necessary for understanding general sounds to understand a mixture
of sounds. In speaker tracking, multiple persons can speak
simultaneously. In such case, a robot have to separate and localize
sound sources. Computational Auditory Scene Analysis (CASA)
is a framework of understanding a mixture of sounds, and a lot of
works have been studied in this area. The concepts of CASA should be
introduced to be applied in the real environment.
Common techniques of sound source separation for understanding general
sounds are a beam forming with a microphone array, independent
component analysis (ICA) as blind source separation, and using
psychological cues such as harmonic relationship, common frequency
modulation and amplitude modulation, onset and offset. However, these
techniques assume that the microphone setting is fixed. Some
techniques like ICA assume that the number of microphones is more than
or equal to the number of sound sources -- this assumption that may
not hold in a real-world environment.
Therefore, we need to develop sound source separation for mobile robot
and moving sound sources to be deployed in the real world.