Description of Robot Audition

The ultimate goal of the project is to realize robot audition that works in the real world in real-time. In the real environments, the robot should cope with the following difficult situations by using its own microphones as follows:

The robot should listen to a specific sound source under noisy environments. This capability in human is known as "Cocktail Party Effect".
The robot should listen to several speeches simultaneously. This is required to cope with such a case that someone or something playing sounds interrupts conversation. It is known as "Barge-in" in spoken dialog systems.

We consider four issues for realization of such robot audition; these are active audition, multi-modal integration and general sound understanding.

Active Audition

Active audition that couples audition, vision, and motor control system is critical. Active audition can be implemented in various aspects. Take the most visible example, the system should be able to dynamically align microphone positions against sound sources to obtain better resolution. Consider that a humanoid has a pair of microphones. Given the multiple sound sources in the auditory scene, the humanoid should actively move its head to improve localization, separation and recognition by aligning microphones orthogonal to the sound source. Active audition requires movement of the components that mounts microphone units. In many cases, such a mount is actuated by motors that create considerable noise. In a complex robotic system, such as humanoid, motor noise is complex and often irregular because numbers of motors may be involved in the head and body movement. Removing motor noise from auditory system requires information on what kind of movement the robot is making in real-time. In other words, motor control signals need to be integrated as one of the perception channels. If dynamic noise canceling of motor noise fails, one may end-up using "stop-perceive-act" principle reluctantly, so that the audition system can receive sound without motor noise. To avoid using such an implementation, we implemented an adaptive noise canceling scheme that uses motor control signal to anticipate and cancel motor noise.

Multimodal Integration

Multimodal integration is necessary in any case to improve performance of real-world applications. In particular, we focuss on Audio-Visual (AV) integration because vision and audition are the two main sensory information in human beings. For sound source localization, the error in direction determined by a CASA application is about 10 degrees, which is similar to that of a human, i.e. 8 degrees. However, this is too coarse to separate sound streams from a mixture of sounds. In visual processing, there are other problems such as narrow visual field of an ordinary camera and visual occlusion on overlapping persons. It is difficult to solve these problems by only visual or auditory processing. Therefore integration of vision and audition is necessary. Audition and vision should be integrated in a real environment for robust speaker localization and tracking. Generally, AV integration in speech recognition uses visual speech, that is, lip-reading. This is effective to improve speech recognition. In a robot, however, lip-reading is not always available because, when a person is away from the robot, resolution of images from robot's camera is insufficient for detecting the lips. Therefore, we consider another AV integration for speech recognition as well, that is, integration of speech and face recognition because the face is generally detected easier than the lips due to its size. Thus, AV integration is effective in various information levels. We also propose a hierarchical AV integration model that provides inter-level integration of auditory and visual information.

General Sound Understanding

Usually we hear a mixture of sounds, not a sound of single source. It is necessary for understanding general sounds to understand a mixture of sounds. In speaker tracking, multiple persons can speak simultaneously. In such case, a robot have to separate and localize sound sources. Computational Auditory Scene Analysis (CASA) is a framework of understanding a mixture of sounds, and a lot of works have been studied in this area. The concepts of CASA should be introduced to be applied in the real environment. Common techniques of sound source separation for understanding general sounds are a beam forming with a microphone array, independent component analysis (ICA) as blind source separation, and using psychological cues such as harmonic relationship, common frequency modulation and amplitude modulation, onset and offset. However, these techniques assume that the microphone setting is fixed. Some techniques like ICA assume that the number of microphones is more than or equal to the number of sound sources -- this assumption that may not hold in a real-world environment. Therefore, we need to develop sound source separation for mobile robot and moving sound sources to be deployed in the real world.