Reseach Activity

Active Audition

The audition system of the highly intelligent humanoid requires localization of sound sources and identification of meanings of the sound in the auditory scene. The active audition focuses on improved sound source tracking by integrating audition, vision, and motor movements. Given the multiple sound sources in the auditory scene, "SIG the humanoid" actively moves its head to improve localization by aligning microphones orthogonal to the sound source and by capturing the possible sound sources by vision. However, such an active head movement inevitably creates motor noise. The system must adaptively cancel motor noise using motor control signals. The experimental result demonstrates that the active audition by integration of audition, vision, and motor control enables sound source tracking in variety of conditions.

Multimodal Integration

Real-Time Multiple Speaker Tracking System

Real-time processing is crucial for sensorimotor tasks in tracking, and multiple-object tracking is crucial for real-world applications. Multiple sound source tracking needs perception of a mixture of sounds and cancellation of motor noises caused by body movements. However its real-time processing has not been reported yet. Real-time tracking is attained by fusing information obtained by sound source localization, multiple face recognition, speaker tracking, focus of attention control, and motor control. Auditory streams with sound source direction are extracted by active audition system with motor noise cancellation capability from 48KHz sampling sounds. Visual streams with face ID and 3D-position are extracted by combining skin-color extraction, correlation-based matching, and multiple-scale image generation from a single camera. These auditory and visual streams are associated by comparing the spatial location, and associated streams are used to control focus of attention. Auditory, visual, and association processing are performed asynchronously on different PC's connected by TCP/IP network. The resulting system implemented on an upper-torso humanoid can track multiple objects with the delay of 200\,msec, which is forced by visual tracking and network latency.

Three Simultaneous Speech Recognition

We are working on listening to three simultaneous talkers by a humanoid with two microphones. In such situations, sound separation and automatic speech recognition (ASR) of the separated speech are difficult, because the number of simultaneous talkers exceeds that of its microphones, the signal-to-noise ratio is quite low (around -3 dB) and noise is not stable due to interfering voices. To improve recognition of three simultaneous speeches, two key ideas are introduced --- acoustical modeling of robot head by scattering theory and two-layered audio-visual integration in both name and location, that is, speech and face recognition, and speech and face localization. Sound sources are separated in real-time by an active direction-pass filter (ADPF), which extracts sounds from a specified direction by using interaural phase/intensity difference estimated by scattering theory. Since features of sounds separated by ADPF vary according to the sound direction, multiple Direction- and Speaker-dependent (DS-dependent) acoustic models are used. The system integrates ASR results by using the sound direction and speaker information by face recognition as well as confidence measure of ASR results to select the best one. The resulting system shows around 10\% improvement on average against recognition of three simultaneous speeches, where three talkers were located 1 meter from the humanoid and apart from each other by 0 to 90 degrees at 10-degree intervals.

Audio-Visual Speech Recognition

It is a challenging topic to improve automatic speech recognition in the real world. Audio-visual speech recognition, that is, integration of ASR and lipreading is one of promising methods to improve robustness and accuracy of real-world speech recognition. However, in case of robot, lipreading is not always available because of occlusion and low resolution of robot cameras. We are working on development of audio-visual speech recognition when lipreading and speech are partially-observed. Currently, we are developing a lipreading function.

Sound Source Separation

Sound source separation is essential to understand general sounds because we usually hear a mixture of sounds. We developed an active direction-pass filter (ADPF) that separates sounds originating from a specified direction detected by a pair of microphones. Thus the ADPF is based on directional processing -- a process used in visual processing. The ADPF is implemented by hierarchical integration of visual and auditory processing with hypothetical reasoning of interaural phase difference (IPD) and interaural intensity difference (IID) for each sub-band. The ADPF gives differences in resolution in sound localization and separation depending on where the sound comes from, the resolving power is much higher for sounds coming directly from the front of the humanoid than for sounds coming from the periphery. This directional resolving property is similar to that of the eye whereby the visual fovea at the center of the retina is capable of much higher resolution than is the periphery of the retina. To exploit the corresponding ``auditory fovea'', the ADPF controls the direction of the head. The human tracking and sound source separation based on the ADPF is implemented on the upper-torso of the humanoid and runs in real-time using distributed processing by 5 PCs networked via a gigabit Ethernet. The signal-to-noise ratio (SNR) and noise reduction ratio of each sound separated by the ADPF from a mixture of two or three speeches of the same volume were increased by about 2.2dB and 9dB, respectively.

Human-Robot Interaction

An application of robot audition, we study human-robot interaction. Internal states of robots such as personality, introduction of other sensory information and a new function such as directional speaker are studied.

Personality in Robot

We are studying how to create social physical agents, i.e., humanoids, that perform actions empowered by real-time audio-visual tracking of multiple talkers. Social skills require complex perceptual and motor capabilities as well as communicating ones. It is critical to identify primary features in designing building blocks for social skills, because performance of social interaction is usually evaluated as a whole system but not as each component.We investigate the minimum functionalities for social interaction, supposed that a humanoid is equipped with auditory and visual perception and simple motor control but not with sound output. Real-time audio-visual multiple-talker tracking system is implemented on the humanoid, SIG, by using sound source localization, stereo vision, face recognition, and motor control. It extracts either auditory or visual streams and associates audio and visual streams by the proximity in localization. Socially-oriented attention control makes the best use of personality variations classi- fied by the Interpersonal Theory of psychology. It also provides task-oriented funcitons with decaying factor of belief for each stream. We demonstrate that the resulting behavior of SIG invites the users' participation in interaction and encourages the users to explore SIG's behaviors. These demonstrations show that SIG behaves like a physical non-verbal Eliza.

Tactile Sensing

Tactile sensing is one of such activities. We use Piezo sensor based tactile sensing. This sensor detects duration and intensity of stimuli. Mesh of Piezo sensors is installed inside soft skin of SIG2 so that any kind of stimuli can be localized. Human-robot interaction by integration of tactile sensing and concepts of distance between human and robot in ``proxemics'' is on-going research.

Ultra-Sonic Directional Speaker

A new communication method specialized in robots is studied by using a ultra-sonic directional speaker. This speaker realize sound spotting by making use of its directivity. By combination of an omni-directional (normal) speaker and a directional speaker installed in SIG2, we show preliminary demonstrations of multilingual announcement for a group of people.

Auditory Awareness by Using Missing Feature Theory

Auditory awareness for a robot is one of critical technologies in realizing an intelligent robot operating in daily environments. We are working on an auditory awareness system by using a new interface between sound source separation and automatic speech recognition (ASR). A mixture of speeches captured with a pair of microphones installed in the ear positions is separated into each speech by using active direction-pass filter (ADPF). The ADPF can extract a sound source from a specific direction in real-time by using interaural phase/intensity differences. The separated speech is recognized by a speech recognizer based on missing feature theory (MFT). By using a missing feature mask, the MFT can cope with distorted and missing features that are caused by speech separation. The ADPF and the ASR are interfaced by the missing feature mask, that is, a missing feature mask for each separated speech is generated in speech separation and is sent to the ASR with the separated speech. Thus, the new interface improves the performance of the auditory awareness system. The auditory awareness system by using the new interface are implemented in three humanoids, i.e., Honda ASIMO, SIG and SIG2 of Kyoto Univ. As a result, the performance of automatic speech recognition improves over 50% in every humanoid in case of simultaneous three speeches.