- SIG listens to two pure tones of 500Hz & 600Hz with canceling motor noises in motion. The sound is captured by a pair of microphones in SIG.
[ MPEG1 74074ms 352x240 ]
Sound Module [ MPEG1 5976ms 720x480 ]
(Localize multiple harmonic sounds)
Face Module [ MPEG1 9984ms 720x480 ]
Localize and Recognize multiple faces
In case of Multiple Faces [ MPEG1 28536ms 720x480 ]
Stereo Vision Module [ MPEG1 41174ms 352x240 ]
Association Module [ MPEG1 21984ms 720x480 ]
Form auditory and visual streams and integrate them into an associated stream according to their proximity
Tracking By Audition Only [ MPEG1 18072ms 720x480 ]
The sound stream (upper-left area) is not so accurate.
Tracking By Vision Only [ MPEG1 13056ms 720x480 ]
Tracking By Audio-Visual Integration [ MPEG1 12672ms 720x480 ]
Disambiguation each other: Missing visual information such as occulusion is disambiguated by auditory information, and
ambiguous auditory information is disambiguated by accurate visual information.
Tracking By Stereo Vision [ MPEG1 35280ms 720x480 ]
Tracking By Integration of 5 Modalities
Using sound direction, face ID, face location, stereo vision, speaker identification.
[ MPEG1
52080ms 720x480 ]
SIG as Paticipant In Conversation [ MPEG1 34968ms 720x480 ]
Even this kind of passive interaction creates an atmosphere as if
the robot participates the conversation
SIG as Receptionist For Party
* Registered Visitor [ MPEG1 31968ms 720x480 ]
In case of a registered visitor, the robot accomplished a reception task
by integration of face and voice of the visitor.
* Unregistered Visitor [ MPEG1 26952ms 720x480 ]
In case of an unregistered visitor, the robot registered the visitor's face and name by association based AV integration.
Simultaneous Speaker Tracking [ MPEG1 28632ms 704x480 ]
The robot listens to simultaneous speeches and pays attention to one of them by
selecting a stream which the robot is interested in.
Tracking Stereo Sound by R-L balance Control [ MPEG1 24864ms 720x480 ]
The robot tracks virtual stereo sounds created by two loudspeaker.
This is an evidence that the auditory processing of the robot is similar to
human's.
Locking Away [ MPEG1 36728ms 720x480 ]
By introducing hostile personality, the robot looks away when a person speaks to the robot.
* Recognition of Three Simultaneous Speeches 1
[ MPEG1 23424ms 720x480 ]
* Recognition of Three Simultaneous Speeches 2 [ MPEG1 33720ms 720x480 ]
Disambiguation By Asking Again : Active Audition
* Recognition of Three Simultaneous Speeches 3 [ MPEG1 30216ms 720x480 ]
Face Information makes the recongnition faster and more accurate.
* Recognition of Three Simultaneous Speeches 4 [ MPEG1 27432ms 720x480 ]
In total, the size of vocabulary is 150, so sound mixture of 3 fruit names can be recognized.
English Version
* Recognition of Three Simultaneous Speeches 1 [ MPEG2 34834ms 720x480 ]
By integration of face ID and speech recognition by speaker- and direction-dependent acoustic models, SIG recognizes simultaneous English speeches.
* Recognition of Three Simultaneous Speeches 2 [ MPEG2 40073ms 720x480 ]
By using only speech recognition, SIG recognizes simultaneous speeches to some extent.
* Recognition of Three Simultaneous Speeches 3 [ MPEG2 48815ms 720x480 ]
Disambiguation By Asking Again : Active Audition
* Recognition of Three Simultaneous Human Talkers in 2005
[ MPEG2 48815ms 720x480 ]* Faster Recognition of Three Simultaneous Human Talkers in 2006-2007 [ MPEG2 48815ms 720x480 ]
* Robot Referee of Rock-Paper-Scissors Sound Games in 2008
* Robot Steps by Recognizing the Beats of Music in 2007
This page is generated based on Area61.NET