Any active robot audition system, or any other automatic speech recognition (ASR) system for that matter, suffers from acoustic reverberation. Unless perfect recording conditions are provided, ASR system needs to be able to deal with the same sound source being picked up by the microphones multiple times as a result of the sound reverberating from various room surfaces. The problem is compounded when the room acoustics is perturbed as a result of the change in the speaker's face orientation. The research team at HRI Japan, which I was part of at the time, saw an opportunity to improve the results of our dereverberation techniques by taking speaker's face orientation into the account - something that no other methods attempted before.
My contribution to the dereverberation approach taken by R. Gomez et. al is in providing the face orientation vector for the correct selection of the Room Transfer function (RTF) as well as correct inputs for a more accurate gain correction scheme. According to our findings, such combination of visual and audio information outperforms state-of-the-art dereverberation techniques.
In my work, I used Microsoft's Kinect V2 sensor to obtain the speaker's location in the room. Knowing the positions of the camera, microphone, and speaker, I was able to calculate not only the position of the speaker but also the speaker's face rotation relative to the microphone.
While this was sufficient information to the dereverberation method, I went further and improved microphone noise filtering by tracking mouth movements. It is intuitive that sound coming from the speaker should only be considered when there is mouth movement. Henceforth, any sound coming from the speaker's direction at the time when no significant and continuous mouth movement was detected by the sensor was successfully ignored and not sent to the ASR system at all.
Our work was published at Interspeech 2015, with citation and publication links below
Gomez, R., Ivanchuk, L., Nakamura, K., Mizumoto, T., & Nakadai, K. (2015). Dereverberation for active human-robot communication robust to speaker's face orientation. In Sixteenth Annual Conference of the International Speech Communication Association.
URL: Click here