The goal of this project was simple: what kind of visual information can we use to improve verbal human - robot communication? Humans have a fantastic system for discerning noise from actual speech and identifying the direction of the sound. For robots and other systems which rely on speech input, this task is much harder than it initially seems, as sound information alone is often not sufficient to discern noise from speech.
Therefore, I have implemented a system which used Microsoft Kinect V2 to determine the locations of humans around the robot. This information was then made available to the sound localization and separation system, which would automatically ignore noise and sound reflections coming from the direction other than the position of an actual human.
In addition, human face and mouth were also tracked. With this information, we were able to ignore any noise coming from the back of the human speaker. As we knew when the human speaker was moving his or her mouth, we were able to process only those sounds that were emitted at the time of the movement. Although the technique of mouth tracking does require a visual line of sight between the camera and the human, it significantly improves the quality of noise filtering during sound source localization.