Utilizing visual cues in robot audition for sound source discrimination by Levko Ivanchuk


The goal of this project was simple: what kind of visual information can we use to improve verbal human - robot communication? Humans have a fantastic system for discerning noise from actual speech and identifying the direction of the sound. For robots and other systems which rely on speech input, this task is much harder than it initially seems, as sound information alone is often not sufficient to discern noise from speech. 

Therefore, I have implemented a system which used Microsoft Kinect V2 to determine the locations of humans around the robot. This information was then made available to the sound localization and separation system, which would automatically ignore noise and sound reflections coming from the direction other than the position of an actual human. 

In addition, human face and mouth were also tracked. With this information, we were able to ignore any noise coming from the back of the human speaker. As we knew when the human speaker was moving his or her mouth, we were able to process only those sounds that were emitted at the time of the movement. Although the technique of mouth tracking does require a visual line of sight between the camera and the human, it significantly improves the quality of noise filtering during sound source localization. 

Using Face tracking to assist Human-Robot Communication by Levko Ivanchuk

Overall system structure

Overall system structure

Any active robot audition system, or any other automatic speech recognition (ASR) system for that matter, suffers from acoustic reverberation. Unless perfect recording conditions are provided, ASR system needs to be able to deal with the same sound source being picked up by the microphones multiple times as a result of the sound reverberating from various room surfaces. The problem is compounded when the room acoustics is perturbed as a result of the change in the speaker's face orientation. The research team at HRI Japan, which I was part of at the time, saw an opportunity to improve the results of our dereverberation techniques by taking speaker's face orientation into the account - something that no other methods attempted before. 

My contribution to the dereverberation approach taken by R. Gomez et. al is in providing the face orientation vector for the correct selection of the Room Transfer function (RTF) as well as correct inputs for a more accurate gain correction scheme.  According to our findings, such combination of visual and audio information outperforms state-of-the-art dereverberation techniques. 

In my work, I used Microsoft's Kinect V2 sensor to obtain the speaker's location in the room. Knowing the positions of the camera, microphone, and speaker, I was able to calculate not only the position of the speaker but also the speaker's face rotation relative to the microphone. 

While this was sufficient information to the dereverberation method, I went further and improved microphone noise filtering by tracking mouth movements. It is intuitive that sound coming from the speaker should only be considered when there is mouth movement. Henceforth, any sound coming from the speaker's direction at the time when no significant and continuous mouth movement was detected by the sensor was successfully ignored and not sent to the ASR system at all. 

Our work was published at Interspeech 2015, with citation and publication links below


Gomez, R., Ivanchuk, L., Nakamura, K., Mizumoto, T., & Nakadai, K. (2015). Dereverberation for active human-robot communication robust to speaker's face orientation. In Sixteenth Annual Conference of the International Speech Communication Association.

URL: Click here

SmartColor by Levko Ivanchuk

A video demonstrating SmartColor in action on a few real life scenarios.

During my internship at University of Manitoba Human - Computer Interaction Lab, I worked primarily with Juan David Hincapié-Ramos on a project that aimed to improve color reproduction on head-worn transparent displays (HMDs). 

Users of optical see-through head-mounted displays (OHMD) perceive color as a blend of the display color and the background. Color-blending is a major usability challenge as it leads to loss of color encodings and poor text legibility. Color correction aims at mitigating color blending by producing an alternative color which, when blended with the background, more closely approaches the color originally intended. To date, approaches to color correction do not yield optimal results or do not work in real-time. This paper makes two contributions. First, we present QuickCorrection, a real-time color correction algorithm based on display profiles. We describe the algorithm, measure its accuracy and analyze two implementations for the OpenGL graphics pipeline. Second, we present SmartColor, a middleware for color management of user-interface components in OHMD. SmartColor uses color correction to provide three management strategies: correction, contrast, and show-upon-contrast. Correction determines the alternate color which best preserves the original color. Contrast determines the color which best warranties text legibility while preserving as much of the original hue. Show-upon-contrast makes a component visible when a related component does not have enough contrast to be legible. We describe the SmartColor’s architecture and illustrate the color strategies for various types of display content.


J. David Hincapié-Ramos, L. Ivanchuk, S. K. Sridharan and P. Irani, "SmartColor: Real-time color correction and contrast for optical see-through head-mounted displays," 2014 IEEE International Symposium on Mixed and Augmented Reality (ISMAR), Munich, 2014, pp. 187-194.
doi: 10.1109/ISMAR.2014.6948426
URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6948426&isnumber=6948385

J. David Hincapié-Ramos, L. Ivanchuk, S. K. Sridharan and P. P. Irani, "SmartColor: Real-Time Color and Contrast Correction for Optical See-Through Head-Mounted Displays," in IEEE Transactions on Visualization and Computer Graphics, vol. 21, no. 12, pp. 1336-1348, Dec. 1 2015.
doi: 10.1109/TVCG.2015.2450745
URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7138644&isnumber=7299343

Here is an example of SmartColor in action: 

From Top Left to Bottom Right - 1) Not corrected, 2) Corrected using Fragment Shader Correction, 3) Corrected using Vertex shader Correction, 4) Corrected using Vertex Shader Correction + Voting mechanism

From Top Left to Bottom Right - 1) Not corrected, 2) Corrected using Fragment Shader Correction, 3) Corrected using Vertex shader Correction, 4) Corrected using Vertex Shader Correction + Voting mechanism