Our target is to achieve both low-cost and high-quality requirements for developing hardware and software for realizing telepresence. In our audio research group, we keep the low-cost by proposing a novel “zero-aperture” acoustic-vector-sensor microphone array shown in the Figure below.

The array has three orthogonally mounted pressure gradient microphones X, Y and Z, having figure-eight patterns with their directions of maximum response oriented in the X, Y, and Z axes, and one omnidirectional acoustic pressure microphone O for detecting sounds from all directions with equal magnitude. All the microphones are standard hearing-aid microphones that have useful directivity between 100 Hz and 20 kHz covering most sound sources. Each microphone measures only a few millimetres across, and is placed about a centimetre apart from the others. We name it XYZO array in our project. Whereas state-of-art techniques use large microphone arrays for immersive telepresence, we use the miniature XYZO microphone array for the purpose. Our initial results show good performance of the XYZO array on 2D and 3D direction finding and binaural 3D audio reconstruction. Our vision is to research the techniques of applying the XZYO array for realistic 3D audio in immersive telepresence. Our research work is described as follows.  

Real-time multiple acoustic sources 3D direction finding

In many cases, there is usually more than one participant who is speaking during a multi-precipitant conference. To identify each direction of participant, we are exploring the sparseness of the time-frequency acoustic sources and the non-stationary characteristics of the acoustic sources. Based on the coherence test, the sparse time-frequency bins of the acoustic sources will be identified. Each direction of acoustic sources can be estimated from its corresponding sparse time-frequency bins. The challenges are to correctly find out the sparse time-frequency bins, to determine how the sparseness is good enough for accurate estimation, and to increase the resolution of direction that can be correctly distinguished under noise and reverberation conditions. Currently, most of the reliable results can only be obtained from very constrained conditions like low noise and short reverberation. However, in practice the noises from ambience are usually high and in an enclosed room the reverberation time due to reflections is also very long, which significantly degrades the performance of direction finding systems. We investigate the onset-based direction finding methods using the direct-path data of acoustic sources.

Real-time directional voice acquisition

For effective communicate in a telepresence, a clean voice acquisition is very important especially with interference and noise around the environment. A typical one-microphone approach is usually lack of the capability for effectively reducing such environmental interference and noise. In the case of audio playback by loudspeakers, the echo problem due to the acoustic coupling between microphone and loudspeaker has to be concerned. To resolve the above problems, the microphone array has been widely adopted for enhancing the clean voice acquisition based on adaptive beamforming and post-filtering techniques. However, the existing microphone array approaches are usually based on a largely spaced array size, which has limitations on portable usage and low-cost devices. Our project explores the microphone array techniques for forming ‘directional beams’ to capture the target voices based on the miniature XYZO microphone array. We investigate the achievable performance of our approach under interference, background noise, echo, and reverberation conditions.

Real-time 3D audio capture and reproduction

In the telepresence, one of the most useful effects is to give listeners a good sound image localization of the remote participant. By incorporating spatial sound, we expect that the efficiency of meetings will be increased because of the improved speaker identification and speech intelligibility. The spatial sound could be realized by either reconstructing the entire sound field for a particular environment or reconstructing the desired sound field in the vicinity of each ear. Our research investigates the 3D audio capture and reproduction based on the XYZO microphone array. The playback system can be both a stereo headphone and a group of directional loudspeakers. The challenge is to improve the users’ spatial hearing perception and to solve the in-head sound problem and front-back confusion with limited measurements.


Our audio research group is currently working together with the video and FPGA/GPU groups for integrating and implementing a realistic real-time audio-video telepresence.