IKCEST

Abstract

In the object-based spatial audio system, positions of the audio objects (e.g., speakers/talkers or voices) presented in the sound scene are required as important metadata attributes for object acquisition and reproduction. Binaural microphones are often used as a physical device to mimic human hearing and to monitor and analyze the scene, including localization and tracking of multiple speakers. The binaural audio tracker, however, is usually prone to the errors caused by room reverberation and background noise. To address this limitation, we present a multimodal tracking method by fusing the binaural audio with depth information (from a depth sensor, e.g., Kinect). More specifically, the probability hypothesis density (PHD) filtering framework is first applied to the depth stream, and a novel clutter intensity model is proposed to improve the robustness of the PHD filter when an object is occluded either by other objects or due to the limited field of view of the depth sensor. To compensate misdetections in the depth stream, a novel gap filling technique is presented to map audio azimuths obtained from the binaural audio tracker to 3D positions, using speaker-dependent spatial constraints learned from the depth stream. With our proposed method, both the errors in the binaural tracker and the misdetections in the depth tracker can be significantly reduced. Real-room recordings are used to show the improved performance of the proposed method in removing outliers and reducing misdetections.

Original Text (This is the original text for your reference.)

Multiple Speaker Tracking in Spatial Audio via PHD Filtering and Depth-Audio Fusion

In the object-based spatial audio system, positions of the audio objects (e.g., speakers/talkers or voices) presented in the sound scene are required as important metadata attributes for object acquisition and reproduction. Binaural microphones are often used as a physical device to mimic human hearing and to monitor and analyze the scene, including localization and tracking of multiple speakers. The binaural audio tracker, however, is usually prone to the errors caused by room reverberation and background noise. To address this limitation, we present a multimodal tracking method by fusing the binaural audio with depth information (from a depth sensor, e.g., Kinect). More specifically, the probability hypothesis density (PHD) filtering framework is first applied to the depth stream, and a novel clutter intensity model is proposed to improve the robustness of the PHD filter when an object is occluded either by other objects or due to the limited field of view of the depth sensor. To compensate misdetections in the depth stream, a novel gap filling technique is presented to map audio azimuths obtained from the binaural audio tracker to 3D positions, using speaker-dependent spatial constraints learned from the depth stream. With our proposed method, both the errors in the binaural tracker and the misdetections in the depth tracker can be significantly reduced. Real-room recordings are used to show the improved performance of the proposed method in removing outliers and reducing misdetections.

+More

Keywords

room reverberation and background metadata attributes clutter intensity model objectbased spatial audio system positions of the audio objects eg speakerstalkers multimodal tracking method binaural audio with depth information outliers misdetections in the depth stream limited field of view analyze the scene including localization and tracking of multiple speakers gap filling technique recordings probability hypothesis density phd filtering framework speakerdependent spatial constraints hearing sound scene object acquisition 3d positions phd filter

Cite this article

APA

MLA

Chicago

filo de CamposPhilip J. B. JacksonAdrian Hilton,Qingju LiuWenwu WangTeó,.Multiple Speaker Tracking in Spatial Audio via PHD Filtering and Depth-Audio Fusion. 20 (7),1767-1780.

Language

International

Translate engine

Article's language

Action

Recommended articles

Report