IEEE Transactions on Multimedia | Vol.20, Issue.7 | | Pages 1767-1780
Multiple Speaker Tracking in Spatial Audio via PHD Filtering and Depth-Audio Fusion
In the object-based spatial audio system, positions of the audio objects (e.g., speakers/talkers or voices) presented in the sound scene are required as important metadata attributes for object acquisition and reproduction. Binaural microphones are often used as a physical device to mimic human hearing and to monitor and analyze the scene, including localization and tracking of multiple speakers. The binaural audio tracker, however, is usually prone to the errors caused by room reverberation and background noise. To address this limitation, we present a multimodal tracking method by fusing the binaural audio with depth information (from a depth sensor, e.g., Kinect). More specifically, the probability hypothesis density (PHD) filtering framework is first applied to the depth stream, and a novel clutter intensity model is proposed to improve the robustness of the PHD filter when an object is occluded either by other objects or due to the limited field of view of the depth sensor. To compensate misdetections in the depth stream, a novel gap filling technique is presented to map audio azimuths obtained from the binaural audio tracker to 3D positions, using speaker-dependent spatial constraints learned from the depth stream. With our proposed method, both the errors in the binaural tracker and the misdetections in the depth tracker can be significantly reduced. Real-room recordings are used to show the improved performance of the proposed method in removing outliers and reducing misdetections.
Original Text (This is the original text for your reference.)
Multiple Speaker Tracking in Spatial Audio via PHD Filtering and Depth-Audio Fusion
In the object-based spatial audio system, positions of the audio objects (e.g., speakers/talkers or voices) presented in the sound scene are required as important metadata attributes for object acquisition and reproduction. Binaural microphones are often used as a physical device to mimic human hearing and to monitor and analyze the scene, including localization and tracking of multiple speakers. The binaural audio tracker, however, is usually prone to the errors caused by room reverberation and background noise. To address this limitation, we present a multimodal tracking method by fusing the binaural audio with depth information (from a depth sensor, e.g., Kinect). More specifically, the probability hypothesis density (PHD) filtering framework is first applied to the depth stream, and a novel clutter intensity model is proposed to improve the robustness of the PHD filter when an object is occluded either by other objects or due to the limited field of view of the depth sensor. To compensate misdetections in the depth stream, a novel gap filling technique is presented to map audio azimuths obtained from the binaural audio tracker to 3D positions, using speaker-dependent spatial constraints learned from the depth stream. With our proposed method, both the errors in the binaural tracker and the misdetections in the depth tracker can be significantly reduced. Real-room recordings are used to show the improved performance of the proposed method in removing outliers and reducing misdetections.
+More
room reverberation and background metadata attributes clutter intensity model objectbased spatial audio system positions of the audio objects eg speakerstalkers multimodal tracking method binaural audio with depth information outliers misdetections in the depth stream limited field of view analyze the scene including localization and tracking of multiple speakers gap filling technique recordings probability hypothesis density phd filtering framework speakerdependent spatial constraints hearing sound scene object acquisition 3d positions phd filter
Select your report category*
Reason*
New sign-in location:
Last sign-in location:
Last sign-in date: