Welcome to the IKCEST

Electronics Letters | Vol.55, Issue.14 | | Pages 816-819

Electronics Letters

TEnet: target speaker extraction network with accumulated speaker embedding for automatic speech recognition

Wenjie LiPengyuan ZhangYonghong Yan  
Abstract

It is challenging to perform automatic speech recognition when multiple people talk simultaneously. To solve this problem, speaker-aware selective methods have been proposed to extract the speech of the target speaker, relying on the auxiliary speaker characteristics provided by an anchor (a clean audio sample of the target speaker). However, the extraction performance depends on the duration and quality of the anchors, which is unstable. To address this limitation, the authors propose a target speaker extraction network (TEnet) which applies the robust speaker embedding to extract the target speech from the speech mixture. To get more stable speaker characteristics during training, the robust speaker embeddings are accumulated over all the speech of each target speaker, rather than utilising the embedding produced by a single anchor. As for testing, very few anchors are enough to get decent extraction performance. Results show the TEnet trained with accumulated embedding achieves better performance and robustness compared with the single-anchored TEnet. Moreover, to exploit the potential of the speaker embedding, the authors propose to feed the extracted target speech as anchor and train a feedback TEnet, whose results are superior to the short-anchored baseline for 22.5% on word error rate and 15.5% on signal-to-distortion rate.

Original Text (This is the original text for your reference.)

TEnet: target speaker extraction network with accumulated speaker embedding for automatic speech recognition

It is challenging to perform automatic speech recognition when multiple people talk simultaneously. To solve this problem, speaker-aware selective methods have been proposed to extract the speech of the target speaker, relying on the auxiliary speaker characteristics provided by an anchor (a clean audio sample of the target speaker). However, the extraction performance depends on the duration and quality of the anchors, which is unstable. To address this limitation, the authors propose a target speaker extraction network (TEnet) which applies the robust speaker embedding to extract the target speech from the speech mixture. To get more stable speaker characteristics during training, the robust speaker embeddings are accumulated over all the speech of each target speaker, rather than utilising the embedding produced by a single anchor. As for testing, very few anchors are enough to get decent extraction performance. Results show the TEnet trained with accumulated embedding achieves better performance and robustness compared with the single-anchored TEnet. Moreover, to exploit the potential of the speaker embedding, the authors propose to feed the extracted target speech as anchor and train a feedback TEnet, whose results are superior to the short-anchored baseline for 22.5% on word error rate and 15.5% on signal-to-distortion rate.

+More

Cite this article
APA

APA

MLA

Chicago

Wenjie LiPengyuan ZhangYonghong Yan,.TEnet: target speaker extraction network with accumulated speaker embedding for automatic speech recognition. 55 (14),816-819.

Disclaimer: The translated content is provided by third-party translation service providers, and IKCEST shall not assume any responsibility for the accuracy and legality of the content.
Translate engine
Article's language
English
中文
Pусск
Français
Español
العربية
Português
Kikongo
Dutch
kiswahili
هَوُسَ
IsiZulu
Action
Recommended articles

Report

Select your report category*



Reason*



By pressing send, your feedback will be used to improve IKCEST. Your privacy will be protected.

Submit
Cancel