Welcome to the IKCEST
IEEE Journal of Selected Topics in Signal Processing

IEEE Journal of Selected Topics in Signal Processing

Archives Papers: 348
IEEE Xplore
Please choose volume & issue:
2022 Index IEEE Journal of Selected Topics in Signal Processing Vol. 16
Abstracts:Presents the 2022 author/subject index for this issue of the publication.
BigSSL: Exploring the Frontier of Large-Scale Semi-Supervised Learning for Automatic Speech Recognition
Yu ZhangDaniel S. ParkWei HanJames QinAnmol GulatiJoel ShorAren JansenYuanzhong XuYanping HuangShibo WangZongwei ZhouBo LiMin MaWilliam ChanJiahui YuYongqiang WangLiangliang CaoKhe Chai SimBhuvana RamabhadranTara N. SainathFrançoise BeaufaysZhifeng ChenQuoc V. LeChung-Cheng ChiuRuoming PangYonghui Wu
Keywords:Automatic speech recognitionTrainingBenchmark testingSemisupervised learningSpeech recognitionsemi-supervised learning (artificial intelligence)speech recognitionASR taskdata efficiencydataset sizesdiverse unlabeled datasetsdownstream tasksgiant automatic speech recognition models pre-trainedlarge-scale semisupervisednonASR tasksparameter pretrained conformer modelpre-trained networksSoTA resultsspeech domainstraining datatraining setGiant modellarge-scale self-supervisedlearningself-supervised learningsemisupervised learningspeech recognition
Abstracts:We summarize the results of a host of efforts using giant automatic speech recognition (ASR) models pre-trained using large, diverse unlabeled datasets containing approximately a million hours of audio. We find that the combination of pre-training, self-training and scaling up model size greatly increases data efficiency, even for extremely large tasks with tens of thousands of hours of labeled data. In particular, on an ASR task with 34 k hours of labeled data, by fine-tuning an 8 billion parameter pre-trained Conformer model we can match state-of-the-art (SoTA) performance with only 3% of the training data and significantly improve SoTA with the full training set. We also report on the universal benefits gained from using big pre-trained and self-trained models for a large set of downstream tasks that cover a wide range of speech domains and span multiple orders of magnitudes of dataset sizes, including obtaining SoTA performance on many public benchmarks. In addition, we utilize the learned representation of pre-trained networks to achieve SoTA results on non-ASR tasks.
WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing
Sanyuan ChenChengyi WangZhengyang ChenYu WuShujie LiuZhuo ChenJinyu LiNaoyuki KandaTakuya YoshiokaXiong XiaoJian WuLong ZhouShuo RenYanmin QianYao QianJian WuMichael ZengXiangzhan YuFuru Wei
Keywords:Predictive modelsSelf-supervised learningSpeech processingSpeech recognitionConvolutionBenchmark testinglearning (artificial intelligence)signal denoisingspeaker recognitionspeech processingspeech recognitionfull stack speech processingfull-stack downstream speech tasksinformation including speaker identityinput speechLarge-scale self-supervised pre-trainingmasked speech predictionnonASR tasksself-supervised learningspeech content modeling capabilityspeech processing tasksspeech recognitionspeech signalspoken contenttraining datasetWavLMSelf-supervised learningspeech pre-training
Abstracts:Self-supervised learning (SSL) achieves great success in speech recognition, while limited exploration has been attempted for other speech processing tasks. As speech signal contains multi-faceted information including speaker identity, paralinguistics, spoken content, etc., learning universal representations for all speech tasks is challenging. To tackle the problem, we propose a new pre-trained model, WavLM, to solve full-stack downstream speech tasks. WavLM jointly learns masked speech prediction and denoising in pre-training. By this means, WavLM does not only keep the speech content modeling capability by the masked speech prediction, but also improves the potential to non-ASR tasks by the speech denoising. In addition, WavLM employs gated relative position bias for the Transformer structure to better capture the sequence ordering of input speech. We also scale up the training dataset from 60 k hours to 94 k hours. WavLM Large achieves state-of-the-art performance on the SUPERB benchmark, and brings significant improvements for various speech processing tasks on their representative benchmarks.
SAMU-XLSR: Semantically-Aligned Multimodal Utterance-Level Cross-Lingual Speech Representation
Sameer KhuranaAntoine LaurentJames Glass
Keywords:TrainingSpeech processingMachine translationRepresentation learninglanguage translationlearning (artificial intelligence)natural language processingspeech processingspeech recognitiontext analysiscross-lingual speech-textcross-lingual speech-to-speech translation retrievalcross-lingual speech-to-text translation retrievalLanguage Agnostic BERT Sentence Embedding modellearned representation spacelearning multimodal multilingual speechmultilingual acoustic frame-level speech representationmultilingual contextual speechmultilingual transcribed speech datamultimodal utterance-level cross-lingual speech representation learning frameworkSAMU-XLSRSAMU-XLSR speechspeech translation retrieval tasksspeech-speech associationsutterance-level multimodal multilingual speechCross-lingual speech representation learningLanguage-agnostic speech embeddingzero-shot speech-to-text translation retrievalzero-shot speech-to-speech translation retrieval
Abstracts:We propose the (<inline-formula><tex-math notation="LaTeX">$tt SAMUtext{-}XLSR$</tex-math></inline-formula>): <underline>S</underline>emantically-<underline>A</underline>ligned <underline>M</underline>ultimodal <underline>U</underline>tterance-level <underline>Cross</underline>-<underline>L</underline>ingual <underline>S</underline>peech <underline>R</underline>epresentation learning framework. Unlike previous works on speech representation learning, which learns multilingual contextual speech embedding at the resolution of an acoustic frame (10&#x2013;20 ms), this work focuses on learning multimodal (speech-text) multilingual speech embedding at the resolution of a sentence (5&#x2013;10 s) such that the embedding vector space is semantically aligned across different languages. We combine state-of-the-art multilingual acoustic frame-level speech representation learning model <inline-formula><tex-math notation="LaTeX">$tt XLSR$</tex-math></inline-formula> with the Language Agnostic BERT Sentence Embedding (<inline-formula><tex-math notation="LaTeX">$tt LaBSE$</tex-math></inline-formula>) model to create an utterance-level multimodal multilingual speech encoder <inline-formula><tex-math notation="LaTeX">$tt SAMUtext{-}XLSR$</tex-math></inline-formula>. Although we train <inline-formula><tex-math notation="LaTeX">$tt SAMUtext{-}XLSR$</tex-math></inline-formula> with only multilingual transcribed speech data, cross-lingual speech-text and speech-speech associations emerge in its learned representation space. To substantiate our claims, we use <inline-formula><tex-math notation="LaTeX">$tt SAMUtext{-}XLSR$</tex-math></inline-formula> speech encoder in combination with a pre-trained <inline-formula><tex-math notation="LaTeX">$tt LaBSE$</tex-math></inline-formula> text sentence encoder for cross-lingual speech-to-text translation retrieval, and <inline-formula><tex-math notation="LaTeX">$tt SAMUtext{-}XLSR$</tex-math></inline-formula> - lone for cross-lingual speech-to-speech translation retrieval. We highlight these applications by performing several cross-lingual text and speech translation retrieval tasks across several datasets.
Self-Supervised Learning of Audio Representations From Audio-Visual Data Using Spatial Alignment
Shanshan WangArchontis PolitisAnnamaria MesarosTuomas Virtanen
Keywords:Self-supervised learningSpatial audioRepresentation learningTrainingAudio-visual systemsacoustic signal processingaudio codingaudio signal processingfeature extractionimage classificationimage representationlearning (artificial intelligence)object detectionsignal classificationvideo signal processingacoustic contentambisonic audioaudio contentaudio inputaudio representationsaudio signalaudio-only downstream tasksaudio-visual correspondenceaudio-visual dataaudio-visual spatial alignmentAVSAbinaural audiodifferent audio formatslearnt audio feature representationobject detectionself-supervised representationsophisticated alignment taskspatial audio featuresspatial locationsupervised learningvisual contentvisual objectsAudio classificationaudio-visual corres- pondenceaudio-visual dataaudio-visual spatial alignmentfeature learningself-supervised learning
Abstracts:Learning from audio-visual data offers many possibilities to express correspondence between the audio and visual content, similar to the human perception that relates aural and visual information. In this work, we present a method for self-supervised representation learning based on audio-visual spatial alignment (AVSA), a more sophisticated alignment task than the audio-visual correspondence (AVC). In addition to the correspondence, AVSA also learns from the spatial location of acoustic and visual content. Based on 360<inline-formula><tex-math notation="LaTeX">$^circ$</tex-math></inline-formula> video and Ambisonics audio, we propose selection of visual objects using object detection, and beamforming of the audio signal towards the detected objects, attempting to learn the spatial alignment between objects and the sound they produce. We investigate the use of spatial audio features to represent the audio input, and different audio formats: Ambisonics, mono, and stereo. Experimental results show a 10&#x0025; improvement on AVSA for the first order ambisonics intensity vector (FOA-IV) in comparison with log-mel spectrogram features; the addition of object-oriented crops also brings significant performance increases for the human action recognition downstream task. A number of audio-only downstream tasks are devised for testing the effectiveness of the learnt audio feature representation, obtaining performance comparable to state-of-the-art methods on acoustic scene classification from ambisonic and binaural audio.
Keyword Localisation in Untranscribed Speech Using Visually Grounded Speech Models
Kayode OlaleyeDan OneaţăHerman Kamper
Keywords:TrainingPredictive modelsSelf-supervised learningAutomatic speech recognitionKeyword searchfeature extractionimage classificationlearning (artificial intelligence)natural language processingobject recognitionquery processingspeech recognitiontext analysisvocabularyarbitrary prediction modelextent keyword localisationindividual keyword localisation performanceinput maskingkeyword spottinglocalisations capabilitiesmasked-based localisationquery keywordreported localisation scoressaliency approachspeech utterancetraining imagesunordered bag-of-word-supervisionVGS modelvisually grounded speech modelwritten keywordVisually grounded speech modelskeyword localisationkeyword spottingself-supervised learning
Abstracts:Keyword localisation is the task of finding where in a speech utterance a given query keyword occurs. We investigate to what extent keyword localisation is possible using a visually grounded speech (VGS) model. VGS models are trained on unlabelled images paired with spoken captions. These models are therefore self-supervised&#x2014;trained without any explicit textual label or location information. To obtain training targets, we first tag training images with soft text labels using a pretrained visual classifier with a fixed vocabulary. This enables a VGS model to predict the presence of a written keyword in an utterance, but not its location. We consider four ways to equip VGS models with localisations capabilities. Two of these&#x2014;a saliency approach and input masking&#x2014;can be applied to an arbitrary prediction model after training, while the other two&#x2014;attention and a score aggregation approach&#x2014;are incorporated directly into the structure of the model. Masked-based localisation gives some of the best reported localisation scores from a VGS model, with an accuracy of 57&#x0025; when the system knows that a keyword occurs in an utterance and need to predict its location. In a setting where localisation is performed after detection, an <inline-formula><tex-math notation="LaTeX">$F_{1}$</tex-math></inline-formula> of 25&#x0025; is achieved, and in a setting where a keyword spotting ranking pass is first performed, a localisation <inline-formula><tex-math notation="LaTeX">$P{@}10$</tex-math></inline-formula> of 32&#x0025; is obtained. While these scores are modest compared to the idealised setting with unordered bag-of-word-supervision (from transcriptions), these VGS models do not receive any textual or location supervision. Further analyses show that these models are limited by the first detection or ranking pass. Moreover, individual keyword locali- ation performance is correlated with the tagging performance from the visual classifier. We also show qualitatively how and where semantic mistakes occur, e.g. that the model locates <italic>surfer</italic> when queried with <italic>ocean</italic>.
Momentum Pseudo-Labeling: Semi-Supervised ASR With Continuously Improving Pseudo-Labels
Yosuke HiguchiNiko MoritzJonathan Le RouxTakaaki Hori
Keywords:TrainingPredictive modelsDeep learningSemisupervised learningLabelingAutomatic speech recognitiondeep learning (artificial intelligence)speech recognitionsupervised learningASR trainingconnectionist temporal classification-based modelcontinuously improving pseudolabelsdeep neural network architectureend-to-end ASR systemend-to-end automatic speech recognitionmean teacher methodmodel-building processmomentum pseudolabelingMPLoffline modelonline model parametersseed modelsemisupervised ASRsemisupervised learning approachsemisupervised methodsspeech-text pairsspeech-to-text conversionunlabeled dataDeep learningend-to-end speech recognitionpseudo-labelingself-trainingsemi-supervised learning
Abstracts:End-to-end automatic speech recognition (ASR) has become a popular alternative to traditional module-based systems, simplifying the model-building process with a single deep neural network architecture. However, the training of end-to-end ASR systems is generally data-hungry: a large amount of labeled data (speech-text pairs) is necessary to learn direct speech-to-text conversion effectively. To make the training less dependent on labeled data, pseudo-labeling, a semi-supervised learning approach, has been successfully introduced to end-to-end ASR, where a seed model is self-trained with pseudo-labels generated from unlabeled (speech-only) data. Here, we propose <italic>momentum pseudo-labeling</italic> (MPL), a simple yet effective strategy for semi-supervised ASR. MPL consists of a pair of <italic>online</italic> and <italic>offline</italic> models that interact and learn from each other, inspired by the mean teacher method. The online model is trained to predict pseudo-labels generated on the fly by the offline model. The offline model maintains an exponential moving average of the online model parameters. The interaction between the two models allows better ASR training on unlabeled data by continuously improving the quality of pseudo-labels. We apply MPL to a connectionist temporal classification-based model and evaluate it on various semi-supervised scenarios with varying amounts of data or domain mismatch. The results demonstrate that MPL significantly improves the seed model by stabilizing the training on unlabeled data. Moreover, we present additional techniques, e.g., the use of Conformer and an external language model, to further enhance MPL, which leads to better performance than other semi-supervised methods based on pseudo-labeling.
Are Discrete Units Necessary for Spoken Language Modeling?
Tu Anh NguyenBenoit SagotEmmanuel Dupoux
Keywords:Feature extractionUnsupervised learningModelingTrainingPredictive modelslearning (artificial intelligence)linguisticsnatural language processingspeech processingspeech recognitiontext analysisdiscrete bottleneck necessarydiscrete unitsdiscrete versus continuous representationslanguage modellanguage modeling performancespseudotextspoken language modelingDiscrete unitsHuBERTspoken language modeling
Abstracts:Recent work in spoken language modeling shows the possibility of learning a language unsupervisedly from raw audio without any text labels. The approach relies first on transforming the audio into a sequence of discrete units (or pseudo-text) and then training a language model directly on such pseudo-text. Is such a discrete bottleneck necessary, potentially introducing irreversible errors in the encoding of the speech signal, or could we learn a language model without discrete units at all? In this work, we study the role of discrete versus continuous representations in spoken language modeling. We show that discretization is indeed essential for good results in spoken language modeling. We show that discretization removes linguistically irrelevant information from the continuous features, helping to improve language modeling performances. On the basis of this study, we train a language model on the discrete units of the HuBERT features, reaching new state-of-the-art results in the lexical, syntactic and semantic metrics of the Zero Resource Speech Challenge 2021 (Track 1 - Speech Only).
Self-Supervised Graphs for Audio Representation Learning With Limited Labeled Data
Amir ShirianKrishna SomandepalliTanaya Guha
Keywords:TrainingGraph neural networksSpeech recognitionSelf-supervised learningFeature extractionEmotion recognitionemotion recognitiongraph theorylearning (artificial intelligence)speech recognitionacoustic event classificationaudio domainaudio representationaudio sampleavailable training databenchmark audio datasetseffective audio representationsfully supervised modelsgeneralized audio representationsgraph constructiongraph nodehigh-quality manual labelshighly limited labelled datalabelled audio sampleslarge-scale databasesself-supervised graph approachself-supervised graphsself-supervision taskssemisupervised model performsspeech emotion recognitionsubgraph-based frameworkunlabeled audio samplesAcoustic event classificationgraph neural networkspeech emotion recognitionself-supervised learningsemi-supervised learningsub-graph construction
Abstracts:Large-scale databases with high-quality manual labels are scarce in audio domain. We thus explore a self-supervised graph approach to learning audio representations from highly limited labelled data. Considering each audio sample as a graph node, we propose a subgraph-based framework with novel self-supervision tasks to learn effective audio representations. During training, subgraphs are constructed by sampling the entire pool of available training data to exploit the relationship between the labelled and unlabeled audio samples. During inference, we use random edges to alleviate the overhead of graph construction. We evaluate our model on three benchmark audio datasets spanning two tasks: acoustic event classification and speech emotion recognition. We show that our semi-supervised model performs better or on par with fully supervised models and outperforms several competitive existing models. Our model is compact and can produce generalized audio representations robust to different types of signal noise.
Autoregressive Predictive Coding: A Comprehensive Study
Gene-Ping YangSung-Lin YehYu-An ChungJames GlassHao Tang
Keywords:Automatic speech recognitionPredictive codingRepresentation learningSelf-supervised learningAutoregressive processeslearning (artificial intelligence)speaker recognitionspeech processingspeech recognitionAPCautomatic speech recognitionautoregressive predictive codingcommon speech tasksduration predictionfine-grained tasksframe classificationfuture framehigh-level speech informationlearned representationspeaker verificationspeech representationAutomatic speech recognitionpredictive codingrepresentation learningself-supervised learningspeaker verification
Abstracts:We review autoregressive predictive coding (APC), an approach to learn speech representation by predicting a future frame given the past frames. We present three different views of interpreting APC, and provide a historical account to the approach. To study the speech representation learned by APC, we use common speech tasks, such as automatic speech recognition and speaker verification, to demonstrate the utility of the learned representation. In addition, we design a suite of fine-grained tasks, including frame classification, segment classification, fundamental frequency tracking, and duration prediction, to probe the phonetic and prosodic content of the representation. The three views of the APC objective welcome various generalizations and algorithms to learn speech representations. Probing on the suite of fine-grained tasks suggests that APC makes a wide range of high-level speech information accessible in its learned representation.
Hot Journals