来自用户 song 的文献。
当前共找到 2 篇文献分享。
song (2022-10-31 12:02):
#paper Conditional Diffusion Probabilistic Model for Speech Enhancement, https://arxiv.org/abs/2202.05256# 一般的扩散模型在speech相关的task上表现并不优秀,原因是扩散模型假设所有的噪音是符合高斯分布的,而在speech任务中只有少量噪音的高斯噪音(白噪音)更多的是各种stationary和non-stationary noise。本文解决这一问题的方法是在reverse和diffuse过程中除了基于上一步的输出外,还基于一个带噪声语音,y,从每一步乘以一个高斯噪音变成乘以带噪声语音于当前步语音的差于高斯噪音的积。在这个过程中模型学到了带噪声语音(非高斯噪音)的特征。这个方法解决了非高斯分布数据使用扩散模型的问题。但语音增强问题有其特殊性,语音增强任务的数据集本身就带有干净语音和噪声语音,使这个任务较为适合这个方法,其他语音任务不一定会有干净语音作为输入。比如语音转换任务就没有大量目标语音作为干净语音输入,可以在此基础上再做研究
arXiv, 2022.
Speech enhancement is a critical component of many user-oriented audio applications, yet current systems still suffer from distorted and unnatural outputs. While generative models have shown strong potential in speech synthesis, they are still lagging behind in speech enhancement. This work leverages recent advances in diffusion probabilistic models, and proposes a novel speech enhancement algorithm that incorporates characteristics of the observed noisy speech signal into the diffusion and reverse processes. More specifically, we propose a generalized formulation of the diffusion probabilistic model named conditional diffusion probabilistic model that, in its reverse process, can adapt to non-Gaussian real noises in the estimated speech signal. In our experiments, we demonstrate strong performance of the proposed approach compared to representative generative models, and investigate the generalization capability of our models to other datasets with noise characteristics unseen during training. <<<
song (2022-09-09 09:04):
#paper https://doi.org/10.48550/arXiv.2206.13236 Pruned RNN-T for fast, memory-efficient ASR training 来自于小米新一代kaldi团队。RNN-T是目前端到端语音识别的主流范式之一,是目前流式解码模型中表现最好和最易工业化部署的,缺点是训练时内存比其他主流模型占用内存至少高一个数量级。究其原因是因为比其他模型如CTC和attention模型的内存多了一个解码器的输出帧数,U,导致的。U值一般在几十到几百之间。本文提出了一种在不降低模型性能的情况下对模型进行剪枝以降低U值的方法。该团队首先发现在RNN-T loss计算过程中,并不是每个计算节点都参与进了计算过程中。计算节点的数量和输出帧数U成正比,只要选择并只保留对模型训练有作用的计算节点便可减少模型内存提高模型训练速度。在计算梯度过程中,只有中间一段连续的计算节点参与进训练之中,根据不同的常见,这个连续节点数,S,为4或5。在实验中,训练时间达到之前sota的约十六分之一,内存占用达到之前的约五分之一,模型性能仅降了0.05%。个人尝试下来,仅用4张V100已经较少的调参便可完全重现并部署。中小型公司将sota模型应用于产品之中的成本和人力将大大减少
The RNN-Transducer (RNN-T) framework for speech recognition has been growing in popularity, particularly for deployed real-time ASR systems, because it combines high accuracy with naturally streaming recognition. One of the drawbacks of RNN-T is that its loss function is relatively slow to compute, and can use a lot of memory. Excessive GPU memory usage can make it impractical to use RNN-T loss in cases where the vocabulary size is large: for example, for Chinese character-based ASR. We introduce a method for faster and more memory-efficient RNN-T loss computation. We first obtain pruning bounds for the RNN-T recursion using a simple joiner network that is linear in the encoder and decoder embeddings; we can evaluate this without using much memory. We then use those pruning bounds to evaluate the full, non-linear joiner network. <<<