来自用户 张浩彬 的文献。
当前共找到 33 篇文献分享,本页显示第 1 - 20 篇。
1.
张浩彬
(2024-10-30 10:19):
#paper
AdapterFusion: Non-Destructive Task Composition for Transfer Learning
https://doi.org/10.48550/arXiv.2005.00247
adapter的改进版本,AdapterFusion。简单来说就是多个任务分别构建adapter,之后通过组合adapters的方式实现更好知识融合。
摘要简述:序列微调和多任务学习是旨在融合多个任务知识的方法;然而,它们存在灾难性遗忘和数据集平衡困难的问题。为了解决这些缺点,我们提出了AdapterFusion,这是一种新的两阶段学习算法,可以利用多个任务的知识。首先,在知识提取阶段,我们学习称为adapters的特定任务参数,这些参数封装了特定任务的信息。然后,我们在单独的知识组合步骤中组合adapters。我们表明,通过分离这两个阶段,即知识提取和知识组合,分类器可以以非破坏性的方式有效地利用从多个任务中学习的表示。我们在16个不同的NLU任务上对AdapterFusion进行了实证评估,发现它可以有效地在模型的不同层结合各种类型的知识。我们表明,我们的方法优于传统策略,如完全微调以及多任务学习。我们的代码和adapters可在AdapterHub.ml上获得。
arXiv,
2020-05-01T07:03:42Z.
DOI: 10.48550/arXiv.2005.00247
Abstract:
Sequential fine-tuning and multi-task learning are methods aiming toincorporate knowledge from multiple tasks; however, they suffer fromcatastrophic forgetting and difficulties in dataset balancing. To address theseshortcomings, we propose AdapterFusion, a …
>>>
Sequential fine-tuning and multi-task learning are methods aiming toincorporate knowledge from multiple tasks; however, they suffer fromcatastrophic forgetting and difficulties in dataset balancing. To address theseshortcomings, we propose AdapterFusion, a new two stage learning algorithm thatleverages knowledge from multiple tasks. First, in the knowledge extractionstage we learn task specific parameters called adapters, that encapsulate thetask-specific information. We then combine the adapters in a separate knowledgecomposition step. We show that by separating the two stages, i.e., knowledgeextraction and knowledge composition, the classifier can effectively exploitthe representations learned from multiple tasks in a non-destructive manner. Weempirically evaluate AdapterFusion on 16 diverse NLU tasks, and find that iteffectively combines various types of knowledge at different layers of themodel. We show that our approach outperforms traditional strategies such asfull fine-tuning as well as multi-task learning. Our code and adapters areavailable at AdapterHub.ml.
<<<
翻译
2.
张浩彬
(2024-09-30 17:03):
#paper DOI 10.48550/arXiv.1902.00751 Parameter-Efficient Transfer Learning for NLP 。ICML 2019 Google 提出了Adapter,这算是peft方法中的开篇文章了。最近在整理大模型的peft的经典文章准备给学生上课,这篇作为开篇最为合适。
微调大型预训练模型是NLP中有效的迁移机制。然而,在存在许多下游任务的情况下,微调在参数效率方面不佳:每个任务都需要一个全新的模型。作为替代方案,我们提出了使用适配器模块进行迁移。适配器模块产生紧凑且可扩展的模型;它们只为每个任务添加少量可训练参数,并且可以在不重新访问之前任务的情况下添加新任务。原始网络的参数保持固定,从而产生高度的参数共享。为了证明适配器的有效性,我们将最近提出的BERT Transformer模型迁移到26个不同的文本分类任务,包括GLUE基准测试。适配器达到了接近最先进的性能,同时每个任务只添加少量参数。在GLUE上,我们达到了完全微调性能的0.4%以内,每个任务只增加3.6%的参数。相比之下,微调每个任务训练100%的参数。
论文中提出了以往的领域适应方法,我们都需要单独对模型进行训练,一般来说包括了两种办法,分别是基于特征的迁移和微调。基于特征的迁移就是基于预训练的embedding模型进行作为特征输入,然后输入到特定的下游任务模型中。
arXiv,
2019-02-02T16:29:47Z.
DOI: 10.48550/arXiv.1902.00751
Abstract:
Fine-tuning large pre-trained models is an effective transfer mechanism inNLP. However, in the presence of many downstream tasks, fine-tuning isparameter inefficient: an entire new model is required for every task. …
>>>
Fine-tuning large pre-trained models is an effective transfer mechanism inNLP. However, in the presence of many downstream tasks, fine-tuning isparameter inefficient: an entire new model is required for every task. As analternative, we propose transfer with adapter modules. Adapter modules yield acompact and extensible model; they add only a few trainable parameters pertask, and new tasks can be added without revisiting previous ones. Theparameters of the original network remain fixed, yielding a high degree ofparameter sharing. To demonstrate adapter's effectiveness, we transfer therecently proposed BERT Transformer model to 26 diverse text classificationtasks, including the GLUE benchmark. Adapters attain near state-of-the-artperformance, whilst adding only a few parameters per task. On GLUE, we attainwithin 0.4% of the performance of full fine-tuning, adding only 3.6% parametersper task. By contrast, fine-tuning trains 100% of the parameters per task.
<<<
翻译
3.
张浩彬
(2024-08-16 20:33):
#paper SMDE: Unsupervised representation learning for time series based on signal mode decomposition and ensemble doi: https://doi.org/10.1016/j.knosys.2024.112369
这个月读自己刚见刊的论文吧,当是做一个宣传。在本文中,我们提出一种新的时间序列对比学习框架SMDE,在实例对比的基础上,首次将模态级别对比纳入对比学习当中,从而加深了对复杂时间序列动态的理解。我们进一步提出了专门针对时间序列特点的代理任务,全局信号一致性与局部模态一致性代理任务,并基于此提出了一种新的损失函数DE Circle loss。我们在广泛的半监督实验中,取得了sota的效果。说实话,虽然全监督的效果也很好,但是我个人觉得半监督是我们做的一个比较好的点
Knowledge-Based Systems,
2024-8.
DOI: 10.1016/j.knosys.2024.112369
Abstract:
No abstract available.
4.
张浩彬
(2024-07-29 13:18):
#paper DOI: https://doi.org/10.1038/s41586-024-07566-y ,AI models collapse when trained on recursively generated data。Nature关于大模型合成语料的探讨文章,讨论了在训练数据中,合成语料的加入(可能是被动,由于现有网络资料已经大量的大模型合成语料),导致模型崩溃的问题。当然,合成语料的使用易燃是大模型的训练的有效方式,但是要做好对合成语料的筛选工作
Abstract:
AbstractStable diffusion revolutionized image creation from descriptive text. GPT-2 (ref. 1), GPT-3(.5) (ref. 2) and GPT-4 (ref. 3) demonstrated high performance across a variety of language tasks. ChatGPT introduced such …
>>>
AbstractStable diffusion revolutionized image creation from descriptive text. GPT-2 (ref. 1), GPT-3(.5) (ref. 2) and GPT-4 (ref. 3) demonstrated high performance across a variety of language tasks. ChatGPT introduced such language models to the public. It is now clear that generative artificial intelligence (AI) such as large language models (LLMs) is here to stay and will substantially change the ecosystem of online text and images. Here we consider what may happen to GPT-{n} once LLMs contribute much of the text found online. We find that indiscriminate use of model-generated content in training causes irreversible defects in the resulting models, in which tails of the original content distribution disappear. We refer to this effect as ‘model collapse’ and show that it can occur in LLMs as well as in variational autoencoders (VAEs) and Gaussian mixture models (GMMs). We build theoretical intuition behind the phenomenon and portray its ubiquity among all learned generative models. We demonstrate that it must be taken seriously if we are to sustain the benefits of training from large-scale data scraped from the web. Indeed, the value of data collected about genuine human interactions with systems will be increasingly valuable in the presence of LLM-generated content in data crawled from the Internet.
<<<
翻译
5.
张浩彬
(2024-06-30 10:34):
@paper https://doi.org/10.48550/arXiv.2403.10131 RAFT: Adapting Language Model to Domain Specific RAG
对我而言很有启发性的paper。在大型文本数据集上预训练大型语言模型(LLMs)已成为一种标准范式。当将这些LLMs用于许多下游应用时,通常会将新的知识(例如,时效性新闻或私有领域知识)通过基于RAG(Retrieval-Augmented Generation,检索增强生成)的提示或微调,融入到预训练模型中。然而,模型如何以最优方式获取这种新知识仍然是一个开放的问题。在这篇论文中,提出了检索增强微调(Retrieval Augmented Fine Tuning,RAFT),简单来说,就是你要用rag的东西微调一下,并使用思维链熟悉一下要做的事情。当然,rag本身和微调就是两个套路,现在合在一起,似乎有点本末倒置,这也是这篇论文我认为没有讨论清楚的地方。不过这些不清楚的地方也是新的研究空间。
arXiv,
2024.
DOI: 10.48550/arXiv.2403.10131
Abstract:
Pretraining Large Language Models (LLMs) on large corpora of textual data isnow a standard paradigm. When using these LLMs for many downstreamapplications, it is common to additionally bake in new …
>>>
Pretraining Large Language Models (LLMs) on large corpora of textual data isnow a standard paradigm. When using these LLMs for many downstreamapplications, it is common to additionally bake in new knowledge (e.g.,time-critical news, or private domain knowledge) into the pretrained modeleither through RAG-based-prompting, or fine-tuning. However, the optimalmethodology for the model to gain such new knowledge remains an open question.In this paper, we present Retrieval Augmented FineTuning (RAFT), a trainingrecipe that improves the model's ability to answer questions in a "open-book"in-domain settings. In RAFT, given a question, and a set of retrieveddocuments, we train the model to ignore those documents that don't help inanswering the question, which we call, distractor documents. RAFT accomplishesthis by citing verbatim the right sequence from the relevant document thatwould help answer the question. This coupled with RAFT's chain-of-thought-styleresponse helps improve the model's ability to reason. In domain-specific RAG,RAFT consistently improves the model's performance across PubMed, HotpotQA, andGorilla datasets, presenting a post-training recipe to improve pre-trained LLMsto in-domain RAG. RAFT's code and demo are open-sourced atgithub.com/ShishirPatil/gorilla.
<<<
翻译
6.
张浩彬
(2024-05-31 07:31):
#paper doi:https://doi.org/10.48550/arXiv.2403.10131
RAFT: Adapting Language Model to Domain Specific RAG
简单但有效的思路。传统大模型变为领域 应用,我们可以微调也可以使用rag,但微软说,我们可以应该基于rag微调。RAFT 是一种将预训练的大型语言模型微调到特定领域 RAG 设置的通用方法。在特定领域 RAG 中,模型需要根据特定领域的一组文档回答问题,例如企业中的私有文件。这与通用 RAG 不同,因为通用 RAG 中的模型并不知道它将在哪个领域进行测试。简单来说,微调是闭卷考试,靠记忆回答。rag是开卷开始,虽然我没记忆,但是考试的时候可以翻书,那么raft就是开卷考试前,我还是先看了一下教科书,虽然没看全,但是大概知道考题长什么样子,但没关系,因为考试的时候我还可以翻书。
arXiv,
2024.
Abstract:
No abstract available.
7.
张浩彬
(2024-04-29 20:35):
#paper doi:
https://doi.org/10.48550/arXiv.2211.14730
A Time Series is Worth 64 Words: Long-term Forecasting with Transformers
ICLR2023的文章,提出了PatchTST。受vision Transformer的启发,把patch技术引入到时序问题。并且回应了早期另一篇认为Transformer用在时间序列其实并不比传统线性模型好的文章(Are transformers effective for time series forecasting?(2022)),重新取得了sota。然而23年底,又有新方法出现了,讨论了其实关键不是transformer,而是patch技术
arXiv,
2022.
DOI: 10.48550/arXiv.2211.14730
Abstract:
We propose an efficient design of Transformer-based models for multivariatetime series forecasting and self-supervised representation learning. It isbased on two key components: (i) segmentation of time series intosubseries-level patches which …
>>>
We propose an efficient design of Transformer-based models for multivariatetime series forecasting and self-supervised representation learning. It isbased on two key components: (i) segmentation of time series intosubseries-level patches which are served as input tokens to Transformer; (ii)channel-independence where each channel contains a single univariate timeseries that shares the same embedding and Transformer weights across all theseries. Patching design naturally has three-fold benefit: local semanticinformation is retained in the embedding; computation and memory usage of theattention maps are quadratically reduced given the same look-back window; andthe model can attend longer history. Our channel-independent patch time seriesTransformer (PatchTST) can improve the long-term forecasting accuracysignificantly when compared with that of SOTA Transformer-based models. We alsoapply our model to self-supervised pre-training tasks and attain excellentfine-tuning performance, which outperforms supervised training on largedatasets. Transferring of masked pre-trained representation on one dataset toothers also produces SOTA forecasting accuracy. Code is available at:https://github.com/yuqinie98/PatchTST.
<<<
翻译
8.
张浩彬
(2023-06-30 11:45):
#paper The Capacity and Robustness Trade-off: Revisiting the Channel Independent Strategy for Multivariate Time Series Forecasting doi:
https://doi.org/10.48550/arXiv.2304.05206Focus to learn more
专门研究了针对多元时间序列的预测问题,探讨了使用独立预测以及联合预测的差异,证明了由于分布偏移的存在,独立预测的方法更好,应为其更加有利于缓解分布偏移的问题,提高模型的繁华性。并且文章证明了独立预测和联合预测,是一种模型容量和模型鲁棒性的权衡。随州论文提出了包括正则化,低秩分解、采用MAE代替MSE,调整序列长度等方法提高联合预测的精度
arXiv,
2023.
DOI: 10.48550/arXiv.2304.05206
Abstract:
Multivariate time series data comprises various channels of variables. The multivariate forecasting models need to capture the relationship between the channels to accurately predict future values. However, recently, there has …
>>>
Multivariate time series data comprises various channels of variables. The multivariate forecasting models need to capture the relationship between the channels to accurately predict future values. However, recently, there has been an emergence of methods that employ the Channel Independent (CI) strategy. These methods view multivariate time series data as separate univariate time series and disregard the correlation between channels. Surprisingly, our empirical results have shown that models trained with the CI strategy outperform those trained with the Channel Dependent (CD) strategy, usually by a significant margin. Nevertheless, the reasons behind this phenomenon have not yet been thoroughly explored in the literature. This paper provides comprehensive empirical and theoretical analyses of the characteristics of multivariate time series datasets and the CI/CD strategy. Our results conclude that the CD approach has higher capacity but often lacks robustness to accurately predict distributionally drifted time series. In contrast, the CI approach trades capacity for robust prediction. Practical measures inspired by these analyses are proposed to address the capacity and robustness dilemma, including a modified CD method called Predict Residuals with Regularization (PRReg) that can surpass the CI strategy. We hope our findings can raise awareness among researchers about the characteristics of multivariate time series and inspire the construction of better forecasting models.
<<<
翻译
9.
张浩彬
(2023-05-30 11:48):
#paper:doi:10.48550/arXiv.2010.04515
Principal Component Analysis using Frequency Components of Multivariate Time Series
提出了一个新的谱分解方法,使得对多元时间序列(二阶平稳,宽平稳)进行分解,从而使得分解后的子序列在组内是有非零的谱相关,而跨组的子序列则具有零的谱相关性。从写作上,则是典型的问题引入,方法介绍、理论的渐近性质证明,数值模拟,实证研究,其中有大量的推导。
arXiv,
2020.
DOI: 10.48550/arXiv.2010.04515
Abstract:
Dimension reduction techniques for multivariate time series decompose the observed series into a few useful independent/orthogonal univariate components. We develop a spectral domain method for multivariate second-order stationary time series …
>>>
Dimension reduction techniques for multivariate time series decompose the observed series into a few useful independent/orthogonal univariate components. We develop a spectral domain method for multivariate second-order stationary time series that linearly transforms the observed series into several groups of lower-dimensional multivariate subseries. These multivariate subseries have non-zero spectral coherence among components within a group but have zero spectral coherence among components across groups. The observed series is expressed as a sum of frequency components whose variances are proportional to the spectral matrices at the respective frequencies. The demixing matrix is then estimated using an eigendecomposition on the sum of the variance matrices of these frequency components and its asymptotic properties are derived. Finally, a consistent test on the cross-spectrum of pairs of components is used to find the desired segmentation into the lower-dimensional subseries. The numerical performance of the proposed method is illustrated through simulation examples and an application to modeling and forecasting wind data is presented.
<<<
翻译
10.
张浩彬
(2023-04-28 13:45):
#paper An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling DOI:arXiv:1803.01271 .
最近密集地做时序问题的分享,认真看了一下TCN的原文.除了RNN那一套,TCN还是用得比较多。为了在不增加太多层的情况下实现大的感受野,通过空洞卷积来实现,并通过padding和裁剪的方式避免了数据泄露问题。一个TCN块有两个空洞因果卷积,激活层,norm层以及一个残差链接组成。实验证明了TCN的超参数相对不敏感,但卷积核大小k是个关键,另外drop out 和梯度裁剪也有较大的帮助。
arXiv,
2018.
DOI: 10.48550/arXiv.1803.01271
Abstract:
For most deep learning practitioners, sequence modeling is synonymous with recurrent networks. Yet recent results indicate that convolutional architectures can outperform recurrent networks on tasks such as audio synthesis and …
>>>
For most deep learning practitioners, sequence modeling is synonymous with recurrent networks. Yet recent results indicate that convolutional architectures can outperform recurrent networks on tasks such as audio synthesis and machine translation. Given a new sequence modeling task or dataset, which architecture should one use? We conduct a systematic evaluation of generic convolutional and recurrent architectures for sequence modeling. The models are evaluated across a broad range of standard tasks that are commonly used to benchmark recurrent networks. Our results indicate that a simple convolutional architecture outperforms canonical recurrent networks such as LSTMs across a diverse range of tasks and datasets, while demonstrating longer effective memory. We conclude that the common association between sequence modeling and recurrent networks should be reconsidered, and convolutional networks should be regarded as a natural starting point for sequence modeling tasks. To assist related work, we have made code available at this http URL .
<<<
翻译
11.
张浩彬
(2023-03-27 15:40):
#paper 10.1109/ijcnn52387.2021.9533426 Self-Supervised Pre-training for Time Series Classification
少有的时间序列迁移学习文章,利用DTW计算距离建立代理任务构建正负样本来做学习,encoder用的transformer,新意少了点。
Abstract:
Recently, significant progress has been made in time series classification with deep learning. However, using deep learning models to solve time series classification generally suffers from expensive calculations and difficulty …
>>>
Recently, significant progress has been made in time series classification with deep learning. However, using deep learning models to solve time series classification generally suffers from expensive calculations and difficulty of data labeling. In this work, we study self-supervised time series pre-training to overcome these challenges. Compared with the existing works, we focus on the universal and unlabeled time series pretraining. To this end, we propose a novel end-to-end neural network architecture based on self-attention, which is suitable for capturing long-term dependencies and extracting features from different time series. Then, we propose two different self-supervised pretext tasks for time series data type: Denoising and Similarity Discrimination based on DTW (Dynamic Time Warping). Finally, we carry out extensive experiments on 85 time series datasets (also known as UCR2015 [2]). Empirical results show that the time series model augmented with our proposed self-supervised pretext tasks achieves state-of-the-art / highly competitive results.
<<<
翻译
12.
张浩彬
(2023-02-28 15:49):
#paper 10.5555/2503308.2188396 Noise-Contrastive Estimation of Unnormalized Statistical Models, with Applications to Natural Image Statistics啃一下nce。nce主要是解决一个问题,当分类类别太多的失衡,softmax的归一化因子计算量太大,于是作者提出nce作为一个替代。作者很巧妙地设计了一个代理任务,把原有的分类问题,转化为一个吧目标从噪声样本中识别出来的二分类问题,从而规避了计算规范化因子的计算量问题。并且作者证明了,当样本趋向于无穷的时候,nce等价于mle。
Abstract:
We consider the task of estimating, from observed data, a probabilistic model that is parameterized by a finite number of parameters. In particular, we are considering the situation where the …
>>>
We consider the task of estimating, from observed data, a probabilistic model that is parameterized by a finite number of parameters. In particular, we are considering the situation where the model probability density function is unnormalized. That is, the model is only specified up to the partition function. The partition function normalizes a model so that it integrates to one for any choice of the parameters. However, it is often impossible to obtain it in closed form. Gibbs distributions, Markov and multi-layer networks are examples of models where analytical normalization is often impossible. Maximum likelihood estimation can then not be used without resorting to numerical approximations which are often computationally expensive. We propose here a new objective function for the estimation of both normalized and unnormalized models. The basic idea is to perform nonlinear logistic regression to discriminate between the observed data and some artificially generated noise. With this approach, the normalizing partition function can be estimated like any other parameter. We prove that the new estimation method leads to a consistent (convergent) estimator of the parameters. For large noise sample sizes, the new estimator is furthermore shown to behave like the maximum likelihood estimator. In the estimation of unnormalized models, there is a trade-off between statistical and computational performance. We show that the new method strikes a competitive trade-off in comparison to other estimation methods for unnormalized models. As an application to real data, we estimate novel two-layer models of natural image statistics with spline nonlinearities.
<<<
翻译
13.
张浩彬
(2023-01-30 13:34):
#paper https://doi.org/10.48550/arXiv.2202.01575 COST: CONTRASTIVE LEARNING OF DISENTANGLED SEASONAL-TREND REPRESENTATIONS FOR TIME SERIES FORECASTING
1. 文章认为一个时间序列可由3个部分组成,趋势项+季节项+误差项。我们需要学习的趋势项和季节项
2. 从整体结构上看,对于原始序列通过编码器(TCN)将原始序列映射到隐空间中,之后分别通过两个结构分理出趋势项及季节项分别进行对比学习
a. 对于趋势项来说,对于获得的隐空间表示,输入到自回归专家混合提取器中进行趋势提取,并通过时域进行对比损失学习。时域的对比损失学习参考了Moco进行
b. 对于季节项,用离散傅里叶变换将隐空间映射到频域,频域损失函数定义为波幅和相位的损失。
3. 最终总的损失函数时域+频域的损失函数
4. 基于5个数据和多个基线模型进行对比,包括TS2Vec、TNC,Moco,Informer、LogTrans、TCN等,大部分取得了SOTA的效果
arXiv,
2022.
DOI: 10.48550/arXiv.2202.01575
Abstract:
Deep learning has been actively studied for time series forecasting, and the mainstream paradigm is based on the end-to-end training of neural network architectures, ranging from classical LSTM/RNNs to more …
>>>
Deep learning has been actively studied for time series forecasting, and the mainstream paradigm is based on the end-to-end training of neural network architectures, ranging from classical LSTM/RNNs to more recent TCNs and Transformers. Motivated by the recent success of representation learning in computer vision and natural language processing, we argue that a more promising paradigm for time series forecasting, is to first learn disentangled feature representations, followed by a simple regression fine-tuning step -- we justify such a paradigm from a causal perspective. Following this principle, we propose a new time series representation learning framework for time series forecasting named CoST, which applies contrastive learning methods to learn disentangled seasonal-trend representations. CoST comprises both time domain and frequency domain contrastive losses to learn discriminative trend and seasonal representations, respectively. Extensive experiments on real-world datasets show that CoST consistently outperforms the state-of-the-art methods by a considerable margin, achieving a 21.3% improvement in MSE on multivariate benchmarks. It is also robust to various choices of backbone encoders, as well as downstream regressors. Code is available at this https URL.
<<<
翻译
14.
张浩彬
(2022-12-31 23:07):
#paper doi:10.1145/3447548.3467401
A transformer-based framework for multivariate time series representation learning
1.多头transformer可以对应到时间序列的多周期。
2. 在通用框架中:原始数据先进行投影并加入位置信息得到第一次引入位置的编码
3. 只用transformer的编码器提取特征,而不适用解码器,使得其更能适应各种下游任务
4. 另外由于transformer对顺序不敏感,因此模型也将位置编码到输入向量
5. 对于变长数据的处理,本文使用任意值掩码进行填充,并为填充位置的注意力分数提供了一个很大的负值迫使忽略填充位置(这个掩码是初始值,后续是否有可能更新到非负值?)
6. 掩码的实际应用了一定的技巧。另外对掩码的预测实际上就将其变为了一个非时间序列问题,而是一个nlp的填空问题
7. 预训练模型:对于多变量的时间序列,对于每个变量随机独立地屏蔽一段子序列。而在损失函数中,仅考虑对被屏蔽段的损失。
8. 模型最后的任务是回归和分类。但是回归并不是用于对未来时间的预测,而是类似于利用房屋的气压,湿度,风速数据预测房屋的当天能耗,使用的是MSE。分类任务则是使用交叉熵
9. 下游任务似乎只是简单的全连接层
10. 模型的比较对象是reocket,lstm,xgb--这个比较就有点差强人意了
Abstract:
We present a novel framework for multivariate time series representation learning based on the transformer encoder architecture. The framework includes an unsupervised pre-training scheme, which can offer substantial performance benefits …
>>>
We present a novel framework for multivariate time series representation learning based on the transformer encoder architecture. The framework includes an unsupervised pre-training scheme, which can offer substantial performance benefits over fully supervised learning on downstream tasks, both with but even without leveraging additional unlabeled data, i.e., by reusing the existing data samples. Evaluating our framework on several public multivariate time series datasets from various domains and with diverse characteristics, we demonstrate that it performs significantly better than the best currently available methods for regression and classification, even for datasets which consist of only a few hundred training samples. Given the pronounced interest in unsupervised learning for nearly all domains in the sciences and in industry, these findings represent an important landmark, presenting the first unsupervised method shown to push the limits of state-of-the-art performance for multivariate time series regression and classification.
<<<
翻译
15.
张浩彬
(2022-11-10 00:03):
#paper Momentum Contrast for Unsupervised Visual Representation Learning
doi:10.1109/cvpr42600.2020.00975
大名鼎鼎的moco。之前只是粗略了解,今天算是认真精读了一下。受nlp的影响,cv也开始了自监督方法的新一代卷了。
自监督,其实也是无监督了。倒是为了和前人分开,又起了self supervised learning的名字。moco这篇论文,算是趟平了有监督和无监督的差距了,第一次用无监督的方法取得了比有监督预训练任务更好的结果。毕竟人工打标签还是很贵的,如果可以利用无监督的方法对模型进行预训练,那么可以说大大降低了受限于标注数据的性能瓶颈了。
说回本文的技术,可以算是自监督目前的一个主流了(另一个是生成式)。在对比学习中,关键在于:1代理任务;2损失函数。当然本文的突出主要贡献还是在于动量更新方法。
1.moco中的代理任务,选择了比较简单的个体判别,即对于原始数据某个样本x_i,通过两个不同的数据增强,获得锚点样本和正样本;而其他样本,则是相对于该基础样本的负样本。另外,作者提到,把锚点称之为q(query),正样本和负样本对称之为k(key)。
2.损失函数是infonce,本质是其实还是类似于softmax。但是考虑到我们有这么多负样本,实际上就有这么多类别,所以选用了infonce,超参数是“温度”
3.接下里是本文的两个贡献,或者说回归对比学习,作者也提到受到两个问题制约:1是字典大小;2是字典一致性(字典姑且理解为负样本集,对比学习中,负样本集越大越好。另外moco是一个正样本,但也有文献证明,使用多个正样本更好)。(1)字典大小问题:在simclr这样的方法中,实际上每个batch都是对应的字典,这样就保证字典一致性,但是问题是字典大小受限制。要对这么多一个字典是反向传播,需要GPU非常大的内存,其次就是大batchsize的优化,相对也更难。(2)字典一致性问题:相比于simclr这样的端到端方法,另一个套路是使用memory bank的方式。我们依然有一个很大的字典,每次从字典抽样k个负样本进行梯度更新。但是这样的问题在于每次我们只更新了抽样的k个样本,实际上,这时候整个字典的特征是很不一致的(因为我们每次用一个较大梯度去更新所选k个样本,这样对于一个大字典,一个epoch后,第一次更新和最后一次更新的特征会差别很大)
4.针对以上问题,作者首先提出队列作为字典。即每次更新新的特征后,最久的特征剔除出字典,新的特征进度。
5.其次就是动量更新。及我们依然使用梯度下降更新q的encoder。但是我们不再使用梯度更新更细k的encoder,而是使用动量的方式,即,(另外,初始化时,k的encoder是直接复制q的encoder)并且m选择一个非常大的数,例如0.999,这就保证了每次更新都只更新一点,从而保证了字典一致性。
5.接下里是下游任务实现。作者冻结了主干网络,之后只使用了Linear Classification Protocol(一个全连接层和softmax层)作为最后的分类器,与经过ImageNet预训练后的其他模型进行比较,除少部分任务外,moco基本都取得了sota结果。
6.另一个有意思的地方是,moco的分类器,用girdsearch日常搜索发现最优的学习率是30.作者解释到,这也说明了自监督得到的特征确实与有监督得到的特征差别很大。另外在后续比较中,考虑到使用grid search不方便,作者使用了归一化处理。
Abstract:
We present Momentum Contrast (MoCo) for unsupervised visual representation learning. From a perspective on contrastive learning as dictionary look-up, we build a dynamic dictionary with a queue and a moving-averaged …
>>>
We present Momentum Contrast (MoCo) for unsupervised visual representation learning. From a perspective on contrastive learning as dictionary look-up, we build a dynamic dictionary with a queue and a moving-averaged encoder. This enables building a large and consistent dictionary on-the-fly that facilitates contrastive unsupervised learning. MoCo provides competitive results under the common linear protocol on ImageNet classification. More importantly, the representations learned by MoCo transfer well to downstream tasks. MoCo can outperform its supervised pre-training counterpart in 7 detection/segmentation tasks on PASCAL VOC, COCO, and other datasets, sometimes surpassing it by large margins. This suggests that the gap between unsupervised and supervised representation learning has been largely closed in many vision tasks.
<<<
翻译
16.
张浩彬
(2022-10-20 16:20):
#paper 1.Unsupervised Scalable Representation Learning for Multivariate Time Series,https://doi.org/10.48550/arXiv.1901.10738
论文关键是:正负样本构造, triplet loss以及因果空洞卷积
适用:该无监督学习模型可以用于不定长的序列;短序列及长序列均可使用;
代码:https://github.com/White-Link/UnsupervisedScalableRepresentationLearningTimeSeries
正负样本构造:
有N个序列对于某序列,随机选择长度,构造一个子序列ref。在这个子序列中,随机抽样一个子序列作为正样本pos;从其他序列(如果有的话)中随机抽样K个作为负样本neg;其中K是超参数
编码器有三个要求:(1)能够提取序列特征;(2)允许变长输入;(3)可以节省时间和内存;(个人觉得,只是为了给使用卷积找的理由);因此使用exponentially dilated causal convolutions作为特征提取器代替传统的rnn、lstm
改造的triplet loss
在时间序列分类任务中结果表明由于现有的无监督方法,并且不差于有监督方法。在序列预测任务中,没做太多的比较
在单序列分类任务:使用了UCR数据集上的所有时间序列分类任务
arXiv,
2019.
DOI: 10.48550/arXiv.1901.10738
Abstract:
Time series constitute a challenging data type for machine learning algorithms, due to their highly variable lengths and sparse labeling in practice. In this paper, we tackle this challenge by …
>>>
Time series constitute a challenging data type for machine learning algorithms, due to their highly variable lengths and sparse labeling in practice. In this paper, we tackle this challenge by proposing an unsupervised method to learn universal embeddings of time series. Unlike previous works, it is scalable with respect to their length and we demonstrate the quality, transferability and practicability of the learned representations with thorough experiments and comparisons. To this end, we combine an encoder based on causal dilated convolutions with a novel triplet loss employing time-based negative sampling, obtaining general-purpose representations for variable length and multivariate time series.
<<<
翻译
17.
张浩彬
(2022-09-21 11:01):
#paper https://doi.org/10.48550/arXiv.2106.00750
Unsupervised Representation Learning for Time Series with Temporal Neighborhood Coding
21年ICLR论文,时间序列对比学习
代码:https://github.com/sanatonek/TNC_ representation_learning
样本的选择思想是,认为领域内的信号是相似的,领域外的信号是需要区分的
正样本的选择:邻域的信号都是服从某个高斯分布,均值为t*,方差是窗口大小和邻域长度.领域内是正样本正样本。如果确定邻域,使用ADF检验。
负样本:不在邻域内的就是负样本,但是这一点,作者在损失函数里进一步优化了
损失函数:作者认为,不在一个领域不能都认为是负样本,因为时序问题具有周期性,因此应该把它归为正无标记样本(即正类和负类混合)。在处理上,根据PU学习的一些经验,它在上面的负样本中引入权重,同时进入损失函数。、
数据:总共3个数据:1个模拟数据(4个类别,HMM生成),1个医疗临床房颤数据(MIT-BIH,特点是类别交替进行,类别非常不平衡,少量个体(人)具体非常长的数据),1个人类活动数据(UCI-HAR数据)
下游任务:聚类与分类,其中主要目标是为了尽可能比较表征学习,因此对于同一任务,不同的模型都用了相同的,并且简单的编码器结构。由于不同数据集特点不一样,因此不同任务的编码器不同。
聚类用了简单的kmeans;分类用了简单的knn;本文的TNC都取得了最好的结果
arXiv,
2021.
DOI: 10.48550/arXiv.2106.00750
Abstract:
Time series are often complex and rich in information but sparsely labeled and therefore challenging to model. In this paper, we propose a self-supervised framework for learning generalizable representations for …
>>>
Time series are often complex and rich in information but sparsely labeled and therefore challenging to model. In this paper, we propose a self-supervised framework for learning generalizable representations for non-stationary time series. Our approach, called Temporal Neighborhood Coding (TNC), takes advantage of the local smoothness of a signal's generative process to define neighborhoods in time with stationary properties. Using a debiased contrastive objective, our framework learns time series representations by ensuring that in the encoding space, the distribution of signals from within a neighborhood is distinguishable from the distribution of non-neighboring signals. Our motivation stems from the medical field, where the ability to model the dynamic nature of time series data is especially valuable for identifying, tracking, and predicting the underlying patients' latent states in settings where labeling data is practically impossible. We compare our method to recently developed unsupervised representation learning approaches and demonstrate superior performance on clustering and classification tasks for multiple datasets.
<<<
翻译
18.
张浩彬
(2022-08-24 09:56):
#paper doi: 10.1007/s11222-022-10130-1 Merlo, L., Maruotti, A., Petrella, L., & Punzo, A. (2022). Quantile hidden semi-Markov models for multivariate time series. Statistics and Computing, 32(4). https://doi.org/10.1007/s11222-022-10130-1
模型关键词:
解决问题:多元时间序列,分位数回归;
解决技术:隐藏半马尔科夫(解决停留时间不满足几何分布有偏问题,模型可以选择更多的分布形式,从而更加灵活)、多元非对称拉普拉斯分布(解决一般非位数回归扩到高维的问题)
估计方法:极大似然估计,EM算法
实证:意大利大气空气质量预测,尤其是极端分位数的估计
Abstract:
This paper develops a quantile hidden semi-Markov regression to jointly estimate multiple quantiles for the analysis of multivariate time series. The approach is based upon the Multivariate Asymmetric Laplace (MAL) …
>>>
This paper develops a quantile hidden semi-Markov regression to jointly estimate multiple quantiles for the analysis of multivariate time series. The approach is based upon the Multivariate Asymmetric Laplace (MAL) distribution, which allows to model the quantiles of all univariate conditional distributions of a multivariate response simultaneously, incorporating the correlation structure among the outcomes. Unobserved serial heterogeneity across observations is modeled by introducing regime-dependent parameters that evolve according to a latent finite-state semi-Markov chain. Exploiting the hierarchical representation of the MAL, inference is carried out using an efficient Expectation-Maximization algorithm based on closed form updates for all model parameters, without parametric assumptions about the states' sojourn distributions. The validity of the proposed methodology is analyzed both by a simulation study and through the empirical analysis of air pollutant concentrations in a small Italian city.
<<<
翻译
19.
张浩彬
(2022-08-23 15:36):
#paper doi: 10.1080/10618600.2021.1909601 Moon, S. J., Jeon, J.-J., Lee, J. S. H., & Kim, Y. (2021). Learning Multiple Quantiles With Neural Networks. Journal of Computational and Graphical Statistics, 30(4), 1238–1248.
提出了一个神经网络模型,用于估计满足非交叉属性的多个条件分位数。 传统的分位数回归会面临一个问题就是可能会出现分位数交叉,即85%分位数的值大于90%分位数的值。一般来说有两种处理策略:(1)调整转转模型参数;(2)将模型空间限制为非交叉分位数。本文采用了第二种思路,借鉴了线性非交叉分位数回归(非交叉SVR中的一个策略,这个策略问题在于计算量可能比较大),提出了一种具有不等式约束的非交叉分位数神经网络模型(把不等式约束用在了神经网络隐藏层)。
解决了交叉问题,第二个贡献是计算效率。为了使用一阶优化方法,文章开发了一种新算法来拟合所提出的模型。 该算法在没有需要多项式计算时间的投影梯度步骤的情况下给出了几乎最优的解决方案。
IF:1.400Q2
Journal of Computational and Graphical Statistics,
2021.
DOI: 10.1080/10618600.2021.1909601
Abstract:
We present a neural network model for estimation of multiple conditional quantiles that satisfies the noncrossing property. Motivated by linear noncrossing quantile regression, we propose a noncrossing quantile neural network …
>>>
We present a neural network model for estimation of multiple conditional quantiles that satisfies the noncrossing property. Motivated by linear noncrossing quantile regression, we propose a noncrossing quantile neural network model with inequality constraints. In particular, to use the first-order optimization method, we develop a new algorithm for fitting the proposed model. This algorithm gives a nearly optimal solution without the projected gradient step that requires polynomial computation time. We compare the performance of our proposed model with that of existing neural network models on simulated and real precipitation data. Supplementary materials for this article are available online.
<<<
翻译
20.
张浩彬
(2022-08-11 16:10):
#paper 10.48550/arXiv.1901.10738
Unsupervised Scalable Representation Learning for Multivariate Time Series
论文关键是:正负样本构造, triplet loss以及因果空洞卷积
适用:该无监督学习模型可以用于不定长的序列;短序列及长序列均可使用;
1.正负样本构造:对于某序列,随机选择长度,构造一个子序列。在这个子序列中,随机抽样一个子序列作为正样本;从其他序列中随机抽样作为一个负样本
2.改造的triplet loss
3. exponentially dilated causal convolutions作为特征提取器代替传统的rnn、lstm
结果表明由于现有的无监督方法,并且不差于有监督方法。
arXiv,
2019.
DOI: 10.48550/arXiv.1901.10738
Abstract:
Time series constitute a challenging data type for machine learning algorithms, due to their highly variable lengths and sparse labeling in practice. In this paper, we tackle this challenge by …
>>>
Time series constitute a challenging data type for machine learning algorithms, due to their highly variable lengths and sparse labeling in practice. In this paper, we tackle this challenge by proposing an unsupervised method to learn universal embeddings of time series. Unlike previous works, it is scalable with respect to their length and we demonstrate the quality, transferability and practicability of the learned representations with thorough experiments and comparisons. To this end, we combine an encoder based on causal dilated convolutions with a novel triplet loss employing time-based negative sampling, obtaining general-purpose representations for variable length and multivariate time series.
<<<
翻译