来自用户 🐼太真实 的文献。
当前共找到 3 篇文献分享。
1.
🐼太真实 (2024-02-29 10:04):
#paper ProPainter: Improving Propagation and Transformer for Video Inpainting 本文介绍了一种新的视频修复技术——ProPainter,通过双域传播和掩码引导稀疏视频Transformer的设计,实现了高效而准确的视频修复。文章详细介绍了ProPainter的三个关键组成部分:循环流场完成、双域传播和掩码引导稀疏视频Transformer,并提供了相应的技术细节和实验结果。
Shangchen Zhou, Chongyi Li, Kelvin C. K. Chan, Chen Change Loy
Abstract:
Flow-based propagation and spatiotemporal Transformer are two mainstream
mechanisms in video inpainting (VI). Despite the effectiveness of these
components, they still suffer from some limitations that affect their
performance. Previous propagation-based approaches are performed separately
either in the image or feature domain. Global image propagation isolated from
learning may cause spatial misalignment due to inaccurate optical flow.
Moreover, memory or computational constraints limit the temporal range of
feature propagation and video Transformer, preventing explorati… >>>
Flow-based propagation and spatiotemporal Transformer are two mainstream<br>mechanisms in video inpainting (VI). Despite the effectiveness of these<br>components, they still suffer from some limitations that affect their<br>performance. Previous propagation-based approaches are performed separately<br>either in the image or feature domain. Global image propagation isolated from<br>learning may cause spatial misalignment due to inaccurate optical flow.<br>Moreover, memory or computational constraints limit the temporal range of<br>feature propagation and video Transformer, preventing exploration of<br>correspondence information from distant frames. To address these issues, we<br>propose an improved framework, called ProPainter, which involves enhanced<br>ProPagation and an efficient Transformer. Specifically, we introduce<br>dual-domain propagation that combines the advantages of image and feature<br>warping, exploiting global correspondences reliably. We also propose a<br>mask-guided sparse video Transformer, which achieves high efficiency by<br>discarding unnecessary and redundant tokens. With these components, ProPainter<br>outperforms prior arts by a large margin of 1.46 dB in PSNR while maintaining<br>appealing efficiency. <<<
2.
🐼太真实 (2024-01-30 21:45):
#paper: doi:2110.11316 文章《CLOOB: Modern Hopfield Networks with InfoLOOB Outperform CLIP》介绍了一种名为CLOOB(Contrastive Leave One Out Boost)的新型自监督学习方法。这种方法结合了现代霍普菲尔德网络(Modern Hopfield Networks)和InfoLOOB目标(Leave One Out Bound),用于提升对比学习的效能。CLOOB在零样本转移学习(zero-shot transfer learning)方面,不论在哪种架构或数据集上,均优于之前的CLIP方法。 CLOOB的核心是使用现代霍普菲尔德网络来增强数据的共现性和协方差结构。这种网络与传统的霍普菲尔德网络相比,具有更高的存储容量和更快的检索速度。通过使用这些网络,CLOOB能够加强输入样本中特征的共现性和协方差结构,有效地提取和强化数据中的重要特征。 此外,CLOOB还采用了InfoLOOB目标函数来避免InfoNCE目标函数中出现的饱和问题。InfoLOOB目标是一种对比学习的目标,用于处理匹配对和不匹配对之间的关系,以减少目标函数的饱和,并使得学习过程更加高效。
Andreas Fürst, Elisabeth Rumetshofer, Johannes Lehner, Viet Tran, Fei Tang, Hubert Ramsauer, David Kreil, Michael Kopp, Günter Klambauer, Angela Bitto-Nemling ... >>>
Andreas Fürst, Elisabeth Rumetshofer, Johannes Lehner, Viet Tran, Fei Tang, Hubert Ramsauer, David Kreil, Michael Kopp, Günter Klambauer, Angela Bitto-Nemling, Sepp Hochreiter <<<
Abstract:
CLIP yielded impressive results on zero-shot transfer learning tasks and is
considered as a foundation model like BERT or GPT3. CLIP vision models that
have a rich representation are pre-trained using the InfoNCE objective and
natural language supervision before they are fine-tuned on particular tasks.
Though CLIP excels at zero-shot transfer learning, it suffers from an
explaining away problem, that is, it focuses on one or few features, while
neglecting other relevant features. This problem is caused by insufficiently
extracting the covariance structure in the original … >>>
CLIP yielded impressive results on zero-shot transfer learning tasks and is<br>considered as a foundation model like BERT or GPT3. CLIP vision models that<br>have a rich representation are pre-trained using the InfoNCE objective and<br>natural language supervision before they are fine-tuned on particular tasks.<br>Though CLIP excels at zero-shot transfer learning, it suffers from an<br>explaining away problem, that is, it focuses on one or few features, while<br>neglecting other relevant features. This problem is caused by insufficiently<br>extracting the covariance structure in the original multi-modal data. We<br>suggest to use modern Hopfield networks to tackle the problem of explaining<br>away. Their retrieved embeddings have an enriched covariance structure derived<br>from co-occurrences of features in the stored embeddings. However, modern<br>Hopfield networks increase the saturation effect of the InfoNCE objective which<br>hampers learning. We propose to use the InfoLOOB objective to mitigate this<br>saturation effect. We introduce the novel "Contrastive Leave One Out Boost"<br>(CLOOB), which uses modern Hopfield networks for covariance enrichment together<br>with the InfoLOOB objective. In experiments we compare CLOOB to CLIP after<br>pre-training on the Conceptual Captions and the YFCC dataset with respect to<br>their zero-shot transfer learning performance on other datasets. CLOOB<br>consistently outperforms CLIP at zero-shot transfer learning across all<br>considered architectures and datasets. <<<
3.
🐼太真实 (2023-12-28 20:39):
#paper https://doi.org/10.48550/arXiv.2312.03701 , Self-conditioned Image Generation via Generating Representations 这篇文章介绍了一种名为“表示条件图像生成”(RCG)的新型图像生成框架。RCG 不依赖于人类标注,而是基于自监督的表示分布来生成图像。使用预训练的编码器将图像分布映射到表示分布,然后通过表示扩散模型(RDM)从中采样,最后通过像素生成器根据采样的表示生成图像。RCG 在 ImageNet 256×256 数据集上实现了显著的性能提升,其 FID 和 IS 分别达到了 3.31 和 253.4。这个方法不仅显著提升了类无条件图像生成的水平,而且与当前领先的类条件图像生成方法相比也具有竞争力,弥补了这两种任务之间长期存在的性能差距。
Tianhong Li, Dina Katabi, Kaiming He
Abstract:
This paper presents $\textbf{R}$epresentation-$\textbf{C}$onditioned image
$\textbf{G}$eneration (RCG), a simple yet effective image generation framework
which sets a new benchmark in class-unconditional image generation. RCG does
not condition on any human annotations. Instead, it conditions on a
self-supervised representation distribution which is mapped from the image
distribution using a pre-trained encoder. During generation, RCG samples from
such representation distribution using a representation diffusion model (RDM),
and employs a pixel generator to craft image pi… >>>
This paper presents $\textbf{R}$epresentation-$\textbf{C}$onditioned image<br>$\textbf{G}$eneration (RCG), a simple yet effective image generation framework<br>which sets a new benchmark in class-unconditional image generation. RCG does<br>not condition on any human annotations. Instead, it conditions on a<br>self-supervised representation distribution which is mapped from the image<br>distribution using a pre-trained encoder. During generation, RCG samples from<br>such representation distribution using a representation diffusion model (RDM),<br>and employs a pixel generator to craft image pixels conditioned on the sampled<br>representation. Such a design provides substantial guidance during the<br>generative process, resulting in high-quality image generation. Tested on<br>ImageNet 256$\times$256, RCG achieves a Frechet Inception Distance (FID) of<br>3.31 and an Inception Score (IS) of 253.4. These results not only significantly<br>improve the state-of-the-art of class-unconditional image generation but also<br>rival the current leading methods in class-conditional image generation,<br>bridging the long-standing performance gap between these two tasks. Code is<br>available at https://github.com/LTH14/rcg. <<<
回到顶部