响马读paper

一个要求成员每月至少读一篇文献并打卡的学术交流社群

2024, bioRxiv. DOI: 10.1101/2024.02.18.580107
FECDO-Flexible and Efficient Coding for DNA Odyssey
Fajia Sun, Long Qian
Abstract:
DNA has been pursued as a compelling medium for digital data storage during the past decade. While large-scale data storage and random access have been achieved in artificial DNA, the synthesis cost keeps hindering DNA data storage from popularizing into daily life. In this study, we proposed a more efficient paradigm for digital data compressing to DNA, while excluding arbitrary sequence constraints. Both standalone neural networks and pre-trained language models were used to extract the intrinsic patterns of data, and generated probabilistic portrayal, which was then transformed into constraint-free nucleotide sequences with a hierarchical finite state machine. Utilizing these methods, a 12%-26% improvement of compression ratio was realized for various data, which directly translated to up to 26% reduction in DNA synthesis cost. Combined with the progress in DNA synthesis, our methods are expected to facilitate the realization of practical DNA data storage.
2024-03-13 05:35:00
#paper doi:10.1101/2024.02.18.580107, 2024, FECDO-Flexible and Efficient Coding for DNA Odyssey. 这篇文献提出了一种新的DNA数据存储编码方法,FECDO(缩写自 Flexible and Efficient Coding for DNA Odyssey),旨在通过高效的数据压缩和灵活的编码策略来减少DNA合成成本,从而促进DNA数据存储技术的实用化。该方法首先使用深度学习方法(分别尝试了无任何先验知识的独立神经网络,以及预训练的语言模型)来提取数据特征,从而把要存储的数据,从独热编码张量(one-hot encoded tensor)转换成为边际概率序列,实现了压缩的过程;该概率序列被映射成为4字母(A、C、G、T)的碱基序列,进而再使用一个层次有限状态机(hierarchical finite state machine)排除掉不适合DNA存储的特殊编码(如连续相同碱基、有特殊二级结构等)。通过上述过程,本文方法通过实测文本和图像数据,对比bzip2方法,提高了12%-26%的压缩效率,这种压缩效率将反映到DNA合成成本的显著降低上,是DNA存储技术的关键问题。同时,本文还尝试将其中一组文字所编码的结果,实际合成为DNA(进行保存),之后使用PCR将目标片段扩增出来,使用NanoPore测序,再解码还原得到原始数据,从整个流程上对方法进行了验证。由于目前文章尚处于bioRxiv preprint(文章提交版本v2),只提供了正文全文和正文图表,并未提供补充材料、方法描述和程序源码,尚有许多实现和结果的细节未公布,我个人比较怀疑该方法的信息容错能力和实测效果,正文中图表上展现的非英语文本和图像的压缩效果看起来也不是很理想,这些都有待文章正式发表后看到相应解答。
TOP