2014年9月26日，中科院计算所生物信息学研究组卜东波教授访问中心，并作题为ARCS: Assemble short-reads using combinatorial optimization in scaffolding的讲座。卜东波教授研究方向包括算法设计与分析，包括SAT问题，信息检索以及生物信息学等。
讲座简介：Compared with the traditional Sanger genome sequencing technique, the next-generation massively parallel sequencing techniques provide higher throughput at much lower cost; however, the next-generation techniques usually generate shorter reads, say ~25-75 bps by Illumina sequencing technique. How to assemble short reads into a complete genome remains a challenge for the next-generation sequencing techniques.
Currently, one of the most popular genome assembly approaches is to construct a de Bruijn graph first, and then extract the whole genome sequence from the de Bruijn graph. In the existing assembly approaches, heuristics or greedy strategies are usually employed to untangle de Bruijn graph due to repetitive regions in genome. In the study, we propose a novel assembly strategy by constructing a globally-optimal genome framework. In particular, the unique regions in the underlying genome are detected first via a copy number estimation procedure, and then optimally positioned according to the distance information among them. This way, long inexact repeats were cut into short ones based on a few different regions in them, making the contigs extendable when pairwise distance information is applied. The whole genome is finally constructed via gap-filling, followed by a scoring procedure.
Experiments show that on E. Coli genome, ARCS generates scaffolds with a N50 of 133k bp while the state-of-art assembly tool SOAP-denovo reports scaffolds with a N50 of 95k. On D9, D12 genomes, ARCS also shows better performance relative to SOAP-denovo (N50=0.5k, 12k for ARCS, and N50=0.4k, 1.6k for SOAP-denovo, respectively). ARCS provides a novel strategy to partly solving the inexact repeat challenge in genome assembly.