2005 Genomic Classification Using an Information-Based Similarity Index_ Application to the SARS Coronavirus

上传人:麦**** 文档编号:141701787 上传时间:2020-08-11 格式:PDF 页数:14 大小:930.29KB
返回 下载 相关 举报
2005 Genomic Classification Using an Information-Based Similarity Index_ Application to the SARS Coronavirus_第1页
第1页 / 共14页
2005 Genomic Classification Using an Information-Based Similarity Index_ Application to the SARS Coronavirus_第2页
第2页 / 共14页
2005 Genomic Classification Using an Information-Based Similarity Index_ Application to the SARS Coronavirus_第3页
第3页 / 共14页
2005 Genomic Classification Using an Information-Based Similarity Index_ Application to the SARS Coronavirus_第4页
第4页 / 共14页
2005 Genomic Classification Using an Information-Based Similarity Index_ Application to the SARS Coronavirus_第5页
第5页 / 共14页
点击查看更多>>
资源描述

《2005 Genomic Classification Using an Information-Based Similarity Index_ Application to the SARS Coronavirus》由会员分享,可在线阅读,更多相关《2005 Genomic Classification Using an Information-Based Similarity Index_ Application to the SARS Coronavirus(14页珍藏版)》请在金锄头文库上搜索。

1、JOURNAL OF COMPUTATIONAL BIOLOGY Volume 12, Number 8, 2005 Mary Ann Liebert, Inc. Pp. 11031116 Genomic Classifi cation Using an Information-Based Similarity Index: Application to the SARS Coronavirus ALBERT C.-C. YANG, ARY L. GOLDBERGER, and C.-K. PENG ABSTRACT Measures of genetic distance based on

2、alignment methods are confi ned to studying sequences that are conserved and identifi able in all organisms under study. A number of alignment-free techniques based on either statistical linguistics or information theory have been developed to overcome the limitations of alignment methods. We presen

3、t a novel alignment-free approach to measuring the similarity among genetic sequences that incorporates elements from both word rank order-frequency statistics and information theory. We fi rst validate this method on the human infl uenza A viral genomes as well as on the human mitochondrial DNA dat

4、abase. We then apply the method to study the origin of the SARS coronavirus. We fi nd that the majority of the SARS genome is most closely related to group 1 coronaviruses, with smaller regions of matches to sequences from groups 2 and 3. The information based similarity index provides a new tool to

5、 measure the similarity between datasets based on their information content and may have a wide range of applications in the large-scale analysis of genomic databases. Key words: Shannon entropy, SARS coronavirus. INTRODUCTION G enetic distance measures are indicators of similarity among species or

6、populations and are useful for reconstructing phylogenetic relationships (Graur and Li, 1999). Measures of genetic distance are mainly derived from examining each pair of sequences aligned nucleotide-by-nucleotide and estimating the number of substitutions. Since the mechanism of genome evolution re

7、lies not only on point-mutations but recombination or horizontal gene transfer from other species, the heterogeneity of gene segments will substantially degrade the accuracy of optimal sequence alignment methods, which are based on the estimation of nucleotide substitution. Therefore, alignment meth

8、ods are confi ned to studying sequences that are conserved and identifi able in all organisms under study (Vinga and Almeida, 2003). Cardiovascular Division and Margret and H.A. Rey Institute for Nonlinear Dynamics in Medicine, Beth Israel Deaconess Medical Center/Harvard Medical School, Boston, Mas

9、sachusetts 02215. 1103 1104YANG ET AL. An alternative approach is to develop alignment-free sequence comparison methods. Current alignment- free sequence comparison methods can be classifi ed into two categories (Vinga and Almeida, 2003): information theory-based (Li et al., 2001) and word statistic

10、s-based measures (Campbell et al., 1999; Qi et al., 2004; Chaudhuri and Das, 2002; Hao et al., 2003; Karlin and Burge, 1995; Qi et al., 2004; Stuart et al., 2002). We have developed a new index adapted from linguistic analysis and information theory to measure the similarity between symbolic sequenc

11、es (Yang et al., 2003a, 2003b). Our approach is based on the concept that the information content in any symbolic sequence is primarily determined by the repetitive usage of its basic elements. The novelty of this information-based similarity index is that it incorporates elements of both informatio

12、n-based and word statistics-based categories since the rank order difference of each n-tuple (word statistics) is weighted by its information content using Shannon entropy (information theory) (Shannon, 1948). Furthermore, the composition of these basic elements captures both global information rela

13、ted to usage of repetitive elements in genetic sequences, as well as local sequence order determined by the n-tuple nucleotides. Hence, our method provides a complementary approach to overcoming limitations of alignment methods and is capable of exploring genetic sequences with hetero- geneic origin

14、s. The resulting measurement has been validated with respect to generic information-carrying symbolic sequences (Yang et al., 2003a, 2003b). Here we show the specifi c application of this method to genomic sequences. METHODS We have recently developed and validated a generic information-based simila

15、rity index to quantify the similarity between symbolic sequences. This method, which has been used for analysis of complex physiologic signals (Yang et al., 2003a) and literary texts (Yang et al., 2003b), can be readily adapted to genetic sequences by examining usages of n-tuple nucleotides (“words”

16、). We fi rst determine the frequencies for each n-tuple by applying a sliding window (moving one nucleotide/step) across the entire genome, and then rank each n-tuple according to its frequency in descending order. To compare the similarity between genetic sequences, we plot the rank number of each n-tuple in the fi rst sequence against that of the second sequence. Figure 1 shows the comparison of 4-tuple nucleotide frequencies between the complete mitochondrial genome of two human lineages an

展开阅读全文
相关资源
相关搜索

当前位置:首页 > 研究生/硕士 > 专业课

电脑版 |金锄头文库版权所有
经营许可证:蜀ICP备13022795号 | 川公网安备 51140202000112号