外文翻译--A New Measurement to Quantity DNA Sequence and its Application

上传人:大米 文档编号:569944499 上传时间:2024-07-31 格式:PDF 页数:4 大小:249.93KB
返回 下载 相关 举报
外文翻译--A New Measurement to Quantity DNA Sequence and its Application_第1页
第1页 / 共4页
外文翻译--A New Measurement to Quantity DNA Sequence and its Application_第2页
第2页 / 共4页
外文翻译--A New Measurement to Quantity DNA Sequence and its Application_第3页
第3页 / 共4页
外文翻译--A New Measurement to Quantity DNA Sequence and its Application_第4页
第4页 / 共4页
亲,该文档总共4页,全部预览完了,如果喜欢就下载吧!
资源描述

《外文翻译--A New Measurement to Quantity DNA Sequence and its Application》由会员分享,可在线阅读,更多相关《外文翻译--A New Measurement to Quantity DNA Sequence and its Application(4页珍藏版)》请在金锄头文库上搜索。

1、A New Measurement to Quantity DNA Sequence and its Application Xie Xiaoli Song Shide Yuan Zhifa College of Science Northwest A&F University Yangling, Shaanxi, 712100 ,China Song John Department of Animal & Avian Sciences University of Maryland College Park, MD. 20742,USA songj88umd.eduAbstractBased

2、 on distance distribution of nucleotides in a DNA sequence, a new distance measurement, average distance of four nucleotides, was proposed. The new index can be used to describe the regularity of four nucleotides distribution in a given DNA sequence. According to the new measurement, 17 complete mam

3、mal mitochondrial genomes have been analyzed and then cluster analysis and principal component analysis have been done. The results indicate that the new measurement is an efficient index to analyze DNA sequence and organic evolution. Keywords- distance distribution; DNA sequence analysis; cluster a

4、nalysis; principal component analysis I. INTRODUCTION With the completion of many genome projects, a large number of genomic sequences are available in public databases. It is one of the biggest challenges to analyze the large volume of biological sequence data. Chemically, DNA consists of two long

5、polymers of simple units called nucleotides, and the two strands run in opposite directions to each other and are therefore anti-parallel. Each strand consists of four nucleotides, and they are adenine (A), cytosine (C), guanine (G) and thymine (T). Nucleotide arrangement, namely, the first structur

6、e of DNA sequence, can be regarded as a sequence containing four characters set. There are two types of methods to analyze DNA sequence. One is sequence alignment 1, and the other is non-alignment method. Sequence alignment is a method to match by inserting or deleting nucleotide and adding empty ch

7、aracters. However, because DNA has great molecular weight, generally tens of thousands to several millions nucleotide pairs, and different species have different sequence length, it is extremely difficult to carry out sequence alignment. Non-alignment methods include complexity, correlation analysis

8、 and graphical representation of DNA sequence25. Many researches on complexity 68 are based on entropy conception. The organization complexity and information redundancy only use the information of probability of each nucleotide or adjacent nucleotides in DNA sequence, lacking consideration on respe

9、ctive nucleotide position in the sequence. In addition, many researchers studied graphical representation of DNA sequence in 2D, 3D and 4D and carried similarity analysis by using mathematical invariants of the graph as feature numerical values of the DNA sequence. But how to select a good invariant

10、 of the graph is a problem needed to be solved. In this study, we assume that different living organism have special structures of DNA sequence, that is to say, different living organism have special structures of DNA sequence. Then, a new index of DNA sequence was proposed, based on which we did cl

11、uster analysis and principal component analysis. II. METHODS AND MATERIALS A. Distance distribution of nucleotide in DNA sequence Given a DNA sequence, we discussed the distance distribution of the four nucleotides. Suppose nucleotide i (, ,)iA T C G= has the following distance distribution (Table 1

12、). In the table, jl( 1,2,jk=?) (supposed12klll?) denotes the distance of the two adjacent nucleotidesi, that is to say, the number of other nucleotides between two adjacent nucleotides i, and jprepresents the probability that the distance of the two adjacent nucleotides i isjl. It is obvious that th

13、ere is a distance distribution for each nucleotide. To have a clear understanding of the distance distribution of respective nucleotide, the following example is put forward. If we use the DNA sequence GATCACAGGTCTATC (the first 15 nucleotides in human complete mitochondrial genomes), the correspond

14、ing distance distribution of the four nucleotides are following (Table25). TABLE 1 Distance distribution of nucleotide TABLE 2 Distance distribution of nucleotide A Supported by 08 Special talent fund of Northwest A&F University No:Z111020834 L 1l 2l ? kl P 1p 2p ? kp Al1 2 5 P1313 13978-1-4244-4713

15、-8/10/$25.00 2010 IEEE TABLE 3 Distance distribution of nucleotide T TABLE4 Distance distribution of nucleotide C TABLE5 Distance distribution of nucleotide G B. Numerical characteristics of DNA sequences and its properties Based on the definition of the distance distribution, we can get four discre

16、te distributions for a given DNA sequence, and correspondence between the nucleotide and its distribution is one-to-one. In order to find some information of DNA sequence, we defined a new measurement, average distance of each nucleotide. Average distance of each nucleotide was defined by ( , ,)(1)i

17、jjjll piA T C G= The index has direct biological significance. It shows the distribution regularity of the four nucleotides in a given DNA sequence. The measurement has the following properties. (1) 1iklll . = is satisfied, if and only if the value of L is 1l or kl. (2) When the distance distributio

18、n is uniform, the average distance is the mean of distance value, 1kjjillk=. III. RESULTS AND ANALYSIS A. Structure distance analysis We chose 17 complete mitochondrial genomes and computed average distance of four nucleotides in the 17 species, thus each DNA sequence can correspond to a vector of 4

19、D (Table 6). Table6 shows that average distances of nucleotide G in the 17 species are bigger than that of other nucleotides. For other three nucleotides, the number of other nucleotides between the two adjacents A is less than that between two adjacent T and G. In order to study the three kinds of

20、mammals, we calculated average distance of four nucleotides for the three kinds of mammals (Fig.1). It is obvious that average distance of nucleotide T and A in primates is bigger than that of ferungulates and rodents, but average distance of nucleotide C in primates is less than that of ferungulate

21、s and rodents. B. Cluster analysis Using average distance corresponding to DNA sequence, we can do cluster analysis for many DNA sequences. Firstly, we TABLE 6 Accession number and average distance of complete mitochondrial genomes of 17 species Species Accession number Al TlClGlprimates human V0066

22、2 2.2379 3.05082.20316.6334Common chimpD38116 2.1982 2.96352.26476.9002Pigmy chimp D38113 2.2194 2.97822.25466.7863Gorilla D38114 2.2404 2.97192.26556.5941Orangutan D38115 2.2810 3.20852.08636.5848Gibbon X99256 2.2733 3.18062.15366.3289Baboon Y18001 2.1850 3.02482.27796.6442FerungulatesHorse X79547

23、2.1123 2.86832.51046.4732White rhino Y00726 2.0000 2.89232.58396.7732Harbor seal X63726 2.0366 2.95872.65116.0262Gray seal X72204 2.0401 2.95552.64786.0263Cat U20753 2.0740 2.69772.82546.0909Fin whale X61145 2.0654 2.74782.66826.5428Blue whale X72204 2.0582 2.76832.62806.6945Cow V00654 1.9969 2.6823

24、2.86016.4528Rodents Rat X14848 1.9445 2.66912.80737.0716Mouse V00711 1.8966 2.48423.10267.1163012345678ATCGdifferent nucleotidesaverage distancePrimateFerungulatesRodent Figure 1 Average distances of three kinds of mammals Tl 1 6 P 23 13 Cl 1 3 4 P 13 13 13 Gl 0 6 P 12 12 normalized average distance

25、 of nucleotides (imlandiml (, ,iA T C G=) denoted the average distance and normalized distance in the mth species meanofstandarddeviationofimimimimllll=) and the transformation is to avoid different variance of average distance. Each species is regarded as a point in 4D, and we can get 17 points cor

26、responding to the 17 species. According to Euclidian distance definition, we calculated Euclidian distance of two species, so distance matrix of 17 species can be obtained. Then we used UPGMA algorithm and made cluster analysis. The result is shown in Fig. 2. From fig.2, we knew the result is in acc

27、ordance with actual evolution regularity and it shows the relationship of primates and ferungulates is closer than that of primates and rodents. The conclusion is in accordance with the results of Cao10and Reyes11 based on sequence alignment and that of Lis9 by using LZ complexity index. C. Principa

28、l Component Analysis (PCA) According to average distance of four nucleotides in 17 species, we did principal component analysis. Each principal component and its cumulative rate of contribution are listed in the Table7. The first two components cumulative rate of contribution is up to 98.68%, so the

29、 first two principal components are selected. 10.28980.49720.71560.3959(2)ATCGFllll= +20.11630.09180.39160.9081(3)ATCGFllll=+ Through PCA s path analysis and decision-making analysis13, we knew that for the first principal component1F , ,ATCGlllls decision ability are 0.1593,0.4289,0.7041and 0.1949

30、respectively, which shows that 1F mainly stored variant information of ,TCll; For the second principal component2F, ,ATCGllll s decision ability are 0.0259,0.0157,0.2850 and 1.5320 respectively, so 2F mainly contained variant information of ,TCll and Gl. By integrating 1Fand 2F, we knew that most va

31、riant is from ,TCll,Gl, so in the evolution tree constructed by using the method, relationship is mainly determined by the three average Figure 2 Cluster tree of 17 species based on normalized average distance TABLE 7 Principal component and its cumulative rate of contribution Component Total Rate o

32、f variance Cumulative rate 1 2 3 4 0.1384 0.1006 0.0029 0.0003 57.14 41.54 1.20 0.12 57.14 98.68 99.88 100 distances ,TCll, Gl, and averagely distributed nucleotide A does not play a leading role. We computed the value of the two principal component of each species. By using the first principal comp

33、onent and the second principal component as x axis and y axis respectively, we plotted Fig.3. Fig. 3 showed that main difference between primates and rodents is that the value of the second principal of primates is bigger than that of rodents, but the 1Fs value of rodents is the biggest in the three

34、 kinds of mammals. That is to say, the variant of primates is mainly decided by average distance of nucleotide G, but the variant of rodents is mainly determined by average distance of nucleotide T and G. However, variant of ferungulates is mainly determined by average distance of nucleotide T, C an

35、d G. IV. CONCLUSIONS AND DISCUSSION This paper put forward a new measurement to quantity DNA sequence and uses it to describe the characteristics of DNA sequence. This index overcame the shortcomings of previous researches which only considered various nucleotides proportion in the sequence without

36、considering their positions. Furthermore, the calculation of the index is easy to carry about. To get the distribution regularity of the four nucleotides in the DNA sequence, we need only to calculate the average number of other nucleotides between two adjacent same nucleotides. Based on the four nu

37、cleotides average distance in each DNA sequence, the paper analyzed 17 species complete mitochondrial genomes, then got their nucleotides distribution and average distance of four nucleotides. Based on nucleotides average distance, we carried cluster analysis and principal component analysis for the

38、 17 species. The result proved that 1.61.822.22.42.62.833.23.43.64.64.855.25.45.65.866.2RodentsFerungulatesPrimates Figure3 Numerical marker of 17 mammals the average distance of nucleotides is very effective in giving a quantitative description for the evolution relationships between sequences. The

39、 result based on the principal component analysis further described the reason for the species differences. The new quantitative description index for DNA sequence provided a new and feasible tool for exploring sequence information, realizing and analyzing genomes. ACKNOWLEDGMENT The authors would l

40、ike to thank members of the applied mathematics lab for helpful discussion. The work was supported by Dr. Yu Ying , Luo Jianzhong, Liang Liping and Dr. Zheng Lifei. References 1 M. S. Waterman, “Introduction to Computational Biology: Maps, Sequence and Genomes,” London: CRC Press. 1995. 2 M. Randic,

41、 M. Vracko, N. Lers, “Novel 2-D graphical representation of DNA sequences and their numerical characterization,” Chemical physics Letters. 2003, 368, pp.1-6. 3 M. Randic, “On 3-D graphical representation of DNA primary sequences and their numerical characterization,” Chem Inf Comput Sci. 2000, 40, p

42、p.1235-1244. 4 R. Chi, K. Ding, “ Novel 4D numerical representation of DNA sequences,” Chemical physics Letters. 2005, 407, pp.63-67. 5 F. Bai, “Novel Numerical Characterization and Analysis of Similarity of DNA sequence,” Mathematics In Practice and Theory(in Chinese). 2007, 37(18), pp.95-99 6 L. L

43、uo., “Physics View of Evolution(in Chinese),” Shang Hai: Science Technology Press. 2000. 7 L. Luo, F. Ji, “Reconstruction of Evolutionary Tree Based on Information About Short-Range Correlation of Nucleotides in Gene Sequence,” Acta Scientiarum Naturalium Universitatis NeiMongo. 2001, 32(1):32-41. 8

44、 H. Li, K. Ma., “The organization complexity of Nucleic Acid Sequence and Its Relation with Evolution (in Chinese)”. Acta Scientiarum Naturalium Universitatis NeiMongo. 1991, 22(3):364-370. 9 A. Lempel, J. Ziv, “On the complexity of finite sequence,” IEEE Transactions on Information Theory. 1976, IT

45、-22(1):75-81 10 B. Li, Y. Li, H. He., “LZ complexity Distance of Symbol sequences and its Application(in Chinese),” Journal of Chinese Computer System. 2007, 28(5), pp. 850-854. 11 Y. Cao, A. Janke, P. J. Waddell, “Conflict among individual mitochondrial proteins in resolving the phylogeny of euther

46、isn orders,” Journal of Molecular Evolution. 1998, 47, pp.307-322. 12 A. Reyes, C. Gissi, G. Pesole, “where do rodent fit? Evidence from the complete mitochondrial genome of Sciurus vulgaris,” Molecular Biology and Evolution. 2000, 17, pp.979-983. 13 L. Liu, L. Wang, M. Guo, “The decision-making analysis of the biggest variance of phenotype traits with principal componet(in Chinese).,” Jour. Of Northwest Sci-Tech Univ. of Agri. and For. 2005, 33(10), pp.97-99.

展开阅读全文
相关资源
正为您匹配相似的精品文档
相关搜索

最新文档


当前位置:首页 > 大杂烩/其它

电脑版 |金锄头文库版权所有
经营许可证:蜀ICP备13022795号 | 川公网安备 51140202000112号