外文翻译--An Experimental Evaluation of Corrected Inversion and DCJ Distance Metric through Simulations

上传人:公**** 文档编号:571537322 上传时间:2024-08-11 格式:PDF 页数:4 大小:106.79KB
返回 下载 相关 举报
外文翻译--An Experimental Evaluation of Corrected Inversion and DCJ Distance Metric through Simulations_第1页
第1页 / 共4页
外文翻译--An Experimental Evaluation of Corrected Inversion and DCJ Distance Metric through Simulations_第2页
第2页 / 共4页
外文翻译--An Experimental Evaluation of Corrected Inversion and DCJ Distance Metric through Simulations_第3页
第3页 / 共4页
外文翻译--An Experimental Evaluation of Corrected Inversion and DCJ Distance Metric through Simulations_第4页
第4页 / 共4页
亲,该文档总共4页,全部预览完了,如果喜欢就下载吧!
资源描述

《外文翻译--An Experimental Evaluation of Corrected Inversion and DCJ Distance Metric through Simulations》由会员分享,可在线阅读,更多相关《外文翻译--An Experimental Evaluation of Corrected Inversion and DCJ Distance Metric through Simulations(4页珍藏版)》请在金锄头文库上搜索。

1、An Experimental Evaluation of Corrected Inversionand DCJ Distance Metric through SimulationsJian ShiDept. of Computer Science & EngineeringUniversity of South CarolinaColumbia, SC 29208, USAJijun TangDept. of Computer Science & EngineeringUniversity of South CarolinaColumbia, SC 29208, USAAbstractTh

2、e available data of complete ordering of genes oneach chromosome of organisms is increasing thank to the researchin gene hunting and annotation. Researchers like evolutionarybiologists and computer scientists are greatly interested in geneorder data because it make possible to resolve the ancientbra

3、nches to fill the tree of life.Inversion based distance metric and the so called DCJ (double-cut-and-join) distance metric have been intensively studiedthrough all these years of work. However, both of the two metricsonly consider the smallest possible distance between two genomes,known as edit dist

4、ance. It underestimates the true evolutionarydistance and thus makes the phylogenies less accurate. Correctiontechniques for inversion and DCJ have been proposed to improvethe aforementioned problem. The first has been widely usedin phylogenetic reconstructions and the second has not seensignificant

5、 use to date.We design and conduct a series of experiments to estimatethe quality of inferred phylogenies by using inversion and DCJbased distance metrics in various aspects, with and withoutapplying correction. Our main findings are three folds: First,correction techniques yield much more accurate

6、phylogenies;second, inversion and DCJ measure return similar results; lastbut not least, corrected DCJ (CDCJ) provides the most accuratedistance metric for tree reconstruction.I. INTRODUCTIONThe complete ordering and strandedness of genes on eachchromosome of organisms is now available due to modern

7、laboratory techniques. As more genomic information we canget, analysis methods for gene order data are in an increasingneed. The most substantial step in analysis of such data is toestimate the evolutionary change between two genomes, whichis the evolutinary distance.Thank to phylogenists, comparati

8、ve genomicists and com-putational biologists, there is a variety of methods for re-constructing phylogineis: such as direct optimization, MCMCand distance-based methods. Among all these approaches,distance-based methods do not require intensive computationas the other two yet still provides reasonab

9、lly accurate results.In this paper, we focus on distance based methods as thenumber of genomes we test exceeds the range of size thatthe other two methods can handle.Inversion based distance metric has been well studiedfor the past 20 years. DCJ 2 is a different model ofgenome rearrangement, under w

10、hich various genomic rear-rangement events: inversion, transposition, translocation, blockinterchange and chromosmal fusion and fission can all berepresented by a single multichromosomal operation. Both in-version and DCJ based distance methods provide very simliarphylogenis, testified by Korthari a

11、nd Moret 3.A problem with these commonly used methods is that theyare bounded and reflect only the final state of an evolutinaryprocess, thus they typcially underestimate the true distanceespecially for genomes that invlove a large amount of evo-lutinary events. Wang et al. proposed a correction met

12、hod forinversion distance metric and greatly improved the accuracyof the phylogenies referred based on it 7. Lin and Moretproposed a novel approach to esitimate the true distance underDCJ model on the mathematical level 4.In this paper, we focus on estimating the quality of referredphylogenies based

13、 on using these distance metrics and makecomparison between them.II. BACKGROUNDA. Inversion distancesThe inversion distance between two genomes is defined asthe minimum number of inversion events needed to transformone into another. Hannelhalli and Pevzner invented a polyno-mial time algorithm to co

14、mpute this distance 5 and Bader etal. later improved it into an optimal linear time algorithm 6.Wang et al. developed a statistcal techinique called EDE(Empiracally Derived Estimator) for correcting inversion dis-tances 7. The formula of corrected inversion distance is:f1(d) = maxd,(b cd) + (b cd)2+

15、 4bd(1 d)12)/(2(1 d),where a, b and c are constants based on their experimentalresults. Given the (minimum) inversion distance d, we canestimate the true inversion distance by computing f1(d).B. DCJ distancesDCJ model was proposed by Yancopoulos et al. 2 thenrefined by Bergeron and his colleagues 1.

16、 As we know agene is a stranded sequence of DNA that starts with a tailand end with a head. An adjacency of two consecutive gensa and b, depending on their respective orientation, can be offour different types: ah,bt,ah,bh,at,bt,at,bh. Forthe singleton sets, like dt or dh, we name it as telomere.U.S

17、. Government work not protected by U.S. copyrightA DCJ operation makes a pair of cuts (which can be anywhere,even the cuts are on different chromosomes) and proceeds toreconnect the ends of cuts.DCJ model can mimic the same event of inversion, fission,fusion, translocation and transposition through

18、different com-bination of one or more DCJ operations, and computationswith DCJ are even simpler than computatioins with justinversion.Since DCJ metric also only considers the initial and finalstates and thus underestimate the true evolutionary distancebetween two genomes. Yu and Moret 4 thoroughly c

19、on-sidered the four possible cases of changes on adjacenciesand tolemeres, and derived a novel process to estimate thedistance between two genomes G1and G2by computing everyintermedia states, step by step, from G1to G2until they finallymatch or they reach to a pre-defined threshold.III. EXPERIMENTAL

20、SETUPA. Data preparationWe set out to test the accuracy of these distances inphgylogenetic reconstruction using simulated data, where thetrue evolutionary histories are known.In our experiments, we generate model tree topologies fromthe uniform distribution on binary trees, each with 20 leaves.On ea

21、ch tree, we evolve signed permutations of 100 genesusing various number of evolutionary rates: letting r denote theexpected number of evolutionary events along an edge of thetrue tree, we use values of r in the range of 2 to 32. The actualnumber of events along each edge is sampled from a uniformdis

22、tribution on the set r2,.,3r2. For each combination ofparameter settings, we run 100 datasets and average the results.We use FastME to obtain phylogenies since it is fast andaccurate with corrected inversion distances 7. Other methods(GRAPPA and MGR) will take very long time for datasets with20 geno

23、mes and large r values.B. Comparison strategiesWe compare the accuracy of a phylogeny using the theRobinson-Foulds (RF) rate. Assuming T be the true tree andlet T?be the inferred tree. An edge e in T is “missing” in T?if T?does not contain an edge defining the same bipartition;such an edge is called

24、 a false negative (FN). The false negativerate is the number of false negative edges in T?with respectto T divided by the number of internal edges in T. The falsepositive (FP) rate is defined similarly, by swapping T and T?.The RF rate is the average of FN and FP rates. Generally anaverage RF rate o

25、f lower than 5% is considered acceptable8.C. Confidence AssessmentTo introduce a certain amount of disturbation to the originaldatasets and assess the quality of the inferred trees, we usejackknife procedure.The jackknifing procedure is applied by removing somegenes from each genome and obtaining a

26、tree from the reducedgenomes. This procedure is then repeated many times anda consensus tree is constructed. From the result of a set ofexperiments, we find that the jackknifing rate (percentage ofgenes being removed) of 40% is a good turning point.Fig. 1 shows the RF rate of using different jackkni

27、fe ratefor r = 8 and r = 28. One can observe that by deleting morethan 40% genes, the consensus trees become far from the truetrees, indicating that too much disturbance is introduced. Asa result, we use the rate of 40% in all our other experiments. 100 90 80 70 60 50 40 30 20 10 0 90 80 70 60 50 40

28、 30 20 10 0RF rate (%)Jackknife Rate (%)inversioninversion EDEDCJcorrected DCJ 100 90 80 70 60 50 40 30 20 10 0 90 80 70 60 50 40 30 20 10 0RF rate (%)Jackknife Rate (%)inversioninversion EDEDCJcorrected DCJ(a) r=8(b) r=28Fig. 1.The RF rates of using different jackknife procedure.When we compare FP

29、and FN rate, we need to decide underwhat confidence values (which are perhaps the most valuableinformation obtained through the jackknife procedure) ofinternal edges shall we examine them.The most important question is to determine where to drawthe threshold so that edges with confidence values high

30、er thanthis threshold can be trusted, whereas edges with lower valuescan be discarded. When determining the best value of supportthreshold, one shall realize that high threshold can reduceFP branches, however the possibility of discarding non-FPbranches also increases, thus increases the FN.To find

31、a point that with reasonably low FP rate (5% isgenerally acceptable) and with as low FN rate as we canget (we want to keep as many branches uncontracted as wecan), We conducted a set of experiments to find the bestsupport threshold value, due to the space limitation, we cannot illustrate all the res

32、ult ranged from 60% to 95%. Fig. 3 to5 show 3 sets of FP and FN rate under 65%,75% and 85%respectively, we observe that FN rate dominate the value ofRF rate, thus we pick 75% as the threshold of support values.IV. RESULTSIn Fig. 2 we observe that uncorrected method, both inver-sion based and DCJ bas

33、ed perform almost exactly the same.Corrected DCJ and EDE (corrected inversion) provide similarresults when r 0 (%)Expected number of events (r)DCJCorrected DCJinversionEDE inversion 100 90 80 70 60 50 40 30 20 10 0 32 28 24 20 16 12 8 4 2Average FN Branch Rate (%)Expected number of events (r)DCJCorr

34、ected DCJinversionEDE inversionFig. 3.The FP and FN branch rates under different methods using 65% threshold. 10 8 6 4 2 0 32 28 24 20 16 12 8 4 2FP Branch Rate FP0 (%)Expected number of events (r)DCJCorrected DCJinversionEDE inversion 100 90 80 70 60 50 40 30 20 10 0 32 28 24 20 16 12 8 4 2Average

35、FN Branch Rate (%)Expected number of events (r)DCJCorrected DCJinversionEDE inversionFig. 4.The FP and FN branch rates under different methods using 75% threshold. 10 8 6 4 2 0 32 28 24 20 16 12 8 4 2FP Branch Rate FP0 (%)Expected number of events (r)DCJCorrected DCJinversionEDE inversion 100 90 80 70 60 50 40 30 20 10 0 32 28 24 20 16 12 8 4 2Average FN Branch Rate (%)Expected number of events (r)DCJCorrected DCJinversionEDE inversionFig. 5.The FP and FN branch rates under different methods using 85% threshold.

展开阅读全文
相关资源
正为您匹配相似的精品文档
相关搜索

最新文档


当前位置:首页 > 大杂烩/其它

电脑版 |金锄头文库版权所有
经营许可证:蜀ICP备13022795号 | 川公网安备 51140202000112号