外文资料--Searching Single Nucleotide Polymorphism Markers

资源描述

《外文资料--Searching Single Nucleotide Polymorphism Markers》由会员分享，可在线阅读，更多相关《外文资料--Searching Single Nucleotide Polymorphism Markers（4页珍藏版）》请在金锄头文库上搜索。

1、Searching Single Nucleotide Polymorphism Markers to Complex Diseases using Genetic Algorithm Framework and a BoostMode Support Vector Machine Khantharat Anekboon, Suphakant Phimoltares, and Chidchanok Lursinsap AVIC, Department of Mathematics, Chulalongkorn University, Bangkok, Thailand Khantharat.A

2、Student.chula.ac.th, suphakant.pchula.ac.th, and lchidchachula.ac.th Sissades Tongsima Genome Institute, National Center for Genetic Engineering and Biotechnology, Pathumtani, Thailand sissadesbiotec.or.th Suthat Fucharoen Thalassemia Research Center, Institute of Molecular Biosciences, Mahidol Univ

3、ersity, Salaya Campus, Nakhonpathom, Thailand grsfcmahidol.ac.th Abstract With the advent of large-scale high density single nucleotide polymorphism (SNP) arrays, case-control association studies have been performed to identify predisposing genetic factors that influence many common complex diseases

4、. These genotyping platforms provide very dense SNP coverage per one chip. Much research has been focusing on multivariate genetic model to identify genes that can predict the disease status. However, increasing the number of SNPs generates large number of combined genetic outcomes to be tested. Thi

5、s work presents a new mathematical algorithm for SNP analysis called IFGA that uses a “BoostMode” support vector machine (SVM) to select the best set of SNP markers that can predict a state of complex diseases. The proposed algorithm has been applied to test for the association study in two diseases

6、, namely Crohns and severity spectrum of 0/Hb E Thalassemia diseases. The results revealed that our predicted SNPs can respectively best classify both diseases at 71.57% and 71.06% accuracy using 10-fold cross validation comparing with the optimum random forest (ORF) and classification and regressio

7、n trees (CART) techniques. Keywords- Single Nucleotide Polymorphism; Support Vector Machine; Genetic Algorithm I. INTRODUCTION Scientists have long been interested in identifying genetic factors that influence the occurrence of complex diseases. With the advent of parallel genotyping technology, cos

8、t and time in finding SNPs are not out of reach. Large case-control cohorts generated from very dense SNP arrays (DNA chip contains dense array of SNPs) challenging researchers to search for SNPs that are associated with the diseases. In contrast to the single gene disorders, the state of complex di

9、seases could be triggered from multiple genes when exposing to certain environmental factors 1, 2. However, searching for multiple marker interactions from a large pool of SNPs imposes high computational and memory complexity. A technique of selecting subset of relevant features, named Feature Selec

10、tion 3, has been widely used in almost fields, including bioinformatics. This technique provides more effective way to improve learning accuracy to understand the importance of the features by removing irreverent or redundant ones. II. THE PROPOSED IFGA METHOD In this section, we introduce a new enc

11、oding method called IFGA. Fig. 1 demonstrates the summary of the IFGA method. The first population is constructed by our proposed integer encoding approach. The data in the chromosome (in Genetic Algorithm (GA) context) are represented by a set of selected features. After the population is generated

12、, each chromosome is evaluated by a fitness score. This score is obtained by using the BoostMode-SVM approach. Then, the IFGA re-generates the next population by IFGA selection, IFGA cross-over, and IFGA mutation until a termination criterion is satisfied. A. The Integer Encoding Method Utilizing GA

13、 to perform feature selection can be done by converting input data using binary encoding 4. The length of 978-1-4244-4713-8/10/$25.00 2010 IEEE Figure 1. The overall IFGA flow char. a chromosome equals a number of all features. The size of encoded chromosome corresponds directly to the number of inp

14、ut features. This, however, presents a problem due to two reasons. First, the running time highly depends on the length of chromosome. Second, a general binary encoding does not fix a number of selected features. It fixes only the length of the chromosome. The IFGA integer encoding method is propose

15、d to solve these problems. Assume that a case-control data used in this study have m number of genotypes. Let Qi be the ith chromosome processed in the algorithm. The length of Qi, denoted by |Qi|, is set to a constant less than or equal to m. Then, random |Qi| numbers, represent the locations to se

16、lect the corresponding genotypes from a given feature sequence. During the IFGA, the length of each chromosome is not necessarily identical. For example, suppose m = 7, the chromosome size (|Qi|) is set to 3, and the randomly selected locations are 1, 5, and 6. So, the chromosome Qi = 1, 5, 6. B. IF

17、GA Selection Each individual chromosome is selected based on its fitness score into a mating pool by a stochastic universal sampling method (SUS) 5. The IFGA also uses an elitism technique 6, in which the next generation chromosome derives from the best chromosome in a current generation. C. IFGA Cr

18、oss-Over The cross-over function of traditional GA randomly selects the recombination point and swaps the two chromosomes flanking this point. Cross-over from the original GA, however, cannot be applied to the IFGA approach because all chromosomes must have the same size and features from the same l

19、oci cannot be on the same chromosome. We must devise an IFGA cross-over technique to overcome this problem. Assume that, parent1 and parent2 are the parental chromosomes where each locus is the position of selected feature. Either number of parent1s or parent2s locus must be more than 1. Number of b

20、oth parents loci (parent1 and parent2) must be greater than or equal to one. Outputs from this algorithm are offspring1s and offspring2. 1: x 2: y 3: tmp1 parent1 4: for i = 0 to |parent1| do 5: v |tmp1| 6: sel random(1, 2, ., v) 7: x x sel 8: tmp1 tmp1 parent1sel?9: end for 10: tmp2 parent2 11: for

22、e value of a specified locus. It hardly occurs when comparing with the cross-over process. IFGA mutation is presented here. Let m denote the length of a given genotype sequence, input_chrom is a chromosome that will be mutated, and output_chrom is a mutated chromosome. Each element in a chromosome i

23、s a selected feature. 1: pos_out random1, |input_chrom| 2: pos_in random1, m 3: for i = 1 to |input_chrom| do 4: if i = pos_out then 5: output_chromi pos_in 6: else 7: output_chromi input_chromi 8: end if 9: end for E. Generating a Population There are two kinds of population, the initial population

24、 and the next generation population. To generate the initial population with P chromosomes, where P is a user-defined number of chromosomes in the population, the algorithm repeatedly generates the chromosomes by integer encoding method and adds them into the set of population until the number of th

25、e chromosomes in the population is equal to P. On the other hand, the population in the next generation consists of the chromosome b, the best fitness score from the current generation, e groups of features from evolution, cross-over and mutation, and r groups of the features from the new re-selecte

26、d features. After adding b and e to the next generation, those chromosomes are checked for redundancy. Each chromosome must be identical in the next generation. Duplicated chromosomes will be removed. If the number of chromosomes in the next generation is less than the number of chromosomes in the c

27、urrent generation then a new subsets of features, r, will be randomly created and added to the next generation. F. Termination This IFGA algorithm consists of a set of recursive steps for generating the population, evaluation by a BoostMode-SVM, IFGA selection, IFGA cross-over, and IFGA mutation. Th

28、ese steps are executed until the number of the best results remains constant in the next 300 iterations. III. THE PROPOSED BOOSTMODE-SVM METHOD The goal of SVM 7 is to find a maximal separating hyperplane: either for (1) linearly separable case or (2) the nonlinearly separable case. Noted that, wT i

29、s a transpose vector of weight, xi is an input vector, is a mapping function, and b is a bias value. yi = sign (w xi + b) (1) yi = sign (w (xi) + b) (2) These equations face the same problem occurred when the input data are imbalanced. The learned separating hyperplane from imbalanced data set may s

30、hift too much in the direction towards the smaller group compared with the true separating hyperplane 8. To solve this problem, the decision hyperplane should be adjusted. It can be seen from (1) and (2) that the parameter w effects the classification output. So, modifying w will adjust the decision

31、 hyperplane, which may improve the classifier. A. BoostMode-SVM A new technique of oversampling for nominal feature is proposed to improve the performance of the SVM. The BoostMode-SVM (Fig. 1) generates two SVMs, namely SVM1 and SVM2. The SVM1 is constructed for generating the score of the training

32、 data set whereas the SVM2 is the final SVM model for classification the test set. First, only the training set is used to construct the SVM1 and to find the BoostMode. This BoostMode is the indicator vector of the minority data set. It is brought to test with the SVM1. Two scoring methods, an Unbia

33、sed Scoring (US) and a Bias Scoring (BS), are proposed to find the scoring value. The US method is performed when the SVM1 correctly classifies the BoostMode, otherwise the BS method is performed. After that, a Scoring Over-Sampling approach (SOS) is processed for adding artificial data to minority

34、group by sampling the data of the minority group until a number of data of both groups are equal. The minority group in this paper means the group of data having fewer elements. The new SVM2 is constructed for the classification by the previous training data set and new set of data from the SOS tech

35、nique. Finally, the test set is run in the SVM2 for the evaluation. The error rate for the test set is the fitness score value using in the IFGA section above. B. Finding the BoostMode To balance the size of data from both classes, some additional data in the minority group must be generated. The se

36、lected generating method (either US or BS) will depend upon a BoostMode vector. The following procedure describes how to compute the BoostMode vector. Let nminor be the number of data in the minority group. Boostrap sampling with replacement is applied on the minority group to generate t data sets,

37、i.e. BoostGroup1, ., BoostGroupt. Each BoostGroupi contains nminor data. 1: for i = 1 to t do. 2: allmodei mode(BoostGroupi ) 3: end for 4: BoostMode mode(allmodei ) i C. The Unbiased Scoring Method This technique is processed when the SVM1 classifies the BoostMode correctly. All data points have eq

38、ual chances (equal scoring values) to be selected for the over-sampling technique. The following algorithm describes the process of finding the scoring value by the US technique. The scoreVal is an output from this algorithm. 1: for i = 1 to nminor do 2: scoreVali = 1/nminor 3: end for D. The Bias S

39、coring Method The BS technique is run when the SVM1 incorrectly classifies by the BoostMode. The scoring value is calculated from the distance of its point to the decision hyperplane by (3) for linear separability or (4) for nonlinear separability. distancei = w xi + b (3) distancei = w (xi) + b (4)

40、 The data point that is correctly classified has lesser chance (less scoring value) to be selected for the oversampling process than the one that is wrongly classified. Therefore, increasing in number of incorrect classifications would influence the higher chance of samples to be chosen and vice ver

41、sa. The scoring value for the BS method is described by the following algorithm. Let distance be a set of distances of all data points in the minority group. The output from this algorithm is a set of scoreVal. 1: sumSV1 0 2: minVal min(distancei ) i 3: addVal = absolute(minVal ) + 1 4: for i = 1 to

42、 nminor do 5: tmpi = distancei + addVal 6: sumSV1 = sumSV1 + tmpi 7: end for 8: if the minority group is the control group then 9: for i = 1 to nminor do 10: tmpi = 2 tmpi 11: end for 12: end if 13: for i = 1 to nminor do 14: sumSV2 = 0 15: for j = 1 to i do 16: sumSV2 = sumSV2 + tmpj 17: end for 18

43、: scoreVali sumSV2/sumSV1 19: end for E. The Scored Over-Sampling Method The objective of the SOS algorithm is to select data from the minority group depending on the scoreVal, computed by either US algorithm or BS algorithm. Let MDi denote data ith, for 1 ith nminor. The number of data in minority

44、group and majority groups are nminor and nmajor, respectively. An output of this algorithm is a set of additional data added to the minority group, samp_data. 1: z = nmajor nminor 2: for i = 1 to |scoreVal | do 3: sumSV1 = sumSV1 + scoreVali 4: end for 5: for i = 1 to |scoreVal | do 6: sumSV2 = 0 7:

45、 for j = 1 to i do 8: sumSV2 = sumSV2 + scoreValj 9: end for 10: mapScorei = sumSV2/sumSV1 11: end for 12: for i = 1 to z do 13: selectPos = rand(1) 14: if selectPos 0 and selectPos mapScore1 then 15: samp_datai = MD1 16: else 17: for j = 2 to |scoreVal| do 18: if selectPos mapScorej1 and selectPos

46、mapScorej then 19: samp_datai = MDj 20: end if 21: end for 22: end if 23: end for IV. EXPERIMENTS AND RESULTS Table II shows the comparison of the IFGA-BoostMode-SVM, ORF 9, and CART 10 by 10-fold cross validation of Thalassemias and Crohns diseases. Our IFGA-BoostMode-SVM performs better classifica

47、tion than the standard ORF and CART methods. Note that, no feat., acc., sen., and spec. in Table II are the number of features, accuracy, sensitivity, and specificity, respectively. Thalassemia data set (503 patients with 835 SNPs) were obtained from the Thalassemia Research Center, Mahidol Universi

48、ty and the Crohn data set (357 patients with 103 SNPs) are obtained from 11. Missing data were inferred by 2SNP phasing method 12. For SVM, a soft margin RBF kernel function with = 0.5 was deployed to analyze both Crohns and Thalasemias dataset. Dummy encoding is applied for SVM as vectors 1 0 0, 0

49、1 0, and 0 0 1 where a genotype is major homozygote, minor homozygote, and heterozygote, respectively. In IFGA, each chromosome size is varied from 1 to 10. Therefore, feature selection from 1 feature to 10 features is processed. Parameters in the IFGA were set as follows: the number of chromosomes

50、is 1000, the cross-over rate is 0.7 for Thalassemias and 0.8 for Crohns diseases, and the mutation rate is 0.035 for Thalassemias and 0.001 for Crohns diseases. TABLE I. THE EXPERIMENTAL RESULTS Dataset + Algorithm no feat. acc. (%) sen. (%) spec. (%) Thal.+ IFGA-BoostMode-SVM 6 71.57 76.39 64.14 Th

51、al.+ ORF 6 54.27 69.84 30.30 Thal.+ CART 6 69.38 76.07 59.09 Crohn+ IFGA-BoostMode-SVM 8 71.06 64.58 74.90 Crohn+ ORF 8 57.88 20.14 80.25 Crohn+ CART 8 63.31 23.61 86.83 V. CONCLUSION A new IFGA with BoostMode-SVM was proposed to identify the susceptible loci from the case-control association studie

52、s. The IFGA technique encodes chromosomes as different integer sizes. The SOS technique samples the minority data set by two scoring approaches (US and BS) are proposed. This method can very well be applied in the case-control association studies. The experimental results from two real data sets: Cr

53、ohns and Thalassemias diseases show that feature selection and classification by the IFGA with BoostMode-SVM outperforms the standard ORF, and CART techniques. REFERENCES 1 J. Marchini, P. Donnelly, and L. R. Cardon, “Genome-wide strategies for detecting multiple loci that influence complex diseases

54、,” Nature Genetics, vol. 37, pp. 413417, March 2005. 2 D. J. Weatherall, “Science, medicine, and the future: Single gene disorders or complex traits: Lessons from the thalassaemias and other monogenic diseases,” BMJ, vol. 321, pp. 11171120, November 2000. 3 Y. Saeys, I. Inza, and P. Larranaga, “A re

55、view of feature selection techniques in bioinformatics,” Bioinformatics, vol. 23, pp. 25072517, October 2007. 4 X.-P. Zeng, Y.-M. Li, and J. Qin, “A dynamic chain-like agent genetic algorithm for global numerical optimization and feature selection,” Neurocomputing, vol. 72, pp. 12141228, January 200

56、9. 5 J. E. Baker, “Reducing bias and inefficiency in the selection algorithm,” in Proc. the Second International Conference on Genetic Algorithm and their Application, Hillsdale, NJ, USA, 1987, pp. 1421. 6 A. K. Bhatia and S. K. Basu, “Implicit elitism in genetic search,” in Proc. ICONIP 13th Int. C

57、onf., Hong Kong, China, 2006, pp. 781 788. 7 C. Cortes and V. Vapnik, “Support-vector networks,” Machine Learning, vol. 20, pp. 273297, September 1995. 8 R. Akbani, S. Kwek, and N. Japkowicz, “Applying support vector machines to imbalanced datasets,” in Proc. ECML the 15th European Conf. on Machine

58、Learning, Pisa, Italy, 2004, pp. 3950. 9 W. Mao and S. Kelly, “An optimum random forest model for prediction of genetic susceptibility to complex diseases,” in Proc. Pacific-Asia Conference on Knowledge Discovery and Data Mining, Nanjing, Chaina, 2007, pp. 193204. 10 A. Gerger et al., “A mulitgenic

59、approach to predict breast cancer risk,” Epidemiology, vol. 104, pp. 159164, August 2007. 11 M. J. Daly, J. D. Rioux, S. F. Schaffner, T. J. Hudson, and E. S. Lander, “High resolution haplotype structure in the human genome,” Nature Genetics, vol. 29, pp. 229232, October 2001. 12 D. Brinza and A. Zelikovsky, “2snp: Scalable phasing method for trios and unrelated individuals,” Journal of IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 5, pp. 313318, April 2008.

展开阅读全文

外文资料--Searching Single Nucleotide Polymorphism Markers

最新文档