外文翻译--On the Complexity and Completeness of Robust Biclustering Algorithm ROBA

资源描述

《外文翻译--On the Complexity and Completeness of Robust Biclustering Algorithm ROBA》由会员分享，可在线阅读，更多相关《外文翻译--On the Complexity and Completeness of Robust Biclustering Algorithm ROBA（4页珍藏版）》请在金锄头文库上搜索。

1、On the Complexity and Completeness of Robust Biclustering Algorithm (ROBA) Muhammad Ibrahim Nasimul Noman Hitoshi Iba Dept. of Computer Science and Dept. of Computer Science and Dept. of Electrical Engineering and Engineering Engineering Information Systems University of Dhaka, Bangladesh University

2、 of Dhaka, Bangladesh University of Tokyo, Japan nomanunivdhaka.edu ibaiba.t.u-tokyo.ac.jp Absract A biclustering algorithm named ROBA has been used in a number of recent works. We present a time and space efficient implementation of ROBA that reduces the time and space complexity by an order of L

3、where L is the number of distinct values present in the data. Our implementation runs almost 11 times faster than the existing implementation on Yeast gene expression dataset. We also improve ROBA and then use it to present an iterative algorithm that can find all perfect biclusters with constant va

4、lues, constant values on rows and constant values on columns. Though our algorithm may take exponential time in the worst case, we use some subtle observations to reduce computational time and space. Experimental result reveals that our algorithm runs in reasonable time on Yeast gene expression data

5、set and finds almost 10 times more biclusters than ROBA. Keywords biclustering, Robust Biclustering Algorithm, gene expression data. I. INTRODUCTION Biclustering techniques perform clustering process simultaneously along rows and columns of a matrix in order to find which rows are clustered under wh

6、ich columns. A bicluster of a gene expression matrix is a subset of rows (genes) that exhibits similar behavior under a subset of columns (conditions). Basically four types of biclusters have been reported in the literature 1. They are: biclusters with constant values, biclusters with constant value

7、s on rows or columns, biclusters with coherent values, biclusters with coherent evolution. So the problem at a glance is: given a two dimensional matrix, our target is to find out as many maximal biclusters satisfying some homogeneity criteria as possible. Biclustering problem has applications in DN

8、A microarray data analysis, collaborative filtering, market research, information retrieval, text mining, electoral trends, exchange analysis etc. 2. II. BACKGROUND STUDY A lot of work on biclustering has been reported so far. To review some of these works, the reader can go through 1. In this study

9、, we confine our discussion to a comparatively recent method, named Robust Biclustering Algorithm (ROBA) 2, 3. Improving ROBAs performance should be significant to all of those works 4 -10 that implement ROBA. A. Description of ROBA After dissolving missing value and noise problem using any suitable

10、 method, ROBA decomposes the N x M data matrix A into L number of N x M elementary matrices (E.M.s) and performs elementary row and column operations to find, as the authors claim, all perfect biclusters. Let an input data matrix A having L distinct values is given. A can be written as, A = 1A1 + 2A

11、2 + .+ LAL where k is the kth distinct value among L distinct values of A, Ak is the kth E.M. where Akij is 1 if Aij is k, otherwise 0. After performing this task, 2 proposed three almost similar algorithms to find perfect biclusters with constant values, perfect biclusters with constant values on r

12、ows and perfect biclusters with constant values on columns. After that, 2 used these algorithms to find biclusters with coherent values and biclusters with coherent evolution. III. PROPOSED EFFICIENT IMPLEMENTATION As far as ROBA is concerned, if we deal with only biclusters with constant values on

13、rows, it suffices for the other four types of biclusters mentioned earlier. The reader is requested to go through the original paper 2 in order to be convinced further. So throughout this paper, we deal with only biclusters with constant values on rows. The gist of ROBA is: every distinct non-zero r

14、ow of all of the E.M.s represents a distinct bicluster. We call this distinct nonzero row pivot row and all other rows candidate rows because pivot row decides if a candidate row should be included in its bicluster. To clarify further: we can represent the columns of a bicluster Bk = (G, C) where C

15、= c1, c2, , c|C| by a binary row vector q where qci = 1 (1 i |C|) and all other entries of q are 0. A candidate row, rc will be a member of the bicluster represented by a pivot row, rp if rp (bitwise AND) rc = rp. In other words, for each distinct value k, it records in a row of E.M.s the positions

16、i.e. column numbers of a row where k is present. Then using the bitwise AND operation, it searches for any other row(s) where at least in these positions the candidate row has a constant value l. It includes all the rows that satisfy this bitwise AND operation criterion in the bicluster represented

17、by the pivot row i.e. the bicluster where the columns are the positions of the pivot row where k is present. This process is carried out for each k. 978-1-4244-4713-8/10/$25.00 2010 IEEE We develop a time and space efficient implementation of ROBA. We elicit some subtle properties of the base princi

18、ple of ROBA to achieve these efficiencies. We reduce both time and space requirements by a factor of L where L is the number of distinct values present in the N x M input data matrix. Key points of our algorithm (figure 1) are: 1. Performing computation directly on the rows of the data matrix instea

19、d of constructing E.M.s 2. Detecting biclusters simultaneously instead of one after another. The comparison of time and space requirement between the existing implementation and our implementation is summarized in Table I. Figure 1. Proposed efficient implementation of ROBA (Algorithm 1) A. Experime

20、ntal Resutls We implemented our algorithm in Matlab on a computer having 1.8 GHz Dual Core processor, 1 GB RAM and running Windows Vista operating system. We used the much-used Yeast Saccharomyces Cerevisiae gene expression dataset 11 to compare the performance between the existing implementation 2

21、and our algorithm. The data are organized in a 2884 x 17 matrix where rows and columns correspond to genes and conditions respectively. The entries are integer numbers in a range from -1 to 595 with 207 distinct values. -1 entry indicates missing value. Result Analysis Like 2, we quantized the data

22、matrix according to the technique suggested in 2 which reduces the numbers of distinct values from 207 to 113. (Of course we dont need to quantize the data whereas 2 needs it much because, otherwise, 2 takes huge amount of time.) Then we have run the implementation of 2 and our algorithm to compare

23、the performance (see Table II). We have found that our algorithm is almost 11 times faster than the implementation of 2. We did not perform duplicate bicluster checking so there can be duplicate biclusters in the result. However, we did compare both of the outputs i.e. set of biclusters found by the

24、 two implementations and noticed no difference. Of course, the order of biclusters produced by the two implementations may be different. As 2 reported that their implementation of ROBA outperformed some classic works such as Cheng and Church 12 and Wang et al 13 in terms of time requirement, therefo

25、re our implementation outperforms them by even more degree. IV. PROPOSED ALGORITHM A. On the Completeness of ROBA We argue that ROBA may not find all perfect biclusters with constant values, constant values on rows and constant values on columns as its authors claimed in 2. Lets consider the followi

26、ng example first. Example 1: Let L = 2, |G| = 4, |C| = 4. Two examples of the values of the two elementary matrices are shown in figures 2(a) and 2(b). In figure 2(a) we have 7 distinct nonzero rows out of 8, so ROBA finds 7 biclusters and indeed there are exactly 7 biclusters (including trivial one

27、s). Now consider a slightly different E.M.s as shown in figure 2(b). Here also we have 7 distinct nonzero rows out of 8, so ROBA finds 7 biclusters. But actually there are 8 biclusters. The bicluster that ROBA fails to find is G = 1, 2, 4 and C = 1, 2. 1 1 0 1 0 0 1 0 1 1 0 1 0 0 1 0 1 1 1 0 0 0 0 1

28、 1 1 1 0 0 0 0 1 1 1 0 0 0 0 1 1 1 0 0 0 0 1 1 1 1 1 1 1 0 0 0 0 1 1 1 1 0 0 0 0 (a) (b) Figure 2. Two sets of E.M.s The pitfall of ROBA is formalized in lemma 1. Lemma 1: Let there is a bicluster Bk = (G, C) where C = c1, c2, , c|C| and let ri be any row of E.M.s; ri = ri1, ri2, , riM. If there is

29、no row in E.M.s such that all of rc1, rc2, , rc|C| are set to 1 i.e. rc1 = rc2 = = rc|C| = 1 and all other entries of ri are 0, then Bk will not be detected by ROBA. Proof: ROBA considers each row ri of E.M.s as pivot row rp and all other rows as candidate row rc. Then if rp AND rc = rp, it includes

30、 rc in the bicluster represented by rp. So ROBA considers each nonzero row as a bicluster. Now if there is a bicluster where the q vector (mentioned earlier) of a bicluster is not in any of the E.M.s, then ROBA cannot detect it because, as we said, ROBA can find only the biclusters that are represen

31、ted in the E.M.s In example 1, there is an undetected bicluster (G = 1, 2, 4 and C = 1, 2) because there was no row in figure 2(b) where r1 and r2 was set to 1 and all other ri are 0 i.e. 1 1 0 0. But this same bicluster was detected in figure 2(a) because the just-mentioned row 1 1 0 0 is present t

32、here. Now that we have identified a pitfall of ROBA, how can we alleviate it? The problem of finding all maximal biclusters is NP-hard 12. So in this study, we shall try to come up with a heuristic solution relating to ROBA; we shall try to use some observations to avoid exhaustive search for biclus

33、ters. At first, lets introduce a definition below. Dummy Row Construction Rule for Two Rows: Let ri and rj be any two rows of E.M.s If there are at least two rows ri and rj such that there are at least two positions p and q such that rip = 1, rjp = 0 and riq = 0, rjq = 1. Then a dummy row for ri and

34、 rj, d(ri, rj) is constructed like this: dk(ri, rj) = 1 Input: Two dimensional data matrix Output: Biclusters with constant values on rows found by ROBA 1. For each row rp /pivot row 2. Compute frequencies of occurrences and positions of occurrences of each value of rp 3. For each row rc /candidate

35、row 4. For each value vp of rp 5. Compute the set of positions sp where vp is present in rp. 6. Compute the set of values sv present in the positions sp of rc. 7. If there is only a single value present in sv, then include rc as a member of the bicluster represented by the positions of vp of rp.only

36、 if rik = rjk = 1; otherwise dk(ri, rj) = 0. Simply put, d(ri, rj) will be bitwise AND operation of ri and rj. We incorporate the above idea in algorithm 2 below and find some biclusters that have not been found by ROBA. Now the next question is: once we have done this, can there be more undetected

37、biclusters? Example 3 shows that still there can be some unearthed biclusters. Figure 3. ROBA with pair-wise dummy rows (Algorithm 2) Example 2: Here there is a bicluster involving rows 1, 2, 3 and columns 1, 2, 3. But even after running the algorithm 2, it remains undetected. The reason is: there w

38、ill be dummy rows, namely, d(1, 2) = 1 1 1 0 0 0 1 0 0, d(1, 3) = 1 1 1 0 0 0 0 1 0 and d(2, 3) = 1 1 1 0 0 0 0 1. None of these dummy rows represents the above mentioned bicluster. Hence it will not be found by algorithm 2. Bearing example 2 in mind, we can write the following conjecture. Conjectur

39、e 1: If we construct dummy rows involving every k rows, then some maximal biclusters involved in at least k + 1 rows may be missing. Here comes the question of accuracy we want. If the application at hand does not mind sparing some maximal biclusters involving at least k +1 rows of the E.M.s, then w

40、e should construct dummy rows using only k rows at a time. Now how can we construct dummy rows taking k rows at a time? We have shown in dummy construction rule for two rows how to make the dummy row using 2 rows. A similar rule below guides us to make it using k rows at a time. Dummy Row Constructi

41、on Rule for k Rows: The dummy row regarding k rows will be bitwise AND operation of all of the k rows. Our final strategy is: we have to construct dummy rows gradually: first we shall run the ROBA and get some biclusters. Then we shall build dummy rows based on every 2 rows of E.M.s; so there will b

42、e LNC2 dummy rows. We shall construct set D consisting of every distinct dummy row after discarding those which are already in E.M.s and then shall run step 3 of algorithm 2. Then we shall construct dummy rows based on every 3 rows of E.M.s to get LNC3 dummy rows and construct set D with them after

43、discarding those which are already in E.M.s and which were produced in the previous run when LNC2 dummy rows were generated; and then run step 3 of algorithm 2. This process will continue until termination condition is satisfied. B. The Algorithm We now present our algorithm which we call algorithm

44、3 that incorporates all the concepts discussed so far. Here convergence means some terminating criteria like total number of biclusters found or a particular bicluster found with specified set of rows/columns etc. is met. Figure 4. Finding all perfect biclusters (Algorithm 3) Time and Space Complexi

45、ty Analyses If we run algorithm 3 with higher values of k, then its time complexity becomes exponential because subset generation (to select every k rows of E.M.s) is involved. Of course, we adopted some tricky techniques in the implementation of algorithm 3 to reduce time and space requirement such

46、 as avoiding construction of E.M.s altogether etc. C. Experimental Resutls We implemented a truncated version one run of the repeat loop at line 3 of algorithm 3 in Matlab on the same computer mentioned earlier in this paper. Result Analysis. We tested algorithm 3 on an excerpt of first 500 rows of

47、the 2884x17 and also on the whole dataset of Yeast gene expression data 11. The excerpt and the whole dataset contain, after data quantization, 87 and 113 distinct values respectively. The result of both of the algorithms is summarized in Table III. We find that algorithm 3 discovered a significant

48、number of biclusters that ROBA could not discover. For the whole dataset, the average entries, rows and columns of the biclusters found by algorithm 3 is 166, 28.43 and 6.61 respectively. Comparison with Genetic Algorithm. We developed a typical genetic algorithm (GA) algorithm that uses the biclust

49、ers found by ROBA as seeds to compare performance of algorithm 3. As the result shows in table III that our algorithm outperforms GA in terms of number of biclusters found when both of the programs are run in roughly same amount of time. We did not run GA on the whole dataset because experiment on 5

50、00x17 excerpt shows that GAs performance is much worse than algorithm 3. We did perform duplicate and non-maximal bicluster checking for all of the algorithms so there is no duplicate or non-maximal bicluster in the result. 1 1 1 0 0 1 1 1 0 1 1 1 0 1 0 1 0 1 1 1 1 1 0 0 0 1 1 Step 1: Run ROBA. Step

51、 2: For each pair of rows ri and rj of E.M.s, generate a dummy rows and put it in a set D. Step 3: Treat the rows of the set D as pivot rows but not as candidate rows (i.e, the rows of D will be treated as the rows of E.Ms thereby will work as pivot rows and they will treat all rows of the E.M.s as

52、candidate rows but not themselves) and run ROBA again based on this idea. 1. Run ROBA (Our implementation is recommended to use) 2. k := 2 3. Repeat until convergence 4. Construct set D consisting of all the dummy rows using every k rows of E.M.s So there will be LNCk dummy rows. 5. Discard a dummy

53、row if (1) it is already in E.M.s, (2) it is in the set of dummy rows built in the previous iterations, (3) it has all 0, (4) it fails to comply with the minimum size of columns of a bicluster etc. 6. Run step 3 of algorithm 2 using E.M.s and D. 7. k := k + 1 8. End repeatSignificance of the Biclust

54、ers Found. The biological significance of the biclusters found in ROBA is discussed in 2. Whats more, ROBA has been successfully used in a number of works 4 -10. This underlines the importance of ROBA. However, we did not perform any analysis yet on the biological significance of the additional bicl

55、usters discovered by our algorithm but not discovered by ROBA. The authors of ROBA analyzed in 2 that ROBA outperforms some classic works like Cheng and Church 12 and Yang et al 14 in terms of quality and number of biclusters found. Our algorithm finds all of the biclusters found by ROBA plus some m

56、ore. So our algorithm also outperforms the algorithms outperformed by ROBA. Biclusters with constant values and constant values on rows have their own use, for example, when potential biomarkers are looked for. V. CONCLUSION We have proposed a time and space efficient implementation of a recent bicl

57、ustering algorithm named ROBA and then we have presented an improved version of ROBA. ROBA has been used in a number of research works 4-10. We have tested the efficient implementation and our improved algorithm on Yeast gene expression dataset. Our efficient implementation has run almost 11 times f

58、aster than the existing implementation and our improved version has found almost 10 times more biclusters than ROBA which are revealed in experimental results. ACKNOWLEDGMENT Thanks to A.B. Tchagang for providing us with information about the application of constant biclusters. REFERENCES 1 Madeira,

59、 S.C., and Oliveira A.L. Biclustering algorithms for biological data analysis: a survey. IEEE Trans Computational Biology and Bioinformatics,1, 1 (2004), 24-48. 2 Tchagang A.B., and Tewfik A.H. DNA Microarray data analysis: A novel biclustering algorithm approach. EURASIP Journal on Applied Signal P

60、rocessing, Hindawi Publishing Corporation, Volume 2006, Article ID 59809: 112. 3 Tchagang A.B., and Tewfik A.H. Robust Biclustering Algorithm (ROBA) for DNA microarray data analysis. 13th IEEE Workshop on Statistical Signal Processing, 2005. 984-989. 4 Mahfouz M.A., and Ismail M.A. BIDENS: Iterative

61、 density based biclustering algorithm with application to gene expression analysis. In Proceedings of World Academy of Science, Engineering and Technology, 37, (January 2009), 342-348 5 Tchagang A.B., and Tewfik A.H. Distributed Robust Biclustering Algorithm for gene expression analysis. IEEE Intern

62、ational Workshop on Genomic Signal Processing and Statistics, 2007, 1-4. 6 Tchagang A.B., Tewfik A.H., and Benos P.V. Biological evaluation of biclustering algorithms using gene ontology and chip-chip data. IEEE International Conference on Acoustics, Speech and Signal Processing, 2008, 637 640. 7 Tc

63、hagang A.B., Tewfik A.H., DeRycke M.S., Skubitz K.M., and Skubitz A.P.N. Early detection of ovarian cancer using group biomarkers. Molecular Cancer Therapeutics, 7, 1(2008), 2737. 8 Tchagang A.B., Tewfik A.H., Skubitz A.P.N., and Skubitz K.M. Uncovering potential biomarkers in ovarian carcinoma via

64、biclustering of DNA microarray data. IEEE International Workshop on Genomic Signal Processing and Statistics, 2006, 107 108. 9 Tchagang A.B., Tewfik A.H., and Skubitz A.P.N. Analysis of order preserving genes biclusters. IEEE International Workshop on Genomic Signal Processing and Statistics, 2006,

65、101 102. 10 Vertatschitsch L., Tewfik A.H., and Tchagang A.B. Biological significance of a novel biclustering technique on genetic expression data. In Proceedings of the 5th IEEE International Symposium on Signal Proc. and Information Technology, 2005, 35-39. 11 Yeast data set. http:/arep.med.harvar

66、d.edu/biclustering/yeast.matrix (Last accessed on August 1, 2009) 12 Cheng Y., and Church G.M. Biclustering of gene expression data. In Proceedings of the International Conference on Intelligent Systems for Molecular Biology 2000, 93-103. 13 Wang H., Wang W., Yang J., Yu P.S., Clustering by pattern

67、similarity in large data sets, Proc. ACM SIGMOD, 394405, 2002. 14 Yang J., Wang H., Wang W., and Yu P.S. Enhanced biclustering on expression data. In Proceedings of 3rd IEEE Symposium on Bioinformatics and Bioengineering (BIBE 03) March 2003, 321327, 2009.TABLE I: COMPARISON OF COMPLEXITIES BETWEEN

68、EXISTING AND OUR IMPLEMENTATION OF ROBA TABLE II: PERFORMANCE COMPARISON BETWEEN THE IMPLEMENTATION OF 2 AND OUR ALGORITHM TABLE III: COMPARISON BETWEEN ROBA, ALGORITHM 3 and GA. Dimension of Dataset Mininum # of rows Minimum # of columns Total # of biclusters found Gain of algorithm 3 over ROBA in

69、finding # of biclusters ROBA Algorithm 3 GA 500x17 3 2 964 4531 2817 4531/964 = 4.7 500x17 4 3 828 4394 2121 4394/828 = 5.31 500x17 5 4 443 3657 1115 3657/443 = 8.26 2884x17 5 4 2848 28112 - 28112/2848 = 9.87 Normal In terms of Nb(# of biclusters found) Existing implemenation Our implementation Exis

70、ting implementatino Our implementation Time O(L2N2M) O(N2M) O(LNbNM) O(NbNM) Space O(LNM) O(MN+LM) O(LNM) O(MN+LM) # of distinct values in data matrix Mininum # of rows Minimum # of columns Total number of biclusters found Time taken (in seconds) Gain of our implementation over 2 Implementation of 2 Our implementation 113 3 2 10807 17144 1601 17144/10807 =10.71

展开阅读全文

外文翻译--On the Complexity and Completeness of Robust Biclustering Algorithm ROBA

最新文档