《ClusteringTemporalGeneExpressionData》由会员分享,可在线阅读,更多相关《ClusteringTemporalGeneExpressionData(20页珍藏版)》请在金锄头文库上搜索。
1、Temporal Probabilistic Concepts from Heterogeneous Data SequencesTitle & AuthorsSally McCleanBryan ScotneyFiona PalmerSchool of Information & Software Engineering, University of Ulster.Gene ExpressionBackgroundScientists have now sequenced the entire human genome -approximately 30,000 genes. Each of
2、 these genes when active results in the production of a protein -proteins have a variety of functions. In order to understand the function of the genes, and the related proteins, scientists are interested in determining where and when the genes are active.The steps involved in producing a protein fr
3、om a gene.Gene(DNA)RNAProteinBackgroundGene Expression Results. The DNA microarray is a microscope slide which enables scientists to determine the activity or expression of genes Scientists place on each of the microarray spots an extract of the cells along with an extract from a reference sample .T
4、he more RNA produced the more active the gene, (green for the sample and red for the reference).Fluorescence of the spot is then measured to give the expression of the gene compared to the reference.The Gene Expression Data SetBackgroundThe gene expression data set analysed describes the expression
5、of 112 genes in the rat cervical spinal cord over 9 time points through the development of the rat from embryo to adult.Only specific genes were analysed which are considered important in the development of the central nervous system in the rat. E11E13E15E18E21P0P7P14AEmbryo:Days since conceptionPos
6、tNatal: Days since birthAdultThe temporal nature of the gene expression dataClusteringMutual InformationClusteringClustering is usually based on a distance metric - in this case mutual information. Before clustering, the continuous gene expressions were discretised by partitioning the expression int
7、o 3 equal sized bins. GeneE11E13 E15 E18 E21P0P7P14AnAChRa2000122221mAChR2000222221mAChR3000122110nAChRa3002122110EGFR010122220NFL011112210nAChRa7100222221MK2222111110PDGFR222000001Time PointGene Expression Sequences for Cluster 3The ClustersIn this paper we mainly use data from cluster 2.The Proces
8、sThe Process1. Cluster2.Learn Mappings5. Learn Temporal Probabilistic Concepts4. Learn Local Temporal ComceptsSet ofSequencesHomogenised sequencesThe steps used to learn the temporal semantics of sequencesCharacterisation of the cluster Clusters &Mappings3. Map sequencesAn ExampleThe ProblemSequence
9、 1001111110Sequence 2000111122Sequence 3000111122Sequence 4000221111The sequences are heterogeneous in the sense that they represent different attributesThe codes (0, 1, or 2) should be regarded as symbolic We re-label to emphasise this. Gene Expression DataSequence 1AABBBBBBaSequence 2CCCDDDDEESequ
10、ence 3FFFHHHHGGSequence 4IIIKKJJJJRelabelled Gene Expression DataMappingsThe schema mappings are between each sequence (local ontologies) and the hidden variable (L, M, N)We represent the underlying concept (global ontology) via a temporal probabilistic concept model. ALBM A NCLDMENFLGNHMILJNKMLCANE
11、MBDSequence 1Sequence 2Hidden ConceptSequence 1L/N L/NMMMMMML/NSequence 2LLLMMMMNNSequence 3LLLMMMMNNSequence 4LLLMMNNNNMapped sequencesSchema MappingsCorrespondence Graph for cluster containing sequences 1 and 2.The Mapping AlgorithmMappingChoose one of the sequences whose number of symbols is maxi
12、mal (S* say); these symbols act as a proxy for values of the global ontology. For each remaining sequence Si, of length L, determine the mapping of the rth value of Si onto one of the values of the global ontology so as to maximise the number of co-occurrences.Repeat for each r and i.In the ith sequ
13、ence, the value r is then mapped to a set of values (partial value), if it is not unique. Concept DefinitionsConcept LearningThe concepts we are concerned with may be thought of as symbolic objects which are described in terms of discrete-valued features, e.g. features: expression level, function wi
14、th respective domains low, medium, high and growth, control. concept: C1=expression level = high; function = growthProbabilistic Concepts have been used to extend the definition of a concept to uncertain situations where we must associate a probability with the values of each feature vector e.g. C3
15、= expression level = high:0.8, expression level = medium:0.2, function = growth:1.0. Local & Temporal Probabilistic ConceptsConcept LearningA localprobabilisticconcept (LPC) is defined on a time interval e.g.In time-interval S =t1, t2 we have a local probabilistic concept C4 = Time = S, expression l
16、evel = high:0.8, expression level = medium:0.2,i.e. during time interval S there is a high expression level with probability 0.8 and medium expression level with probability 0.2. A temporalprobabilisticconcept (TPC) is defined in terms of a time attribute with domain T = t1 , tk and discrete-valued
17、features Xj, where Xj has domain Dj=vj1, Learning LocalProbabilistic ConceptsConcept LearningThe algorithm for learning LPCs takes account of the fact that the schema mappings may map a local value in the local ontology onto a set of global values (partial values) . We use the EM algorithm to learn
18、LPCs with values that are expressed as a local probabilistic concepts.Sequence 1CCDDDDDDCSequence 2CCCDDDDEESequence 3CCCDDDDEESequence 4CCCDDD/E D/E D/E D/EThen, for example, using only the data at the eighth time point, (column 9 of Table 5) we obtain:Iteration yields the solution Learning Tempora
19、lProbabilistic ConceptsConcept LearningOnce we have learned the local probabilistic concepts, the next task is to learn the TPC. This is carried out using temporal clustering. This is done via log-likelihood ratios and chi-squared tests.Sequence 1CCDDDDDDCSequence 2CCCDDDDEESequence 3CCCDDDDEESequen
20、ce 4CCCDDD/E D/E D/E D/EThe values for the first two time points (columns) are identical so the distance d12 is zero and we combine LC1 and LC2 to form LC12. We now must decide whether LC12 should be combined with LC3 or whether LC3 is part of a new LPC. The distance between LPC12 and LPC3 is then 1
21、.193. Since this value is inside the chi-squared threshold, we therefore decide to combine LPC12 and LPC3 etc.Cluster 2 Rat Gene Sequence DataCluster 2 Mapped Rat DataCluster 2 The LPCs and TPCTheLPCs there are 4 LPCs represented by the 4 coloursThese clusters are then characterised by the local pro
22、babilistic concepts E11, E13: (0.961, 0, 0.039)E15: (0.28, 0. 0.72)E18, E21, P0, P14, A: (0, 0, 1) P7:(0, 0.154, 0.846)ConclusionConclusionWe have described a methodology for describing and learning temporal concepts from heterogeneous sequences that have the same underlying temporal pattern. The da
23、ta are heterogeneous with respect to classification schemes. However, because the sequences relate to the same underlying concept, the mappings between values may be learned. On the basis of these mappings we use statistical learning methods to describe the localprobabilisticconcepts. A temporal pro
24、babilistic concept that describes the underlying pattern is then learned. This concept may be matched with known genetic processes and pathways.Further WorkFurther WorkFor the moment we have not considered performance issues since the problem we have identified is both novel and complex. Our focus,
25、therefore, has been on defining terminology and providing a preliminary methodology. In addition to addressing such performance issues, future work will also investigate the related problem of associating clusters with explanatory data.For example our gene expression sequences could be related to the growth process. Temporal Probabilistic Concepts from Heterogeneous Data SequencesTitle & AuthorsSally McCleanBryan ScotneyFiona PalmerSchool of Information & Software Engineering, University of Ulster.