DataMininginBioinformatics－金锄头文库

资源描述

《DataMininginBioinformatics》由会员分享，可在线阅读，更多相关《DataMininginBioinformatics（30页珍藏版）》请在金锄头文库上搜索。

1、Peter Bajcsy, PhDAutomated Learning GroupNational Center for Supercomputing ApplicationsUniversity of Illinoispbajcsyncsa.uiuc.eduJanuary 31, 2002Data Mining in BioinformaticsOutlineIntroductionOverview of Microarray ProblemImage AnalysisData MiningValidationSummary2Introduction: Recommended Literat

2、ure1. Bioinformatics The Machine Learning Approach by P. Baldi & S. Brunak, 2nd edition, The MIT Press, 20012. Data Mining Concepts and Techniques by J. Han & M. Kamber, Morgan Kaufmann Publishers, 20013. Pattern Classification by R. Duda, P. Hart and D. Stork, 2nd edition, John Wiley & Sons, 20013I

3、ntroduction: Microarray Problem in Bioinformatics DomainProblems in Bioinformatics DomainData production at the levels of molecules, cells, organs, organisms, populationsIntegration of structure and function data, gene expression data, pathway data, phenotypic and clinical data, Prediction of Molecu

4、lar Function and StructureComputational biology: synthesis (simulations) and analysis (machine learning) 4Microarray Problem: Major ObjectiveMajor Objective: Discover a comprehensive theory of lifes organization at the molecular levelThe major actors of molecular biology: the nucleic acids, Deoxyrib

5、oNucleic acid (DNA) and RiboNucleic Acids (RNA)The central dogma of molecular biologyProteins are very complicated molecules with 20 different amino acids.5Input and Output of Microarray Data AnalysisInput: Laser image scans (data) and underlying experiment hypotheses or experiment designs (prior kn

6、owledge)Output: Conclusions about the input hypotheses or knowledge about statistical behavior of measurementsThe theory of biological systems learnt automatically from data (machine learning perspective)Model fitting, Inference process6Overview of Microarray ProblemData MiningMicroarray ExperimentI

7、mage AnalysisBiology Application DomainExperiment Design and HypothesisData AnalysisArtificial Intelligence (AI)Knowledge discovery in databases (KDD)Data WarehouseValidation7Artificial Intelligence (AI) CommunityIssues:Prior knowledge (e.g., invariance)Model deviation from true modelSampling distri

8、butionsComputational complexityModel complexity (overfitting)Collect DataTrain ClassifierChoose ModelChoose FeaturesEvaluate Classifier Design Cycle of Predictive Modeling8Knowledge Discovery in Databases (KDD) CommunityDatabase9Data Mining and Image Analysis StepsImage AnalysisNormalizationGrid Ali

9、gnmentFeature construction (selection and extraction)Data MiningStatisticsMachine learningPattern recognitionDatabase techniquesOptimization techniquesVisualizationPrior knowledgeValidationIssuesCross validation techniques?10IMAGE ANALYSIS11Image Analysis: NormalizationRed BandGreen BandDynamic rang

10、e of red bandDynamic range of green bandSolution: Reference points with reference values12Image Analysis: Grid AlignmentSolution: Manual, semi-automatic and fully automatic alignment based on fiducials and/or global grid fitting. 13Image Analysis: Feature SelectionFeatures: mean, median, standard de

11、viation, ratiosArea: Sensitive to background noise14Image Analysis: Feature ExtractionArea is determined by image thresholding and used during feature extractionDist: 2004Box: 902Plane: 2632110215DATA MINING 16Why Data Mining ? Sequence ExampleBiology: Language and GoalsA gene can be defined as a re

12、gion of DNA.A genome is one haploid set of chromosomes with the genes they contain.Perform competent comparison of gene sequences across species and account for inherently noisy biological sequences due to random variability amplified by evolutionAssumption: if a gene has high similarity to another

13、gene then they perform the same functionAnalysis: Language and GoalsFeature is an extractable attribute or measurement (e.g., gene expression, location)Pattern recognition is trying to characterize data pattern (e.g., similar gene expressions, equidistant gene locations).Data mining is about uncover

14、ing patterns, anomalies and statistically significant structures in data (e.g., find two similar gene expressions with confidence x)17Data Mining TechniquesVisualization18StatisticsInductive StatisticsStatisticsDescriptive StatisticsAre two sample sets identically distributed ? Make forecast and inf

15、erencesDescribe data19Machine LearningSupervisedMachine LearningUnsupervisedReinforced“Natural groupings”Examples20Pattern RecognitionPattern RecognitionLinear Correlation and RegressionNeural NetworksStatistical ModelsDecision TreesLocally Weighted LearningNN representation and gradient based optim

16、izationNN representation and genetic algorithm based optimizationk-nearest neighbors, support vectors21Database TechniquesDatabase Design and Modeling (tables, procedures, functions, constraints) Database Interface to Data Mining SystemEfficient Import and Export of DataDatabase Data VisualizationDa

17、tabase Clustering for Access EfficiencyDatabase Performance Tuning (memory usage, query encoding)Database Parallel Processing (multiple servers and CPUs)Distributed Information Repositories (data warehouse)MINING22Optimization TechniquesHighly nonlinear search space (global versus local maxima)Gradi

18、ent based optimizationGenetic algorithm based optimizationOptimization with sampling Large search space Example: A genome with N genes can encode 2N states (active or inactive states, regulated is not considered). Human genome 230,000; Nematode genome 220,000 patterns. 23VisualizationData: 3D cubes,

19、distribution charts, curves, surfaces, link graphs, image frames and movies, parallel coordinatesResults: pie charts, scatter plots, box plots, association rules, parallel coordinates, dendograms, temporal evolutionPie chartParallel coordinatesTemporal evolution24Prior Knowledge from Experiment Desi

20、gnComplexity Levels of Microarray Experiments:1.Compare single gene in a control situation versus a treatment situationExample: Is the level of expression (up-regulated or down-regulated) significantly different in the two situations? (drug design application)Methods: t-test, Bayesian approach2.Find

21、 multiple genes that share common functionalitiesExample: Find related genes that are dependent?Methods: Clustering (hierarchical, k-means, self-organizing maps, neural network, support vector machines)3.Infer the underlying gene and protein networks that are responsible for the patterns and functio

22、nal pathways observedExample: What is the gene regulation at system level?Directions: mining regulatory regions, modeling regulatory networks on a global scaleGoal of Future Experiment Designs: Understand biology at the system level, e.g., gene networks, protein networks, signaling networks, metabol

23、ic networks, immune system and neuronal networks.25Types of Expected Data Mining and Analysis ResultsHypothetical Examples:Binary answers using tests of hypothesesDrug treatment is successful with a confidence level x.Statistical behavior (probability distribution functions)A class of genes with fun

24、ctionality X follows Poisson distribution.Expected eventsAs the amount of treatment will increase the gene expression level will decrease.RelationshipsExpression level of gene A is correlated with expression level of gene B under varying treatment conditions (gene A and B are part of the same pathwa

25、y).Decision trees Classification of a new gene sequence by a “domain expert”.26VALIDATION27Why Validation? Validation type:Within the existing dataWith newly collected dataErrors and uncertainties:Systematic or random errorsUnknown variables - number of classesNoise level - statistical confidence du

26、e to noiseModel validity error measure, model over-fit or under-fit Number of data points - measurement replicasOther issuesExperimental support of general theoriesExhaustive sampling is not permissive28Cross Validation: ExampleOne-tier cross validationTrain on different data than test dataTwo-tier

27、cross validationThe score from one-tier cross validation is used by the bias optimizer to select the best learning algorithm parameters (# of control points) . The more you optimize the more you over-fit. The second tier is to measure the level of over-fit (unbiased measure of accuracy).Useful for c

28、omparing learning algorithms with control parameters that are optimized.Number of folds is not optimized.Computational complexity: #folds of top tier X #folds of bottom tier X #control points X CPU of algorithm29SummaryMicroarray problemComputational biology Major objective of microarray technologyInput and output of data analysisData mining and image analysis stepsImage normalization, grid alignment, feature constructionData mining techniquesPrior knowledgeExpected results of data miningValidationIssuesCross validation techniques 30

展开阅读全文

DataMininginBioinformatics

最新文档