超越概率统计方法

上传人:wm****3 文档编号:52130465 上传时间:2018-08-18 格式:PPT 页数:45 大小:194.50KB
返回 下载 相关 举报
超越概率统计方法_第1页
第1页 / 共45页
超越概率统计方法_第2页
第2页 / 共45页
超越概率统计方法_第3页
第3页 / 共45页
超越概率统计方法_第4页
第4页 / 共45页
超越概率统计方法_第5页
第5页 / 共45页
点击查看更多>>
资源描述

《超越概率统计方法》由会员分享,可在线阅读,更多相关《超越概率统计方法(45页珍藏版)》请在金锄头文库上搜索。

1、Language Theory and BioinformaticsBailin Hao T-Life Research Center, Fudan University Institute of Theoretical Physics, Academia Sinica http:/ Analysis of DNA SequencesA first and must step in any analysis: Frequency of appearance of strings Correlations of letters and strings 1D and 2D DNA walks vs

2、. random walkSummary in two lines according to Luo Liao-fu: 1. DNA sequences are not random.2. Characteristics close to randomness.Hint: Statistical methods alone are not powerful enough to amplify the difference between DNA and random sequences and the difference among themselves.Need for new “dete

3、rministic” approaches.超越概率统计方法概率统计是基本功频度和关联,马可夫链和隐马可夫链神经网络模型贝叶斯(Bayes)统计、“先验”分布随机序列是好的参考系吗?足够长的符号序列具有不可避免的“规则性”基因组序列够长吗?具有确定后果的随机运动因果论与目的论终值分布决定的随机微分方程超越郎之万:随机微分方程的其他提法分子马达、沿细胞骨架的运动语言学方法:语法和语义语义问题、遗传“字典”Gnomics: A DNA Dictionary (1986) 目前:5000转录因子结合位点300内切酶识别点各种重复序列,卫星、微卫星Language Metaphor in Biolog

4、yTranscription (转录) Translation (翻译) Edition (编辑) Modification (修饰)WordsAs landmarks, e.g., recognition sites for :Restriction endonucleases (REBASE)methylases (REBASE)transcription factors (TRANSFAC)As components of “sentences” :promoters (EPD), enhancerssilencers, insulators, terminatorssplicing s

5、itesSentences enhancer silencer enhancer promotor ( exon intron )k exon terminator Essays/Articlesgenes, “ junk”, EncyclopediaComplete genome of a species Reference LibraryKingdom Monera, , kingdom Animalia 自然语言与遗传语言相似处: 多义性 冗余度 容错和纠错 长程关联 均基于离散的排列组合系统 有某些语法,但不能完全生成 方言、个体差异性 演化、突变、灭绝 历史“垃圾”、古语、“化石”外

6、来语、横向交换相异处:标点符号和间隔不同两种语言的相互作用二维、三维的相互作用重复序列的数目和作用语言学(language 而非 philology)方法统计语言学“字”的频度和关联Zipf 定律代数语言学:生成语法和语法复杂性串行生成:Chomsky体系平行生成:Lindenmayer 体系(来自发育生物学)可因式化语言模糊语言学 形式推广不难:Z .G .Yu (2001)如何定量地引用生物知识Consensus 序列和权重矩阵随机语法隐马可夫链 = 随机正规语法更高阶的随机语法? Consensus Sequences TATAAT ( Pribnov or -10 box ):T80A

7、95T45A60A50T96 TTGACA ( -35 box ):T82T84G78A65C54A45 CAAT ( CAAT or 75 box ):GGYCAATCT TATA ( TATA or Goldberger-Hogness box ):TATAWAW CATG ( Transcription startpoint ): However, in Aful: ATG 76% GTG 22% TTG 2%An Observation u d c s b t charge, mass, flavor, charm, p n e charge, mass, spin, magnetic

8、 momentum, H C N O P atomic number, ion radius, valence, affinity, H2O NO CO2 molecular weight, polarity, a c g t A D E F G H W Y V BRCA1 PDGFA PROGRAMME:Coarse-Grained Description of NatureUse of Symbols and Symbolic StringsLanguageGrammar and Complexity (Chomsky, Lindenmayer, etc.)So far this prog

9、ramme has been best realized in the study of dynamics by using Symbolic Dynamics.There have been preliminary attempts in analyzing biological sequences.It may not be a coincidence that the two systems in the universe that most impress us with their open-ended complex design life and mind are based o

10、n discrete combinatorial systems. Many biologists believe that if inheritance were not discrete, evolution as we know it could not have taken place.S. Pinker, The Language Instinct (1995)Simple ExamplesAt the level of words:DOG GODAt sentence level:Dog bites ManMan bites DogN C EGF (Epidermal GF)N C

11、 Chymotrypsin (胰凝乳蛋白酶) N C Urokinase (UK) (尿激酶)N C Factor IX(凝血因子IX, X-mas抗血友病因子)N C Plasminogen(纤维蛋白融酶原)几种丝氨酸蛋白酶的domain组合B.Alberts 等,Mol.Biology of the Cell 第三版 1994. P.123Ca 结合蛋白含3个-s-s-GC 语法复杂性字母表 例1. = a, c, g, t例2. = A, C, D W, Y例3. = a, z, A, Z, +, , 字母表中各种字母组成的一切字母串 (包括空串) *的任何子集是基于的一种语言语法 =

12、字母表,初始字母,产生规则基于该语法的语言Classification of Formal LanguagesChomsky Hierarchy Sequential production rulesLindenmayer Systems Parallel production rulesGenerative Grammar S Sentence NP Noun Phrase VP Verb Phrase Adj Adjective Art ArticleS if S then SS either S or SNon-Terminal and Terminal SymbolsN boy | g

13、irl | scientist | V sees | believes | loves | eats | Adj young | good | beautiful | Art a | one | theS NP VPVP V NPNP (Art) Adj* NChomsky 语法层次N 非终结字母集(工作用符号)T 终结字母集S N 起始字母P = 生成规则(x y)的集合x, y 为字母串 关于 x, y 的不同规定导致不同语法语法 G = (N, T, P, S)0 类语法x (NT)* N(NT)*y (NT)*至少含有一个非终结字母1 类语法 上下文有关语法x = t1 a t2t1, t2 T*a N2 类语法 上下文无关语法x = a N3 类语法 正规语法x = a y = b 或 bca, c N b = 空 或 b TA, B, Non-terminals (NT) , , Terminals (T) Regular Grammar:A A A One symbol on LHS; One or none NT at the right-end of the RHS.Context-Free Grammar: A A

展开阅读全文
相关资源
相关搜索

当前位置:首页 > 生活休闲 > 社会民生

电脑版 |金锄头文库版权所有
经营许可证:蜀ICP备13022795号 | 川公网安备 51140202000112号