基于ontology的web信息抽取和信息集成的研究

资源描述

《基于ontology的web信息抽取和信息集成的研究》由会员分享，可在线阅读，更多相关《基于ontology的web信息抽取和信息集成的研究（128页珍藏版）》请在金锄头文库上搜索。

1、上海交通大学博士学位论文基于Ontology的Web信息抽取和信息集成的研究姓名：宋晖申请学位级别：博士专业：计算机系统结构指导教师：马范援 20040601 申请上海交通大学博士学位论文 i 基于Ontology的Web信息抽取和信息集成的研究摘要互联网是二十世纪末最重大的科技发展之一它为分布在世界各个角落的人们之间建立互联的平台提供了知识共享信息交流的平台人们对 Web 的依赖使得 Web 上的信息呈爆炸性增长在 WWW 上查找信息已非易事但由于 WebInternet 的自治性Web 上发布的信息由个人或组织自由地维护和管理缺乏统一的表现形式对于计算机程序来说We

2、b 上的信息无法理解很难自动进行信息搜索和交换如何有效地利用 Web 上的资源一直是重要的研究课题搜索引擎为人们提供了按照关键字检索 Web 页面的途径但由于其缺乏对信息的理解大量的无关检索结果被返回而且这些结果无法直接为其它应用所用许多研究开始试图通过 Web 挖掘寻找有效的 Web 信息利用途径 Web 信息抽取和集成试图从半结构化的 Web 页面中自动抽取出有价值的数据然后按照知识模型将分散在不同页面上的数据进行集成整合为统一的知识表现形式Web 信息的抽取和集成可以为用户提供基于知识的 Web 检索同时也可以为信息知识系统服务用于相互之间的自动信息交换研究表明目前Web

3、上的页面主要是以动态页面的形式存在占总数的90%以上即用户在调用页面时临时通过程序动态生成的页面动态页面使用固定的显示模板然后将后台数据库中的数据嵌入而得这些数据经过专业人员组织整理因而具有更高的价值由此本文重点研究动态页面的抽取和集成技术它的主要挑战在于 Web 页面表示形式中并没有特别标识出有价值的信息以及这些信息的语义而且从页面中抽取出来的数据也没有完整的结构比较传统的异构信息集成更加复杂 Hidden Web 是 Web 中无法被搜索引擎直接索引的 Web 数据也是动态 Web 页面的重要来源Hidden Web 的信息通常只能通过人工填写 Web 表单也就是 Hidde

4、n Web 入口来获取如果能够自动地收集 Hidden Web 入口并使用程序自动地填写表单来获取后台数据将为 Web 信息集成提供极大的便利然而各网站的 Hidden Web 入口参数并没有显式地给出它只是嵌在 Web 页面中而且各个不同的网站提供的申请上海交通大学博士学位论文 ii Hidden Web 入口区别很大参数不一为程序自动检索带来了困难本文在前人已有的研究工作基础上重点研究了动态 Web 页面包括 Hidden Web的信息抽取技术和算法Web 信息集成中的 Schema 匹配等问题在研究获得的算法基础上实现了一个智能信息代理平台并成功地应用于自然科学基金项目基

5、于 Ontology 的 Web 音乐知识检索系统主要的研究和成果如下 1. 本文提出了针对动态生成网页的基于 Web 树结构表示的信息抽取和注释算法该算法任意从页面集中选择两个页面或多个页面作为样本无需人工标注从中自动推导出页面模板 Wrapper 和数据模式算法中创新提出的最小抽取树纯文本模板单元等概念提高了Web页面模板识别的准确性并减少了Wrapper生成算法的开销页面数据的语义注释直接利用了生成 Wrapper 过程中的中间结果通过对大量真实网站上下载的网页进行实验表明该算法在两种不同类型的动态网页抽取和注释上都具有很好的效果 2. 通过 Web 获得的数据 Schem

6、a 没有传统关系数据库的 Schem 的完整定义本文提出了基于聚类的 Web 信息 Schema 匹配算法该算法综合使用了实例匹配和 Schema 名字匹配两种技术给出了聚类算法中对象距离的计算方法避免了通常 schema 匹配算法中 1-1 匹配的限制以及对 schema 定义的要求实验数据表明了算法的有效性 3. 本文提出了一套自动搜集索引以及查询 Hidden Web 入口信息的新途径给出了其中关键的算法它能自动从网页上抽取 Hidden Web 的访问入口借助 Ontology 技术按应用领域筛选出对应的访问入口并将其转换为统一定义的 Ontology 概念由于使用统一的概念

7、表示 Hidden Web 的查询入口参数为机器自动地查询后台信息提供了基础 4. 利用研究所得的 Web 信息抽取算法 Schema 匹配算法以及 Hidden Web 索引等技术本文设计并实现了一个智能信息代理平台它能为信息系统从 Web 上收集信息并按照领域模型进行知识集成该代理已成功地应用于自然科学基金重大国际合作项目中国民族音乐数字图书馆子系统基于 Ontology 的 Web 音乐知识检索系统为该系统收集集成 Web 上的音乐知识通过更换领域模型的定义申请上海交通大学博士学位论文 iii 该代理可以方便有效地应用于不同的信息系统关键字 Web信息抽取动态 Web页

8、面 Schema 匹配 Ontology Hidden Web入口索引信息代理申请上海交通大学博士学位论文 iv RESEARCH ON ONTOLOGY-BASED WEB INFORMATION EXTRACTION AND INTEGRATION ABSTRACT WWW is one of the most significant techniques developed at the end of 20th century. It provides a platform to share and exchange information all over the world. Th

9、e knowledge on the Web growing explosively, so it is not easy to search and reuse. Because of the autonomic characteristic of Web and Internet, the information on the Web is maintained and managed by individuals and groups, no uniform representation of knowledge. People can explore the Web pages for

10、 the detail information but its hardly to be captured and exchanged by applications. How to efficiently make use of Web resources is an interesting research area. Search engine is the first achievement, which provides an approach to retrieve Web pages by key words. Lacking of the semantic informatio

11、n of the text on the Web, search engines return too many irrelevant pages. On the other side, the valuable information embedded on Web pages, it cannot be resued by application. A lot of researches turn to Web mining for Web resources utilization. Web information extraction and integration is an are

12、a of this. It tries to extracte useful information from semi-strcuture Web pages, and then integrat them into uniform knowledge representation according to knowledge model. Web extraction and integration can provide knowledge-based retrieve for users and also automatic information exchange approach

13、for knowledge services. Statistics points out that about 90% Web pages are dynamicly generated when they are called. They are programmed with the fixed template and embedded data selected from backend databases. The data on these pages are more valuable because they are cleaned up by Web sites desig

14、ners. This paper so focus the research on dynamic Web information extraction and integration in automatic method. The challenges arise in the following way: 申请上海交通大学博士学位论文 v No explict information identifies the valuable text on the Web page and also no senmatic information about the text; without c

15、omplete structure definition, integrating the data extracted from Web pages are more complex than the traditional heterogenous information integration problem. Hidden Web is the Web that cannot be directly indexed by search engines. It is also the sources of the dynamic Web pages. Hidden Web can onl

16、y be retrieved by filling the forms (Hidden Web portal) on the Web page manually. Itll faciliate the Web information integration if program can automatically access the Hidden Web information. But, the hidden web portal of web sites embed on the web pages and differ with each other, so it is difficult for program retrieve. This paper mainly research on the dymanic information extration algorithm, schema- matching problem on the web information integaration and hidden Web information extract

展开阅读全文

基于ontology的web信息抽取和信息集成的研究

最新文档