硕士论文基于文本的web图片搜索引擎的研究

资源描述

《硕士论文基于文本的web图片搜索引擎的研究》由会员分享，可在线阅读，更多相关《硕士论文基于文本的web图片搜索引擎的研究（64页珍藏版）》请在金锄头文库上搜索。

1、学位论文基于文本的 Web 图片搜索引擎的研究硕士研究生学位论文题目：基于文本的 Web 图片搜索引擎的研究学位论文基于文本的 Web 图片搜索引擎的研究- I -学位论文基于文本的 Web 图片搜索引擎的研究- II -摘要本文研究工作是针对 Web 图片搜索引擎的应用背景，以构建大型 Web 图片搜索引擎为目标，提出基于文本检索方式的 Web 图片搜索引擎设计方案。文中介绍和研究了一系列与 Web 图片搜索引擎相关的技术，包括网页抓取、相关性排序（VSM 和 LSI）、信息提取、信息索引等，这些技术将被应用到文中提出的系统设计方案中。本文重点地研究如何从 HTML 文档中提取图片

2、相关信息，保证高效和准确的实现图片检索。在对真实数据进行实验和分析的基础上，提出了若干关键技术，用于系统的设计，现归纳如下：1) 本文提出的方法通过细致地分析 HTML 文件的标记、标记、网页标题、网页的超链接文本、图片 URL、标记、关联的和结构、结构、图片周围文本等部分的结构特点，并利用真实数据进行实验验证，总结了 9 条提取模式，用于从这些结构中提取与图片相关的信息，以保证提取到的信息相关性程度较高。研究了三种具体的提取方法：基于 DOM 的方法、基于字符串的方法和基于Wrapper 的方法。2) 提出了过滤无用图片的方法，提高了系统中图片的可用度。该方法将图片文件大小小于某一阈值，图片

3、的长或宽小于某一阈值，图片的长宽比例超过某一阈值以及同一网页内通过引用次数超过某一阈值的图片作为无用图片剔除。3) 通过统计分析总结出 HTML 文件中表现出的一些潜在规律，比如 JPG和 GIF 的区别、和标记的不同意义以及图片引用次数的不同意义。得到如下结论：JPG 重要性大于 GIF；标记来源图片的重要性大于标记的图片；引用次数越高的图片重要性越高，而引用次数高的图片需要经过过滤才能保证重要性较高。4) 粗略地探讨了将 LSI 算法应用于图片搜索引擎来整合文字和内容信息的方法，并通过简单实验进行了效果验证。5) 设计并实现了一个基于文本的 Web 图片搜索引擎，给出了系统的总体结构图，并

4、对获取网页、提取信息、图片抓取和死链检查、生成缩略图、建立索引和提供查询这 6 个工作流程进行了详细的描述，最后对系统的使用效果和性能进行了简单评测。关键词：Web 图片搜索引擎图像检索基于文本基于内容信息提取学位论文基于文本的 Web 图片搜索引擎的研究- III -AbstractIn the thesis, we form a scheme to design a large-scale Web image search engine system using mainly text-based technology. We introduce and research a s

5、eries of techniques related to Web image search engine, such as crawling, relevance ranking (VSM and LSI), information extraction and indexing. Those techniques will be used in our system design.We concentrate on how to extract information relevant to images from HTML documents more effectively and

6、precisely. According to experiments and analysis on real data, we propose several key techniques as below for designing the system:1) We analyze carefully the structure of HTML components including tag, tag, title of web page, anchor text of web page, URL of image, tag, tag, surrounding text of tag

7、etc. And sum up nine extraction patterns to fetch information relevant to images. We also research three extracting methods: DOM based method, String based method and Wrapper based method.2) We propose some methods to filter useless images according to file size, width and height of images and refer

8、red count of images by tags.3) Through statistics of mass of HTML documents, We conclude some latent rules, such as the difference between JPG and GIF, the difference between tag and tag, the difference between different referred count of images.4) We Simply research the application method of LSI to

9、 integrate high-level and low-level information of images.5) We design and implement a text-based Web image search engine. The global structure of our system and relations of the components of system are introduced. Some components are detailed in function and implementation. Finally a simple evalua

10、tion about searching effect and performance is given.Keywords: Web image search engine, text-based, content-based, information extraction学位论文基于文本的 Web 图片搜索引擎的研究- IV -目录第 1 章引言 .11.1 背景 .11.2 图片检索系统概述 .31.2.1 系统应用领域 .31.2.2 用户检索方式 .31.2.3 系统评价 .41.3 研究现状 .51.4 现有图片检索系统简介 .51.5 本文的主要工作 .8第 2 章相关技术 .102.1 网页抓取技术 .102.1.1 基本原理 .102.1.2 大型 Spider 的问题 .102.2 相关性排序技术 .112.2.1 VSM.122.2.2 VSM 的改良：LSI .

展开阅读全文

硕士论文 基于文本的web图片搜索引擎的研究

硕士论文基于文本的web图片搜索引擎的研究