1、一种通过内容和结构查询文档数据库的方法王晓玲 1+,文继荣 2,栾金锋 1,马维英2,董逸11(东南大学计算机科学与工程系,江苏南京 210096) 2(微软亚洲研究院,北京 100080)A Method to Query Docu ment Database by Content and StructureWANG Xiao-Ling1+, WEN Ji-Rong2, LUAN Jin-Feng1, MA Wei-Ying2, DONG Yi-Sheng11(Department of Co mputer Scienceand Engineeirng, Southeast Univeris

2、ty,Nanjing 210096, China)2(Microsotf ResearchAsia, Beijing 100080, China)+ Correspondingauthor: Phn: 86-25-689672,5 E-mail: http:/Received 2002-04-0;4 Accepted 2 0 0 2 - 1 0- 1 7Wang XL, Wen JR, Luan JF, Ma WY, Dong YS A method to query document database by content and structure Journalof Software 2

3、003 14(5:976983 htp:ww w.joscrg ci/1000-9825/4/976htmAbstract Structured documents are made up of afew logLcel components such as title sections subsetions and paragraphs The components ineach structured document can be represetied by an (rderedtree model, which can aLso be viewed as a hLearehLCti c

4、oncept relationslip To meet the usR s requirements fo more precise and concentratedsearchresults,theretrievaltechniquesshouldallowthe usertoretrieve document components with varying granularity. This paperpresentsa methodto query document database by contentandstructure. The key ideaistoconstructa m

5、ore comprehensive similarity function by taking advantage oftheinherent hierarchical structurein documents. This work combines Information R etrieval te c h ni q u e s, semi-structured data queryand proximate search for document documents. The proposed method isevaluated on the Encarta encyclopedia

6、document set and the experimental resultsshow thatitcanprovide moreaccurtae and focusedanswers than traditional documentretrieval methods.Key words:document database;informationretrieval; passageretrieval;structureddocument 摘 要:文档是有一定逻辑结构的 ,标题、章节、段落等这些概念是文档的内在逻 辑 .不同的用户对文档的检索,有不同的需求,检索系统如何提供有意义的信息,一

7、直是研究的中心任务结合文档的和内容,对结构化* This work was performed while thefirst authorwas a visiting studentat Microsotf ResearchAsia.WANG Xiao-Ling was bornin 1975.She isa Ph.D. candidtae atthe Department of Computer Science, Southeats University. Her current reserach interests include databasetheoryand XML. WEN Ji

8、-Rong isa researcherin Microsotf Research,China. His research interested areas are batabasetheory and information retrieval. LUAN Jin-Feng was bornin 1974. His research interested areas are artifical intelligence and com munication. MA Wei-Ying is areseracherin Microsoft Researc,h China. Hisreseacrh

9、 interested areasare data mining and multimedia management theor.y DONG Yi-Sheng was bornin 1940. His curent researchinterests include database theoryand informaiton proces.s文件的检索,提出了一种新的计算相似度的方法.这种方法可以提供多粒度的文档内容的检索包括从单词、短语到段落或者章节.基于这种方法实现了一个问题回答系统,测试集是微软的百科全书关通键过词与:传文统档方数法据库;信息检索;段落检索;结构化文档中图法分类证明通

10、过这种方法检标文章片片断更合理、更有效Docu ment database received more and more attention becauseof their multiple applications in the areas such as digital library, dictionaries, encyclopedias, etc. With the wide use of X ML 1, whichisa standardformat for WWW data exchange and transformation, the whole web can also b

11、e viewed as a large document databas.e Traditional document retrieval techniques normaly concentrate on thecontent part and various words matching2 approachesare usedto obtainrelevant documents accoridngto user queries . How to employstructureinformation to enhance documentretrieval isa new challeng

12、e for researche.rsOnthe otherside, traditional informationretrieval techniques treateach document as an atomic unitand return the whole documents to the user. But, in many case,s onlya part ofthe document is relevant to the users information need. The user hasto scan each (usually very long)document

13、 tolook for relevantanswers. Passage retrieval is one ofthe techniques aiming toretrieve and return m ore compact and shorter answer tothe user. Most previous work suggests using roughly fixed length passage,s which may decreaseretrieval performance dueto devciopeg temaaCh rdaUMentSbaCOmgithgcomftOl

14、fnandiarOtWntS Inreceit years many modeh have been How to retrieve the vdume ofstructured documentsmore efficieitly and return a more compact andpreise answerto usrs gein more andmore attentionsIn ths paper we propose a novi method to retrieve components f structured documents more accurteLy which c

15、an be viewed as a question-answer system Compared with document retrieval tie main task f a question-answersyatem isto provide a shot and direct answer to a usr query Our work focuseson helping users tolocte the most matching answer from the undetying structured documentcdlection The expeiimentaL re

16、suLts show that ourmethod can produce more accurtaeresults and shorter answersthantraditional document retrieval and, at the same time, provide much morerelated contextinformationaboutfuzzy questionssothat userscan understandtheanswerbetterThis paperis organizedasthefollowing: Section 1 is about relat ed work aboutstructure documentretrieval Section 2 definessome data structurea



