作业三最大匹配分词法

资源描述

《作业三最大匹配分词法》由会员分享，可在线阅读，更多相关《作业三最大匹配分词法（9页珍藏版）》请在金锄头文库上搜索。

1、作业三中文分词法中文分词法：开始没有思路，后来查了下资料和问了下同学，才搞定的。我的思路：这里说的是最大匹配分词法：首先准备一个分词词表 input.txt 作为输入，顺序扫描待分词的句子，将句中候选词按照词长从大到小的顺序依次跟词表 cizu.txt 文件中的词进行匹配，匹配成功即作为一个词输出。这样就使得每次输出的词是长度最大的（相比已知的确定的词表而言）。如果一个句中的多字候选词跟词表中所有的词都匹配不上，自然就只能把单字词当作分词结果输出了。把事先准备好的欲分词文件在目录 d:outputinput.txt，那么我们开始执行程序显示之后再看文件夹 d:output 里面多了个 o

2、utput.txt 的文件，这就是对 input.txt 做好的分词输出文件。程序的算法思想：首先对一篇文本按照标点符号等自身的分隔符分解成句子，然后对每个句子按照词长 MAX_CWORD_LEN=18（9 个汉字）的正向最大匹配法进行分词。在划分句子的时候，最关键的操作在确定字串在何处断开成为独立的句子。这里考虑了英文和中文混杂的情况。数据结构：使用二维的指针数组进行词典存储：词典中的每个词利用其第一个字节和最后一个字节进行二维定位来存储，有相同的第一个字节和最后一个字节的多个词串用指针进行相连。这种存储方式极大得提高了查词典的效率，在匹配词串时利用第一字节和最后一个字节直接定位或者通过几级

3、指针快速检索。输入文本文件和输出文本文件都用一维数组进行存储，对空间的要求比较大，避免了多次文件的 I/O 操作。主要文件有测试文件 input.txt，输出文件 output.txt，这里还有一个字典，就是每个词后面都有可能出现的词语，比如：人名，人民，人生等 ziguang.txt。实现的部分源程序如下：/不进行索引的单词char *arrayEnglishStop = a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, q, r, s, t, u, v, w, x, y, z, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, abo

4、ut, above, after, again, all, also, am, an, and, any, are, as, at, back, be, been, before, behind, being, below, but, by, can, click, do, does, done, each, else, etc, ever, every, few, for, from, generally, get, go, gone, has, have, hello, here, how, if, in, into, is, just, keep, later, let, like, l

5、ot, lots, made, make, makes, many, may, me, more, most, much, must, my, need, no, not, now, of, often, on, only, or, other, others, our, out, over, please, put, so, some, such, than, that, the, their, them, then, there, these, they, this, try, to, up, us, very, want, was, we, well, what, when, where

6、, which, why, will, with, within, you, your, yourself;/词典索引时,字或词不需要索引char *arrayChineseStop = 的 ,吗,么,啊,说, 对,在, 和,是,被 ,最,所,那,这, 有,将, 会,与,於 ,于,他,她,它, 您,为, 欢迎;/标点符号及汉字的标点符号,注意 + - 这三个符号，因为在搜索的时候需要通过他们进行异或等条件判断char arrayAsciiSymbol =!,*,(,),-,_,+,=,:,;,.,?,/,|,#,$,%,&;/汉字词典typedef struct _WORD_NODEchar

7、strWordMAX_CWORD_LEN+1;/ todo ,可以增加两个字，三个字，四个字，五个字的数组，这样查起来更快struct _WORD_NODE *nextWord;WORD_NODE; /定义结构体类型struct _CH_DICT WORD_NODE *lstWord;CH_DICTMAX_CDIMMAX_CDIM; /定义结构体二维数组变量：词典char SEG_LISTMAX_WORD; /定义数组存放分词后的句串int WNUM_IN = 0;int WNUM_OUT = 0;char *strTrim(char str)int firstchar=0;int endp

8、os=0;int i;int firstpos=0;for(i=0;stri!=0;i+)if(stri= | stri = r | str i = n | str i=t)if(firstchar=0) firstpos+; else endpos=i;firstchar=1; for(i=firstpos;istrWord,strWord);newWord-nextWord = NULL;firstChar -= 161 ;lastChar -= 161 ; /按词串第一个字节和最后一个字节顺序来存词curLst = CH_DICTfirstCharlastChar.lstWord;if(

9、 curLst = NULL) /reinit list; CH_DICTfirstCharlastChar.lstWord = newWord ;return 0; curTmp = curLst ; /curLst 对应位置上已经存词则顺着指针向后找while(curTmp - nextWord != NULL ) curTmp = curTmp-nextWord; curTmp - nextWord = newWord ;return 0;void addSegWord(unsigned char *strWord , int len)const char *cstrWord = ( c

10、onst char * ) ( char * ) strWord;strcat(SEG_LIST , cstrWord);strcat(SEG_LIST , |);WNUM_OUT = WNUM_OUT + len + 1;void freeDict()int i ,j ; WORD_NODE *curLst,*curTmp ,*tmp;for ( i = 0 ; i nextWord ;free(tmp);tmp = ( WORD_NODE *)NULL; CH_DICTij.lstWord = (WORD_NODE *)NULL; int freeSeg()FILE *fpOut = NU

11、LL;fpOut = fopen(d:outputoutput.txt,w);if( fpOut=( FILE *)NULL)printf(output.txt cannot be createdn);return -1; fwrite ( SEG_LIST , WNUM_OUT , 1 , fpOut );memset ( SEG_LIST , 0x00 , sizeof ( SEG_LIST ) );fclose ( fpOut );return 0;BOOL searchWord( unsigned char *strWord,int len ) WORD_NODE *curLst,*c

12、urTmp;unsigned char firstChar,lastChar;firstChar = strWord0 ;lastChar = strWordlen-1;firstChar -= 161 ;lastChar -= 161 ;curLst = CH_DICTfirstCharlastChar.lstWord;curTmp = curLst;while ( curTmp != NULL ) if ( strcmp(char *)strWord,curTmp-strWord) = 0)return TRUE; curTmp = curTmp - nextWord ; return F

13、ALSE;int segWord ( unsigned char *strText , int iWordLen , BOOL bChinese ) int i = 0 ,j = 0 , k = 0 , l = 0 , temp=0;unsigned char strCharMAX_CWORD_LEN+1,strChar15,strChar25,strChar37;BOOL bFound = FALSE;i = iWordLen ; if ( FALSE = bChinese ) /英文/检查是否在 stop 数组里addSegWord(strText,iWordLen);return 0;

14、 while ( i 1 ) /i 表示取出的句子中还未匹配出来的字数for ( j = MAX_CWORD_LEN ; j = 2 ; j -=2 ) /最大词长个汉字if ( i 0 return 0;void initSegList()int i ;for ( i = 0 ; i MAX_CWORD_LEN )printf(%s error!n,sLine);addDictWord(sLine,len); fclose(fpDict); return 0;BOOL isEnglishStop(unsigned char *strWord) /arrayEnglishStopreturn FALSE ;inline BOOL isAsciiSymbol (char cChar) int i = 0 ;for ( i = 0 ; i cChar) /英文字符/*连续空格不算分隔

展开阅读全文

作业三 最大匹配分词法

作业三最大匹配分词法