php提取网页正文内容的例子__6

资源描述

《php提取网页正文内容的例子__6》由会员分享，可在线阅读，更多相关《php提取网页正文内容的例子__6（14页珍藏版）》请在金锄头文库上搜索。

1、php提取网页正文内容的例子_ 由于难点在于如何去识别并保留网页中的文章部分，而且删除其它无用的信息，并且要做到通用化，不能像火车头那样依据目标站来制定采集规章，由于搜索引擎结果中有各种的网页。抓回一个页面的数据，如何匹配出正文部分，郑晓在下班路上想了个思路是： 1. 提取出body标签部分剔除全部链接剔除全部script、说明剔除全部空白标签(包括标签内不含中文的)猎取结果。 2. 挺直匹配出非链接的、符合在div、p、h标签中的中文部分? 还是会有不少其它多余信息啊，比如底部信息等。如何搞?不知道大家有木有什么思路或建议? 这个类是从网上找到的一个php实现的提取网页正文部分的算法，

2、郑晓在本地也测试了下，精准率特别高。代码如下: ?php class Readability / 保存判定结果的标记位名称 const ATTR_CONTENT_SCORE = contentScore; / DOM 解析类目前只支持 UTF-8 编码 const DOM_DEFAULT_CHARSET = utf-8; / 当判定失败时显示的内容 const MESSAGE_CAN_NOT_GET = Readability was unable to parse this page for content.; / DOM 解析类（PHP5 已内置） protected $DOM = nu

3、ll; / 需要解析的源代码 protected $source = ; / 章节的父元素列表 private $parentNodes = array(); / 需要删除的标签 / Note: added extra tags from private $junkTags = Array(style, form, iframe, script, button, input, textarea, noscript, select, option, object, applet, basefont, bgsound, blink, canvas, command, menu, nav, data

4、list, embed, frame, frameset, keygen, label, marquee, link); / 需要删除的属性 private $junkAttrs = Array(style, class, onclick, onmouseover, align, border, margin); /* * 构造函数 * param $input_char 字符串的编码。默认 utf-8，可以省略 */ function _construct($source, $input_char = utf-8) $this-source = $source; / DOM 解析类只能处理

5、UTF-8 格式的字符 $source = mb_convert_encoding($source, HTML-ENTITIES, $input_char); / 预处理 HTML 标签，剔除冗余的标签等 $source = $this-preparSource($source); / 生成 DOM 解析类 $this-DOM = new DOMDocument(1.0, $input_char); try /libxml_use_internal_errors(true); / 会有些错误信息，不过没关系 :) if ( encoding=.Readability:DOM_DEFAULT_C

6、HARSET.$source) throw new Exception(Parse HTML Error!); foreach ($this-DOM-childNodes as $item) if ($item-nodeType = XML_PI_NODE) $this-DOM-removeChild($item); / remove hack / insert proper $this-DOM-encoding = Readability:DOM_DEFAULT_CHARSET; catch (Exception $e) / . /* * 预处理 HTML 标签，使其能够精准被 DOM 解析

7、类处理 * * return String */ private function preparSource($string) / 剔除多余的 HTML 编码标记，避开解析出错 preg_match(/charset=(w|-+);?/, $string, $match); if (isset($match1) $string = preg_replace(/charset=(w|-+);?/, , $string, 1); / Replace all doubled-up BR tags with P tags, and remove fonts. $string = preg_replac

8、e(/br/? rns*br/?/i, /pp, $string); $string = preg_replace(/?font*/i, , $string); / see / - from $string = preg_replace(#script(.*?)(.*?)/script#is, , $string); return trim($string); /* * 删除 DOM 元素中全部的 $TagName 标签 * * return DOMDocument */ private function removeJunkTag($RootNode, $TagName) $Tags = $

9、RootNode-getElementsByTagName($TagName); /Note: always index 0, because removing a tag removes it from the results as well. while($Tag = $Tags-item(0) $parentNode = $Tag-parentNode; $parentNode-removeChild($Tag); return $RootNode; /* * 删除元素中全部不需要的属性 */ private function removeJunkAttr($RootNode, $Att

10、r) $Tags = $RootNode-getElementsByTagName(*); $i = 0; while($Tag = $Tags-item($i+) $Tag-removeAttribute($Attr); return $RootNode; /* * 依据评分猎取页面主要内容的盒模型 * 判定算法来自： * 这里由郑晓博客转发 * return DOMNode */ private function getTopBox() / 获得页面全部的章节 $allParagraphs = $this-DOM-getElementsByTagName(p); / Study all t

11、he paragraphs and find the chunk that has the best score. / A score is determined by things like: Number of ps, commas, special classes, etc. $i = 0; while($paragraph = $allParagraphs-item($i+) $parentNode = $paragraph-parentNode; $contentScore = intval($parentNode-getAttribute(Readability:ATTR_CONT

13、y)?)(s|$)/i, $className) $contentScore += 25; / Look for a special ID if (preg_match(/(comment|meta|footer|footnote)/i, $id) $contentScore -= 50; else if (preg_match( /(post|hentry|entry-?(content|text|body)?|article-?(content|text|body)?)$/i, $id) $contentScore += 25; / Add a point for the paragraph found / Add points for any commas within this paragraph if (strlen($paragraph-nodeValue) 10) $contentScore += strlen($paragraph-nodeValue); / 保存父元素的判定得分 $parentNode-setAttribute(Readability:ATTR_CONTENT_SCORE,

展开阅读全文

php提取网页正文内容的例子__6

最新文档