Python使用HTTP长连接获取网络资源

资源描述

《Python使用HTTP长连接获取网络资源》由会员分享，可在线阅读，更多相关《Python使用HTTP长连接获取网络资源（5页珍藏版）》请在金锄头文库上搜索。

1、Python使用HTTP长连接获取网络资源1， python介绍：python是一门非常好用的语言，对于初学者和完成普通任务，Python语言是非常简单易用的。比如使用python实现简单的网络爬虫功能爬取文本图片，可以大大减少重复性劳动，快速获取网络资源。常见如扒网站图片，网页文本，一页一页的人工取保存或者用下载软件批量下载其实都不是很方便。自己写代码可以控制下载内容，下载方案，控制休眠时间，维护长连接等，使用起来绝对是比用别人的工具方便的，当然，大文件断点续传什么的比较麻烦，这个程序仅针对扒网页，下小说，扒图片。比如逛论坛，想搜索个什么关键词，搜索出来的结果很不满意，明明某个帖子看到过关键

2、词，但是就是搜索不出来，那完全可以把某论坛的多个板块，可能上万的页面全部扒了下来，然后用python做一个索引，方便搜索资料，毕竟很多论坛的搜索功能很废。我习惯将自己常逛的论坛页面扒到自己服务器上，时时刻刻都能用ssh远程连上去更新代码或者扒取策略。以下代码功能简单，复杂的也不能公开传出来，自己去研究最好了。代码示例首先贴上完整代码：import urllibimport urllib2import osimport timeimport sysimport httplibimport tracebackdef getHtml(conn_obj, nextUri): try: conn_obj

3、.request(GET, nextUri, , Connection:Keep-Alive) response = conn_obj.getresponse() html = response.read() return html.decode(gbk, ignore).encode(utf-8) + rn except Exception,e: print Exception,:,e return Nonedef interstr(src, begin, end): index1 = src.find(begin) if index1 is -1: return None index1 +

4、= len(begin) tmp = srcindex1: index2 = tmp.find(end) if index2 is -1: return None dst = tmp:index2 return dstdef getTitle(html): title = interstr(html, bookname=, /) if title is None: return None title = interstr(title, tilename=, /) if title is None: return None return titledef getNextPage(html): p

5、ageNum = interstr(html, next_page = , /) bookID = interstr(html, bookid = , /) if pageNum is None or bookID is None: return None nextPage = /book/%s/%s.html % (bookID, pageNum) return nextPagedef getContent(html): data = interstr(html, content, /) if data is None: return None return data + ndef fors

6、tr(src, begin, end):tmpSrc = srcstrList = while True:indexBegin = tmpSrc.find(begin)if indexBegin is -1:breakindexBegin += len(begin)indexEnd = tmpSrc.find(end)if indexEnd is -1:breaktmpString = tmpSrcindexBegin:indexEndstrList.append(tmpString)tmpSrc = tmpSrcindexEnd:return strListif _name_ = _main

7、_: book = sys.argv1 page = sys.argv2host = sys.argv3 print book s = time.time() conn = httplib.HTTPConnection(host, 80, True, 10) nextPage = /book/%s/%s.html % (book, page) path = /mnt/j/%s.txt % book output = open(path, a+) while True: print nextPage html = getHtml(conn, nextPage) if html is None:

8、time.sleep(1) conn.close() conn = httplib.HTTPConnection(host, 80, True, 10) print waiting timeout for html html = getHtml(conn, nextPage) title = getTitle(html) if title is not None: output.write(title) print title else: print title none time.sleep(1) html = getHtml(conn, nextPage) title = getTitle

9、(html) if title is None: continue output.write(title) data = getContent(html) if data is not None: output.write(data) print data else: print content none nextPage = getNextPage(html) if nextPage is None: break e = time.time() print e - s conn.close()2，代码详解def getHtml(conn_obj, nextUri): try: conn_o

10、bj.request(GET, nextUri, , Connection:Keep-Alive) response = conn_obj.getresponse() html = response.read() return html.decode(gbk, ignore).encode(utf-8) + rn except Exception,e: print Exception,:,e return None函数功能是下载conn_obj指向的host站点的nextUri页面，其中conn_obj参数是是一个HTTP的长连接，使用httplib库，nextUri是站点的相对路径，通常小说

11、网站都是gbk编码，读取小说章节网址，下载页面，下载的数据即页面源码，源码中的汉字通常是gbk编码，需要转码UTF-8。def interstr(src, begin, end): index1 = src.find(begin) if index1 is -1: return None index1 += len(begin) tmp = srcindex1: index2 = tmp.find(end) if index2 is -1: return None dst = tmp:index2return dst函数功能是截取字符串，在src中截取begin和end字符串中间的数据。操作很

12、初级，效率比较低，但是思路很简单，功能都自己实现。def getTitle(html):函数功能，是截取章节名，功能也很简单，但是不同的网站的章节名截取规则不同。html是输入参数，小说内容页面，使用interstr匹配小说章节名的前后特征字符串，提取标题。def getNextPage(html):获取下一章节的url，通常网站页面会有本章节内容，和上下章节的地址，将这些地址爬出来，可以在循环中自动获取下一章节内容。def getContent(html):解析下载源码中的小说文本内容，不同站点规则不同，匹配方案简单，只用字符串操作，可以考虑正则表达式，但是，其实正则看起来还没这个直观。函数功能解析完毕，最重要的功能在main函数中，长连接的维持，使用长连接的好处是，程序会在开始处建立与服务器之间的一个http连接，只要网络连接维持不断，程序可以使用该连接下载文本，避免了每下载一个页面就要建立一次连接，相对而言，速度会提高，并且降低被服务器踢掉的风险。主要的作用是提高了爬取页面的速度，用urllib一次连接一次下载，下载800章节的小说估计得一个小时，而且很容易被服务器拒绝连接，因为太频繁了。使用长连接，一般就是十分钟左右的事。这个程序，可以稍微改一下，用来爬取图片，和pdf文档等资源，可以大大提高搜索资源的效率

展开阅读全文