python基本数据统计－金锄头文库

资源描述

《python基本数据统计》由会员分享，可在线阅读，更多相关《python基本数据统计（52页珍藏版）》请在金锄头文库上搜索。

1、Python基本数据 Basic data processing of Python 统计 Department of Computer Science and Technology Department of University Basic Computer Teaching Nanjing University 数据分析 4 数据描述 3 数据整理数据收集 1 2 简单数据处理过程 2 Nanjing University 便捷数据获取用Python玩转数据 Nanjing University 用Python获取数据本地数据如何获取? 文件的打开，读写和关闭文件打开读

2、文件写文件文件关闭 4 4 Nanjing University 用Python获取数据网络数据如何获取? 抓取网页，解析网页内容 urllib urllib2 httplib httplib2 5 5 Nanjing University yahoo财经数据 http:/ t 6 Nanjing University 利用urllib库获取yahoo财经数据 7 # Filename: dji.py import urllib import re dStr = urllib.urlopen(http:/ m = re.findall( (.*?) (.*?).*?(.*?).*?, d

3、Str) if m: print m print n print len(m) else: print not match File Nanjing University 数据形式包含多个字符串（dji） AXP, American Express Company, 86.40 BA, The Boeing Company, 122.24 CAT, Caterpillar Inc., 99.44 CSCO, Cisco Systems, Inc., 23.78 CVX, Chevron Corporation, 115.91 8 Nanjing University 便捷网络数据是否能够简

4、单方便并且快速的方式获得雅虎财经上各上市公司股票的历史数据？ # Filename: quotes.py from matplotlib.finance import quotes_historical_yahoo from datetime import date import pandas as pd today = date.today() start = (today.year-1, today.month, today.day) quotes = quotes_historical_yahoo(AXP, start, today) df = pd.DataFrame(quotes)

5、 print df File 9 Nanjing University 便捷网络数据 quotes的内容日期收盘价开盘价最高价最低价成交量 10 Nanjing University 便捷网络数据自然语言工具包NLTK 古腾堡语料库布朗语料库路透社语料库网络和聊天文本 from nltk.corpus import gutenberg import nltk print gutenberg.fileids() uausten-emma.txt, uausten-persuasion.txt, uausten-sense.txt, ubible-kjv.txt, ublake

6、-poems.txt, ubryant-stories.txt, uburgess-busterbrown.txt, ucarroll- alice.txt, uchesterton-ball.txt, uchesterton-brown.txt, uchesterton-thursday.txt, uedgeworth-parents.txt, umelville-moby_dick.txt, umilton-paradise.txt, ushakespeare-caesar.txt, ushakespeare-hamlet.txt, ushakespeare-macbeth.txt, uw

7、hitman-leaves.txt texts = gutenberg.words(shakespeare-hamlet.txt) u, uThe, uTragedie, uof, uHamlet, uby, . Source brown 11 Nanjing University 数据准备用Python玩转数据 Nanjing University 数据形式 30支成分股（dji）股票数据的逻辑结构公司代码公司名最近一次成交价美国运通公司（quotes）股票详细数据的逻辑结构日期开盘价收盘价最高价最低价成交量 13 Nanjing University 数据整理

8、quotes数据加属性名 # Filename: quotesproc.py from matplotlib.finance import quotes_historical_yahoo from datetime import date import pandas as pd today = date.today() start = (today.year-1, today.month, today.day) quotes = quotes_historical_yahoo(AXP, start, today) fields = date,open,close,high,low,volume

9、 quotesdf = pd.DataFrame(quotes, columns = fields) print quotesdf File 14 Nanjing University 数据整理 dji数据：加属性名 code name lasttrade AXP BA CAT XOM quotes数据：加属性名 date open close high low volume 735190.0 735191.0 735192.0 735551.0 15 Nanjing University 数据整理用1,2,作为索引 quotesdf = pd.DataFrame(quotes, col

10、umns = fields) quotesdf = pd.DataFrame(quotes, index = range(1,len(quotes)+1),columns = fields) 16 Nanjing University 数据整理如果可以直接用date作为索引，quotes的时间能否转换成常规形式（如下图中的效果）？ from datetime import date firstday = date.fromordinal(735190) lastday = date.fromordinal(735551) firstday datetime.date(2013, 11, 1

11、8) lastday datetime.date(2014, 11, 14) Source 17 Nanjing University 时间序列 # Filename: quotesproc.py from matplotlib.finance import quotes_historical_yahoo from datetime import date from datetime import datetime import pandas as pd today = date.today() start = (today.year-1, today.month, today.day) qu

12、otes = quotes_historical_yahoo(AXP, start, today) fields = date,open,close,high,low,volume list1 = for i in range(0,len(quotes): x = date.fromordinal(int(quotesi0) y = datetime.strftime(x,%Y-%m-%d) list1.append(y) quotesdf = pd.DataFrame(quotes, index = list1, columns = fields) quotesdf = quotesdf.d

13、rop(date, axis = 1) print quotesdf File 转换成常规时间转换成固定格式删除原date列 18 Nanjing University 创建时间序列 import pandas as pd dates = pd.date_range(20141001, periods=7) dates 2014-10-01, ., 2014-10-07 Length: 7, Freq: D, Timezone: None import numpy as np dates = pd.DataFrame(np.random.randn(7,3),index=dates,col

14、umns = list(ABC) dates A B C 2014-10-01 1.302600 -1.214708 1.411628 2014-10-02 -0.512343 2.277474 0.403811 2014-10-03 -0.788498 -0.217161 0.173284 2014-10-04 1.042167 -0.453329 -2.107163 2014-10-05 -1.628075 1.663377 0.943582 2014-10-06 -0.091034 0.335884 2.455431 2014-10-07 -0.679055 -0.865973 0.24

15、6970 7 rows x 3 columns Source 19 Nanjing University 数据显示用Python玩转数据 Nanjing University 数据显示 djidf quotesdf 21 Nanjing University 数据显示 djidf.index Int64Index(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, dtype=int64) djidf.columns Index(ucode, uname, ulasttrade, dtype=object) dijdf.values array(AXP, American Express Company, 90.67, BA, The Boeing Company, 128.86, XOM, E

展开阅读全文