R语言在基因芯片数据处理中的应用要点

资源描述

《R语言在基因芯片数据处理中的应用要点》由会员分享，可在线阅读，更多相关《R语言在基因芯片数据处理中的应用要点（16页珍藏版）》请在金锄头文库上搜索。

1、1.R语言安装:官方网站 http:/www.r-project.org/ 安装软件。2.所需要的软件包:2.1 affy数据处理相关的程序包在 R 中复制 source(http:/bioconductor.org/biocLite.R) biocLite(affy)2.2热度图相关程序包Gplots()： install.packages(gplots)3. 获取基因表达数据3.1读取基因芯片数据(cel.files)the.filter - matrix(c(CEL file (*.cel), *.cel, All (*.*), *.*), ncol = 2, byrow = T)cel

2、.files - choose.files(caption = Select CEL files, multi = TRUE, filters = the.filter, index = 1) raw.data - ReadAffy(filenames = cel.files)3.2 sampleNames(raw.data)ang #先看看原样品名称的规律3.3 pat- .*MT-(0-9A-Z+).* #样品名称查找的正则表达式sampleNames (raw.da ta) - gsub (pa t, 1, cel.files) #gsub 为正则表达式查找函数 (samples- sa

3、mpleNames(raw.data)pData(raw.data)$treatment - rep(c(0h, 1h, 24h, 7d), each = 2)#确定样品重复数使用rma 方法进行预处理：eset.rma- rma(raw.data)# Background correcting# Normalizing# Calculating Expression4. 计算基因表达量emat. rma.log2- exprs (eset. rma)class (emat. rma.log2)head (emat. rma.log2, 1)#计算平均值，并做对数转换results.rma-

4、 data.frame (emat. rma.log2, c (1, 3, 5, 7) + emat. rma.log2,c(2, 4, 6, 8)/2)#计算表达量差异倍数resu lt s.rma$fc.lh- resu lt s.rma, 2 - resu lt s.rma, 1resu lt s.rma$fc.24h- resu lt s.rma, 3 - resu lt s.rma, 1resu lt s.rma$fc.7d- resu lt s.rma, 4 - resu lt s.rma, 1head (resu lt s.rma, 2)5.T检验p.value = apply(

5、emat.rma.log2,l, function(x)(t.test(x7:9, x10:12)$p.value)6. 导出数据write.csv(results.rma,file=C:/users/suntao/desktop/data.csv)7. 选取目的基因在http:/www.plexdb.org/index.php上确定探针，选取数据；汇总到excel表格中，保存为csv格式。&热度图cipk=read.csv(c:/users/suntao/desktop/TaCIPK affx arry log.csv)row.names(cipk)=cipk$genenamecipk so

6、ur ce(http:/bioconducto r.or g/biocLite.R) biocLite(impute)impute是专门用KNN法进行缺失值填充的R package:设置好当前工作目录(Windows是在R的菜单栏-文件-改变工作目录设置，Linux下用setwd()函数)然后在R控制台输入以下代码：lib rar y(impute)#导入 impute packager aw-r ead.table (r aw_data_3 _r eplicates.txt,heade r=TRUE)rawexp r-r aw,-1#移除第一列ID列if(exists(.Random.see

7、d) r m(.Random.seed)#必须，如果没有这句话会出错，原因不知-,-请高手指教imputed-impute.knn(as.matrix (r awexp r) ,k = 10, r owmax = 0.5, colmax = 0.8, maxp = 1500, rng.seed = 362436069)#impute.knn()使用一个矩阵作为第一个参数，其他参数这里使用的是默认值wr ite.table(imputed$data,file=imputed_data.txt)#write.table()把数据保存在当前工作目录下的文件中，文件名用file=、指定，这一步不是必须

8、的imputeddata-imputed$data#imputed$data是在R中储存imputed后的数据的矩阵现在在R里输入imputed，即填充好的数据矩阵，是不是NA值全都没了？关于 impute package 的详细 Documentation 在http:/bioc on duct or. fhc rc.or g/packages/release/bioc/html/impute.html全部数据文件：http:/ : http:/azaleasays.eom/tag/r/用R和BioConductor进行基因芯片数据分析(三)：计算median接前一篇：http:/ rchive/2012/12/05/2803144.html我们已经知道要分析的数据对每个基因有3个重复测定值，经过缺失值填充后，每个基因都有3个可用值。这一步很简单，就是取这3个值的中位数，即median。方法很多，在excel中可以用median函数；在R中以下代码进行操作：get_median-function(i,j)num_vec-c(imputeddatai*3-2,j,imputeddatai*3-1,j,imputeddatai*3,j) median(num_vec)#A simple function

展开阅读全文