二代测序数据分析简介讲义

上传人:今*** 文档编号:106962894 上传时间:2019-10-17 格式:PPT 页数:57 大小:4.37MB
返回 下载 相关 举报
二代测序数据分析简介讲义_第1页
第1页 / 共57页
二代测序数据分析简介讲义_第2页
第2页 / 共57页
二代测序数据分析简介讲义_第3页
第3页 / 共57页
二代测序数据分析简介讲义_第4页
第4页 / 共57页
二代测序数据分析简介讲义_第5页
第5页 / 共57页
点击查看更多>>
资源描述

《二代测序数据分析简介讲义》由会员分享,可在线阅读,更多相关《二代测序数据分析简介讲义(57页珍藏版)》请在金锄头文库上搜索。

1、二代测序数据分析简介,童春发 2013.12.23,主要内容,重测序的原理及流程 数据结构与质量评估 SRA数据库及数据获取 Bowtie2、BWA和SAMtools软件使用,重测序的原理及流程,数据结构与质量评估,Fastq格式 FastQC,FASTQ format,http:/en.wikipedia.org,A FASTQ file containing a single sequence might look like this,SEQ_ID GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT + !*(*+)%+

2、)(%).1*-+*)*55CCFCCCCCCC65,Illumina sequence identifiers,HWUSI-EAS100R:6:73:941:1973#0/1,Versions of the Illumina pipeline since 1.4 appear to use #NNNNNN instead of #0 for the multiplex ID, where NNNNNN is the sequence of the multiplex tag.,With Casava 1.8 the format of the line has changed,EAS139:

3、136:FC706VJ:2:2104:15343:197393 1:Y:18:ATCACG,Quality,A quality value Q is an integer mapping of p (i.e., the probability that the corresponding base call is incorrect). Phred quality score: The Solexa pipeline (i.e., the software delivered with the Illumina Genome Analyzer) earlier used,Quality,Enc

4、oding,Sanger format can encode a Phred quality score from 0 to 93 using ASCII 33 to 126 Illuminas newest version (1.8) of their pipeline CASAVA will directly produce fastq in Sanger format Solexa/Illumina 1.0 format can encode a Solexa/Illumina quality score from -5 to 62 using ASCII 59 to 126 Start

5、ing with Illumina 1.3 and before Illumina 1.8, the format encoded a Phred quality score from 0 to 62 using ASCII 64 to 126 Starting in Illumina 1.5 and before Illumina 1.8, the Phred scores 0 to 2 have a slightly different meaning,American Standard Code for Information Interchange (ASCII),FastQC,htt

6、p:/www.bioinformatics.babraham.ac.uk/projects/fastqc/ Double click “run_fastqc.bat” to run FastQC The analysis results for 11 modules Green tick for normal Orange triangle for slightly abnormal Red cross for very unusual,Basic Statistics,Filename NHS066-47_L4_1.fq.gz File type Conventional base call

7、s Encoding Sanger / Illumina 1.9 Total Sequences 3992798 Filtered Sequences 0 Sequence length 100 %GC 37,Per Base Sequence Quality,The central red line is the median value The yellow box represents the inter-quartile range (25-75%) The upper and lower whiskers represent the 10% and 90% points The bl

8、ue line represents the mean quality,Per Sequence Quality Scores,A warning is raised if the most frequently observed mean quality is below 27 - this equates to a 0.2% error rate. An error is raised if the most frequently observed mean quality is below 20 - this equates to a 1% error rate.,Per Base Se

9、quence Content,This module issues a warning if the difference between A and T, or G and C is greater than 10% in any position. This module will fail if the difference between A and T, or G and C is greater than 20% in any position.,Per Base GC Content,This module issues a warning it the GC content o

10、f any base strays more than 5% from the mean GC content. This module will fail if the GC content of any base strays more than 10% from the mean GC content.,Per Sequence GC Content,A warning is raised if the sum of the deviations from the normal distribution represents more than 15% of the reads This

11、 module will indicate a failure if the sum of the deviations from the normal distribution represents more than 30% of the reads,Per Base N Content,This module raises a warning if any position shows an N content of 5% This module will raise an error if any position shows an N content of 20%,Sequence

12、Length Distribution,This module will raise a warning if all sequences are not the same length This module will raise an error if any of the sequences have zero length,Duplicate Sequences,This module will issue a warning if non-unique sequences make up more than 20% of the total This module will issu

13、e a error if non-unique sequences make up more than 50% of the total,Overrepresented Sequences,AATTAGATCGGAAGAGCACACGTCTGAACTCCAGTCACTCGAAGATCTCG 65311 1.636 TruSeq Adapter, Index 10 (97% over 36bp) ATTAGATCGGAAGAGCACACGTCTGAACTCCAGTCACTCGAAGATCTCGT 6464 0.162 TruSeq Adapter, Index 10 (97% over 36bp

14、) AATAGATCGGAAGAGCACACGTCTGAACTCCAGTCACTCGAAGATCTCGT 4633 0.116 TruSeq Adapter, Index 10 (97% over 36bp) AATTAGTCGGAAGAGCACACGTCTGAACTCCAGTCACTCGAAGATCTCGT 4463 0.112 TruSeq Adapter, Index 10 (97% over 34bp) AATTATGGATAATTAAAGTATTCCCCCCTTTTTTTTATGATATTTTTGAC 3994 0.100 No Hit Warning: 0.1% Failure:

15、1%,Overrepresented Kmers,This module will issue a warning if any k-mer is enriched more than 3 fold overall, or more than 5 fold at any individual position This module will issue a error if any k-mer is enriched more than 10 fold at any individual base position,Saving a Report,NHS066-47_L4_1.fq_fast

16、qc.zip,SRA数据库及数据获取,SRA数据库及数据获取,SRA数据库及数据获取,SRA数据库及数据获取,查看和下载SRR576183,Fastq-dum将SRA文件转化成FASTQ格式,fastq-dump -split-files -DQ “+” ./SRR576183.sra fastq-dump -split-files -DQ “+” -gzip ./SRR576183.sra,直接下载FASTQ格式数据,ftp:/ftp.era.ebi.ac.uk/vol1/fastq/SRR576/SRR576183,将Reads比对到参考序列,BWA Bowtie2 Soap Samtools,BWA,http:/bio- wget tar -xjvf bwa-0.7.5a.tar.bz2 cd bwa-0.7.5a make Dowload test.tar.gz from ftp:/202.119.214.193,BWA,/bwa-0.7.5a/bwa index

展开阅读全文
相关资源
正为您匹配相似的精品文档
相关搜索

最新文档


当前位置:首页 > 高等教育 > 大学课件

电脑版 |金锄头文库版权所有
经营许可证:蜀ICP备13022795号 | 川公网安备 51140202000112号