intel memory关键技术解析

资源描述

《intel memory关键技术解析》由会员分享，可在线阅读，更多相关《intel memory关键技术解析（9页珍藏版）》请在金锄头文库上搜索。

1、Intel Memory关键技术解析Independent Channel ModeChannels can be populated in any order in Independent Channel Mode. All fourchannels may be populated in any order and have no matching requirements. Allchannels must run at the same interface frequency but individual channels may run atdifferent DIMM timing

2、s (RAS latency, CAS latency, and so forth).Lockstep Channel ModeIn Lockstep Channel Mode, each memory access is a 128-bit data access that spansChannel 0 and Channel 1, and Channel 2 and Channel 3. Lockstep Channel mode is theonly RAS mode that allows SDDC for x8 devices. Lockstep Channel Mode requi

3、res thatChannel 0 and Channel 1, and Channel 2 and Channel 3 must be populated identicallywith regards to size and organization. DIMM slot populations within a channel do nothave to be identical but the same DIMM slot location across Channel 0 and Channel 1and across Channel 2 and Channel 3 must be

4、populated the same.Mirrored Channel ModeIn Mirrored Channel Mode, the memory contents are mirrored between Channel 0 andChannel 2 and also between Channel 1 and Channel 3. As a result of the mirroring, thetotal physical memory available to the system is half of what is populated. MirroredChannel Mod

5、e requires that Channel 0 and Channel 2, and Channel 1 and Channel 3must be populated identically with regards to size and organization. DIMM slotpopulations within a channel do not have to be identical but the same DIMM slotlocation across Channel 0 and Channel 2 and across Channel 1 and Channel 3

6、must bepopulated the same.Rank Sparing ModeIn Rank Sparing Mode, one rank is a spare of the other ranks on the same channel. Thespare rank is held in reserve and is not available as system memory. The spare rankmust have identical or larger memory capacity than all the other ranks (sparing sourceran

7、ks) on the same channel. After sparing, the sparing source rank will be lost.进行内存热备时，做热备份的内存在正常情况下是不使用的，也就是说系统是看不到这部分内存容量的。每个内存通道中有一个 DIMM 不被使用，预留为热备内存。芯片组中设置有内存校验错误次数的阈值, 即每单位时间发生错误的次数。当工作内存的故障次数达到这个“容错阈值” ，系统开始进行双重写动作，一个写入主内存，一个写入热备内存，当系统检测到两个内存数据一致后，热备内存就代替主内存工作，故障内存被禁用，这样就完成了热备内存接替故障内存工作的任务，有效避免

8、了系统由于内存故障而导致数据丢失或系统宕机。这个做热备的内存容量应大于等于所在通道的最大内存条的容量，以满足内存数据迁移的最大容量需求。内存刷洗（Memory Scrubbing）It is important to check each memory location periodically, frequently enough, before multiple bit errors within the same word are too likely to occur, because the one bit errors can be corrected, but the mult

9、iple bit errors are not correctable, in the case of usual (as of 2008) ECC memory modules.In order to not disturb regular memory requests from the CPU and thus prevent decreasing performance, scrubbing is usually only done during idle periods. As the scrubbing consists of normal read and write opera

10、tions, it may increase power consumption for the memory compared to non-scrubbing operation. Therefore, scrubbing is not performed continuously but periodically. For many servers, the scrub period can be configured in the BIOS setup program.The normal memory reads issued by the CPU or DMA devices ar

11、e checked for ECC errors, but due to data locality reasons they can be confined to a small range of addresses and keeping other memory locations untouched for a very long time. These locations can become vulnerable to more than one soft error, while scrubbing ensures the checking of the whole memory

12、 within a guaranteed time.Key Info：1） Soft error, an important reason for doing memory scrubbing2） Error detection and correction, a general theory used for memory scrubbingECC 技术90 年代初，内存体系采用奇偶性校验（Parity Verifying）技术。奇偶校验内存在每一字节（8 位）外又额外增加了一位作为错误检测之用，BIOS 中的监控程序会将存入内存中的数据位相加，并将结果存于校验位中。比如一个字节中存储了某一

13、数值 10011110，每一位加起来的结果为奇数（100111105），校验位存入 1。当 CPU 读取储存的数据时，监控程序再次相加存储的 8 位数据，并将计算结果与校验位相比较。如果发现二者不同，系统就会产生出错信息。奇偶校验技术仅能粗略地检查内存错误，并不具备纠错能力。另一种内存纠错技术叫做 ECC（Error Correct Code，纠错码），它也是在原来的数据位上外加位来实现的，增加的位用来重建错误数据。在 ECC 纠错体系中，如果数据为 N个字节，则外加的 ECC 位为 log2N + 5。例如对于 64 位数据，需要外加 log28 + 5 = 8个 ECC 位。当出现一个

14、存储位错误时，ECC 体系可以自动进行纠错。当出现 2 个数据位错误时，可以检测出来，但不能纠错，这种行为通常称作“单错纠正双错检测（Single Error Correction/Double Error Detection ，简称 SEC/DED）。一次存取中有 2 个以上的数据位出错时，由于 SEC/DED 体系检测不出来了，致使数据的完整性受损。采用这种结构的存储器，当检测出多位错误时，系统就会报告出现了致命故障（Fatal fault），之后系统崩溃。X4/X8 SDDC (Single Device Data Correction)随着 RAM 芯片的集成度的提高和内存容量的增

15、大，内存发生错误的概率也随之增加。几年前被认为很可靠的 SECDED 内存体系，今天已经力不从心了，寻求具有多位纠错能力的内存体系结构一直是众多厂商追求的目标。RAM 器件失效最为严重的情形是其全部数据位全部发生错误，纠正这种错误的基本思路应该着眼于芯片和系统的硬件结构，而不可能通过软件升级的方式来达到目的。存储器中的每个字节外加一个 ECC 位构成 ECC 字。如果存储器系统的数据宽度为 32个字节（或 256 位），实际的存储器数据的宽度是 25632288 位。同时，每一个数据位都被置于分离的 ECC 字中。图 1 描述了这种方法工作的原理。存储系统由 4 个 DIMM 模块构成，32

16、个字节（256位）的数据被分成 4 个 ECC 字，每个 ECC 字含有 8 个字节（ 64 位）的数据位和 8 个 ECC位。这样，一个 ECC 字的实际长度为 64872 位，存储数据总长度为 724288 位。图 1 Chipkill 内存纠错原理存储器控制器（Memory Controller）把每个 ECC 字被分成 4 个长度为 18 位的段，分别存储于 4 个 DIMM 中。同时，每个 DIMM 中也存储了 4 个来自不同的 ECC 字的段。然后，每个段的 18 个位再被存储在不同的 RAM 芯片中。经过上述处理，每个 DRAM 芯片中只保存了 ECC 字的一位。如果 RAM 芯片失效，导致某个芯片中的全部 18 个位都出错，也只是造成 ECC 字的一位错误。因为每个 ECC 字具有 SECDED 能力，可以自动纠错，所以可以恢复所有的数据。What is LR-DIMM or LRDIMM ?Today, using RDIMMs, a typical server system ca

展开阅读全文

intel memory关键技术解析

最新文档