可靠性测试技术－金锄头文库

资源描述

《可靠性测试技术》由会员分享，可在线阅读，更多相关《可靠性测试技术（19页珍藏版）》请在金锄头文库上搜索。

1、Dependability Basic Concepts and Taxonomy of Dependable and Secure Computing Dependability p availability: readiness for correct service. p reliability: continuity of correct service. p safety: absence of catastrophic consequences on the user(s) and the environment. p integrity: absence of improper

2、system alterations. p maintainability: ability to undergo modifications and repairs. p Fault prevention: to prevent the occurrence or introduction of faults. p Fault tolerance : to avoid service failures in the presence of faults. p Fault removal : to reduce the number and severity of faults. p Faul

3、t forecasting : to estimate the present number, the future incidence, and the likely consequences of faults. p Fault: The adjudged or hypothesized cause of an error is called a fault. Its can be internal or external of a system p Error: is the part of the total state of the system that may lead to i

4、ts subsequent service failure. p Failure: is an event that occurs when the delivered service deviates from correct service Fault p Development faults: that include all fault classes occurring during development p Physical faults: that include all fault classes that affect hardware p Interaction faul

5、ts: that include all external faults. Failure p Content failures: The content of the information delivered at the service interface (i.e., the service content) deviates from implementing the system function. p Timing failures: The time of arrival or the duration of the information delivered at the s

6、ervice interface (i.e., the timing of service delivery) deviates from implementing the system function. p Halt failure: when the service is halted (the external state becomes constant, i.e., system activity, if there is any, is no longer perceptible to the users); a special case of halt is silent fa

7、ilure, or simply silence, when no service at all is delivered at the service interface (e.g., no messages are sent in a distributed system). p Erratic failures : when a service is delivered (not halted), but is erratic (e.g., babbling). Fault Tolerance p fault prevention and fault removal into fault

8、 avoidance, i.e., how to aim for fault- free systems, p fault tolerance and fault forecasting into fault acceptance, i.e., how to live with systems that are subject to faults. Fault Removal: p 开发阶段：代码走读以及各种测试活动等 p 运行阶段：监控，巡检、更换、隔离等维护性活动 Fault Forecasting: p 定性评估： FMEA、RBD、FTA等等模型 p 定量评估：Reliability

9、 growth模型，马尔科夫链等可靠性系统产生不可靠的机理可描述在可靠性工程领域：错误error缺陷defect故障fault失效failure 1错误：是指在系统生存期内的不希望或不可接受的人为错误，其结果是导致缺陷的产生。可见，错误是一种人为过程，相对于系统本身，是一种外部行为。 2缺陷：缺陷是存在于系统（文档、数据、程序、硬件等）之中的那些不希望或不可接受的偏差，如少一个逗号、多一语句等。其结果是系统运行于某一特定条件时出现故障（这时称缺陷被激活） 3故障：故障是指系统运行过程中出现的一种不希望或不可接受的内部状态。譬如，软件处于执行一个多余循环过程时，我们说软件出现故障。此

10、时若无时当的措施（容错）加以及时处理，便产生软件失效。显然，故障是一种动态行为 4失效：失效是指系统运行时产生的一种不希望或不可接受的外部行为结果。系统不出现问题出现问题后不影响业务客户角度看可靠性出现问题后影响业务后，能够快速恢复错误和缺陷都被消灭，就不会发生故障，也就不会有失效。代码检视以及各种常规测试活动都是为此而努力的针对各种故障，系统要具备相应的容错手段以确保业务的连续性，常用方法就是故障管理（故障检测、定位、上报、恢复等），冗余设计。这是可靠性设计活动的重心（目标：系统中不存在单点故障，同时单点故障不影响业务），同时也是可靠性测试的重点自动

11、容错的能力总是有限的，主要能解决单点故障场景，对于多点的故障场景或者涉及到硬件损坏的场景则可能无能为力了（例如存储的多副本同时故障），这时候需要人工介入进行故障的恢复，此处的核心要求就是要速度快 (S)FMEA：（软件）失效模式影响分析，是一种自下而上的故障分析方法论 FTA：故障树分析，是一种自上而下的故障分析方法论主要的可靠性分析设计方法论 (S)FIT：（软件）故障注入测试：模拟FMEA、FTA分析出来的故障来验证故障容错机制的正确性场景故障已知故障已知场景未知故障已知场景未知故障未知场景已知故障未知场景测试类型长时间稳定性测试极限压力测试流控测试

12、故障注入测试故障注入测试可靠性预计与建模故障注入故障处理代码产品功能代码产品代码概念：对故障处理代码的功能测试，测试故障处理代码对产品功能代码的接口方法： p 通过修改系统内的各层次的对象的信息来实现故障的模拟，这些对象可以是CPU寄存器里面的信息，内存里面的内容，网络上传输的数据，硬盘上的数据等这些对外不可见的信息，也可以是进程，网卡，文件，服务器，硬盘等这些可见的对象； p 对于不见对象的修改，往往是采用随机修改的方式，例如业界的NFTAPE工具等；对于可见对象的修改往往采用精准的方式进行；例如业界的ChaosMonkey等。对于内部不可见对象的修改，如果要精准修改的

13、话，一般是采用函数跳转与替换的方式来实现的（改变函数的返回值，参数，变量等）业界信息-Chaoes Monkey Chaos Monkey 是一种服务，用于将系统分组，并随机终止属于某个分组中的系统中的一部分； Chaos Monkey 属于 Netflix 公司的 Simian Army 产品中的一员； Simian Army 由一组软件工具构成，用于测试 AWS 基础设施；该软件开源，可用于其他云服务用户进行相应测试使用； Chaos Monkey 的原则：避免大多数失效的主要方式就是经常失效； Chaos Monkey，可以随机关闭生产环境中的实例，确保网站系统能够经受故障的考

14、验，同时不会影响客户的正常使用。 Latency Monkey，在RESTful服务的调用中引入人为的延时来模拟服务降级，测量上游服务是否会做出恰当响应。通过引入长时间延时，还可以模拟节点甚至整个服务不可用。 Conformity Monkey，查找不符合最佳实践的实例，并将其关闭。例如，如果某个实例不在自动伸缩组里，那么就该将其关闭，让服务所有者能重新让其正常启动。 Doctor Monkey，查找不健康实例的工具，除了运行在每个实例上的健康检查，还会监控外部健康信号，一旦发现不健康实例就会将其移出服务组。 Janitor Monkey，查找不再需要的资源，将其回收，这能在一定程度上降

15、低云资源的浪费。 Security Monkey，这是Conformity Monkey的一个扩展，检查系统的安全漏洞，同时也会保证SSL和DRM证书仍然有效。 10-18 Monkey，进行本地化及国际化的配置检查，确保不同地区、使用不同语言和字符集的用户能正常使用Netflix。 Chaos Gorilla，Chaos Monkey的升级版，可以模拟整个Amazon Availability Zone故障，以此验证在不影响用户，且无需人工干预的情况下，能够自动进行可用区的重新平衡。业界信息-Linux Fault-Injection Linux内核集成了一个比较实用的功能“Fault-

16、injection”来帮助我们进行故障注入，从而可以构建一些通用的内核异常场景。它能够模拟内存slab分配失败、内存页分配失败、磁盘IO错误、磁盘IO超时、futex锁错误以及专门针对mmc的IO错误，用户也可以利用该机制设计增加自己需要的故障注入Fault-injection默认实现了6种错误注入方式，分别是failslab、fail_page_alloc 、fail_futex、fail_make_request、fail_io_timeout和fail_mmc_request。它们分别的功能如下： 1）failslab：注入slab分配器内存分配错误，主要包括kmalloc()、kmem_cache_alloc()等等。 2）fail_page_alloc：注入内存页分配错误，主要包括alloc_pages()、get_free_pages()等等（较failslab更为底层）。 3）fail_futex：注入futex锁死锁和uaddr错误。 4）fail_make_request：注入磁盘IO错误。它对块核心层的generic_make_request()函数进行故障注入

展开阅读全文