网络备份中重复数据删除技术研究

资源描述

《网络备份中重复数据删除技术研究》由会员分享，可在线阅读，更多相关《网络备份中重复数据删除技术研究（128页珍藏版）》请在金锄头文库上搜索。

1、华中科技大学博士学位论文网络备份中重复数据删除技术研究姓名：杨天明申请学位级别：博士专业：计算机系统结构指导教师：冯丹 2010-08 华中科技大学博士学位论文华中科技大学博士学位论文 I 摘要摘要科技的飞速发展和生产力的突飞猛进正在加速产生大量高价值数据，对这些数据的存储和备份需求可以达到 PB 级（千万亿字节）。尽管数据呈爆炸性增长，但研究表明，重复数据大量存在于信息处理和存储的各个环节，如文件系统、邮件附件、web 对象，以及操作系统和应用软件中。传统的数据保护技术如周期性备份、版本文件系统、快照和连续数据保护等

2、更是加速着重复数据的增长，导致网络带宽和存储空间资源的紧缺以及数据管理成本的快速上升。为了抑制数据过快增长，提高资源利用率，降低成本，重复数据删除技术已经成为一个备受关注的研究课题。数据的持续增长和应用的高连续性对备份性能的要求越来越高，在大规模网络备份系统中实现重复数据删除，提高存储空间效率的同时，必须保证系统具有良好的性能和可扩展性。因此，围绕重复数据删除性能和可扩展性，在大规模重复数据删除系统架构、元数据管理、索引维护、高性能数据备份和恢复等方面进行研究，取得了相应的研究进展。针对已有的重复数据删除技术采用单服务器架构、可扩展性较差，难以满足大规模分布式数据备份需要的问

3、题，提出了一种基于集中式管理、网络数据备份的层次化重复数据删除系统架构。该架构由一台主服务器对整个系统进行管理，支持多台备份服务器并行作业。数据流由备份客户端经过备份服务器流入后端存储节点中，实现了控制流和数据流的有效分离。多层数据索引技术把逻辑数据和底层物理数据有效分离开来，支持高性能层次化重复数据删除以及备份服务器层和存储节点层的动态扩展，使得系统具有良好的性能、可管理性和可扩展性。现有的重复数据删除技术在数据写入后台存储系统的过程中在全局范围内查询指纹以消除重复数据。随着备份数据量的增长，用来加速指纹查询的内存数据结构所消耗的存储空间会越来越大，使得系统规模最终受服务器

4、内存空间限制。为此，设计了一种基于小范围检测的指纹过滤器用于在备份过程中对数据进行初步过滤，消除周期性备份产生的重复数据，节省网络带宽，提高备份效率。该技术把指纹查询的范围限定在作业链内，备份的内存开销和系统规模无关，另外，其在备份过程中收集指纹，便于系统使用高性能后处理重复数据删除算法对数据进行集中处理，消除了磁盘索引查询和更新对应用系统的影响。实验表明，该技术能消除备份流中大部分重复数据，既节省网络带宽又减少了需要在后台进一步处理的数据量，提高了系统整体性能。提出了一种后处理重复数据删除算法对备份数据进行集中处理，该算法顺序扫描磁盘索引一次性批处理大量指纹，从而有效消除

5、了指纹查询和索引更新的随机磁华中科技大学博士学位论文华中科技大学博士学位论文 II 盘 I/O 瓶颈。该算法使用固定大小的存储容器保护新数据块逻辑顺序，支持高性能数据恢复，另外，使用一种无状态路由算法把存储容器分发到后台存储节点中，支持后台存储节点的负载平衡、数据迁移和动态扩展。实验表明，相较于目前主流的重复数据删除技术，该算法在相同内存开销下支持更大的系统物理容量，更重要的是，它支持多服务器并行操作，具有良好的可扩展性。后处理重复数据删除算法顺序扫描数据块索引（磁盘索引）进行批处理指纹查询和索引更新，因而在一定系统规模下维持较小的数

6、据块索引对于提高系统性能来说至关重要。目前在数据块索引空间利用率方面尚没有发现相关的研究工作。因此，设计了一种基于前缀映射的磁盘哈希表作为数据块索引，保证了良好的索引可扩展性，同时着重研究了数据块索引溢出概率和空间利用率问题。研究表明，使用恰当大小的索引桶，既能避免过高的桶内指纹查询开销，又能降低索引溢出概率，提高数据块索引空间利用率，从而有效降低索引存储开销，提高索引扫描性能。关键词关键词：数据备份，重复数据删除，磁盘索引，指纹查询，索引更新，后处理华中科技大学博士学位论文华中科技大学博士学位论文 III Abstract To

7、day, the ever-growing volume and value of digital information have raised a critical and mounting demand on large-scale and high-performance data protection. The massive data needing backup and archiving has amounted to several perabytes and may soon reach tens, or even hundreds of perabytes. Despit

8、e the explosive growth of data, research shows that a large number of duplicate data exists in the information processing and storage of all aspects, such as file systems, e-mail attachments, web objects, and the operating system and application software. Traditional data protection technologies suc

9、h as periodic backup, version file system, snapshot and continuous data protection magnify this duplication by storing all of this redundant data over and over again. Due to the unnecessary data movement, enterprises are often faced with backup windows that roll into production hours, network constr

10、aints, and too much storage under management. In order to restrain the excessive growth of data, improve resource utilization and reduce costs, data de-duplication technology has become a hot research topic. Due to the continued growth of data and high-continuity of application, its very important t

11、o ensure that the system has good performance and scalability while performing data de-duplication in a large-scale network backup system to improve storage space efficiency. Therefore, our work mainly focuses on data de-duplication performance and scalability. A distributed hierarchical data de-dup

12、lication architecture based on centralized management is presented, and then the metadata management, index maintenance, scalable and high performance data de-duplication technology are researched in detail. The main contributions of this dissertation include: To overcome the shortcomings of existin

13、g de-duplication solutions, which obtain high backup performance, but suffer from poor scalability for large-scale and distributed backup environments because of their single-server architecture, we present a distributed 华中科技大学博士学位论文华中科技大学博士学位论文 IV hierarchical data de-duplica

14、tion architecture based on centralized management for network backups. The architecture supports a cluster of backup servers to perform data de-duplication in parallel, and uses a master server, which handles job scheduling, metadata management and load balancing to improve the systems scalability.

15、Data stream is transfered directly from the client to the backup server, deduplicated in-batch and then sent to the back-end storage nodes, which separates the control flow from the data flow effectively. Multi-layer data indexing technology supports high-performance hierarchical data de-duplication

16、 and dynamic expansion of both backup server and storage node, which provides the system good performance, manageability and scalability. Exising de-duplication technologies lookup fingerprints in the global system to eliminate duplicate while writing data to the back-end storage, with the growth in the amount of data, the memory overhead for accelerating fingerprint lookup will grow increasingly, and thus the system physical capacity will be limited inevitably by the amount of physical memor

展开阅读全文