谷歌大规模排序实验的历史[翻译]

资源描述

《谷歌大规模排序实验的历史[翻译]》由会员分享，可在线阅读，更多相关《谷歌大规模排序实验的历史[翻译]（6页珍藏版）》请在金锄头文库上搜索。

1、原原文文链链接接：https:/ sorting-experiments-at-google 作者：Marian Dvorsky，软件工程师，谷歌云平台History of massive-scale sorting experiments at Google谷谷歌歌大大规规模模排排序序实实验验的的历历史史Thursday, February 18, 2016 星期四，2016 年 2 月 18 日Weve tested MapReduce by sorting large amounts of random data ever since we created the tool. We li

2、ke sorting, because its easy to generate an arbitrary amount of data, and its easy to validate that the output is correct.我们发明了 MapReduce 这个工具之后，对它进行了大规模随机数据的排序测试。我们喜欢排序，因为很容易产生任意规模的数据，也很容易验证排序的输出是否正确。Even the original MapReduce paper reports a TeraSort result. Engineers run 1TB or 10TB sorts as r

3、egression tests on a regular basis, because obscure bugs tend to be more visible on a large scale. However, the real fun begins when we increase the scale even further. In this post Ill talk about our experience with some petabyte-scale sorting experiments we did a few years ago, including what we b

4、elieve to be the largest MapReduce job ever: a 50PB sort.我们最初的 MapReduce 论文就报道了一个 TeraSort 排序的结果。工程师在一定的规则基础上对 1TB 或 10TB 的数据进行排序测试，因为细小的错误更容易在大规模数据运行的时候被发现。然而，真正有趣的事情在我们进一步扩大数据规模后才开始。在这篇文章中，我将讲一讲我们在几年之前所做的一些 PB 级别的排序实验，包括我们认为是目前最大的 MapReduce 工作：50PB 排序。These days, GraySort is the large scale so

5、rting benchmark of choice. In GraySort, you must sort at least 100TB of data (as 100-byte records with the first 10 bytes being the key), lexicographically, as fast as possible. The site sortbenchmark.org tracks official winners for this benchmark. We never entered the official competition.那时候，GrayS

6、ort 是大型排序基准的选择。在 GraySort 基准下，你必须按照尽快对至少 100TB 的数据(每 100B 数据用最前面的 10B 数据作为键)进行字典序排序。Storbenchmark.org 这个网站追踪报道了这个基准的官方优胜者。而我们从未正式参加过比赛。MapReduce happens to be a good fit for solving this problem, because the way it implements reduce is by sorting the keys. With the appropriate (lexicographic) shar

7、ding function, the output of MapReduce is a sequence of files comprising the final sorted dataset.MapReduce 是解决这个问题的一个不错选择，因为它实现减少(优化)的方法是对通过对键进行排序。结合适当的(字典)分区功能，MapReduce 的输出是一组包含了最终排序数据的文件序列。Once in awhile, when a new cluster in a datacenter came up (typically for use by the search indexing team

8、), we in the MapReduce team got the opportunity to play for a few weeks before the real workload moved in. This is when we had a chance to “burn in” the cluster, stretch the limits of the hardware, destroy some hard drives, play with some really expensive equipment, learn a lot about system performa

9、nce, and, win (unofficially) the sorting benchmark.偶尔，当一个新的 cluster 在一个数据中心出现时(通常被搜索索引团队所使用)，我们 MapReduce 团队就得到一个机会在真正的工作到来之前运行若干星期。这是我们有机会去“燃烧”这个 cluster，延伸硬件的限制，放弃一些硬盘，而使用一些真正昂贵的设备，了解系统的性能，并赢得(非正式)排序基准。Figure 1: Google Petasort records over time. 图 1：谷歌 Petasort 时间记录2007 2007 (1PB, 12.13 hours,

10、 1.37 TB/min, 2.9 MB/s/worker)(1PB, 12.13 小时，1.37TB/min，2.9MB/s/worker)We ran our very first Petasort in 2007. At that time, we were mostly happy that we got it to finish at all, although there are some doubts that the output was correct (we didnt verify the correctness to know for sure). The job wo

11、uldnt finish unless we disabled the mechanism which checks that the output of a map shard and its backup are the same. We suspect that this was a limitation of GFS (Google File System), which we used for storing the input and output. GFS didnt have enough checksum protection, and sometimes returned

12、corrupted data. Unfortunately, the text format used for the benchmark doesnt have any checksums embedded for MapReduce to notice (the typical use of MapReduce at Google uses file formats with embedded checksums).2007 年我们运行了第一个 Petasort。在那个时候，我们最高兴的是这个程序最终完成了排序，尽管我们对排序的结果有一些疑问(我们没有验证排序结果的正确性)。如果不是我

13、们取消了一定要某一个输出分区与备份完全相同的验证机制，这个排序便不会结束。我们怀疑这是因为我们用来存储输入和输出的文件是 GFS 格式(谷歌文件系统)的缘故。GFS 文件没有足够校验和保护，有时会返回被污染的数据。不幸的是，这个基准所使用的文件格式没有任何嵌入式校验供 MapReduce 使用(谷歌使用的典型 MapReduce 的文件是有嵌入式校验的)。2008 2008(1PB, 6.03 hours, 2.76 TB/min, 11.5 MB/s/worker) 1PB, 6.03 小时，2.76TB/min，11.5MB/s/worker2008 was the first tim

14、e we focused on tuning. We spent several days tuning the number of shards, various buffer sizes, prefetching/write-ahead strategies, page cache usage, etc. We blogged about the result here. The bottleneck ended up being writing the three-way replicated output GFS, which was the standard we used at G

15、oogle at the time. Anything less would create a high risk of data loss.2008 年我们第一次把注意力集中于调整。我们花费几天的时间来调整分区的数量，缓冲区的大小，预取/预写策略，页面缓存使用等。我们曾经在这个博客里记录过结果。最终的瓶颈是写三路复制的 GFS 输出文件，这是当时我们在谷歌使用的标准。任何事情的缺失都会造成数据丢失的高风险。2010 2010(1PB, 2.95 hours, 5.65 TB/min, 11.8 MB/s/worker) 1PB, 2.95 小时, 5.65 TB/min, 11.8 M

16、B/s/workerFor this test, we used the new version of the GraySort benchmark that uses incompressible data. In the previous years, while we were reading/writing 1PB from/to GFS, the amount of data actually shuffled was only about 300TB, because the ASCII format used the previous years compresses well.在这个测试中，我们使用了一种新的不可压缩的 GraySort 基准的数据版本。在前几年，当我们读/写 1PB GFS 文件时，实际上混排的数据只有 300TB，因为前几年的数据是用 A

展开阅读全文