算法导论课件第10讲赫夫曼编码

资源描述

《算法导论课件第10讲赫夫曼编码》由会员分享，可在线阅读，更多相关《算法导论课件第10讲赫夫曼编码（43页珍藏版）》请在金锄头文库上搜索。

1、1,第4章贪心算法,赫夫曼编码单源最短路径,2,Huffman codes,3,Huffman codes,Data Compression via Huffman Coding Motivation Limited network bandwidth. Limited disk space. Human codes are used for data compression. Reducing time to transmit large files, and Reducing the space required to store them on disk or tape. Huffm

2、an codes are a widely used and very effective technique for compressing data, savings of 20% to 90% are typical, depending on the characteristics of the data.,4,Problem Suppose that you have a file of 100K characters. To keep the example simple, suppose that each character is one of the 6 letters fr

3、om a through f. Since we have just 6 characters, we need just 3 bits to represent a character, so the file requires 300K bits to store. Can we do better? Suppose that we have more information about the file: the frequency which each character appears. Solution The idea is that we will use a variable

4、 length code instead of a fixed length code (3 bits for each character), with fewer bits to store the common characters, and more bits to store the rare characters.,Huffman codes,5,Example of Data Compression,For example, suppose that the characters appear with the following frequencies, and followi

5、ng codes: Then the variable-length coded version will take not 300K bits but 45*1 + 13*3 + 12*3 + 16*3 + 9*4 + 5*4 = 224K bits to store, a 25% saving. In fact this is the optimal way to encode the 6 characters present, as we shall see.,6,Characteristic of Optimal Code,Represented as a binary tree wh

6、ose leaves are the given characters. In an optimal code each non-leaf node has two children.,7,Prefix-free Code,In a Prefix code no codeword is a prefix of another code word. Easy encoding and decoding. To encode, we need only concatenate the codes of consecutive characters in the message. The strin

7、g 110001001101 parses uniquely as 1100-0-100-1101, which decodes to FACE. To decode, we have to decide where each code begins and ends. Easy, since, no codes share a prefix.,8,Example of Huffman codes,9,Example of Huffman codes,10,Algorithm of Huffman Coding,The greedy algorithm for computing the op

8、timal Human coding tree T is as follows. It starts with a forest of one-node trees representing each c C, and merges them in a greedy style, using a priority queue Q, sorted by the smallest frequency:,11,Algorithm of Huffman Coding,12,霍夫曼编码实例, step I,Assume that relative frequencies are: A: 40 B: 20

9、 C: 10 D: 10 R: 20 (I chose simpler numbers than the real frequencies) Smallest number are 10 and 10 (C and D), so connect those,13,霍夫曼编码实例, step II,C and D have already been used, and the new node above them (call it C+D) has value 20 The smallest values are B, C+D, and R, all of which have value 2

10、0 Connect any two of these,14,霍夫曼编码实例, step III,The smallest values is R, while A and B+C+D all have value 40 Connect R to either of the others,15,霍夫曼编码实例, step IV,Connect the final two nodes,16,霍夫曼编码实例, step V,Assign 0 to left branches, 1 to right branches Each encoding is a path from the root,A =

11、0 B = 100 C = 1010 D = 1011 R = 11 Each path terminates at a leaf Do you see why encoded strings are decodable?,17,赫夫曼编码的正确性,Definition: the cost of the tree T.,For each character c in the alphabet C, let f (c) denote the frequency of c in the file and let dT(c) denote the depth of cs leaf in the tr

12、ee. Note that dT(c) is also the length of the codeword for character c. The number of bits required to encode a file is thus,Given a tree T corresponding to a prefix code, it is a simple matter to compute the number of bits required to encode a file.,18,赫夫曼编码的正确性,定义：树T的代价,对字母表C中的每一个字符c ，设f (c)表示c在文件

13、中出现的频度， dT(c)表示c的叶子在树中的深度。注意dT(c)也是字符c的编码的长度。这样编码一个文件所需的位数为：,给定对应一种前缀编码的二叉树T ，很容易计算出编码一个文件所需要的位数。,(1),19,赫夫曼编码的正确性,Lemma 1 Let C be an alphabet in which each character c C has frequency f c. Let x and y be two characters in C having the lowest frequencies. Then there exists an optimal prefix code fo

14、r C in which the codewords for x and y have the same length and differ only in the last bit.,20,赫夫曼编码的正确性,引理 1 设C为一字母表，其中每个字符c C具有频度f c. 。设x和y为C中具有最低频度的两个字符，则存在C的一种最优前缀编码，其中x和y编码长度相同但最后一位不同。,21,赫夫曼编码的正确性,Proof,The idea of the proof is to take the tree T representing an arbitrary optimal prefix code

15、and modify it to make a tree representing another optimal prefix code such that the characters x and y appear as sibling leaves of maximum depth in the new tree. If we can do this, then their codewords will have the same length and differ only in the last bit. Let a and b be two characters that are

16、sibling leaves of maximum depth in T. Without loss of generality, we assume that fa fb and fx fy. Since fx and fy are the two lowest leaf frequencies, in order, and fa and fb are two arbitrary frequencies, in order, we have f x fa and fy fb. As shown in Figure , we exchange the positions in T of a and x to produce a tree T, and then we exchange the positions in T of b and y to produce a tree T. By equation the above, the difference in cost between

展开阅读全文