字符串的编码

2015-02-10

字符串的编码

字符串的编码是一个生活中随处都要使用到的问题，本篇文章将从最基本的编码方式说起。
最简单的办法可在每个单词前纪录单词长度，并用特殊字符分割即可。

1	def encode1(words): res = "" for word in words: res += str(len(word)) + ':' + word return res

有时可能找不到一个合理的分隔符，同时此编码方式长度部分并没有表示字符串本身值，实际上是浪费的。

为了解决上面两个问题，使用两个特殊字符来标示单词之间的分割，如Linux采用\t\n分割。
当字符中出现\t时，需在其后面再补上\t。

def encode2(words):
    res = ""
    for word in words:
        for c in word:
            if c != '\t':
                res += c
            else:
                res += '\t\t'
        res += '\t\n'
    return res


def decode2(s):
    word, words = "", []
    i = 0
    while i < len(s):
        if s[i] != '\t':
            word += s[i]
            i += 1
        else:
            if s[i+1] == '\t':
                word += '\t'
            elif s[i+1] == '\n':
                words.append(word)
                word = ""
            else:
                return []  # invalid
            i += 2
    return words

上述编码还可进行一个简单的优化，将重复的字符用数字表示。
如aacccd -> a2c3d

# compress repeat word, say aaabbc -> a3b2c
def encode3(words):
    res = ""
    for word in words:
        i, start, n = 1, 0, len(word)
        for i in range(1, n+1):
            if i == n or word[i] != word[start]:
                if i-start > 1:
                    res += str(i-start)
                if word[start] != '\t':
                    res += word[start]
                else:
                    res += '\t\t'
                start = i
        res += '\t\n'
    return res


def decode3(s):
    i, n, words = 0, len(s), []
    word = ""
    while i < n:
        num = 0
        while i < n and s[i].isdigit():
            num = num*10+int(s[i])
            i += 1
        if num == 0:
            num = 1  # single character
        if s[i] != '\t':
            word += s[i] * num
            i += 1
        else:
            if s[i+1] == '\t':
                word += '\t' * num
            else:  # '\t\n'
                words.append(word)
                word = ""
            i += 2
    return words

此前的编码方式使得每个字符占用了8个Bit位，我们可以对字符进行压缩。
可以证明使用Huffman对字符进行编码，每个字符的平均编码长度是最短的。
Huffman编码的步骤为：

统计文章每个出现字符词频
建立字符的huffman tree
根据树上的位置，算出字符编码
编码原字符串

let paragraph = abaaccd
1. count the frequency of each character
a -> 6, c -> 3, b->1, d->1

2. build huffman tree

                    (11, ((b, d), c)), a)  
            (5, (b,d), c)               (6, a)           
    (2, b, d)   (3, c)
(1, b)    (1, d)

3. calculate the code for each character
calculate code from root to bottom
1) split root to  ((b, d), c), a => a -> 1 
2) split((b,d), c) to (b,d), c => c -> 01
3) split(b, d) to b, d => b -> 000, d -> 001
4. encode the origin paragraph
1 000 1 1 01 01 001

解码时，逐一读取编码，检查当前编码是否在词表中，如果在编码解码，否则继续读。

1	def decode(codes, dict): paragraph, code = "", "" for c in codes: code += c if code in dict: paragraph = dict[code] code = "" return paragraph

具体代码
在日常使用中，我们还经常看到gz, lz, sz编码，这些编码方式不再是针对单个字符进行编码，而是针对词组和段落进行压缩。
它们的思想是缓存历史中出现字符段落，寻找当前串中是否存在历史中出现，如果是则用标记标示，从而对一个子串进行压缩。
历史的段保存得越多，编码压缩率越高，但占用的内存越大，编码解码速度也越慢。
具体介绍可参考wiki