It’s been almost a year since I had to code in anger. I wanted to do some compression on an XML document by shortening element and attribute names to the bare minimum. The documents I’m compressing contain an unknown number of element names (could be 5, could be 30,000).
To do this, I decided to use a Huffman encoding for the element names, then convert them to an allowable name. I’m not going to go in to the Huffman bit, so for the sake of simplicity let’s call the first element “1”, the second element “2”, the third element “3”, and so on.
The problem is that name elements in this way doesn’t work very well for two reasons:
- You can’t start an XML element name with a number.
- Once you’ve hit your tenth element, you’re now in to two character element names; i.e. we’ve not used all the available characters.
So, I decided to encode them in Base 52, using the “digits” ABCDEFGHIJKLMNOPQRSTUVWYXZabcdefghijklmnopqrstuvwxyz. So the first element would be “A”, then “B”, then the 27th would be “a”, the 53rd “AA”, then “AB”, etc.
Of course, I could be more clever and allow digits or valid punctuation characters in positions other than the first, but that would be too much for a Tuesday afternoon. Also, I guessed at the valid starting characters for an XML element. When I go and look it up and find I made a mistake, what then? What if I wanted to change it? Or move to a completely different base, like binary, for another project?
Anyway, without further ado, the method is as follows:
public string GetEncoding(long numberToEncode, string vocab) { int numberBase = vocab.Length; long total = numberToEncode; StringBuilder encodedNumber = new StringBuilder(); int columnNumber = 0; while (Math.Pow(numberBase,columnNumber+1) <= total) columnNumber++; while (total > 0 || columnNumber>=0) { int colValue = (int) Math.Pow(numberBase, columnNumber); encodedNumber.Append(vocab[((int)total / colValue)]); total -= ((total / colValue) * colValue); columnNumber-- ; } return encodedNumber.ToString(); }
This can be used for any base you like, using any single character representations for digits. For example:
GetEncoding(0, "01"); GetEncoding(1, "01"); GetEncoding(2, "01"); GetEncoding(3, "01"); GetEncoding(4, "01"); GetEncoding(5, "01"); GetEncoding(5, "0123456789"); // Yes, I'm aware how boring this is, but it's a good test GetEncoding(27, "0123456789"); GetEncoding(546, "0123456789"); GetEncoding(16, "0123456789ABCDEF"); // Hexadecimal GetEncoding(32, "0123456789ABCDEF"); GetEncoding(45243256, "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz");
Rather than attempt to explain how to use it, here are some sample invocations:
… and the outputs:
0 1 10 11 100 101 5 27 546 10 20 GJnyg