Persistence and Structure Sharing
Hash Tables and Tries
- Mutable Hash tables are often the natural choice for retrieval based data structures, but do not play well with persistent collections — linked data structures are needed, so that the new versions can share structure with the prior versions.
- In Clojure, the 3 basic collection types are all implemented using the idea of structure sharing across a link-based trie – a trie is simply a tree structure with values at leaves and keys label paths from root to leaf.
Bagwell’s Hash Array Mapped Tries
A trie has all the values stored in its leaves. Picking the right branch is done by using parts of the key as a lookup. Consequently, a trie may have more than two branches.
Clojure’s Persistent Vector is implemented as trie where the indices of elements are used as keys. Of course we must split up the index integers in some way– this is done using digit partitioning or its faster sibling, bit partitioning. In Clojure implementation we have as many as 32 branches at each node (ref: Bagwell’s HAMT, see http://lampwww.epfl.ch/papers/idealhashtrees.pdf)
Digit Partitioning for Tries
- Digit partitioning means that we split up the key into digits, which we then use as a basis for populating a trie. For instance, we can split up the key 9128 to [9, 1, 2, 8], and put an element into a trie based on that. We may have to pad with zeroes at the front of the list, if the depth of the trie is larger than the size of the list.
- We can also use whatever base we would like, not just base 10. We would then have to convert the key to the base we wanted to use, and use the digits from the conversion. As an example, consider 9128 yet again. 9128 is 35420 in base 7, so we would have to use the list [3, 5, 4, 2, 0] for lookup/insertion in the trie.
Bit Partitioning in HAMT
- Digit-partitioned tries would generally have to do a couple of integer divisions and modulo operations. Doing this is on every branch we must take is a bit time consuming.
- In Bit-partitioned tries all digit-partitioning is a power of two (usually 32) . With some knowledge of bit manipulation, we can remove those costly arithmetic operations, as well as compress an array of link pointers.
- We split keys into chunks of bits with some predefined size. For 32-way branching tries, we need 5 bits in each chunk. Finding the link for a particular chunk means finding the associated bit in an integer bit map and counting the number of preceeding 1s –
(one CountPopulation instruction)
Extensions to Persistent Trees
- Trees are not a standard collection type in Clojure, however many applications make explicit use of tree structures, eg. XML.
- A zipper is a popular way to modify a tree in a persistent and functional way. Idea is to “drill down” a branch in an existing tree and perform a local edit, without having to tediously rebuild the upper areas of the tree again.
- Assume the tree is stored as nested vector. We can “drill down” the zipper with “down” and “right” to reach the node we want to edit. We call “root” to convert the zipper back into nested v.
- Common practice to use the -> macro, which recursively inserts each expression as the first argument of the next expression, as follows:
Discussion
No comments yet.