More rdf bits
I've spent a little more time on quad-store, and found some more realistic performance figures from Franz Inc and the ESW wiki.
There's now a reasonably fast in-memory sorting (using radix 256), around 2.2s/10 million triples, which should help indexing small stores. I also have had a look at MPI for linking nodes together.
I think the premise may well hold - as the limiting factor for growth is IO and memory rates, not CPU, so trade CPU for compression and limit the number of values a store can hold, so making the store's representation smaller, reducing IO. Structure for locality and divide the store into nodes when the limits are exceeded. I still haven't found a large enough dataset to push the 230 values it can hold. And 230 values is up to 290 triples, as each triple has 3 values. 2 bits of the 32 are reserved to indicate whether it's a compressed URI, or a hash table URI, a blank or an literal.
OpenLink Virtuoso also seem to be using something similar in their 'bitmap indexes', though limiting to 32 bit values gives you a set of sorted b-trees of blocks of quads, rather than indices - in a pure index you have to look up the value, so incur an indirection.
I'm not yet bored with playing with this, as I am learning quite a bit about how bandwidth limited systems behave, whereas most of my previous work has been cpu limited numerical stuff, distributed services, or km and visualisation apps.
Pete
Labels: hpc, programming, quad-store, rdf