I can’t go into too many specifics, but basically I need to make an index for a large amount of data based on multiple dimensional requirements. Most of these dimensions have multiple ranges which may overlap within that dimension. The index itself is static as far as the system is concerned. It is built outside of the system, and will not change when being used as part of the system. Due to this, any multi-dimension tree based off of B-trees is probably a waste of space since there will be no need to optimize insertions and deletions. The index will be used only for searching. The data “point” that will be queried will consist of discrete values only, no ranges, so the tree doesn’t need to worry about optimizing ranged quiries, only value queries.

So to sum up what I’m looking for is:

Multiple Dimensions
Each dimension will use a range for it’s key value rather than a discrete value
Insertions and Deletions are not a consideration in performance, only Searching
Searches will be performed with discrete values only
The more compact the indexing system, the better.

Anyone got any recommendations of trees I should consider? I’ve looked at a few various multi-dimensional trees (e.g. SH-trees, KD-trees, SKD-trees) but many seem to be based off of a B-tree, which means there’s going to be wasted space.

Not sure what your definition of “large” is and whether you are open to commercial options, but a product I’ve used and have been extremely happy with in the past is Pivotlink from Seatab software. It’s an analytical tool, but I’ve used it at various companies for it’s searching capabilities in addition to traditional OLAP summarization stuff.

They have a highly compressed indexing scheme allowing for complete indexing on all dimensions over hundreds of millions (and I believe billions) of denormalized rows with sub-second response time. All of this with no building of cubes, etc.

Ranges were not a problem when I’ve used the product in the past.

I am a bit rusty on data structures but if you’re not going to be deleting and inserting, why can’t you just prune a B-tree after it’s created. If I remember correctly, the wasted spaces with B-trees comes from pre-allocation for higher order trees, right? So why not store the size in the node and compress the tree?

Unfortunately COTS products are not an option. I’m going to be programming all the algorithms for building and searching the index myself (well, me and another guy will be). And by large, I mean “a number you probably couldn’t count to within your lifespan” large.

AFAIK, the wasted space in B-tree is designed to allow automatic balancing upon insertion and deletion. And B-trees themselves only work for one dimension. An ISAM uses the identical search algorithm to a B-tree. It essentially is what you would get when you prune the wasted space from a B-tree, without having to prune it.

But ISAM is only one-dimension as well.

I’ve yet to find any multi-dimensional B-tree derivative that includes pseudocode for a pruning algorithm. I wouldn’t know where to even start pruning one.

Are you using the term “dimension” the same as in data warehousing where it basically maps to a field? I was a little confused by your statement that ISAM only uses one dimension, which to me implies the index is built on 1 field only, unless you are considering a specific combination of fields to represent one dimension?

Sorry about that. Yes, each dimension is essentially the equivalent of a field in a relational DBMS. I was making the point that pruning a B+ tree essentially gives you an ISAM tree with no overflow pages. The result may be different in mulidimensional/spatial trees, but I’ve never worked with them before, so I don’t know.

Well I’m going to be no help to you on this problem as I’ve been googling and I see that there is quite a bit of research in an area that I naively assumed to be solvable by simply taking the intersection of querying multiple independent indexes (each dimension has it’s own index).

Can you help me understand what a query into a multi-dimensional index looks like (again, I keep picturing multiple ranges or selection criteria each related to a specific dimension with the intersection of the results as the final result)?

Is the multi-dimensional index designed to achieve the same goal as many 1 dimensional indexes but without the requirement to intersect the results at the end?