The World's Most Advanced Lexicon-Data-Structure

HeavyJ · April 11, 2011, 5:20am

Hello,

Over the past few years, I've conducted some rather thorough R&D in the field of lexicon-data-structure optimization.

A Trie is a good place to start, followed by a traditional DAWG.

Smaller means faster, but a traditional DAWG encoding operates as a Boolean-graph, unable to index the keywords within.

It came to my attention that the world's most powerful lexicon-data-structure would incorporate postfix-compression, while at the same time eliminating the need to scroll through lists in alphabetical order. Further, the graph would operate as an incremental-(perfect & complete)-hash-function.

After a lot of deep insight thinking, and many sessions of accurate reckoning, I put together just exactly that: I call it Caroline Word Graph or CWG, and published the documentation on a web page: (Updated the DAWG page as well.)

CWG
DAWG

Please inform me if you have encountered a similar construct.

All the very best,

JohnPaul Adamovsky

DGPickett · April 11, 2011, 4:44pm

Some of the early NAT language packages for C used compression exploiting the null terminated string, finding short strings that were suffixes of other strings, so "1234" might be stored but "234", "34", "4" and "" were just offset pointers into "1234". While not that great for compressing long strings, it was great for sets with many short strings.

I was working on high performance container since a while back, and came up with a byte-tree, where the first byte was a lookup into an array of pointers, or similar structure, to quickly travers an invariant tree one byte of key at a time. Various alternate nodes dealt with compression, like a 'next-n-bytes-must-be' to swallow invariant areas in a key, or a truncated array of less than 256 cells, with a base and size, or a dumb list lookup leveraging strchr(), a string of random key letters, and a like-length array of pointers, or a N-copies-of for duplicates. The advantages: quick insert, sorted access, no rebalancing, quick access. Linear hash is cute, but if you are not sure of the data's key distribution, it is dicey to go all the way to one key per bucket, so how much linear search do you want?