Ankola
BAN USERdef histogram(s):
fm = {}
for c in s:
fm[c] = fm.get(c, 0) + 1
return fm
def find_ana_indices(hays, needle):
fneedle = histogram(needle)
for i in range(len(hays) - len(needle) + 1):
if histogram(hays[i:i+len(needle)]) == fneedle:
print i
find_ana_indices('abcdbacb', 'abc')
Each node of a trie would have 26 pointers, so 26 * 8 < 2^5 * 2^3 = 2^8 bytes. Assuming the length of the longest word you'll have to index is 32 (Wiki has longer "coined" words but we can ignore them for this discussion). The max number of nodes we'll have in the full non-compressed trie is 2^6 - 1. So the total mem consumption is 2^14 bytes. You'll also have to store the positions of the word in the trie leaf node. Now if you have millions of words (1M < 2^20), you need to have a list of ints (2^4 bytes) to store the indices. Assuming that a word appears 1000s of times, you have the list length of 2^7
This scheme has the following mem footprint:
- 2^8*2^6*2^4*2^7 = 2^25 = 32G
Which is not a lot. However it can be sigificantly reduced if we notice that a major contributor to this is the storage of the list of ints. This can be taken off this datastructure.
The alternate scheme is, we split the "index" and the "occurence" into two bits. Index is a trie that points into the occurrences list. The pointer points to a linked list stored in an implicit datastructure like a contigous array (i.e something like allocate a block of 32, 128, 512.. ints, the last one is used to point to the next linked list, this block of 32, 128... ints are the indices of the occurence of the word).
The scheme above has the advantage that the "index" is sort of fixed size and depends on your dictionary, the occurence grows ammortized linear as the length of array increases. You could theoretically scale the occurence datastructure horizontally and have some service etc to look up the same. You should also use a bloom filter to quickly return empty arrays instead of searching the trie.
The scheme sounds elaborate but lucene uses similar idea to mantain its index (cant find source, actually not even very sure, but I vaguely remember that it does). It also uses delta encoding for the ints, (i.e. occurence of something is say 100, 128, 132, 158, 1158... is encoded as 100, 28, 4, 26, 1000 and so on to save some space).
- Ankola December 09, 2017