alexis.tejeda
BAN USERTaken in count that a username is mostly in ASCII (4 byte per char) with a max length of seven (standard), this is for a billion:
4 * 7 * 1000000000 = 26.077 GB
In a 32 bit architecture system with more than 4 GB, e.g.: 32 and a OS Gnu/Linux based using kernel pae, split in chunks of 2 GB (13 portions of the file / external sort) to processed in parallel by 13 (not forked) process, MapReduce is one approach.
Trie-tree might be one option to "compress" these files, e.g.: that the tree root is a->b->c->...->z (using just from a, z) and the so on (recursively)..
Based on that the username can use any of the 24 chars:
As a 24-char-tree, 0, 24^1, 24*24..., total number of nodes is nodes = (k^(h+1))/(k-1), with h = 6 (using 7 chars) and k = 24, space = bytes2GB(nodes * 4) = 3.46 GB as max, also can't be used in a 32 bit architecture (max allocable is around 3.2)... not entirely true, due is just using 7 chars it will never reach the 3.4GB. The complexity to search a username is based on the tree height.
Another approach can be by using bit manipulation and bit vectors (an int) to identify the char to be used, e.g.: 0000 0000 0000 0000 0001 0101, a,c,e are used, this will improve the max memory used up to 3 bytes per node, but is using a integer will use 4 bytes anyway, and store them in an array int[10^9] using 3.7 GB, having in mind to navigate the bits over the array as a tree.
IMHO
It is an union-find problem, use the union-find algorithm.
- alexis.tejeda October 08, 2014