Google Interview Question
Software Engineer / DevelopersCountry: United States
Interview Type: In-Person
Cant agree with you on the distribution of tries, you can have like aa to machine 1, ab to machine 2 and so on. That gives you 26*26 machines and each can know abt the other just like hashmaps. (Ofcourse, the special case of the word 'a' etc that can be stored specifically in say the first machine aa)
A trie stores only one 'instance' of a common prefix of several strings. Thus it saves memory that is consumed by 'common' prefixes of strings.
What do you mean by 'instance'? For instance, for input strings abc and abcde, common prefix a->b->c, this sequence of nodes will only store once, is that what you mean? But if the input size is extremely large, it will eventually run out of memory, won't it?
Implicit in your question is the premise that a hash table somehow handles running out of memory. A trie handles this the same way a hash table does -- it doesn't.
@GZ: Let's review what it means to run out of memory. It means that we need additional space to store some data we need to store, but we just don't have that extra memory. We're out of physical resources.
Your question is like asking "I have a warehouse and I need to put another box in it, but the warehouse is completely full already. How can I handle this case?" You're physically out of space. Your options when you run out of RAM are analogous to the options you have here:
1. Get a bigger warehouse (buy new RAM) or annex a new section to your warehouse (add a stick of RAM). Then you can store your box (data) and possibly many more boxes.
2. Compress the boxes (data) already in the warehouse (memory), if they are compressible. Just like physical goods, not all data is very compressible. Doing this means you might spend extra time compressing and decompressing boxes (data) when putting them in and when retrieving them. Depending on how you compress, it might also make your warehouse less organized, making it harder and/or slower to locate items.
3. Discard some of the old items. This only works if you have something that can be thrown away. If there's a requirement to keep all the old stuff around, there's nothing you can do here.
4. Put whatever you can't fit into your warehouse (RAM) into secondary storage (hard disk). The secondary location may be far away, so placing goods there might take much longer than it would to place them in your warehouse. Retrieving items from secondary storage might also take longer for the same reason. Disk is very slow compared to RAM.
actually I am not so understand the question.
does this question is ask how to make trie tree become scalability if there has so much long string?
trie structure can handle huge string input with short length. but if the string length is large, which will make trie become buge,
so in this case, we can build sub trie on sub machine,
First, if the dictionary needs to support Range Query, like: "fetch all words starts from a*", the HashTable is not sufficient.
So, let's say the dictionary only needs to support Exist or Not.
Second,
Case 1: The index can be kept in main memory.
Trie: Average Time Cost: O(n), n is the average length of words. Space: O(N * n), N is the number of words.
HashTable: Average Time Cost: O(1). Space: O(N), N is the number of words.
(Both supports Parallel Query)
Case 2: The index cannot be kept in main memory, but can kept in 1 PC's Disc.
Then, Disc read is the major source of time cost.
Trie: O(n) - k, (k is the level of trie that can keep in memory)
HashTable: O(1)
Case 3: The index cannot be kept in main memory, but can kept in a distributed memory on K machines. (assume that there is 1 master machine)
Trie: O(n)
HashTable: O(1) + O(1)
Case 3: The index cannot be kept in main memory, but can kept in a distributed file system on K machines. (assume that there is 1 master machine)
Trie: O(n)
HashTable: O(1)
In summary, in all the cases, HashTable is more scalable than Trie, if only Exist() should be supported.
You can easily distribute hash table into hundreds/thousands machines by assigning each machine a key range. Each machine will know what other machine is responsible for the give key. So any key can be looked up with at most two machine forwards.
- AK February 09, 2014You cannot easily distribute Trie, since there is no easy way of splitting data and looking up efficiently. If you want to specify range by first letter, then you have only 26 options (assuming only letters) and it doesn't handle cases when data fits only in thousands machines. You may have to use first 10 letters to locate a machine, then each machine has to keep this big machine table in memory which further reduces memory availability.