Amazon Interview Question
Software Engineer / DevelopersCountry: United States
Interview Type: In-Person
How does Trie solve the follow-up issues? Like how does it handle sclability and memory issues? Also, regarding Trie's space complexity, for words share common prefix like abc and abcde, even though they share common prefix in trie, doesn't it still store two values for both abc and abcde?
Tries requires a lot more memory because of indexing arrays in every node. It is not so if a trie is really dense and every node has close to the alphabet links. In that case we just use all redundant space.
Well, each trie node just keeps an array of size 26 (one for each letter) or a hashtable<Character, TrieNode> whose size is 26 or less. I don't think it will require any additional space? In the case of hashtable implement, for input abc, abcd, abcde, abcdef, abcdefg, the common prefix abc is stored five times, whereas Trie will only store abc once.
You can use a function like crc32("abc") which will give you an integer value for your word.
You may chose a server to store or lookup your word as crc32("abc") % (number of your servers)
However, there is a problem with this approach. If you add new servers to your app, you have to rearrange your words by servers again...
Store dictionary using B+ tree in a file (same as B+ tree index built in Database). Each page of a file will represent a node of a B+ tree. This will have high fan out of the Tree (for example if page size of a file is 4 KB and if we use 32 bit number to represent page number( pointer ) each node will have 4KB/ 4 B = 1024 fan out ). and hence very low depth of the tree.
B+ tree can be preferred over HashTable when size of hashtable is larger than RAM size. If size of hashtable is larger than RAM size it will increase threshing due to random access pattern of hashtable.
In this case B+ tree gives benefit of locality of reference by storing nodes which are at lower level in RAM (Cache)
I think he questioned about memory since you are storing everything in the hash table. Here is the implementation with using B-tree.
public class Dictionary
{
public Node head = new Node();
private class Node{
public Node parent;
public char index;
public String definition;
public HashMap<Character, Node> children;
public Node(){
parent = null;
index = '\u0000';
definition = null;
children = new HashMap<>();
}
public Node(Node parent, char index, String definition){
this.parent = parent;
this.index = index;
this.definition = definition;
children = new HashMap<>();
}
public Node(Node parent, char index){
this.parent = parent;
this.index = index;
this.definition = null;
children = new HashMap<>();
}
private boolean hasDefinition(){
if(definition != null)
return true;
else
return false;
}
}
public void insert(String word, String definition){
Node currentNode = head;
for(int i = 0; i < word.length(); i++){
char index = word.charAt(i);
if(currentNode.children.containsKey(index)){
currentNode = currentNode.children.get(index);
}
else{
Node newChild = new Node(currentNode, index);
currentNode.children.put(index, newChild);
currentNode = newChild;
}
}
currentNode.definition = definition;
}
public String lookup(String word){
Node currentNode = head;
String returnDef = "";
for(int i = 0; i < word.length(); i++){
char index = word.charAt(i);
if(currentNode.children.containsKey(index)){
currentNode = currentNode.children.get(index);
}
else{
return "";
}
}
if(currentNode.hasDefinition()){
returnDef = currentNode.definition;
}
return returnDef;
}
}
You could use MD5 function to generate string for each input and use the resulting string as key in your HashMap, for example:
- md5("abc") = "900150983cd24fb0d6963f7d28e17f72"
- md5("abcd") = "e2fc714c4727ee9395f324cd2e7f331f"
you would have less probability of key clashes on this case.
Since the interviewer sated on 2) you are out of memory, I believe Distributed HashTable would solve the issue, you would use a Hash function of the key String and with the resulting String you could index a range of machines that could answer the request, using consistent hashing would avoid the bucket collision issues pointed above, and you would be able to scale to several machines.
Trie is better from HashTable with respect to space complexity. If you have to words with common prefix e.g abc abcde you still have to create to separate keys in HT whereas in trie the words share common prefix part.
- thelineofcode February 05, 2014But if access time is the most important factor the HT is better. So I think you should have explained both approaches and compare their pros and cons.