Amazon Interview Question
Software Engineer / DevelopersTeam: Kindle
Country: United States
Interview Type: In-Person
Two pass not required, if we maintain current top 3 as we go through the book hashing words. Also inside the hash map every word is also equipped with its word count.
.. where by 'tire' you mean 'trie', right? Also, you don't have to have any particular structure to keep the top 3, they could be just separate variables. But, probably an array of 3 is more convenient.
Not optimal in which sense?
Theorem:
Every correct algorithm that solves this problem is Omega(N).
Proof:
Assume not. That is, assume there exists an algorithm O* such that, for every input I, O* finds the top 3 frequently used words in a text file in a time asymptotically faster than O(N).
Then, we can conclude that for every input I, O* does not check every single word in the file; that is, it is capable of deciding the top 3 property without looking at every single word in the input. At the very least, there exists one word which hasn't been considered; thus, it is easy to build an instance K where each word is identically distributed, and that single word that hasn't been checked is the tie-breaker, essentially causing the algorithm to produce incorrect results without considering such a word, contradicting our assumption that O* is correct.
What was the running time of your algorithm?
And what about map-reduce? Wouldn't this be a good fit?
You are correct I.E in the sense that you are only considering time complexity. The hashing technique would give exponential, O(26^N), space complexity which is clearly unacceptable. In fact a two-trie is better than even a trie for this case as it has the time complexity of a trie but space complexity better than a trie's.
-use list of pairs<string,int>, int argument contains the number of particular word.Now sort using the second argument.
Actuslly the interviewer was right that it is not optimal.
The reason is that you have to hash a word (first pass) then after lookup compare it with the already existing one (second pass).
The optimal solution is to use a tire and add the words to it. When you reach the end leaf, increase the counter and compare it to the current top 3 (to the last, least popular item). If it is already in top 3, the just increase the arrays value and check with the next item. If its larger, swap them.
The 3 topmost array can be substituted with a linked list which always drops the last item but that requires a lot of garbage collection so a fixed array is favorable.
Generally, hash tables are really good but keep in mind that their average collisioon rate with strings are high and they require resizing all the time.
On the other hand, tries are static, no need to resize them.
Actually the interviewer is not right.
It doesn't matter if you do one or two passes over the data; they're both constants, so the interviewer is NOT correct.
Both O(2N) and O(N) are O(N).
The solution the OP provided is asymptotically as good as the so called 'optimal' solution by the interviewer.
Errrr... There is time complexity and space/memory complexity. So you are right that there are quite a few O(N) algorithms (time!) but we should consider memory usage as well.
Without knowing the overall number/distinct number/distribution/etc of the words it is hard to tell more.
The funny side of it is that most probably the answer is always "a", "the", "in" for books written in English (or something like that) :-) and if I am not mistaken very similar list-of-3-most-common-words can be compiled for all languages.
You are right absolutely right.
That is why I asked the OP optimal in which sense.
I personally admit I would have answered in a similar matter. I am certainly not an expert, but I think as the file size increases, a map-reduce approach could be a sweet solution; this seems tailored for some kind of parallel processing.
For god's sake stop saying "max" heap. How are you planning to find out which node to remove when something bigger than the root comes? You are going to traverse the whole heap to find the minimum?
Here is the code i written using prefix tree. This program will tell you first n frequent used words in a file .. I have not read from file for simplicity i just read from an array . If you want to read from file ..read file line by line.. convert each line into words of array and trying adding each word using this program.
package example.concepts;
import java.util.Arrays;
import java.util.HashMap;
import java.util.Map;
import java.util.Set;
public class TrieTreeExample {
public static void main(String[] args) {
int n= 3 // No. of most occured word
TriTree tree = new TriTree(n);
String[] words = {"prashant","pras","pras","pras","prashant","pras","abc","test","test","prashant","abc","hhh","bbb","bbb","pras","bbb","prashant","prashant","ab","ab","ab","pras"};
for (int i = 0; i < words.length; i++)
tree.insertWord(words[i]);
// tree.printTree(0, new Character[50], tree.root);
tree.printTree(0, new StringBuffer(), tree.root);
System.out.println("Is exist :" + tree.findWord("pras"));
System.out.println("Max occured word is : " + tree.maxWord1 + " Count is :" + tree.maxWordCount1);
System.out.println("Max occured word is : " + tree.maxWord2 + " Count is :" + tree.maxWordCount2);
System.out.println("Max occured word is : " + tree.maxWord3 + " Count is :" + tree.maxWordCount3);
System.out.println("Max Word :" + Arrays.asList(tree.maxWord));
System.out.print("Max Count : [ ");
for(int i =0; i < tree.maxCount.length;i++)
{
System.out.print(tree.maxCount[i] + " ");
}
System.out.println(" ]");
System.out.println(tree.maxMap);
}
}
class TriTree
{
TriNode root;
int maxCount[];
String maxWord[];
TriTree(int n)
{
root= new TriNode();
maxCount = new int[n];
maxWord = new String[n];
}
void insertWord(String word)
{
if (root==null)
{
System.out.println("Tree does not exist...");
}
boolean isExist=true;
TriNode current= root;
Character character;
for(int i=0; i < word.length();i++)
{
character = word.charAt(i);
if (current.links.get(character) == null)
{
current.links.put(character, new TriNode(character, i==(word.length()-1)?true:false,1));
isExist=false;
}
current = current.links.get(character);
}
if(!current.fullWord)
{
isExist=false;
current.fullWord=true;
}
else if (isExist && current.fullWord)
{
current.countOfChar++;
}
findMaxWord(current, word);
}
void findMaxWord(TriNode current, String word)
{
for (int i=0; i < maxCount.length;i++)
{
if (maxCount[i] < current.countOfChar )
{
if(!word.equalsIgnoreCase(maxWord[i]) && i < maxCount.length-1)
{
maxCount[i+1]= maxCount[i];
maxWord[i+1] = maxWord[i];
}
maxCount[i]= current.countOfChar;
maxWord[i]=word;
break;
}
}
}
boolean findWord(String word)
{
if (root==null)
{
System.out.println("Tree does not exist...");
}
TriNode current= root;
Character character;
int i;
for(i=0; i < word.length();i++)
{
character = word.charAt(i);
if (current.links.get(character) == null)
{
return false;
}
current = current.links.get(character);
}
if (i == word.length() && current==null)
{
return false;
}
else if (current!=null && !current.fullWord)
{
return false;
}
else
{
System.out.println("count of word " + current.value + " is :" + current.countOfChar);
}
return true;
}
void traverseTree()
{
}
void printTree(int level, Character branch[], TriNode root)
{
if (root==null)
{
return;
}
Set<Character> setKeys = root.links.keySet();
for (Character key: setKeys)
{
branch[level]= root.value;
printTree(level+1, branch, root.links.get(key));
}
if (root.fullWord)
{
branch[level]=root.value;
for (int i=1; i <=level;i++)
{
System.out.print(branch[i]);
}
System.out.println();
}
}
void printTree(int level, StringBuffer buf, TriNode root)
{
if (root==null)
{
return;
}
Set<Character> setKeys = root.links.keySet();
for (Character key: setKeys)
{
buf.insert(level, root.value);
printTree(level+1, buf, root.links.get(key));
}
if (root.fullWord)
{
buf.insert(level, root.value);
System.out.print("Word :");
for (int i=1; i <=level;i++)
{
System.out.print(buf.charAt(i));
}
System.out.print(" Count : " + root.countOfChar);
System.out.println();
}
}
}
class TriNode
{
Character value;
boolean fullWord;
Map<Character,TriNode> links;
int countOfChar;
TriNode()
{
value ='\0';
fullWord=false;
countOfChar=0;
links = new HashMap<Character,TriNode>();
}
TriNode(Character value, boolean isFullWord, int countOfChar)
{
this.value = value;
fullWord=isFullWord;
this.countOfChar=countOfChar;
links = new HashMap<Character,TriNode>();
}
}
Actuslly the interviewer was right that it is not optimal.
- Adam January 31, 2012The reason is that you have to hash a word (first pass) then after lookup compare it with the already existing one (second pass).
The optimal solution is to use a tire and add the words to it. When you reach the end leaf, increase the counter and compare it to the current top 3 (to the last, least popular item). If it is already in top 3, the just increase the arrays value and check with the next item. If its larger, swap them.
The 3 topmost array can be substituted with a linked list which always drops the last item but that requires a lot of garbage collection so a fixed array is favorable.
Generally, hash tables are really good but keep in mind that their average collisioon rate with strings are high and they require resizing all the time.
On the other hand, tries are static, no need to resize them.