Linkedin Interview Question
Software Engineer / DevelopersThe MIN heap is really cool to get the 100 most frequently occurring words. For getting count of each word, I have a good idea by using the trie (prefix tree). It can have constant time to insertion and look up a word assuming the length for all words is limited constant. It is also more memory saving and has better worst case running time comparing with the hash table. Then, assuming there are n words in the documents and m possible words, the total computation time will be O((log100)*m + n). Can you check it? thanks.
tries are not memory saving until they are implemented in some compressed form like DAWG or Patricia tree. Moreover using trie (or its compact form) u will have to traverse whole trie which if number of words in document have a great diversity and large in size then i think trie is all crap.
Not sure why using a min heap here.. shouldn't that be a max heap?
Hash table for initial word count - o(n)
Max heap for yielding the first 100 elements - o(nlogn) + o(logn)
Total cost: o(nlogn)
wouldn't insertions into a max heap of 100 elements cost you n * log(100) since thats the depth of the tree for each of n insertions? so total O(n log(m)) where m is the # most frequent we're trying to find
Anyway, you need to store some kind of information about every word seen, you can optimize it further in below ways.
1. For word in dictionary, maintain a array and increment a score there.(Note that you are not storing the word here)
2. Create an index for non-dictionary words and maintain in a hashmap, and update the count in the array as above.
Now construct a heap to find the top 100 occurences. Note that, heap is better than sort, because we don't care about the order of remaining elements.
O(nlogn)
Count using a map and insert them into a heap.
ifstream fin ("test.in");
struct classcomp {
bool operator() (const pair<string,int>& lhs, const pair<string,int>& rhs) const
{return lhs.second>rhs.second;}
};
void getCount() {
map <string, int> count;
string in;
while (fin.good()) {
fin >> in;
if (count.find(in) == count.end())
count[in] = 1;
else
count[in]++;
}
count[in]--;
set<pair<string,int>, classcomp> heap;
for (map <string, int>::iterator i = count.begin(); i != count.end(); i++)
heap.insert(make_pair(i->first, i->second));
int c = 0;
for (set<pair<string,int>, classcomp>::iterator i = heap.begin(); c < 100 and i != heap.end(); c++, i++)
cout << i->first << " " << i->second << endl;
}
1. Get count of each word.
- Anurag Singh February 08, 20112. Create a MIN heap of word counts with 1st 100 elements.
3. Now for all other word counts , if count is smaller (OR equal) than root (of max heap), ignore it, otherwise replace the root with new greater count and heapify.
At the end, elements in heap will be the 100 most frequently occurring words.