Microsoft Interview Question Software Engineer / Developers
0of 0 votesYou have stream of words coming, which data structure should be used to store the words efficiently and to answer word queries and maintain top k high frequency words.
I feel a hashtable and a min heap will do a better job.
As the words are coming maintain the hash, and a min heap of size k.
Suppose a new word comes, then add it to hash and if already present increase its count value to say x.
Now suppose the top element in heap is having a count y.
If x > y
then replace y with x and rebuild the heap.
else do nothing
here time complexity after reading n words = O(nlogk)
yes, I also think PAT-tree is the most appropriate here: due to storage efficiency (one doesn't need to store all common prefixes) in contrast to hash tables as well as performance issues due possible hash conflicts. Additionaly, we need to store word frequencies in leaf nodes to maintain the heap of k most frequent ones.
While I agree that PAT-tree would be the best in storage and probably in the searching of the new string that is coming.
however I do not agree that storing the count in the leaf node is going to help. How would you search for the K most common words? In this way you would have to traverse the whole tree.
I think we have to maintain two data-structures.."Height balanced BST(AVL)" and "Min Heap"
Struct of BST contains extra count variable..which incremented on every occurence of duplicated string...
After insertion in BST insert the same string in min heap.
Create Min heap wrt updated count value associated with each string...
Heap size is "a[K]"
If heap is full delete the first element from heap(minimum count value)...
At the end u have "k" frequently occurring nodes.
Time complexity..
O(nlogn) + O(nlogk) = O(nlogn)
I think by high frequency, it means that the number of times the word is queried the most and not the number of times it is coming. I think to store the words efficiently, it should be stored first in an AVL tree as the words are coming. The node should contain an extra info about the number of time the word is queried. Now upon every query, the number of corresponding node is incremented and then based upon this number the subtree of its parent should be rebalanced such that if the number is greater than the corresponding query number of its parent, this node should be pushed up as far as it satisfies. Hence this way its both kind of working as a BST wrt wrod as well as heap wrt number. This way usually a high frequency number should relatively be high on the level of tree as well since it will be a BST, the worst case search would be O(n).

I guess we will have to maintain 2 data structures, one can be a self balancing tree and the other a hash table. The first tree will store the word as they come in sorted order so that the words can be inserted and extracted in O(lgn) time. Along with that whenever a word is inserted its count is incremented. So then in order to select the topmost k high frequency words we have to just select the top k largest counts from the hashtable which can be done using Randomized Quicksort in O(n) time. Instead of a hashtable we can also use a self balancing tree whose nodes will have a count variable and size variable where size will denote the total no of nodes beneath that node including the node itself. And then we find the kth Largest element in using the tree and thus obtains pointers to top k high frequency words.
- gauravk.18 on April 09, 2008 Edit | Flag ReplyAnyone with a better Solution????