Morgan Stanley Interview Question
Technical Support EngineersCountry: India
Interview Type: In-Person
i answered il use a trie data structure to store words with the counts..
the node in structure will have a character name,a pointer and a variable count..
ill traverse through the book character by character...in the trie structure from the root node i keep moving towards the corresponding character....when in the book character is a \t (i.e a space) in the structure the last character nodes counter will be incremented...so in the end what the strucutre will have is for any node's count will give the count of the word which is denoted by characters from root node to that node....
so once structure created traverse the tree and find max count and traverse again from root to get that word...
please post for alternative nd better solutions
i too had the same solution. Hash table could also b a sol but since the words would b huge and there is infinite range of words so hashtable is not in picture
What would be the character on the root? Is your proposing multiple trees or single tree?
You can do external merge sort on the complete book(all merge logic depends on the RAM available since book can not be loaded completely into memory).
After the sorting is done, You need to read through the sorted structure again and check for the highest occurence word.
This is a tricky question and is expensive to to solve using traditional methods.
One alternative method is to use a large Array. Each element of this Array is a pointer to string.
Now read each string from the book. Take a random no in range 0 to Max_Array_Size. Replace the existing Array[randomNo] with the recently read string.
At the end of reading all words from book, check the array and find the string which is occuring maximum times.
I was asked similar question and argued that the solution will not work. Interviewer explained the following
1. Algorithm most of the times work. I forgot the algorithm name.
2. Array size has to be of certain size according to a formula. I don't remember that
3. The algorithm is very fast and efficient.
4. Can we used in search type applications like google/bing etc where you don't need 100% accurate results.
5. If a word occurs many times in the book, there is higher probability that the word is present at multiple places in the Array.
Since it is a big big book, internal sorting or internal data structure may not work.
My solution is split the whole book into K buckets.
handle bucket one by one.
For firest bucket, take 20% of top frequency words to build a maxheap.
When next bucket comes in, maintain the maxheap.
After read all bucket, the words on the top of the maxheap is most frequent words.
Do not know if it works. I will be very happy if it is corrected.
Another point to this answer, we can use MapReduce way to optimize the execution.
Distribute book to some characters;
Assign different machines/threads to walk through their character, log all words and count by HashMap;
Summarize the result, set a flag into each HashMap entry, keep it the MAX, so we no need to iterate after it.
For language words it is better to use Hash map (unordered_map in C++ 11) than a binary tree, as it is possible to create a good hash function which follows language rules (i.e. which vowels follow which consonants and vice versa in English, so the probability of collisions will be small.
How about a B-Tree which stores the number of occurrences of a word.
- Ady September 19, 2011h t t p: //en.wikipedia.org/wiki/B-tree