Find the 100 most frequently o

Linkedin Interview Question for Software Engineer / Developers

0

of 0 votes

16
Answers
Find the 100 most frequently occurring words in a set of documents. Optimize.
- RR February 06, 2011 | Report Duplicate | Flag | PURGE
Linkedin Software Engineer / Developer Algorithm

Email me when people comment.

An error occurred in subscribing you.

Email me when people comment.

An error occurred in subscribing you.

Comment hidden because of low score. Click to expand.

of 3 vote

1. Get count of each word.
2. Create a MIN heap of word counts with 1st 100 elements.
3. Now for all other word counts , if count is smaller (OR equal) than root (of max heap), ignore it, otherwise replace the root with new greater count and heapify.

At the end, elements in heap will be the 100 most frequently occurring words.

- Anurag Singh February 08, 2011 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

The MIN heap is really cool to get the 100 most frequently occurring words. For getting count of each word, I have a good idea by using the trie (prefix tree). It can have constant time to insertion and look up a word assuming the length for all words is limited constant. It is also more memory saving and has better worst case running time comparing with the hash table. Then, assuming there are n words in the documents and m possible words, the total computation time will be O((log100)*m + n). Can you check it? thanks.

- lipingwu@iastate.edu June 18, 2011 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

tries are not memory saving until they are implemented in some compressed form like DAWG or Patricia tree. Moreover using trie (or its compact form) u will have to traverse whole trie which if number of words in document have a great diversity and large in size then i think trie is all crap.

- mrn July 12, 2011 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

Not sure why using a min heap here.. shouldn't that be a max heap?

Hash table for initial word count - o(n)
Max heap for yielding the first 100 elements - o(nlogn) + o(logn)

Total cost: o(nlogn)

- fydango March 25, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 1 vote

wouldn't insertions into a max heap of 100 elements cost you n * log(100) since thats the depth of the tree for each of n insertions? so total O(n log(m)) where m is the # most frequent we're trying to find

- emalaj23 January 30, 2014 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

Should use MIN Heap?
And insert (now 101 elements) and pop the smallest element of the heap(so that the rest 100 element is always the biggest among 101 elements)

- Anonymous November 19, 2014 | Flag

Comment hidden because of low score. Click to expand.

of 1 vote

Anuj - How would you solve this with splay trees

- Anonymous February 07, 2011 | Flag Reply

Comment hidden because of low score. Click to expand.

of 1 vote

attribute that to a brain freeze for now, until I can think of a reason to mention splay trees here :)

- Anuj February 07, 2011 | Flag

Comment hidden because of low score. Click to expand.

of 2 vote

How about using a heap to store the words based on frequency. You could also use splay trees.

- Anuj February 07, 2011 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

Create a table and increment count when the words are already present? Sort them according to count or use a heap to pick the top 100.

Was this question asked by HR?

- RR February 07, 2011 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

Use a hash to store the word as the key and increment the value as it is parsed.then sort it to find the first 100.

- Bharath February 07, 2011 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

Anyway, you need to store some kind of information about every word seen, you can optimize it further in below ways.

1. For word in dictionary, maintain a array and increment a score there.(Note that you are not storing the word here)
2. Create an index for non-dictionary words and maintain in a hashmap, and update the count in the array as above.

Now construct a heap to find the top 100 occurences. Note that, heap is better than sort, because we don't care about the order of remaining elements.

- Messi February 08, 2011 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

Use a hash map to map word -> # occurrences. Copy the pairs (key, #occ.) to an array. Use order statistics to find the 100th-highest occurrence. Partition on that number. O(n) + /expected/O(n) = /expected/ O(n).

Heap solutions will be O(n lg n).

- JeffD April 08, 2012 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

Hashtable + Heap . Inspired by other replies.

- Shu February 03, 2013 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

O(nlogn)

Count using a map and insert them into a heap.

ifstream fin ("test.in");

struct classcomp {
      bool operator() (const pair<string,int>& lhs, const pair<string,int>& rhs) const
            {return lhs.second>rhs.second;}
};

void getCount() {
    map <string, int> count;
    string in;
    while (fin.good()) {
        fin >> in;
        if (count.find(in) == count.end())
            count[in] = 1;
        else
            count[in]++;
    }
    count[in]--;

    set<pair<string,int>, classcomp> heap;

    for (map <string, int>::iterator i = count.begin(); i != count.end(); i++)
        heap.insert(make_pair(i->first, i->second));
    int c = 0;
    for (set<pair<string,int>, classcomp>::iterator i = heap.begin(); c < 100 and i != heap.end(); c++, i++)
        cout << i->first << " " << i->second << endl;
}

- pretobomba November 04, 2013 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

Another approach is medians of medians as described in Corman book

- Anonymous November 12, 2013 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

Hashtable + DLL.. basically DLL is nothing but the sorted words based on frequency.

since the word count increase by only one we can check wether the word before given word count in DLL is less than then swap nodes of DLL.

Here we need not to worry about top 100 or so.

- Mitesh July 03, 2021 | Flag Reply

CareerCup

Linkedin Interview Question for Software Engineer / Developers

Books

Videos

Resume Review

Mock Interviews