Microsoft Interview Question
Software Engineer / DevelopersTeam: MS Office Platform
Country: India
Interview Type: In-Person
Wouldn't the min heap option take only O(k) space - rather than O(n + k) - since we're only storing the top 10 items?
Here ya go dude
import java.util.*;
import java.io.*;
public class FrequentWords {
public static void main(String args[]) throws FileNotFoundException {
FrequentWords mine = new FrequentWords();
mine.getFrequent("Devil.txt");
}
public void getFrequent(String file) throws FileNotFoundException {
Scanner input = new Scanner(new FileInputStream(file));
HashMap<String, Integer> map = new HashMap<String, Integer>();
while(input.hasNext()) {
String line = input.next();
if (map.containsKey(line))
map.put(line, map.get(line) + 1);
else
map.put(line, 1);
}
input.close();
Comparator<WFile> minComp = new minComparator();
PriorityQueue<WFile> minHeap = new PriorityQueue<WFile>(minComp);
for (String key: map.keySet()) {
WFile word = new WFile(key, map.get(key));
minHeap.offer(word);
}
for(int i = 0; i < 10; i++) {
WFile word = minHeap.poll();
System.out.println(word.word + " " + word.amount);
}
}
private class WFile {
String word;
int amount;
public WFile(String word, int amount) {
this.word = word;
this.amount = amount;
}
}
private class minComparator implements Comparator<WFile> {
@Override
public int compare(WFile a, WFile b) {
if (a.amount < b.amount) return 1;
if (a.amount > b.amount) return -1;
return 0;
}
}
}
Tries - How to get word frequency any fancy implementation ? Besides tries will take a lot of memory thats why they came with a variation called Burst tries.
Did the question poster actually meant a MAX-HEAP with word frequency count being the key value? Can you please explain the min-heap plus list approach ?
Lets say we have N words and K = 100.
- Miguel Oliveira May 20, 2014First, I would use a hash map to count the number of occurrences of each word. A Trie is another option but it consumes a ton of memory.
Then, we have 2 options:
1) Process all <word, frequency> pairs. Use a min-heap to keep the current 10 most frequent words. If the current word has a higher frequency than the minimum of the heap, remove the minimum and add this word. Finally, extract the 10 words from the heap. This will take O(N log K) time and O(N + K) space.
2) Use the median of medians selection algorithm (search wikipedia "Median_of_medians") to find the 10th highest frequency. As a result, we find the 10th highest and the top 9 are the words to the right of the 10th word.
This will take O(N) time and O(N) space.