Google Interview Question
Software EngineersCountry: United States
Interview Type: Phone Interview
Looks like order of the words matters in the question. Otherwise, Ngrams for doc1 would have been = 'This is a', 'is a dog', 'This is dog', 'This a dog'.
With this assumption the algo is incorrect.
I would propose a kind of modified version of KMP-string comparison, replacing characters with words. We scan a doc word by word and fill hash(doc_word[i]...doc_word[i+n]) so that the order matters (similar to KMP). Do the same for the next doc and check if the hash already contains an element with a particular key.
public Set<String> getIntersection(Set<String> a, Set<String> b) {
Set<String> intersection = new HashSet<String>();
for (String ngram : a) {
if (b.contains(ngram)) {
intersection.add(ngram);
}
}
return intersection;
}
public Set<String> getNGrams(String document, int n) {
String[] words = document.split(" ");
Set<String> ngrams = new HashSet<String>();
for (int i = 0; i <= words.length - n; i++) {
String ngram = "";
for (int j = i; j < i + n; j++) {
ngram += words[j];
ngram += " ";
}
ngrams.add(ngram.trim());
}
return ngrams;
}
In N-Gram problems, n is often very small compare to M and N, where M and N are size of the documents in words.
Ignoring the case that words are separated with multiple space, I've written a simple code that implements a O((M+N)n) time complexity
public class Interview {
private void process(String file, HashMap<String, Boolean> cmap, int n){
String[] words = file.split(“ “);
if (words.length < n)
return;
for (int i = 0; i<words.length-n; i++){
StringBuilder sb = new StringBuilder();
for (int j = i; j<n+i; j++)
sb.append(words[j] + “ “);
String key = sb.toString();
if (cmap.containsKey(key))
cmap.put(key, Boolean.True);
else
cmap.put(key, Boolean.False);
}
}
public List<String> intersectionNGram(String file1, String file2, int n){
List<String> result = new ArrayList();
HashMap<String, Boolean> cmap = new HashMap();
process(file1, cmap, n);
if (cmap.size()==0)
return result;
process(file2, cmap, n);
for (String key : cmap.keySet()){
if (cmap.get(key))
result.add(key);
}
return result;
}
}
Nice answer but I think there might be a problem.
If I understand your code correctly, you would count a repeated Ngram in the first document as an intersection.
That's a nice solution,
However when I look at your code I think that there is a small mistake.
Indeed if I understand correctly, you would count a duplicated Ngram in the first document as an intersection.
That's a nice solution,
However when I look at your code I think that there is a small mistake.
Indeed if I understand correctly, you would count a duplicated Ngram in the first document as an intersection.
That's a nice solution,
However when I look at your code I think that there is a small mistake.
Indeed if I understand correctly, you would count a duplicated Ngram in the first document as an intersection.
That's a nice solution,
However when I look at your code I think that there is a small mistake.
Indeed if I understand correctly, you would count a duplicated Ngram in the first document as an intersection.
set<string> process(string& file, int n)
{
ostringstream oss;
vector<string> words;
set<string> ngrams;
split(file, ' ', words);
for (size_t i = 0; i <= words.size() - n; i++)
{
oss << words[i] << ' ' << words[i+1] << ' ' << words[i+2];
ngrams.emplace(oss.str());
oss.str("");
}
return ngrams;
}
set<string> intersectionNgram(string& file1, string& file2, int n)
{
set<string> result;
set<string> ngrams1 = process(file1, n);
if (!ngrams1.size())
return result;
set<string> ngrams2 = process(file2, n);
set_intersection(ngrams1.begin(), ngrams1.end(), ngrams2.begin(), ngrams2.end(), std::inserter(result,result.begin()));
return result;
}
creates a set out of ngrams for document1 and cross checks ngrams from document2
- Phoenix June 20, 2015