Google Interview Question
Software Engineer / DevelopersCountry: -
Interview Type: In-Person
1) For each key word, extract the vector of the document-id, the vector is sorted
2) for k list extracted, perform modified k-way merge algorithm
3) k-way merge is recursive like merge sort: you find the result in the first half and in the second half, you merge them into final result
It is straightforward to use map-reduce to solve. Extracted lists are distributed across n nodes. In each node, perform map operation to emit pair<value, list_id>. In the reduce node, it collects all the pair with the same value and count the number of list it receives. If it equals to k, output the value, i.e. document-id
How to improve if there are only two arrays and one is much smaller compared to the other?
-- Say sizes are , m (A[]) and n (B[]), m << n
Sort smaller array, A. For every element in B, search this in A. COmplexity O(nlogm+mlogm) = O((n+m)logm), if m<<n, then this is almost O(n). (Assuming, the arrays are not sorted.)
If they are already sorted as given,
O(nlogm) (We can use modified bin search, where the array to be searched in keeps decrasing by the x, where x is index of last found intersection), wiht a termination condition, whenever A[last_elem] < B[i]. Very close to O(n)
How to distribute this on a set of different machines?
--Map Reduce.
Divide the input data (total size, n) into k parts. -> k mappers -> Every mapper will find intersection of n/k sorted arrays.
Reducer will then combine the results from these k mappers and produce final output.
Basically you take advantage of parallelism. It is only recommended for large data sets, since it involves network latency. For small data sets, this might actually increase the total running time, because of several latencies involved
Wouldn't it make more sense to do the reverse and do a binary search in B for every element in A?
That way your time complexity goes down to O(mlogn), which is a lot better than O(nlogm) since n >> m.
1) How to find intersection across 'N' sorted arrays:
- mytestaccount2 September 21, 2013We can make use of a min-heap.
Insert all A1[0],A2[0],...,An[0] in min-heap.
while min-heap is not empty, keep removing a document id from it.
Then increment the Ai from which this element came.
Whenever we find the same document id popping out 'N' times successively,
that implies that it is present in all the documents.
The idea is similar to merging 'N' sorted arrays.
2) Improvements when one array is small and other is relatively bigger.
Search each element of smaller Array in the bigger one.
complexity: O(m logN) where m<<N
3) Distribute computation across set of machines:
Assuming there are 'M' machines available:
assign N/M (provided M < N) arrays for finding intersections to each of them.
Now wait for each machine's intersection list.
Repeat the same distribution steps, until you get one final list i.e. list of document ids.
Each doument id in this list will contain all the search terms.