Assume we have a sorted list o

Google Interview Question for Software Engineer / Developers

0

of 0 votes

6
Answers
Assume we have a sorted list of document_id-s for every keyword. Given a search string, how to find the documents that have all keywords in the search string? Basically how to find the intersection in a set of sorted arrays? What is the complexity? How to improve if there are only two arrays and one is much smaller compared to the other? How to distribute this on a set of different machines?
- lyra_vega November 09, 2011 in - | Report Duplicate | Flag | PURGE
Google Software Engineer / Developer Algorithm

Email me when people comment.

An error occurred in subscribing you.

Country: -
Interview Type: In-Person

Email me when people comment.

An error occurred in subscribing you.

Comment hidden because of low score. Click to expand.

of 1 vote

1) How to find intersection across 'N' sorted arrays:

We can make use of a min-heap.
Insert all A1[0],A2[0],...,An[0] in min-heap.
while min-heap is not empty, keep removing a document id from it.
Then increment the Ai from which this element came.

Whenever we find the same document id popping out 'N' times successively,
that implies that it is present in all the documents.

The idea is similar to merging 'N' sorted arrays.

2) Improvements when one array is small and other is relatively bigger.
Search each element of smaller Array in the bigger one.
complexity: O(m logN) where m<<N

3) Distribute computation across set of machines:
Assuming there are 'M' machines available:

assign N/M (provided M < N) arrays for finding intersections to each of them.
Now wait for each machine's intersection list.

Repeat the same distribution steps, until you get one final list i.e. list of document ids.
Each doument id in this list will contain all the search terms.

- mytestaccount2 September 21, 2013 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

1) For each key word, extract the vector of the document-id, the vector is sorted
2) for k list extracted, perform modified k-way merge algorithm
3) k-way merge is recursive like merge sort: you find the result in the first half and in the second half, you merge them into final result

It is straightforward to use map-reduce to solve. Extracted lists are distributed across n nodes. In each node, perform map operation to emit pair<value, list_id>. In the reduce node, it collects all the pair with the same value and count the number of list it receives. If it equals to k, output the value, i.e. document-id

- geliang2008 November 09, 2011 | Flag Reply

Comment hidden because of low score. Click to expand.

of 2 vote

How to improve if there are only two arrays and one is much smaller compared to the other?
-- Say sizes are , m (A[]) and n (B[]), m << n
Sort smaller array, A. For every element in B, search this in A. COmplexity O(nlogm+mlogm) = O((n+m)logm), if m<<n, then this is almost O(n). (Assuming, the arrays are not sorted.)
If they are already sorted as given,
O(nlogm) (We can use modified bin search, where the array to be searched in keeps decrasing by the x, where x is index of last found intersection), wiht a termination condition, whenever A[last_elem] < B[i]. Very close to O(n)

How to distribute this on a set of different machines?
--Map Reduce.

Divide the input data (total size, n) into k parts. -> k mappers -> Every mapper will find intersection of n/k sorted arrays.

Reducer will then combine the results from these k mappers and produce final output.

Basically you take advantage of parallelism. It is only recommended for large data sets, since it involves network latency. For small data sets, this might actually increase the total running time, because of several latencies involved

- P November 10, 2011 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

solution seems correct but
O((n+m) log m) is not equal to O(n), but O(n log m)

- Deepak February 25, 2012 | Flag

Comment hidden because of low score. Click to expand.

of 1 vote

Wouldn't it make more sense to do the reverse and do a binary search in B for every element in A?

That way your time complexity goes down to O(mlogn), which is a lot better than O(nlogm) since n >> m.

- Anonymous March 26, 2012 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

Same here. I agree that it could be O(mlogn) if searching larger array for the elements from the smaller array.

- Daru June 25, 2012 | Flag

CareerCup

Google Interview Question for Software Engineer / Developers

Books

Videos

Resume Review

Mock Interviews