| Given a document and a query o... | |||||||
|
30 Day Risk-Free Guarantee:
100% money back if you're unsatisfied. Book (308 Pages):
![]() Video (One Hour):
![]() Resume Review (24 - 48hr)
All Products / Services
|
|||||||
Given a document and a query of K words, how do u find the smallest window that covers all the words at least once in that document? (given you know the inverted lists of all K words, that is, for each word, you have a list of all its occurrrences). This one is really hard. Could someone propose an algorithm in O(n)?
36
please at least leave the idea. What you said is the information I give u.
Hint:
1. use k pointers to iterate through each array
2. at a single step, you can only increment one pointer
3. so, which one? you should be able to find some clue if you draw a simple sketch of multiple arrays on several parallel number axises
geniusxsy,
please give the pseudo code
geniusxsy,
please give the pseudo code
The interviewer asked for a O(n) solution. You guys should find better approach.
By the way, I heard this was (also) an MS interview problem?
I was asked this question google more than 4yrs back.
I have a solution:
1. first, combine the inverted lists of all K words into a array A(length l), and sort it, and we suppose the initial smallest window is [A[0], A[l-1]],
2. next, use a cursor on the beginning point of the array, one by one, check whether the array element is the last occurrence for specific Word, e.g. if A[0] is one of occurrences of Word W1, check if it is the last occurence, if not, move the cursor to the next element, else stop at pos1
3. Do the similar job like Step 2, use another cursor from the last element of the array. Stop at pos2
4. The smallest window is [pos1, pos2]
Improved:
1. let K pointers point to the start point of K arrays(which each contain all the occurrences of specific word), get the minimum value of all this K values, check whether the min value is the last element of the containing array. If yes, stop at this point, the position of this occurrences is pos1.If not, move this pointer to the next, and do the same job again.
2. let K pointers pointing to the last element of K arrays, get the maximum value ,check whether it is the first one of the array whose element's value bigger than pos1. If not, move to previous element, else stop. Get the position, pos2.
We get the smallest window is [pos1, pos2]
@alex
I am still not able to visualize the solution. Could you please try explaining it taking an example or something
alex, does your method work for below example?
w1: 1, 1000, 2500
w2: 400, 1001, 2000
w3: 500, 1002, 1500
alex, does your method work for below example?
w1: 1, 1000, 2500
w2: 400, 1001, 2000
w3: 500, 1002, 1500
Suppose N(i) gives the occurrence count of i-th word, i going from 1 to K words.
The total number of possible window sizes is the product P(N(i) for i= 1..K).
Go through each of the possible windows and find the one(s) where the difference
of the largest offset and smallest offset is the lowest.
So the algorithm depends on the number of times the K words occur, and not the document size. It can stop once it finds a window size that equals K, or continue to find other windows of the same size.
Example: Suppose the document is Z A Y B C X Z D E F Y G, and you want to find smallest window of XYZ. The inverted index list is X = {5}, Y = {2, 10}, Z = {0, 6}
The total number of windows is the product 1 x 2 x 2 = 4. They are (5, 2, 0)-> size 6,
(5, 2, 6) -> size 5, (5, 10, 0) -> size 6, (5, 10, 6) -> size 6. The smallest window is (5, 2, 6) i.e. from 2 (Y) to 6 (Z) -> "Y B C X Z". I think this is correct.:-,
the size of (5,10,0) -> 11 not 6
That solution takes O(<Avg. Occurances>^K)...cleanly in the NP arena. Probably not a workable solution. What optimizations can you make?
One optimization I can think of but not sure is as follows:
For K words, the minimum window size is K. So find the shortest occurrence list, then
for each offsets O, look for offsets in others words (K - 1) to be within 0 - K < N < O + K. That is a 2K window size. If you find an occurrence of all the words in that window,
then eliminate the rest, and find the set with lowest window size. So in the
above example, K = 3, X is the shortest occurring word. For O = 5, find offsets of Y and Z between 2 (5 - 3) and 8 (5 + 3). The only set, in this example, is Y = 2, and Z = 6. So the answer is X = 5, Y = 2, Z = 6, with window size of 5, which is K + 2. If no offsets are found in the range, increase the bound to O - 2K < N < O + 2K..and so on.
for given ex:
w1: 1, 1000, 2500
w2: 400, 1001, 2000
w3: 500, 1002, 1500
keep 2 arrays negdev (negative deviation from base) and posdev(positive deviation from base) of size 3 (same as size of base array) along with base array which is same as w1:
base=[1, 1000,2500] same as w1 ()
negdev=[0,0,0] and posdev=[0,0,0] (zero initialization)
now calculate both arrays traversing through the k array of occurances of given words , here w1,w2 and w3
negdev(i)= maximum negative deviation from base[i] to find all k words
posdev(i)= maximum positive deviation from base[i] to find all k words
so after traversal we get
posdev=[499,2,0]
and negdev=[0,0,1000]
now calculate range array range(i)=posdev(i)+negdev(i)
so range=[499,2,1000]
so minvalue of range will be the answer.
time complexity : O(n)
space complexity: 2 arrays of size of w1 (if you want to minimize this, then instead of taking w1 as base first find base in set of k array , word which has minimum occurances)
I don't see you can do it in time O(n). For each occurrence in w1, you have go through all occurrences in w2 and w3 to get max deviation. You can sort, then binary search the max deviation, but O(n) is impossible.
I was wrong. You were right. O(n) is the one.
Still don't see how this is linear. To me it seems that, as you rightfully pointed out, while scanning through w1 is indeed linear, you need to do O(n^(m-1)) operations (worst case) to exhaustively search for minimal spans.
can be solved in O(nlogK).
Use dynamic programming, and it can be solved with O(nk) time and O(k) space.
here is my take and quite certain its not accomplished in O(nlogn) but worth a shot.
create a kxk graph where the edges signify the minimum distance from each word to the other. ( this part needs the optimization)
once the graph is generated, compute the minimum spanning tree and the resultant tree will be your desired positions.
what now needs to be done is figure out how to efficiently generate the KxK graph.
cheers
I don't understand what your MST is supposed to accomplish, even if I give you the matrix. How does a tree in your scheme even translate into a window, let alone a minimum one?
let me try to explain this.
you have a list of words and the list of distance from start.
abc - 2 , 3 , 15, 19 , 24
def - 4 , 5, 10 , 28
ghi - 12,33,55
random values
etc.
these are distances from the start. let each word represent a node. let each node have an edge to each other node. the number of edges from one node to another node will be more than 1 (because of multiple occurrence of the words in the text).
compute the difference for each word distance with regard to every other word distance and this difference will be the distance on the edge for that word (for optimality we can take the minimum distance for two words and just add one edge between them) .
as a result you have a graph where all nodes are connected with each other with the relative distance. with this we have the information on how close the words are to each other.
now to find the smallest window, we need to find the edges that would connect all the nodes(words) such that the weight of the edges will be the smallest. this translates as finding an mst for the graph.
Ok let's use the 3-node example you have given:
min_dist between abc and def is between 3 and 4 = 1. min_dist between abc and ghi is between 12 and 15 = 3. min_dist between def and ghi is between 12 and 10 = 2. The MST in this case will of course be a length two path going from abc to def to ghi i.e. sum 1+2 = 3. But what is the window then? Recall that abc to def is (3,4) and ghi to def is (10,12). So I'm at a loss as to what your window is. Surely it can't be (3,12)?
Maybe I'm missing something, but your latest explanation still does not make clear how MST gives the desired minimum window, or even a window at all.
Here is a solution I have come up with:
Let a1, a2, a3 and so on be the array that holds the position of occurrence of each word.
k1, k2, k3 and so on be the pointers to the start of each array.
Initialize the minimum_window = N, where N is the number of words in the input
Follow the steps until the pointer reach the end of each array.
1. Compute the value: largest-smallest+1 for each pointer. This is the current window. If this value is less than the minimum_window then update the minimum window and temp_result = current positions of the pointers.
2. Check if all the pointers pointing to the last position in respective array. If yes, minimum_window is the smallest window and If not , increment the pointer with the minimum value.
Result: current positions of the pointers in the temp_result and size is minimum_window.
This works.
I don't see how it's O(n). Suppose you have k words and the each word occurs m times, to find the "largest" and "smallest", you have to scan k arrays. Then, perhaps you have to move pointers m times, then it's at least O(k*m).
Please clarify.
This is for generating a snippet of a document.
As all the inverted lists are sorted, lets suppose there are k of them.
Now build a min-heap with the first index of all the k lists and also keep track of the max element. Now check the value of max - min and store it.
Now keep reading the indexes from each list, and check if its index > min value, then insert it into the heap and pop the min element. Keep doing this till you exhaust elements in one of the lists.
Imagine the entries in the lists as points in 2-D planes, where Y axis = the word and X axis = position . Imagine the posting lists for I am one
I- > 1 5 7 11
am -> 3 6 9 12
one -> 7 8 99
Sort all the entries - so u get
1 3 5 6 7 8 9 11 12 99 in o(n) time as the lists are already sorted
Now, we just have to come up with a window having 3 non-similar elements. As there can be max n-k (where k=no of term we are looking it, here 3), looking at those windows and determining the min window shouldn't take more than o(n) time
Does this seem too easy? Am I missing something?
This seems good. You can use hash table to find which index corresponds to which word
come on, this is not o(N), it is o(NlogK)...
very easy. I hope I will have such easy questions in my interview.
You can do it in one scan of K arrays, O(KN). Each array records the integer positions of every occurrence of a word.