Given a document and a query o

Google Interview Question for Software Engineer / Developers

2

of 2 votes

46
Answers
Given a document and a query of K words, how do u find the smallest window that covers all the words at least once in that document? (given you know the inverted lists of all K words, that is, for each word, you have a list of all its occurrrences). This one is really hard. Could someone propose an algorithm in O(n)?
- Anonymous November 11, 2009 | Report Duplicate | Flag | PURGE
Google Software Engineer / Developer Algorithm

Email me when people comment.

An error occurred in subscribing you.

More Questions from This Interview

Email me when people comment.

An error occurred in subscribing you.

Comment hidden because of low score. Click to expand.

of 0 vote

Suppose N(i) gives the occurrence count of i-th word, i going from 1 to K words.
The total number of possible window sizes is the product P(N(i) for i= 1..K).
Go through each of the possible windows and find the one(s) where the difference
of the largest offset and smallest offset is the lowest.
So the algorithm depends on the number of times the K words occur, and not the document size. It can stop once it finds a window size that equals K, or continue to find other windows of the same size.

- Anonymous November 15, 2009 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

It's rather inefficient if we have to consider all possible windows. For instance, suppose K=3 and we have the following occurrences:
{1}
{2}
{3, 4, 5, 6, 7}

If we hold the occurrence of the first two words constant (ie. 1 & 2), we shouldn't have to consider all those 4,5,6,7 once we have considered 3 for the last word...

- Sunny December 23, 2012 | Flag

Comment hidden because of low score. Click to expand.

of 0 vote

Example: Suppose the document is Z A Y B C X Z D E F Y G, and you want to find smallest window of XYZ. The inverted index list is X = {5}, Y = {2, 10}, Z = {0, 6}
The total number of windows is the product 1 x 2 x 2 = 4. They are (5, 2, 0)-> size 6,
(5, 2, 6) -> size 5, (5, 10, 0) -> size 6, (5, 10, 6) -> size 6. The smallest window is (5, 2, 6) i.e. from 2 (Y) to 6 (Z) -> "Y B C X Z". I think this is correct.:-,

- Anonymous November 15, 2009 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

the size of (5,10,0) -> 11 not 6

- anony December 21, 2009 | Flag

Comment hidden because of low score. Click to expand.

of 0 vote

That solution takes O(<Avg. Occurances>^K)...cleanly in the NP arena. Probably not a workable solution. What optimizations can you make?

- trey November 15, 2009 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

One optimization I can think of but not sure is as follows:
For K words, the minimum window size is K. So find the shortest occurrence list, then
for each offsets O, look for offsets in others words (K - 1) to be within 0 - K < N < O + K. That is a 2K window size. If you find an occurrence of all the words in that window,
then eliminate the rest, and find the set with lowest window size. So in the
above example, K = 3, X is the shortest occurring word. For O = 5, find offsets of Y and Z between 2 (5 - 3) and 8 (5 + 3). The only set, in this example, is Y = 2, and Z = 6. So the answer is X = 5, Y = 2, Z = 6, with window size of 5, which is K + 2. If no offsets are found in the range, increase the bound to O - 2K < N < O + 2K..and so on.

- Anonymous November 17, 2009 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

for given ex:
w1: 1, 1000, 2500
w2: 400, 1001, 2000
w3: 500, 1002, 1500

keep 2 arrays negdev (negative deviation from base) and posdev(positive deviation from base) of size 3 (same as size of base array) along with base array which is same as w1:
base=[1, 1000,2500] same as w1 ()
negdev=[0,0,0] and posdev=[0,0,0] (zero initialization)
now calculate both arrays traversing through the k array of occurances of given words , here w1,w2 and w3
negdev(i)= maximum negative deviation from base[i] to find all k words
posdev(i)= maximum positive deviation from base[i] to find all k words
so after traversal we get
posdev=[499,2,0]
and negdev=[0,0,1000]
now calculate range array range(i)=posdev(i)+negdev(i)
so range=[499,2,1000]
so minvalue of range will be the answer.
time complexity : O(n)
space complexity: 2 arrays of size of w1 (if you want to minimize this, then instead of taking w1 as base first find base in set of k array , word which has minimum occurances)

- Anonymous November 22, 2009 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

I don't see you can do it in time O(n). For each occurrence in w1, you have go through all occurrences in w2 and w3 to get max deviation. You can sort, then binary search the max deviation, but O(n) is impossible.

- Anonymous December 03, 2009 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

I was wrong. You were right. O(n) is the one.

- Anonymous December 03, 2009 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

Still don't see how this is linear. To me it seems that, as you rightfully pointed out, while scanning through w1 is indeed linear, you need to do O(n^(m-1)) operations (worst case) to exhaustively search for minimal spans.

- DMB December 04, 2009 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

i dont think this solution will work when the frequency of occurrence of words isn't the same. Can you account for that?

- dazzeled2011 December 24, 2010 | Flag

Comment hidden because of low score. Click to expand.

of 0 vote

can be solved in O(nlogK).

- hero December 08, 2009 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

Use dynamic programming, and it can be solved with O(nk) time and O(k) space.

- nec December 08, 2009 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

here is my take and quite certain its not accomplished in O(nlogn) but worth a shot.

create a kxk graph where the edges signify the minimum distance from each word to the other. ( this part needs the optimization)

once the graph is generated, compute the minimum spanning tree and the resultant tree will be your desired positions.

what now needs to be done is figure out how to efficiently generate the KxK graph.

cheers

- fragzilla December 11, 2009 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

I don't understand what your MST is supposed to accomplish, even if I give you the matrix. How does a tree in your scheme even translate into a window, let alone a minimum one?

- WTF January 02, 2010 | Flag

Comment hidden because of low score. Click to expand.

of 0 vote

let me try to explain this.

you have a list of words and the list of distance from start.

abc - 2 , 3 , 15, 19 , 24
def - 4 , 5, 10 , 28
ghi - 12,33,55

random values
etc.

these are distances from the start. let each word represent a node. let each node have an edge to each other node. the number of edges from one node to another node will be more than 1 (because of multiple occurrence of the words in the text).

compute the difference for each word distance with regard to every other word distance and this difference will be the distance on the edge for that word (for optimality we can take the minimum distance for two words and just add one edge between them) .

as a result you have a graph where all nodes are connected with each other with the relative distance. with this we have the information on how close the words are to each other.

now to find the smallest window, we need to find the edges that would connect all the nodes(words) such that the weight of the edges will be the smallest. this translates as finding an mst for the graph.

- fragzilla January 07, 2010 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

Ok let's use the 3-node example you have given:

min_dist between abc and def is between 3 and 4 = 1. min_dist between abc and ghi is between 12 and 15 = 3. min_dist between def and ghi is between 12 and 10 = 2. The MST in this case will of course be a length two path going from abc to def to ghi i.e. sum 1+2 = 3. But what is the window then? Recall that abc to def is (3,4) and ghi to def is (10,12). So I'm at a loss as to what your window is. Surely it can't be (3,12)?

Maybe I'm missing something, but your latest explanation still does not make clear how MST gives the desired minimum window, or even a window at all.

- WTF January 07, 2010 | Flag

Comment hidden because of low score. Click to expand.

of 0 vote

Here is a solution I have come up with:

Let a1, a2, a3 and so on be the array that holds the position of occurrence of each word.
k1, k2, k3 and so on be the pointers to the start of each array.
Initialize the minimum_window = N, where N is the number of words in the input
Follow the steps until the pointer reach the end of each array.
1. Compute the value: largest-smallest+1 for each pointer. This is the current window. If this value is less than the minimum_window then update the minimum window and temp_result = current positions of the pointers.
2. Check if all the pointers pointing to the last position in respective array. If yes, minimum_window is the smallest window and If not , increment the pointer with the minimum value.

Result: current positions of the pointers in the temp_result and size is minimum_window.

- bbb January 10, 2010 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

This works.

- Mahesh April 14, 2010 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

I don't see how it's O(n). Suppose you have k words and the each word occurs m times, to find the "largest" and "smallest", you have to scan k arrays. Then, perhaps you have to move pointers m times, then it's at least O(k*m).

Please clarify.

- peng April 16, 2010 | Flag

Comment hidden because of low score. Click to expand.

of 0 vote

This is for generating a snippet of a document.

- Anonymous February 22, 2010 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

As all the inverted lists are sorted, lets suppose there are k of them.
Now build a min-heap with the first index of all the k lists and also keep track of the max element. Now check the value of max - min and store it.
Now keep reading the indexes from each list, and check if its index > min value, then insert it into the heap and pop the min element. Keep doing this till you exhaust elements in one of the lists.

- Messi April 18, 2010 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

Imagine the entries in the lists as points in 2-D planes, where Y axis = the word and X axis = position . Imagine the posting lists for I am one

I- > 1 5 7 11
am -> 3 6 9 12
one -> 7 8 99

Sort all the entries - so u get
1 3 5 6 7 8 9 11 12 99 in o(n) time as the lists are already sorted

Now, we just have to come up with a window having 3 non-similar elements. As there can be max n-k (where k=no of term we are looking it, here 3), looking at those windows and determining the min window shouldn't take more than o(n) time

Does this seem too easy? Am I missing something?

- godlIkeNoob April 25, 2010 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

This seems good. You can use hash table to find which index corresponds to which word

- KC May 24, 2010 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

come on, this is not o(N), it is o(NlogK)...

- Anonymous August 06, 2010 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

You cannot sort the lists in O(n) time, can you if so please give the algo.

- krishna June 27, 2012 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

^ given that the arrays are already sorted, you can merge sort them in O(n) time.

- rlee June 13, 2014 | Flag

Comment hidden because of low score. Click to expand.

of 0 vote

down vote My solution:
a. Create a hash table with m keys, one for each value in B. Each key in H maps to a dynamic array of sorted indices containing indices in A that are equal to B[i]. This takes O(n) time. We go through each index j in A. If key A[i] exists in H (O(1) time) then add an value containing the index j of A to the list of indices that H[A[i]] maps to.
At this point we have 'binned' n elements into m bins. However, total storage is just O(n).
b. The 2nd part of the algorithm involves maintaining a ‘left’ index and a ‘right’ index for each list in H. Lets create two arrays of size m called L and R that contain these values. Initially in our example,
We also keep track of the “best” minimum window.
We then iterate over the following actions on L and R which are inherently greedy: i. In each iteration, we compute the minimum and maximum values in L and R. For L, Lmax - Lmin is the window and for R, Rmax - Rmin is the window. We update the best window if one of these windows is better than the current best window. We use a min heap to keep track of the minimum element in L and a max heap to keep track of the largest element in R. These take O(m*log(m)) time to build. ii. From a ‘greedy’ perspective, we want to take the action that will minimize the window size in each L and R. For L it intuitively makes sense to increment the minimum index, and for R, it makes sense to decrement the maximum index.
We want to increment the array position for the minimum value until it is larger than the 2nd smallest element in L, and similarly, we want to decrement the array position for the largest value in R until it is smaller than the 2nd largest element in R.
Next, we make a key observation:
If L[i] is the minimum value in L and R[i] is less than the 2nd smallest element in L, ie, if R[i] were to still be the minimum value in L if L[i] were replaced with R[i], then we are done. We now have the “best” index in list i that can contribute to the minimum window. Also, all the other elements in R cannot contribute to the best window since their L values are all larger than L[i]. Similarly if R[j] is the maximum element in R and L[j] is greater than the 2nd largest value in R, we are also done by setting R[j] = L[j]. Any other index in array i to the left of L[j] has already been accounted for as have all indices to the right of R[j], and all indices between L[j] and R[j] will perform poorer than L[j].
Otherwise, we simply increment the array position L[i] until it is larger than the 2nd smallest element in L and decrement array position R[j] (where R[j] is the max in R) until it is smaller than the 2nd largest element in R. We compute the windows and update the best window if one of the L or R windows is smaller than the best window. We can do a Fibonacci search to optimally do the increment / decrement. We keep incrementing L[i] using Fibonacci increments until we are larger than the 2nd largest element in L. We can then perform binary search to get the smallest element L[i] that is larger than the 2nd largest element in L, similar for the set R. After the increment / decrement, we pop the largest element from the max heap for R and the minimum element for the min heap for L and insert the new values of L[i] and R[j] into the heaps. This is an O(log(m)) operation.
Step ii. would terminate when Lmin can’t move any more to the right or Rmax can’t move any more to the left (as the R/L values are the same). Note that we can have scenarios in which L[i] = R[i] but if it is not the minimum element in L or the maximum element in R, the algorithm would still continue.
Runtime analysis: a. Creation of the hash table takes O(n) time and O(n) space. b. Creation of heaps: O(m*log(m)) time and O(m) space. c. The greedy iterative algorithm is a little harder to analyze. Its runtime is really bounded by the distribution of elements. Worst case, we cover all the elements in each array in the hash table. For each element, we perform an O(log(m)) heap update.
Worst case runtime is hence O(n*log(m)) for the iterative greedy algorithm. In the best case, we discover very fast that L[i] = R[i] for the minimum element in L or the maximum element in R…run time is O(1)*log(m) for the greedy algorithm!
Average case seems really hard to analyze. What is the average “convergence” of this algorithm to the minimum window. If we were to assume that the Fibonacci increments / binary search were to help, we could say we only look at m*log(n/m) elements (every list has n/m elements) in the average case. In that case, the running time of the greedy algorithm would be m*log(n/m)*log(m).
Total running time Best case: O(n + m*log(m) + log(m)) time = O(n) assuming m << n Average case: O(n + m*log(m) + m*log(n/m)*log(m)) time = O(n) assuming m << n. Worst case: O(n + n*log(m) + m*log(m)) = O(n*log(m)) assuming m << n.
Space: O(n + m) (hashtable and heaps) always.
Edit: Here is a worked out example:
A[5, 1, 1, 5, 6, 1, 1, 5] B[5, 6]
H: { 5 => {1, 4, 8} 6 => {5} }
Greedy Algorithm:
L => {1, 1} R => {3, 1}
Iteration 1: a. Lmin = 1 (since H{5}[1] < H{6}[1]), Lmax = 5. Window: 5 - 1 + 1= 5 Increment Lmin pointer, it now becomes 2.
L => {2, 1}
Rmin = H{6}[1] = 5, Rmax = H{5}[3] = 8. Window = 8 - 5 + 1 = 4. Best window so far = 4 (less than 5 computed above). We also note the indices in A (5, 8) for the best window.
Decrement Rmax, it now becomes 2 and the value is 4.
R => {2, 1}
b. Now, Lmin = 4 (H{5}[2]) and the index i in L is 1. Lmax = 5 (H{6}[1]) and the index in L is 2. We can't increment Lmin since L[1] = R[1] = 2. Thus we just compute the window now.
The window = Lmax - Lmin + 1 = 2 which is the best window so far.
Thus, the best window in A = (4, 5).

- Arnab December 08, 2010 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

public class Solution{

public int smallWindow(ArrayList<ListNode<Integer>> arr){
Comparator<ListNode> com = new Comparator<ListNode>() {
@Override
public int compare(ListNode n1 , ListNode n2)
{
if(n1.val < n2.val) return -1;
if(n1.val > n2.val ) return 1;
return 0;
}
};
PriorityQueue<ListNode> qe = new PriorityQueue<ListNode>(arr.size(),com);
int max = 0 , window_len = Integer.MAX_VALUE;
for(ListNode a : arr){
qe.offer(a);
max = Math.max(a.val , max);
}
while( qe.peek().next !=null ){
ListNode min_node = qe.poll();
window_len = Math.min(window_len, max - min_node.val);
qe.offer(min_node.next);
max = Math.max(min_node.next.val , max);
}
return window_len;
}
}

- huihancarl September 02, 2013 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

My idea is:
1 keep a min heap and the max value,
2 each time choose the smallest one to cal the window size and update it to its next
3 O(n)time the min windows size is come out when the small one don't have next appearance.
Am I right? I don't think it for many times

- huihancarl September 02, 2013 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

Can be done in O(NlogK). Where N = total number of frequencies combined and K is the number of words.

For example:
W1: [4, 10, 15, 24, 26]
W2: [0, 3, 9, 12, 20]
W3: [5, 18, 22, 30]

The result is [3,5]

1. Use a minHeap and apply the similar logic of merging K sorted arrays. The structure of heap node is as follows.

class HeapNode {
	private int wordArrayIndex;
	private int data;
	private int currentPosInArray;
}

2. Maintain a currMin and currMax to identify the current minimum and maximum values in the minHeap. Note that, as every node in the heap stores frequency of a unique word, which means all words will lie between [currMin,currMax]
3. currMin = heap[0].data and currMax = max(arr[wordArrayIndex][currentPosInArray],currMax)
4. In every loop check if(currMax-currMin) is minimum or not.

Dry Run:

W1: [4, 10, 15, 24, 26]
W2: [0, 3, 9, 12, 20]
W3: [5, 18, 22, 30]
W4: [11, 21]

The result is [20,24]

1. Heap = {0,4,5,11}, currMin=0, currMax=11 result = [0,11]
2. Heap = {3,4,5,11}, currMin=3, currMax=11 result = [3,11]
3. Heap = {4,9,5,11}, currMin=4, currMax=11 result = [4,11]
3. Heap = {5,9,10,11}, currMin=5, currMax=11 result = [5,11]
4. Heap = {9,11,10,18}, currMin=9, currMax=18 result = [5,11]
5. Heap = {10,11,12,18}, currMin=10, currMax=18 result = [5,11]
6. Heap = {11,15,12,18}, currMin=11, currMax=18 result = [5,11]
7. Heap = {12,15,18,21}, currMin=12, currMax=21 result = [5,11]
7. Heap = {15,20,18,21}, currMin=15, currMax=21 result = [3,5]
7. Heap = {20,22,21,24}, currMin=20, currMax=24 result = [20,24]

Loop breaks as heap[w1].currentPosInArray >= w1.length

Result = [20,24]

The entire solution can be viewed here https://github.com/sohamchakravarty/InterviewPreparation/tree/master/rangeOfFrequencyOfAllWords

- coderBon May 12, 2016 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

Take all the word references in an array or ArrayList, sort it. O(nlongn). Loop through each element of the array, and find the smallest window (i.e find the first references, that makes up all the words, put the difference between the max (last reference found and the first) as min distance, save the start and end point of the min distance. Return it, if no better window can be found.

- Joy June 24, 2016 | Flag Reply

Comment hidden because of low score. Click to expand.

-1

of 1 vote

very easy. I hope I will have such easy questions in my interview.

You can do it in one scan of K arrays, O(KN). Each array records the integer positions of every occurrence of a word.

- geniusxsy November 11, 2009 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

please at least leave the idea. What you said is the information I give u.

- Anonymous November 11, 2009 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

Hint:

1. use k pointers to iterate through each array
2. at a single step, you can only increment one pointer
3. so, which one? you should be able to find some clue if you draw a simple sketch of multiple arrays on several parallel number axises

- geniusxsy November 11, 2009 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

geniusxsy,
please give the pseudo code

- Anonymous November 12, 2009 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

geniusxsy,
please give the pseudo code

- Anonymous November 12, 2009 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

The interviewer asked for a O(n) solution. You guys should find better approach.

- Anonymous November 12, 2009 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

geniusxsy is the dumbest person to visit this forum.

- RSS January 06, 2011 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

You are given the occurrences of each of those K words. So essentially you have K variable-size arrays (ArrayList in Java). The first valid window we can construct is the min & max among the words' first occurrences.

So say K=3 and the occurrences are:
{1, 7}
{8, 12}
{9}

The first window will be [1, 9]. To construct the next window, we can either use the 7 from first word or 12 from second word. Since our goal is to find the smallest window, we should pick whatever word is at the start of the current window (in this case the first word) and consider its next occurrence (in this case 7). Our new window is now [7, 9] and you get the idea.

- Sunny December 23, 2012 | Flag

Comment hidden because of low score. Click to expand.

-1

of 1 vote

By the way, I heard this was (also) an MS interview problem?

- geniusxsy November 11, 2009 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

I was asked this question google more than 4yrs back.

- Anonymous November 12, 2009 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

I have a solution:
1. first, combine the inverted lists of all K words into a array A(length l), and sort it, and we suppose the initial smallest window is [A[0], A[l-1]],
2. next, use a cursor on the beginning point of the array, one by one, check whether the array element is the last occurrence for specific Word, e.g. if A[0] is one of occurrences of Word W1, check if it is the last occurence, if not, move the cursor to the next element, else stop at pos1
3. Do the similar job like Step 2, use another cursor from the last element of the array. Stop at pos2
4. The smallest window is [pos1, pos2]

- alex November 12, 2009 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

Improved:
1. let K pointers point to the start point of K arrays(which each contain all the occurrences of specific word), get the minimum value of all this K values, check whether the min value is the last element of the containing array. If yes, stop at this point, the position of this occurrences is pos1.If not, move this pointer to the next, and do the same job again.
2. let K pointers pointing to the last element of K arrays, get the maximum value ,check whether it is the first one of the array whose element's value bigger than pos1. If not, move to previous element, else stop. Get the position, pos2.
We get the smallest window is [pos1, pos2]

- alex November 12, 2009 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

@alex
I am still not able to visualize the solution. Could you please try explaining it taking an example or something

- amit November 14, 2009 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

alex, does your method work for below example?

w1: 1, 1000, 2500
w2: 400, 1001, 2000
w3: 500, 1002, 1500

- Anonymous November 15, 2009 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

alex, does your method work for below example?

w1: 1, 1000, 2500
w2: 400, 1001, 2000
w3: 500, 1002, 1500

- Anonymous November 15, 2009 | Flag

CareerCup

Google Interview Question for Software Engineer / Developers

Books

Videos

Resume Review

Mock Interviews