Google Interview Question
Software Engineer / DevelopersCountry: United States
1. How are you certain you did not pass by optimal solution? Just asking, it is not immediately obvious to me.
2. Knowing Google, they would probably ask something like what if some arrays are much larger than the others. Potentially, smallest can be walked and larger can be binary-search-jumped.
My trial to prove that we will not pass it. Prove that this algorithm will not pass through the optimal answer. Assume that and that we didn't check the minimal range starting at i, therefore the distance between i-> k (that we did check) could be smaller i->k' (optimal answer), where k' is still bigger then i (because we are speaking about the ranges starting at i). In such case we however didn't have to move from k' -> k, because both are bigger than i and we move only the smallest one out of these three. This is contradiction.
Pretty simple solution with n list that don't have the same length.
Walkthrough with this example:
5->9->17
4->13->18->20
8->19->21->24
iteration #1: Pointers will point to 5,4,8 so:
minVal: 4, maxVal: 8, indexMinLeft: 1 (pointing at list starting with 4)
minDifference = 8-4=4
iteration #2: Pointers will point to 5,13,8 so:
minVal: 5, maxVal: 13, indexMinLeft: 0 (pointing at list starting with 5)
minDifference = 13-5=8
and so on and so on. The "indexMinLeft" comes into play when we reach the last item of a certain list. In that case we can't go to the next item on the list, so instead we go to the 2nd smallest number and go to the next item in its list. Eventually pointers to all lists will point to their last element in which case we break out of the loop and return the answer :)
public static int leastDistance (Node<Integer>[] words) {
int minDifference = Integer.MAX_VALUE;
while (true) {
int minVal = Integer.MAX_VALUE;
int minLeft = Integer.MAX_VALUE;
int maxVal = 0;
int indexMinLeft = -1;
for (int i=0; i<words.length; i++) {
if (words[i].value<minVal) {
minVal = words[i].value;
}
if (words[i].next!=null && words[i].value<minLeft) {
minLeft = words[i].value;
indexMinLeft = i;
}
if (words[i].value>maxVal) {
maxVal = words[i].value;
}
}
minDifference = Math.min(minDifference, maxVal-minVal);
if (indexMinLeft==-1) break;
words[indexMinLeft] = words[indexMinLeft].next;
}
return minDifference;
}
You are repeatedly searching min and max through n elements. Your solution could be optimized by maintaining a BST.
double MinimalDistance(double[][] A, int N, int size[]) //A has N rows and each row represents positions for a different word and size[] contains number of positions for each word.
{
int index[N];
double min_value = infinity;
double current_val;
int opt_indexes[N];
for(int i=0;i<N;i++)
opt_indexes[i] = 0;
while(!anyIndexExceedsBounds(index, size))
{
current_val = calculate_distance(A, index, N);
if(current_val < min_value)
{
min_value = current_val;
for(int i=0;i<N;i++)
opt_indexes[i] = index[i];
}
if(A[i] < B[j] && A[i] < C[k] && i < A.size)
i++;
else if (B[j] < C[k] && j < B.size)
j++;
else
k++;
for(int i=0;i<N;i++){
if(isIndexValMinimum(i, index, A) && index[i] < size[i])
index[i]++;
else continue;
}
}
return min_value;
}
int calculate_distance(double[][] A, int[] index, int N){
int max = -1;
int min = Integer.MAX;
for(int i=0;i<N;i++){
if(A[i][index[i]] > max)
max = A[i][index[i]];
if(A[i][index[i]] < min)
min = A[i][index[i]];
}
return max-min;
}
boolean anyIndexExceedsBounds(int index[], int size[]){
for(int i=0;i<size.length;i++)
if(index[i]>=size[i])
return true;
return false;
}
boolean isIndexValMinimum(int currIndex, int index[], double[][]A){
for(int i=0;i<index.length;i++)
if(i != currIndex && A[i][index[i]]<A[currIndex][index[currIndex]])
return false;
return true;
}
I can think of a shortest path solution. Consider a graph with vertices on different rows. On row "r" we have "Nr" vertices where "Nr" is the number of locations for r-th word.
Vertices of each row are connected to vertices of the row immediately after.
In other words, j-th vertex of row i is connected to those vertices of row "i + 1" whose positions are after that of the j-th vertex of row i.
Add a super vertex at row 0 and a super vertex at row r + 1.
super vertex at row 0 is connected to all of row 1 and all of row "r" are connected to super vertex at row r + 1.
The total number of vertices is equal to the total number of given positions + 2.
The total number of edges is O(total number of vertices).
The edge weights between vertices are the difference in their positions.
row 0 and row r + 1 are connected with 0-weight edges.
Shortest path from 0 -> r + 1 in O(n^2).
Example for above:
Vij : vertex at row i, column j (i = 1, 2, 3; j = 1, 2,3)
Edge list: (start, end, weight)
(V11, V22, 13 - 5)
(V11, V23, 18 - 5)
(V12, V22, 13 - 9)
(V12, V23, 18 - 9)
(V13, V33, 18 - 17)
(V21, V31, 8 - 4)
(V21, V32, 19 - 4)
(V21, V33, 21 - 4)
(V22, V32, 19 - 13)
(V22, V33, 21 - 13)
(V23, V32, 19 - 18)
(V23, V33, 21 - 18)
Then add V0 and edges (V0, V11, 0), (V0, V12, 0), and (V0, V13, 0).
Also add V4 and edges (V31, V4, 0), (V32, V4, 0), (V33, V4, 0).
Finally, find the shortest path from V0 to V4. In this case, it will go through V13, V23, V32.
Overall complexity is quadratic in total number of positions (regardless of number of rows).
It might be possible to find a faster approach due to the limited structure of the graph. I am assuming Dijkstra or something similar is used.
struct Elem {
int array;
int index;
Elem() { array = 0; index = 0; };
Elem(int _array, int _index) : array(_array), index(_index) {}
};
vector<int> minDistPostion(vector<vector<int>>& wordsArray) {
int n = wordsArray.size();
int dist;
int min_dist;
vector<int> result(n, 0);
map<int, Elem> selection;
map<int, Elem> candidate;
for (int i = 0; i < n; i++) {
selection[wordsArray[i][0]] = Elem(i,0);
if (wordsArray[i].size()>1) {
candidate[wordsArray[i][1]] = Elem(i,1);
}
}
min_dist = selection.rbegin()->first - selection.begin()->first;
while (!candidate.empty()) {
Elem elem = candidate.begin()->second;
candidate.erase(candidate.begin());
int array = elem.array;
int index = elem.index;
if (index < wordsArray[array].size() - 1)
candidate[wordsArray[array][index + 1]] = Elem(array, index + 1);
int val_prev = wordsArray[array][index-1];
int val_curr = wordsArray[array][index];
selection.erase(val_prev);
selection[val_curr] = elem;
dist = selection.rbegin()->first - selection.begin()->first;
if (dist < min_dist) {
min_dist = dist;
Elem tmp;
for (auto num : selection) {
tmp = num.second;
result[tmp.array] = wordsArray[tmp.array][tmp.index];
}
}
}
return result;
}
C++ 11
struct OpenSolution {
set<int> foundSoFar;
bool contains(int i) {return foundSoFar.find(i) != foundSoFar.end();};
int startIndex;
};
int findMinDistance(vector<vector<int>> indexes) {
vector<OpenSolution> openSolutions;
int idx[3]= {0,0,0};
int min_length = INT_MAX;
while(idx[0] < indexes[0].size() ||
idx[1] < indexes[1].size() ||
idx[2] < indexes[2].size() ) {
int wordIndex = 0;
int min = INT_MAX;
for(int i = 0; i < 3; i++) {
if(idx[i] < indexes[i].size() && indexes[i][idx[i]] < min) {
min = indexes[i][idx[i]];
wordIndex = i;
}
}
bool createNewSolution = true;
for(int i = 0; i < openSolutions.size(); i++) {
auto &sol = openSolutions[i];
if(sol.contains(wordIndex)) {
if(sol.foundSoFar.size() == 1) {
sol.startIndex = min;
createNewSolution = false;
}
} else {
sol.foundSoFar.insert(wordIndex);
if(sol.foundSoFar.size() == 3) {
if(min - sol.startIndex < min_length) {
min_length = min - sol.startIndex;
}
openSolutions.erase(openSolutions.begin()+i);
i--;
}
}
}
if(createNewSolution) {
OpenSolution s;
s.startIndex = min;
s.foundSoFar.insert(wordIndex);
openSolutions.push_back(s);
}
idx[wordIndex]++;
}
return min_length;
}
The following code is implemented with the minimum heap . I had tested some cases.
struct node
{
int val;
int idx;
int offset;
node(int _val = 0 , int _idx = -1, int _offset = -1):val(_val), idx(_idx),offset(_offset){}
};
struct cmp:public binary_function<bool,const node &, const node &>
{
public:
bool operator()(const node &lhs, const node &rhs)
{
return lhs.val > rhs.val;
}
};
typedef pair<int,int> P;
P minInteval(const vector<vector<int> > &A)
{
const int m = A.size();
if(m < 2) return P(0,0);
vector<int> capa;
for(int i=0;i<m;++i) capa.push_back(A[i].size());
priority_queue<node,vector<node> ,cmp> pq;
int curMax = INT_MIN;
for(int i=0; i < m;++i )
{
pq.push(node(A[i][0],i,0));
curMax = max(curMax,A[i][0]);
}
int ans = curMax - pq.top().val;
int l = pq.top().val, r = curMax;
while(1)
{
node cur = pq.top();pq.pop();
int val = cur.val, idx = cur.idx, off = cur.offset;
if(curMax - val < ans)
{
ans = curMax-val;
l = val;
r = curMax;
}
if(off >= capa[idx]-1) return P(l,r);
int next = A[idx][off+1];
if(next > curMax) curMax = next;
node tmp(next,idx,off+1);
pq.push(tmp);
}
return P(l,r);
}
It looks like a search / optimization problem to me.
If I interpret it correctly We need to find the shortest range in the sequence of words that contains all the words in our list.
We can use a best first search strategy.
The first step in the search is choose the first word, so you can think of the search tree as branching out to all the locations in the array we chose to be the first word.
We should probably choose the smallest array for this as the remaining search is proportional to this. They are all the same in this example so our search front consists of 3 items.
Each search item consists of a range and a list of words that have been added to the list, each node also has a score which is derived from the range. We could compute the score on the fly but I am writing it below. We could represent the list of included words by the index of the next word to be included but I am going to write it out explicitly as well. The score is the length of the range end - start.
(start = 5, end = 5, score = 0, words = job),
( start = 9, end = 9, score = 0, words = job)
(start = 17, end = 9, score = 0, words = job)
All the scores are 0 at this point.
The next step is to choose a second word. We could consider all combinations of the first words appearances and the second appearances but this would be pointless. Rather we need only choose the closes 2 words that are nearest the current word. Additionally we can also eliminate cases were expanding the range would include a second copy of the first word, since you would just be better off just having that second word. You can construct these in O(log(n)) time for each of the searches. We basically do a binary search for the index of the last added word in the index list of the next words locations list and then take the one to the right and the one to the left (and of course remove ranges that add an additional instance of the last word.
(start = 4, end = 5, score = 1, words = job, in)
(start = 5, end = 13, score = 8, words = job, in) which uses the first job and the second in is pointless as you would do better using the second job with the second in
(start = 4, end = 9 , score = 5, words = job, in) this is also dismissed for the same reason
(start =9, end = 13, score = 4, words = job, in)
(start = 13, end = 17, score = 5, words = job, in)
(start = 17, end = 18, score = 1, words = job, in)
Now we push all these onto a least first heap
Then we take the best one and try to expand it
We take out (start = 4, end = 5, score = 1, words = job, in)
And expand it to the right and the left
Going left to 8 "crosses over" values in 'job' and 'in' so we dismiss it. Now that I think about it the algorithm may be nearly as efficient without checking for this kind of thing. Search nodes that over reach will get high scores and be ignored. On the other hand having a good data structure for this might help. Something like a set of merged lists would do the trick. Thinking even farther the cross over rule should not be applied under many circumstances. Consider the string:
B A A C
The search node that comprises (the B and the first A) will have be expanded to contain C and in so doing so will also include the second A so maybe it only applies to the last word added and no special data structure is needed for this.
Expanding this search node gets us
(start = 4, end = 8, score = 4, words = job, in, google)
If this were a larger problem we might want to purge the heap of things with higher scores and set up a sentinel on the heap so that nothing with a value of greater than 4 can get in. The former will reduce the cost of putting new things into the heap but will have some cost. There may be some optimal algorithm that cleans the heap only occasionally, perhaps only when the best score goes down by a significant fraction of the heaps score range. You might even consider keeping around an integrated histogram but this seems a bit excessive.
This first complete result is stored into the best result variable.
As there are search nodes that have better (lower) scores then this result we pop another one off the heap and expand it.
(start = 17, end = 18, score = 1, words = job, in) -> (start = 17, end = 19, score = 2, words = job, in, google)
This is not only better than the existing best result (start = 4, end = 8, score = 4, words = job, in, google)
But this is better than anything in the priority queue assuming it was cleaned out when the first result came in. So we are done.
All this is a bit over kill for the example of 3 sets of 3 words. I imagine the lists would be very much larger for Google Book Search where you would have many hits with in a 500 page manuscript and you choose books from amongst millions based on having a small clique of the search words.
It looks like a search / optimization problem to me.
If I interpret it correctly We need to find the shortest range in the sequence of words that contains all the words in our list.
We can use a best first search strategy.
The first step in the search is choose the first word, so you can think of the search tree as branching out to all the locations in the array we chose to be the first word.
We should probably choose the smallest array for this as the remaining search is proportional to this. They are all the same in this example so our search front consists of 3 items.
Each search item consists of a range and a list of words that have been added to the list, each node also has a score which is derived from the range. We could compute the score on the fly but I am writing it below. We could represent the list of included words by the index of the next word to be included but I am going to write it out explicitly as well. The score is the length of the range end - start.
(start = 5, end = 5, score = 0, words = job),
( start = 9, end = 9, score = 0, words = job)
(start = 17, end = 9, score = 0, words = job)
All the scores are 0 at this point.
The next step is to choose a second word. We could consider all combinations of the first words appearances and the second appearances but this would be pointless. Rather we need only choose the closes 2 words that are nearest the current word. Additionally we can also eliminate cases were expanding the range would include a second copy of the first word, since you would just be better off just having that second word. You can construct these in O(log(n)) time for each of the searches. We basically do a binary search for the index of the last added word in the index list of the next words locations list and then take the one to the right and the one to the left (and of course remove ranges that add an additional instance of the last word.
(start = 4, end = 5, score = 1, words = job, in)
(start = 5, end = 13, score = 8, words = job, in) which uses the first job and the second in is pointless as you would do better using the second job with the second in
(start = 4, end = 9 , score = 5, words = job, in) this is also dismissed for the same reason
(start =9, end = 13, score = 4, words = job, in)
(start = 13, end = 17, score = 5, words = job, in)
(start = 17, end = 18, score = 1, words = job, in)
Now we push all these onto a least first heap
Then we take the best one and try to expand it
We take out (start = 4, end = 5, score = 1, words = job, in)
And expand it to the right and the left
Going left to 8 "crosses over" values in 'job' and 'in' so we dismiss it. Now that I think about it the algorithm may be nearly as efficient without checking for this kind of thing. Search nodes that over reach will get high scores and be ignored. On the other hand having a good data structure for this might help. Something like a set of merged lists would do the trick. Thinking even farther the cross over rule should not be applied under many circumstances. Consider the string:
B A A C
The search node that comprises (the B and the first A) will have be expanded to contain C and in so doing so will also include the second A so maybe it only applies to the last word added and no special data structure is needed for this.
Expanding this search node gets us
(start = 4, end = 8, score = 4, words = job, in, google)
If this were a larger problem we might want to purge the heap of things with higher scores and set up a sentinel on the heap so that nothing with a value of greater than 4 can get in. The former will reduce the cost of putting new things into the heap but will have some cost. There may be some optimal algorithm that cleans the heap only occasionally, perhaps only when the best score goes down by a significant fraction of the heaps score range. You might even consider keeping around an integrated histogram but this seems a bit excessive.
This first complete result is stored into the best result variable.
As there are search nodes that have better (lower) scores then this result we pop another one off the heap and expand it.
(start = 17, end = 18, score = 1, words = job, in) -> (start = 17, end = 19, score = 2, words = job, in, google)
This is not only better than the existing best result (start = 4, end = 8, score = 4, words = job, in, google)
But this is better than anything in the priority queue assuming it was cleaned out when the first result came in. So we are done.
All this is a bit over kill for the example of 3 sets of 3 words. I imagine the lists would be very much larger for Google Book Search where you would have many hits with in a 500 page manuscript and you choose books from amongst millions based on having a small clique of the search words.
this problem can be transform to another problem. let's say that every world have a id. of 1,2,3,4...n, so we sort the entire number with it's ID . we so have words ID of job=1,in=2,google=3. and have sorted pairs (4,2),(5,1), (8,3) ,(9,1) ,(13,2), (17,1) ,(18,2),(19,3),(21,3) ,now our goal is to find a minimum intersect that have all type of ID 1 to n ,and
have min(maxNumber-minNumber) .this is like the shorted abstract. we use two pointers to be left and right .when we have a full ID, we will check if Max-Min is smaller, as we have sorted ,so we have Max=right and Min=left. so we have a O(nlgn) time complexity.
and have a O(n) space complexity as for save the type.
Generalizing problem for arbitrary N, imho it looks like max flow min cost problem. Let's take an example from the problem statement.
Let the positions for every word be a vertex of the graph. We draw an edge from one word to another(different) word set capacity to 1 and the cost is calculated as
|pos[i] - pos[j]|
. Say we connect "job" at 5 with "in" at 4, then we draw edge "job" -> "in" with cost 1 and capacity 1. Let's have fictive vertexes for source and sink. Now make a flow of capacity of 1 from source to sink. The algorithm will return the path with minimum cost and using every word only once.
//For simplicity lets assume that we have N words to track each mapped to unique id - 0,1,2,... it can be done with simple map.
typedef pair<int, int> loc; //first - word_id, sec - location
int N = 3; //number of unique words to track
int M = 9; //total number of occurances of N words.
//Calculating distance - 0(0.5 * N^2) ~ O(N^2)
int GetDistance(const vector<int> &locs)
{
//locs.size() = N
//Check we have all words. Missing word -> loc = -1
for (int i=0; i<locs.size(); i++)
if (locs[i] == -1)
return -1;
//Calc distances and return max one.
int max_dist = 0;
for (int i=0; i<locs.size(); i++)
{
for (int j=i; j<locs.size(); j++)
{
int dist = abs(locs[i]-locs[j]);
max_dist = max(dist, max_dist);
}
}
return max_dist;
}
//Calculating least distance with complexity 0(NxM + MxN^2) = 0(MxN^2)
int LeastDistance()
{
//With simple merge sort we can merge all tracked words locations into one sorted array of pairs - 0(NxM) time complexity.
loc input[] = {loc(1,4),loc(0, 5),loc(2,8),loc(0,9),loc(1,13),loc(0,17),loc(1,18),loc(2,19),loc(2,21)};
vector<int> last_occurances(N, -1); //last occurances
vector<int> least_dist_occurances(N); //not required, just for track.
int min_dist = INT_MAX;
//0(M)
for (int i=0; i<M; i++)
{
last_occurances[input[i].first] = input[i].second;
int dist = GetDistance(last_occurances);
if (dist >= 0 && dist < min_dist)
{
min_dist = dist;
copy(last_occurances.begin(), last_occurances.end(), least_dist_occurances.begin());
}
}
return min_dist;
}
It's like the merge-step of merge-sort with trackers for added for the last addition of each query word. And each iteration also updates the min-distance between all words.
Complexity O(nk) where n is the number input lists (potentially up to the number of words in the paragraph, but really the sum of the number of occurrences of the query words), and k is the number of query words.
It may be possible to make this O(n log k) if using a priority queue intelligently, e.g. speed up the last min(..) statement.
import copy
D = {'b': [5, 9, 17],
'in': [4, 13, 18],
'google': [8, 19, 21]}
def min_distance(D):
prev_added = dict([(k,None) for k in D])
merge_cursors = dict([(k,0) for k in D])
min_state = (100000, None)
while True:
min_key,L = min(D.items(), key=lambda (k,v): v[merge_cursors[k]]
if merge_cursors[k] < len(v) else 100000)
if merge_cursors[min_key] >= len(L):
return min_state
addval = L[merge_cursors[min_key]]
prev_added[min_key] = addval
merge_cursors[min_key] += 1
min_token = min(prev_added.values())
if min_token:
min_state = min(min_state,
((addval - min_token), copy.copy(prev_added)),
key=lambda x: x[0])
print addval, prev_added, min_state
print min_distance(D)
where calculate_distance is:
- Anonymous November 14, 2014