Google Interview Question
SDE1sCountry: United States
Interview Type: Phone Interview
" and do O(n logn) sorting..." or use quickselect for O(n) running time on average.
The last case: An exact solution is possible with a distributed system, using binary search. Slaves store segments of data sorted, master drives the binary search: at each iteration for a hi,lo boundary the nodes provide the item count falling in the boundaries.
Agree with you on using quickselect.
For "distributed system": you assume you have enough external space, or "distributed space". Interesting solution, I never thought of.
I am not clear how you store segments of sorted data on slave nodes. Can you tell a bit more details?
Good solution and explanation. The people blubbering about wrong soluions such as a heap need to read this: you cannot discard any item based on the present distribution, because this number might be the required answer after future items change the distribution.
I don't see where it problem statement it says that memory limit can be comensated by multiple machines?
You can use quick sort for that. Pick a random pivot and arrange your numbers. Based upon the index of the pivot element you will know where the required number lies(in which partition). Then you can again use quick sort on the other partition and so on. This will have time complexity O(logn)
The above is assuming you have enough memory then you can use external sort and get the 90th percentile.
That works for the first part of the question. The challenge is the second part which states there is not enough memory.
That's what I am saying. Since we don't have enough memory we have to do external sort which has time complexity O(nlogn) and then find the 90th percentile element
No upper bond is not problem, for limited memory we can try to reduce the stream by getting rid of unwanted part.
i can get the middle element by going through the numbers, then make it the pivot and do quick sort, get rid of first half of the numbers. each time we keep track of how many numbers did we throw away.
when the amount of numbers is small enough follow your algorithm
Quick select seems to be the appropriate algorithm for the first part.
And for the follow up, we can scale up quick select just the way we do with merge sort to get External Merge Sort. Lets call this External Quick Select. Following is an implementation in C++. I have taken the stream from file named "numlist". Also I have assumed that the numbers can be accommodated in long long unsigned int data type.
#include<iostream>
#include<cstring>
#include<cstdio>
#include<deque>
#include<queue>
#include<utility>
#include<vector>
#include<climits>
#include<algorithm>
#include<stack>
#include<fstream>
#include<ctime>
#include<sstream>
using namespace std;
bool debug=false;
typedef long long int lld;
typedef unsigned long long int llu;
string get_filename(){
stringstream ss;
ss<<rand()%100000;
return ss.str();
}
llu external_quick_select(fstream &fptr , llu k , string fname){
llu pivot , num;
fptr>>pivot;
llu left_count=0 , right_count=0 , center_count=0;
fstream left;
string left_name = get_filename();
left.open(left_name , fstream::in | fstream::out | fstream::trunc);
fstream right;
string right_name = get_filename();
right.open(right_name , fstream::in | fstream::out | fstream::trunc);
while(fptr>>num){
if(num<pivot){
++left_count;
left<<num<<endl;
}else if(num>pivot){
++right_count;
right<<num<<endl;
}else{
++center_count;
}
}
if(k<=left_count){
fptr.close();
if(fname!="numlist")remove(fname.c_str());
right.close();
remove(right_name.c_str());
left.close();
left.open(left_name);
return external_quick_select(left , k , left_name);
}else if(k>left_count+center_count+1){
fptr.close();
if(fname!="numlist")remove(fname.c_str());
left.close();
remove(left_name.c_str());
right.close();
right.open(right_name);
return external_quick_select(right , k-left_count-center_count-1,right_name);
}else{
fptr.close();
if(fname!="numlist")remove(fname.c_str());
left.close();
remove(left_name.c_str());
right.close();
remove(right_name.c_str());
return pivot;
}
}
int main(int argc , char **argv)
{
if(argc>1 && strcmp(argv[1],"DEBUG")==0) debug=true;
srand(time(NULL));
fstream fptr;
fptr.open("numlist");
llu k,num;
cin>>k;
cout<<external_quick_select(fptr,k,"numlist")<<endl;
return 0;
}
I like the first answer best. Let me see if I can elaborate on it and enhance it a little.
Create heap of size 10^6/10 = 10^5 this goes into an in place array.
Actually we might do well to have a 10^5 + 1 sized heap .
This heap will keep the smallest at the top.
Read 10^5 elements off the disk into the heap.
Now peek at the top element of the heap call this cutoff.
Loop start: Now for the rest of the elements.
If it is below cutoff go to loop start because that number is not in the top 10% at this point and it will only, get harder to get into the top 10%.
Otherwise if the new value off the disk is better than the smallest one in the heap put it into the 10^5+1 position in the heap and let it bubble up for a cost of log2(10^5) /2 (on average I would guess).
Now remove the element at the top of the heap and let things bubble for another log2(10^5) /2 operations.
Update the cutoff point by peeking at the now updated top of the heap.
For a random file: As you progress through the disk file the probability of being in the ever more exclusive club of the 10^5 so you will only have to put fewer and fewer values from the disk through the log2(10^5) of the heap insert + delete. (any one care to compute this)
In the worst case of a sorted file you will in fact have to do this.
You can think of it as a brainiacs club it has a set member ship set at 10% of the population. As each person takes the IQ test (a very precise one) you can get into the club. If the club is full then the person with the lowest IQ is kicked out (how humiliating) and the new person gets in.
If you allow duplicates then you will have to define a policy for what to do when duplicates are found at 10^5 and 10^5+1 and then modify things appropriately.
If you do not even have the memory for the 10% (10^5) numbers then you will have to run through the file a few times.
Say you had enough memory for 10^5/2
First run the process as before.
This will give you a min value to be in the top %5.
Now repeat the process and filter out the top %5 and just use your array to compute the next 5%.
If you had a really fast disk then you could use a process of subdivision and only use a few bytes of memory.
Choose a number and count how many are above it and how many below, then adjust your point and repeat. This would cost log2(10^6) file reads (of the whole file each time) but you would not need much memory.
If you had a bit more memory say enough for say 10^4 numbers you could set up a bunch of buckets representing a count for a particular range.
Then for each element you get off the disk you increment one of the buckets if it falls within its range.
This will give you a lot more information for each run through the file. Then you can set the cut off values (things that are obviously in the lower 90% and things that are obviously in the top 10%) and reassign your buckets to the range in between and repeat.
If you had some expectations about the distribution you might even be able to come up with a better initial bucket distribution.
Finally when you get to within 10x of your memory limit run the heap algorithm with your limits set to the range where you know the break point to be.
I would use a min-Heap as follows:
1- Read a number.
2- Store it in the heap.
3- If heap.size() > 10^5, extract-Min() (drop current min)
At the end we have exactly 10^5 elements .
Assuming that we can read the stream multiple times, there is O(1) memory O(n log n) time using binary search.
Initialize low = 0, high = 10^6
Repeat (around 20 times)
1- Let mid = (low + high) / 2.
2- Read all numbers and count how many above mid.
3- Is count > targeted count? (i.e., 10^5)?
YES- low = mid;
NO- if Equal return it, other wise, high = mid.
{{def top_90(lst):
top = []
for i, num in enumerate(lst):
expected_size = int((i + 1) / 10.0)
if expected_size == len(top) and top:
top.pop(0)
if expected_size > 0:
top.append(num)
return top}}
If you use a queue for the top variable, this should be optimal
Number are between 1 and 10^6 = 1000000
I would use a map to store the number of occurences of each number.
Then it would be easy to compute the 90th percentile.
Apart from above,
I would suggest to store the number based on the digit in Ordered linked list...and then perform the relevant tasks..
eg:
10000000000
1->0->0->0->0->0->0....
this does not exceed the memory..
The followup question is hard.
Let's re-formulate the question as following.
The hardness is that, the answer (p-th percentile) depends on the whole sequence of numbers and may come at any point. Since you don't know when it comes, if you through it away (i.e. don't store any information about it), you never get it back.
- ninhnnsoc October 19, 2014Thus, to get an exact answer, we need to store some information about each number.
In case k is small, like k = 10^6, and n >> k:
We can use an O(k) memory to store the occurrence of each number, like in Counting Sort. Using this count, we can find p-th percentile at the end when reading stream is finished. We need to know the range k before hand, but don't need to know n - the number of numbers in stream.
This solution takes O(n) time, O(k) space.
If n << k, and we have O(n) memory: just store all numbers and do O(n logn) sorting... Or if we know n before hand, we can store just min{p, (100-p)} /100 * n numbers into a heap. (EDITED: A better time complexity solution is store all numbers and do quickselect in O(n) time. -- thanks to norbert.madarasz )
For the followup question when we don't know k, don't know n, and don't have enough memory any way: I think we can't find the exact answer, but approximate answer only.
If we know the distribution of numbers in the stream, we can have some good approximation. For example, if we know numbers are equally distributed, reservoir sampling may be a good solution.
Other approximation algorithms?