Google Interview Question for SDE1s


Country: United States
Interview Type: Phone Interview




Comment hidden because of low score. Click to expand.
4
of 4 vote

The followup question is hard.
Let's re-formulate the question as following.

"Find p-th percentile of a stream of n numbers, where each number is in range 1 to k."
Note that, "stream" means you can read an input number only once.

The hardness is that, the answer (p-th percentile) depends on the whole sequence of numbers and may come at any point. Since you don't know when it comes, if you through it away (i.e. don't store any information about it), you never get it back.

Thus, to get an exact answer, we need to store some information about each number.

In case k is small, like k = 10^6, and n >> k:
We can use an O(k) memory to store the occurrence of each number, like in Counting Sort. Using this count, we can find p-th percentile at the end when reading stream is finished. We need to know the range k before hand, but don't need to know n - the number of numbers in stream.
This solution takes O(n) time, O(k) space.

If n << k, and we have O(n) memory: just store all numbers and do O(n logn) sorting... Or if we know n before hand, we can store just min{p, (100-p)} /100 * n numbers into a heap. (EDITED: A better time complexity solution is store all numbers and do quickselect in O(n) time. -- thanks to norbert.madarasz )


For the followup question when we don't know k, don't know n, and don't have enough memory any way: I think we can't find the exact answer, but approximate answer only.

If we know the distribution of numbers in the stream, we can have some good approximation. For example, if we know numbers are equally distributed, reservoir sampling may be a good solution.


Other approximation algorithms?

- ninhnnsoc October 19, 2014 | Flag Reply
Comment hidden because of low score. Click to expand.
1
of 1 vote

" and do O(n logn) sorting..." or use quickselect for O(n) running time on average.

The last case: An exact solution is possible with a distributed system, using binary search. Slaves store segments of data sorted, master drives the binary search: at each iteration for a hi,lo boundary the nodes provide the item count falling in the boundaries.

- madno October 19, 2014 | Flag
Comment hidden because of low score. Click to expand.
0
of 0 votes

Agree with you on using quickselect.

For "distributed system": you assume you have enough external space, or "distributed space". Interesting solution, I never thought of.

I am not clear how you store segments of sorted data on slave nodes. Can you tell a bit more details?

- ninhnnsoc October 20, 2014 | Flag
Comment hidden because of low score. Click to expand.
1
of 1 vote

Good solution and explanation. The people blubbering about wrong soluions such as a heap need to read this: you cannot discard any item based on the present distribution, because this number might be the required answer after future items change the distribution.

- gen-y-s October 21, 2014 | Flag
Comment hidden because of low score. Click to expand.
1
of 1 vote

I don't see where it problem statement it says that memory limit can be comensated by multiple machines?

- CT October 22, 2014 | Flag
Comment hidden because of low score. Click to expand.
0
of 0 votes

When the data don't fit into memory, you could use external merge sort and still find the exact solution instead of approximation.

- damluar July 06, 2015 | Flag
Comment hidden because of low score. Click to expand.
1
of 1 vote

Can we use min-heap to keep the top 10 percent?

- Victor October 16, 2014 | Flag Reply
Comment hidden because of low score. Click to expand.
0
of 0 votes

I would think so. Please provide your algorithm.

- addy October 17, 2014 | Flag
Comment hidden because of low score. Click to expand.
0
of 0 votes

Don't think so. If a bunch of numbers smaller than your current 90th percentile are added, the 90th percentile would decrease but you wouldn't know what value to decrease it to.

- Tom October 25, 2014 | Flag
Comment hidden because of low score. Click to expand.
1
of 1 vote

You can use quick sort for that. Pick a random pivot and arrange your numbers. Based upon the index of the pivot element you will know where the required number lies(in which partition). Then you can again use quick sort on the other partition and so on. This will have time complexity O(logn)

The above is assuming you have enough memory then you can use external sort and get the 90th percentile.

- mikewhity October 17, 2014 | Flag Reply
Comment hidden because of low score. Click to expand.
0
of 0 votes

That works for the first part of the question. The challenge is the second part which states there is not enough memory.

- addy October 17, 2014 | Flag
Comment hidden because of low score. Click to expand.
0
of 0 votes

That's what I am saying. Since we don't have enough memory we have to do external sort which has time complexity O(nlogn) and then find the 90th percentile element

- mikewhity October 17, 2014 | Flag
Comment hidden because of low score. Click to expand.
0
of 0 votes

No upper bond is not problem, for limited memory we can try to reduce the stream by getting rid of unwanted part.
i can get the middle element by going through the numbers, then make it the pivot and do quick sort, get rid of first half of the numbers. each time we keep track of how many numbers did we throw away.
when the amount of numbers is small enough follow your algorithm

- rahman.kdd October 18, 2014 | Flag
Comment hidden because of low score. Click to expand.
0
of 0 votes

it is not O(log n) but O(n logn)

- mbriskar May 25, 2015 | Flag
Comment hidden because of low score. Click to expand.
1
of 1 vote

Quick select seems to be the appropriate algorithm for the first part.

And for the follow up, we can scale up quick select just the way we do with merge sort to get External Merge Sort. Lets call this External Quick Select. Following is an implementation in C++. I have taken the stream from file named "numlist". Also I have assumed that the numbers can be accommodated in long long unsigned int data type.

#include<iostream>
#include<cstring>
#include<cstdio>
#include<deque>
#include<queue>
#include<utility>
#include<vector>
#include<climits>
#include<algorithm>
#include<stack>
#include<fstream>
#include<ctime>
#include<sstream>
using namespace std;
bool debug=false;
typedef long long int lld;
typedef unsigned long long int llu;
string get_filename(){
	stringstream ss;
	ss<<rand()%100000;
	return ss.str();
}
llu external_quick_select(fstream &fptr , llu k , string fname){
	llu pivot , num;
	fptr>>pivot;
	llu left_count=0 , right_count=0 , center_count=0;
	
	fstream left;
	string left_name = get_filename();
	left.open(left_name , fstream::in | fstream::out | fstream::trunc);
	
	fstream right;
	string right_name = get_filename();
	right.open(right_name , fstream::in | fstream::out | fstream::trunc);

	while(fptr>>num){
		if(num<pivot){
			++left_count;
			left<<num<<endl;
		}else if(num>pivot){
			++right_count;
			right<<num<<endl;
		}else{
			++center_count;
		}
	}
	
	if(k<=left_count){
		fptr.close();
		if(fname!="numlist")remove(fname.c_str());
		
		right.close();
		remove(right_name.c_str());
		
		left.close();
		left.open(left_name);
		
		return external_quick_select(left , k , left_name);
	}else if(k>left_count+center_count+1){
		fptr.close();
		if(fname!="numlist")remove(fname.c_str());
		
		left.close();
		remove(left_name.c_str());
		
		right.close();
		right.open(right_name);
		
		return external_quick_select(right , k-left_count-center_count-1,right_name);
	}else{
		fptr.close();
		if(fname!="numlist")remove(fname.c_str());
		
		left.close();
		remove(left_name.c_str());
		
		right.close();
		remove(right_name.c_str());
		
		return pivot;
	}
}
int main(int argc , char **argv)
{
	if(argc>1 && strcmp(argv[1],"DEBUG")==0) debug=true;
	srand(time(NULL));
	fstream fptr;
	fptr.open("numlist");
	llu k,num;
	cin>>k;
	
	cout<<external_quick_select(fptr,k,"numlist")<<endl;
	
	return 0;
}

- anantpushkar009 October 26, 2014 | Flag Reply
Comment hidden because of low score. Click to expand.
0
of 0 vote

I like the first answer best. Let me see if I can elaborate on it and enhance it a little.

Create heap of size 10^6/10 = 10^5 this goes into an in place array.

Actually we might do well to have a 10^5 + 1 sized heap .

This heap will keep the smallest at the top.

Read 10^5 elements off the disk into the heap.

Now peek at the top element of the heap call this cutoff.

Loop start: Now for the rest of the elements.

If it is below cutoff go to loop start because that number is not in the top 10% at this point and it will only, get harder to get into the top 10%.

Otherwise if the new value off the disk is better than the smallest one in the heap put it into the 10^5+1 position in the heap and let it bubble up for a cost of log2(10^5) /2 (on average I would guess).

Now remove the element at the top of the heap and let things bubble for another log2(10^5) /2 operations.

Update the cutoff point by peeking at the now updated top of the heap.

For a random file: As you progress through the disk file the probability of being in the ever more exclusive club of the 10^5 so you will only have to put fewer and fewer values from the disk through the log2(10^5) of the heap insert + delete. (any one care to compute this)

In the worst case of a sorted file you will in fact have to do this.

You can think of it as a brainiacs club it has a set member ship set at 10% of the population. As each person takes the IQ test (a very precise one) you can get into the club. If the club is full then the person with the lowest IQ is kicked out (how humiliating) and the new person gets in.

If you allow duplicates then you will have to define a policy for what to do when duplicates are found at 10^5 and 10^5+1 and then modify things appropriately.

If you do not even have the memory for the 10% (10^5) numbers then you will have to run through the file a few times.

Say you had enough memory for 10^5/2

First run the process as before.

This will give you a min value to be in the top %5.

Now repeat the process and filter out the top %5 and just use your array to compute the next 5%.

If you had a really fast disk then you could use a process of subdivision and only use a few bytes of memory.

Choose a number and count how many are above it and how many below, then adjust your point and repeat. This would cost log2(10^6) file reads (of the whole file each time) but you would not need much memory.

If you had a bit more memory say enough for say 10^4 numbers you could set up a bunch of buckets representing a count for a particular range.

Then for each element you get off the disk you increment one of the buckets if it falls within its range.

This will give you a lot more information for each run through the file. Then you can set the cut off values (things that are obviously in the lower 90% and things that are obviously in the top 10%) and reassign your buckets to the range in between and repeat.

If you had some expectations about the distribution you might even be able to come up with a better initial bucket distribution.

Finally when you get to within 10x of your memory limit run the heap algorithm with your limits set to the range where you know the break point to be.

- Dr A.D.M. October 17, 2014 | Flag Reply
Comment hidden because of low score. Click to expand.
0
of 0 vote

I would use a min-Heap as follows:

1- Read a number.
2- Store it in the heap.
3- If heap.size() > 10^5, extract-Min() (drop current min)

At the end we have exactly 10^5 elements .

Assuming that we can read the stream multiple times, there is O(1) memory O(n log n) time using binary search.

Initialize low = 0, high = 10^6
Repeat (around 20 times)
1- Let mid = (low + high) / 2.
2- Read all numbers and count how many above mid.
3- Is count > targeted count? (i.e., 10^5)?
YES- low = mid;
NO- if Equal return it, other wise, high = mid.

- Ehsan October 17, 2014 | Flag Reply
Comment hidden because of low score. Click to expand.
1
of 1 vote

I think you are making an assumption that there are 10^6 numbers. The problem does not state this. The problem states the numbers are in a range from 1 to 10^6. If you are making an assumption that each number shows up once then the problem is trivial (9*10^5).

- Seitz.Will October 17, 2014 | Flag
Comment hidden because of low score. Click to expand.
0
of 0 vote

{{def top_90(lst):
top = []
for i, num in enumerate(lst):
expected_size = int((i + 1) / 10.0)
if expected_size == len(top) and top:
top.pop(0)
if expected_size > 0:
top.append(num)
return top}}

If you use a queue for the top variable, this should be optimal

- Anonymous October 18, 2014 | Flag Reply
Comment hidden because of low score. Click to expand.
0
of 0 votes

The idea is basically the same as reservoir sampling

- Anonymous October 18, 2014 | Flag
Comment hidden because of low score. Click to expand.
1
of 1 vote

@Anon: Reservoir sampling works only if the data is statistically equally distributed. That was not mentioned.

- addy October 18, 2014 | Flag
Comment hidden because of low score. Click to expand.
0
of 0 vote

U can do it with o(n)
1)use select () to get the kth rank element-->this is O(n)
select-->find median of median(rec) -->median -->t(elimination of large set of values)--->k th raank element
2)u know wat k has got to be-->just put it in and it works out

- jovi October 18, 2014 | Flag Reply
Comment hidden because of low score. Click to expand.
0
of 0 vote

Number are between 1 and 10^6 = 1000000
I would use a map to store the number of occurences of each number.
Then it would be easy to compute the 90th percentile.

- Guillaume October 19, 2014 | Flag Reply
Comment hidden because of low score. Click to expand.
0
of 0 votes

Map does internal sorting if its ordered maps and its not effecient too compared to Unordered maps ..

- jovi October 19, 2014 | Flag
Comment hidden because of low score. Click to expand.
0
of 0 vote

Apart from above,
I would suggest to store the number based on the digit in Ordered linked list...and then perform the relevant tasks..
eg:
10000000000
1->0->0->0->0->0->0....
this does not exceed the memory..

- Johnson October 20, 2014 | Flag Reply
Comment hidden because of low score. Click to expand.
0
of 0 votes

Also it should have header node...for n numbers..

- Johnson October 20, 2014 | Flag
Comment hidden because of low score. Click to expand.
0
of 0 vote

For the second part, we can have a count of every number in an array(assuming the memory is big enough to hold 10^6 long numbers). Then we can find the 90th percentile by using the count of these numbers.

- praveen November 02, 2014 | Flag Reply
Comment hidden because of low score. Click to expand.
0
of 0 vote

How about using a prefix tree/trie to store the numbers and then going thru this trie to locate the 90th percentil number?

- andy March 15, 2015 | Flag Reply


Add a Comment
Name:

Writing Code? Surround your code with {{{ and }}} to preserve whitespace.

Books

is a comprehensive book on getting a job at a top tech company, while focuses on dev interviews and does this for PMs.

Learn More

Videos

CareerCup's interview videos give you a real-life look at technical interviews. In these unscripted videos, watch how other candidates handle tough questions and how the interviewer thinks about their performance.

Learn More

Resume Review

Most engineers make critical mistakes on their resumes -- we can fix your resume with our custom resume review service. And, we use fellow engineers as our resume reviewers, so you can be sure that we "get" what you're saying.

Learn More

Mock Interviews

Our Mock Interviews will be conducted "in character" just like a real interview, and can focus on whatever topics you want. All our interviewers have worked for Microsoft, Google or Amazon, you know you'll get a true-to-life experience.

Learn More