Google Interview Question
Software Engineer / DevelopersCountry: United States
I don't quite understand how to use two heaps to find the midian. Can you explain in detail for me? thank you
I have the same question, will watch the thread..so far the reply of Asaf Nachmany seems to be more accurate
Median is the middle element in the sorted array of numbers.
In other words if you partition the sorted array into two sets of equal size, the element on the borders of the sets will be the median.
Now when an element comes, we insert to the left set or right set in such a way that left and right set always remain sorted as well as contain equal number of elements.
So when element less than max of the left set comes it will go to the left set and when an element greater than the min of the right set comes it will go to the right set . and in order to keep the sets equal the max element from the left set will move across the border to the right set or the min element from the right set will move across the border to the left set.
So we need an efficient way of insert element in a sorted list and extracting min and max out of the sorted list.
Heap qualifies just for that purpose.
Left set is modeled with a max heap and right set is modeled with a min heap.
MaxHeap max; //to store left set
MinHeap min; //to store right set
Array array;
max.Insert( array.GetNext() )
while( array.HasNext() )
{
int data = array.GetNext();
if( data < max.GetMax() )
{
max.Insert( data )
}
if( data > min.GetMin() )
{
min.Insert( data )
}
int num_left = max.GetNumElements();
num_left -= min.GetNumElements();
if( num_left < -1 )
{
int data = min.PopMin()
max.Insert( data )
}
if( num_left > 1 )
{
int data = max.PopMax()
min.Insert( data )
}
}
int num_left = max.GetNumElements();
num_left -= min.GetNumElements();
if( num_left < 0 )
median = min.GetMin()
else
median = max.GetMax()
@Arun "Median is the middle element in the sorted array of numbers" - sorted array is quite weak assumption
See my answer that begins with "Let's assume memory and disk are both constrained...". It elaborates on the dual-heap approach in the context of large datasets.
Ok we have O(1) check time, but we do inserting on each ellement O(log(n)). So actually complexity is O(n*log(n)) wich is equal to basic quick sort solution. Where am I wrong? Sorry for stupid question.
Ok guys,all problems resolved,I think i finally found out what Abhi actually meant...
Basically there are two heaps and one integer holding the median value..
if data> median
insert into minheap
else
insert into maxheap
if abs(sizediff)>1
adjust such that an element from the higher sized heap is popped into the median variable and erstwhile median is pushed onto the other heap thus balancing the heaps...
The actual median can be gotten using the tops of heaps plus the median variable because we may need to average if the total size is even.
Here's the actual code...
int median()
{
int median;
fin.open("inp.txt");
M=new MinHeap();
N=new MaxHeap();
int temp,sizediff;
fin>>temp;
median=temp;
while(!fin.eof())
{
fin>>temp;
if(temp>=median)
M.insert(temp);
else if(temp<median)
N.insert(temp);
sizediff=M.size()-N.size();
if(sizediff>1)
{
temp=median;
median=M.pop();
N.insert(temp);
}
else if(sizediff<-1)
{
temp=median;
median=N.pop();
M.insert(temp);
}
}
sizediff=M.size()-N.size();
if(sizediff==1)
return (M.top()+median)/2;
else if(sizediff==-1)
return (N.top()+median)/2;
else if(sizediff==0)
return median;
}
Sometimes, an interviewer will ask you to describe an algorithm to identify the kth
smallest element in an array of n elements. To do this, you select a random pivot
and partition the array as you would in the quicksort algorithm. Then, based on the
index of the pivot element, you know which half of the array the desired element lies
in. For example, say k = 15 and n = 30, and after you select your pivot and partition
the array, the first half has 10 elements (the half before the pivot). You know that
the desired element is the 4th smallest element in the larger half. To identify the
element, you partition the second half of the array and continue recursively. The
reason that this is not O(n log n) is that the recursive partition call is only on one
half of the array, so the expected running time is n + (n/2) + (n/4) + (n/8) + ... =
O(n).
Note that finding the median of an array is a special case of this where k = n / 2.
This is a very important point, as an interviewer will often ask you to find a way to
get the median of an array of numbers.
Are you sure this implementation will work and you will not miss the true median ? I have been asked once to find the first 4 largest elements in the array and I provided an O(4n) solution. Find first maximum then take it out, the 2nd maximum, etc. In the case of a median this will be O(n-square).
@aalimian. Google kth selection algorithm or quick select algorithm. There's also a guaranteed O(n) algo that is a little more complicated to understand called the median of medians selection. Basically quick select evaluates and partitions n elements then n/2 then n/4, etc to find the top k elements.. This sums to 2n on average.
Write all the elements to disk in a first O(N) pass, and randomly capture about 99 of them into memory. Sort the 99 elements to create 100 ranges for the billion elements to be bucketed into. Each range will be a file on disk. For each of the billion integers, do a binary search on your 99 randomly sampled integers to determine which file they go to next. Keep track of how many integers are in each file. At the end of the second pass you will have a hundred files, each with about ten million elements each. You'll know how many elements are in each file, and you'll be able to easily compute which file contains the median, and where the median ranks within that file. At that point, use an in-memory solution to find the nth element in the file that you know contains the median. Sorting ten million integers is pretty fast, so you can probably just sort all of them without having to go to a fancy heap-based approach. The total cost here is roughly 2 billion disk writes and roughly 11 billion integer comparisons.
This will get the job done easily for most data sets, but you can refine it by having provisions for one range taking an unexpectedly large number of values. Also, after a certain amount of time, you'll know that it will be impossible for certain ranges to ever contain the median, so you can short circuit writing values to their files and simply update their counts.
This is much better than the two heap approach,I am not even sure whether the heap approach can actually work...An upgrade to this algorithm would be to try to split till we get a partition of size at most 10^7,also while writing to a file we could keep track of max and min and divide intervals based on those instead of random sampling.
After we get a partition of about 10^7 we could use the ith order statistic method(similar to quicksort) to get the median in O(n) time instead of sorting the elements and taking O(NlogN) time..
But with infinite input i dont think it is possible to retrieve the median in O(1) time at any given time..
Let's assume memory and disk are both constrained, but you can assume that your dataset is ordered in a fairly random way. Use the first element as the provisional median. Maintain two heaps, one for numbers lower than the median and one for numbers higher than the median. For each new number, decide which heap to update and keep track of the size of both heaps. Whenever you get two elements in a row that go to the same heap, you pop off the top element from the large heap, compare it to the most recent number, and decide the new provisional median. Put the other number in the smaller heap.
Since memory and disk are constrained, you will want to limit the size of the heaps. This means that numbers that are "extreme" will get tossed. This is risky, because the "extreme" number might actually be a non-extreme number that's only "extreme" relative to early numbers in the dataset. For real world problems, this is often an acceptable risk, as you're effectively treating early values as outliers. You can detect the situation by keeping track of the max number discarded for being too small and keeping track of the min number discarded for being too big.
If you want to compute the *exact* median, then you have to store the first half billion integers *somewhere*, since every one of the first half billion integers could theoretically still be the median without any knowledge of the second half of the sample.
For a finite input (1 Billion integers), the most efficient way (in both time and space) is simply sorting the array (e.g merge sort) and returning the middle item in the array - the median.
Efficiency is O(n) in space and O(n*log(n)) where n = 1,000,000,000.
For the infinite output, the best way would be to hold a self-balanced Binary Search Tree, or BST. With each input from the infinite stream of integers, the integer is inserted in the BST at a cost of O(log(n)) in time where n is the current number of elements in the tree.
For the infinite numbers, the two heaps approach described by Abhi is much better. In the self balanced BST, you will have to do O(log(N)) work to get the median everytime it is asked for, while in heap approach, you get it in O(1), rest of the complexities remaining the same for both algos.
Assume that the integers are 32-bit integers and that you can read the data stream four different times (either by asking for it again or persisting it to disk). You can solve this in four passes. In the first pass, shift each number by 24 bits and increment one of 256 buckets. After the first pass, iterate through the buckets to find out which contains the median value and what its rank will be. On the second pass, mask out the highest eight bits to filter values then use the next eight bits to to populate 256 buckets again. Repeat for the third pass and fourth pass. After the fourth pass you will have your result.
P.S. After the first pass you will already have an estimate of the median that is within 16 million of the exact median. On the second pass you will have an estimate that is within 64k of the precise median. After the third pass you will be at most 256 off. After the 4th pass you'll know the precise integer median. If you're dealing with 64-bit numbers, you will do eight passes.
As for the first question, I think the "select" or "median of medians" algorithm could solves it perfectly, as the time complexity is O(n) to find the kth order statistic. For the second one, I think Ahbi's idea is good, and I think an AVL search tree could also do the job, because the root divides the elements in two subtrees, the left tree is less than the root element and the right tree is larger than the root element, and the difference of the subtree tree's size should be less than 1. The median could be get in three cases:
1) the balance factor of the root is 0, then the root is the median, which is O(1);
2) the balance factor of the root is -1, which means the size of the right subtree has one more element than the left subtree, and the root should be the median;
3) the balance factor of the root is 1, and this case maybe tricky, because the root is the upper median instead of the lower median(the number of element is even, definitely), and in order to get the lower median, we need to get the rightest element of the left subtree, whose time complexity is O(logn).
The insertion time complexity of the AVL search tree is O(logn), of course.
Full solution in C++ using two heaps. Running time is O(n*log n):
#include <iostream>
#include <vector>
using namespace std;
#define LeftChild(i) (2 * (i) + 1)
#define RightChild(i) (2 * (i) + 2)
#define Parent(i) (((i) - 1) / 2)
template <typename T>
class Heap {
public:
Heap(int (*cmp)(T, T)) {
this->cmp = cmp;
}
void push(T value) {
this->data.push_back(value);
this->siftUp();
}
T pop() {
if (this->data.empty()) {
// Error
} else {
T ret = this->data[0];
this->data[0] = this->data.back();
this->data.pop_back();
this->siftDown();
return ret;
}
}
T peak() {
return this->data[0];
}
int size() {
return this->data.size();
}
void print() {
for (vector<int>::iterator it = this->data.begin(); it != this->data.end(); ++it) {
cout << *it << ", ";
}
cout << endl;
}
private:
vector<T> data;
int (*cmp)(T, T);
void siftDown() {
int i = 0, j;
while ((LeftChild(i) < this->data.size() && cmp(this->data[i], this->data[LeftChild(i)]) < 0)
||
(RightChild(i) < this->data.size() && cmp(this->data[i], this->data[RightChild(i)]) < 0)) {
j = (cmp(this->data[i], this->data[LeftChild(i)]) < 0) ? LeftChild(i) : RightChild(i);
this->swap(i, j);
i = j;
}
}
void siftUp() {
int i = this->data.size() - 1;
while (i != 0 && cmp(this->data[i], this->data[Parent(i)]) > 0) {
this->swap(i, Parent(i));
i = Parent(i);
}
}
void swap(int i, int j) {
T tmp = this->data[i];
this->data[i] = this->data[j];
this->data[j] = tmp;
}
};
int infComp(int a, int b) {
return b - a;
}
int supComp(int a, int b) {
return a - b;
}
void computeMedian(vector<int> L) {
Heap<int> minheap(&infComp), maxheap(&supComp);
for (vector<int>::iterator it = L.begin(); it != L.end(); ++it) {
cout << "New item: " << *it << endl;
if (minheap.size() == 0 || *it > minheap.peak()) {
minheap.push(*it);
} else {
maxheap.push(*it);
}
if (minheap.size() > maxheap.size() + 1) {
maxheap.push(minheap.pop());
} else if (maxheap.size() > minheap.size() + 1) {
minheap.push(maxheap.pop());
}
if (maxheap.size() > 0 && minheap.size() > 0) {
cout << "Median is " << (maxheap.peak() + minheap.peak()) / 2 << endl;
}
maxheap.print();
minheap.print();
};
}
int main(int argc, char* argv[]) {
vector<int> L;
L.push_back(1);
L.push_back(8);
L.push_back(2);
L.push_back(15);
L.push_back(7);
L.push_back(9);
L.push_back(23);
L.push_back(15);
computeMedian(L);
return 0;
}
@Abhi...there are billions of numbers and if we use the concept of two heaps then we need space O(n). Is it possible to reduce space cmplexity..
If the integers can range from 0 to 2<<32 (approximately 4 billion), then you still have scaling issues. Assume the first half of your dataset ranges from 1 billion to 1.5 billion. You need to track every number, since the second half of the dataset could all be less than 1 billion or all be more than 1.5 billion.
It's a good question, though. If you constrain the size of the numbers more, than the "bucket" approach becomes practical.
For constant-length array, select algorithm works well.
For continuous stream, it is unknown how much to allocate for the array ahead of time. Array allocation at runtime is very expensive. Hence, use a BST and keep a count of the total number of elements N. Then, when it is needed to find the median, do in-order traversal for the N/2-th element; worst-case O(N), best-case O(logN).
Max/Min Heap does not work when the streams of integer is infinite, since it will need to record all these numbers. Since the number is integer already is billion, assume the value of integer is less than 2<<32, i would use 2<<32 number of array to counter the number of element with that value in the infinite integer streams and track the media value in the array.
My approach is quiet the same and dump maybe... but billion is ~ 2^30 (GB). Since these numbers are integers, you have a range: 2^32 possibilities. If each possibility is a counter of 32bit (which gives you a massive support, despite hotspots, for the endless stream) you could use a int array to count the integers. Since this counter is indexed, the counting process is O(n) and scales with the stream.
More, you can use a double to count the TOTAL number of entires (or BigInteger), then to find the median, just /2 the Total and use a pointer to sum and reach that value.
Issues: The median is not real-time and the stream would need to stop
I would say the heap or bst options and not mine... but it was just a new idea...
I thought of the same solution. But, it won't work!
How many numbers would you keep track of, at any instant?
Consider the worst case, where a lot of numbers are inserted at one particular side of the median (i.e left or right of the median). Assume that you keep track of 4 numbers,
60,70,80,90 --> median = 70
I have lost track of other numbers like 50,55,95,100, since we are keeping track of the nearest neighbors of median, that is 4 here.
Now, I am inserting 20 numbers with values between (20 to 40). How would you get the median? You have lost the values 50, 55, etc which you should have considered. This is where it would fail.
So, the correct solution would be to use self-balancing trees like AVL tree. The root of the tree would be the median.
Use two heaps a MAX Heap and a MIN Heap. This lets us decide in O(1) time to decide in which heap to enter the new number. If the heaps fall out of order, rebalance them by moving one item in LogN.
- Abhi February 16, 2013