Write a function that gets a b

Google Interview Question for Software Engineer / Developers

7

of 7 votes

46
Answers
Write a function that gets a billion integers. How can you find the midian in most efficient way (time)?

same question, but the input is an endless stream of integers, and we want to find the current median.
- adam2008 February 16, 2013 in United States | Report Duplicate | Flag | PURGE
Google Software Engineer / Developer

Email me when people comment.

An error occurred in subscribing you.

Country: United States

More Questions from This Interview

Email me when people comment.

An error occurred in subscribing you.

Comment hidden because of low score. Click to expand.

of 20 vote

Use two heaps a MAX Heap and a MIN Heap. This lets us decide in O(1) time to decide in which heap to enter the new number. If the heaps fall out of order, rebalance them by moving one item in LogN.

- Abhi February 16, 2013 | Flag Reply

Comment hidden because of low score. Click to expand.

of 1 vote

I don't quite understand how to use two heaps to find the midian. Can you explain in detail for me? thank you

- sunshihaosd February 17, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

I have the same question, will watch the thread..so far the reply of Asaf Nachmany seems to be more accurate

- S.Abakumoff February 17, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

I agree with Abhi.
Using two heaps(min and max) is the right way to go about it.

- Anonymous February 17, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 1 vote

Median is the middle element in the sorted array of numbers.
In other words if you partition the sorted array into two sets of equal size, the element on the borders of the sets will be the median.
Now when an element comes, we insert to the left set or right set in such a way that left and right set always remain sorted as well as contain equal number of elements.
So when element less than max of the left set comes it will go to the left set and when an element greater than the min of the right set comes it will go to the right set . and in order to keep the sets equal the max element from the left set will move across the border to the right set or the min element from the right set will move across the border to the left set.

So we need an efficient way of insert element in a sorted list and extracting min and max out of the sorted list.
Heap qualifies just for that purpose.
Left set is modeled with a max heap and right set is modeled with a min heap.

- Arun February 17, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 1 vote

How about a height balanced bst with median as root?

- Anonymous February 18, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

it's all very, very unclear without the code, can you provide it?

- S.Abakumoff February 18, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

MaxHeap max; //to store left set
MinHeap min; //to store right set

Array array;

max.Insert( array.GetNext() )

while( array.HasNext() )
{

int data = array.GetNext();
if( data < max.GetMax() )
{
max.Insert( data )
}
if( data > min.GetMin() )
{
min.Insert( data )
}

int num_left = max.GetNumElements();
num_left -= min.GetNumElements();

if( num_left < -1 )
{
int data = min.PopMin()
max.Insert( data )
}
if( num_left > 1 )
{
int data = max.PopMax()
min.Insert( data )
}

}

int num_left = max.GetNumElements();
num_left -= min.GetNumElements();

if( num_left < 0 )
median = min.GetMin()
else
median = max.GetMax()

- Arun February 18, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

what is the initial min value of the min heap?

- S.Abakumoff February 19, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

@Arun "Median is the middle element in the sorted array of numbers" - sorted array is quite weak assumption

- dubovoy.vladimir February 19, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

See my answer that begins with "Let's assume memory and disk are both constrained...". It elaborates on the dual-heap approach in the context of large datasets.

- showell30@yahoo.com February 21, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

@Arun same question as S.Abakumoff, what's the initial value for min???

- Anonymous February 27, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

Ok we have O(1) check time, but we do inserting on each ellement O(log(n)). So actually complexity is O(n*log(n)) wich is equal to basic quick sort solution. Where am I wrong? Sorry for stupid question.

- yurkor.83 April 17, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

Ok guys,all problems resolved,I think i finally found out what Abhi actually meant...
Basically there are two heaps and one integer holding the median value..

if data> median
insert into minheap
else
insert into maxheap

if abs(sizediff)>1

adjust such that an element from the higher sized heap is popped into the median variable and erstwhile median is pushed onto the other heap thus balancing the heaps...

The actual median can be gotten using the tops of heaps plus the median variable because we may need to average if the total size is even.

Here's the actual code...

int median()
{

int median;
fin.open("inp.txt");
M=new MinHeap();
N=new MaxHeap();
int temp,sizediff;
fin>>temp;
median=temp;

while(!fin.eof())
{
	fin>>temp;
	
	if(temp>=median)
		M.insert(temp);
	else if(temp<median)
		N.insert(temp);

sizediff=M.size()-N.size();

if(sizediff>1)
	{
		temp=median;
		median=M.pop();
		N.insert(temp);
	}
else if(sizediff<-1)
	{
		temp=median;
		median=N.pop();
		M.insert(temp);
	}

}


sizediff=M.size()-N.size();
if(sizediff==1)
	return (M.top()+median)/2;
else if(sizediff==-1)
	return (N.top()+median)/2;
else if(sizediff==0)
	return median;

}

- Srinath.Venigandla April 27, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

Building two heaps with half a billion integers in them? just the integers alone require around 8GB main memory. Any other memory efficient way? using secondary storage?

- Anonymous December 15, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 6 vote

Sometimes, an interviewer will ask you to describe an algorithm to identify the kth
smallest element in an array of n elements. To do this, you select a random pivot
and partition the array as you would in the quicksort algorithm. Then, based on the
index of the pivot element, you know which half of the array the desired element lies
in. For example, say k = 15 and n = 30, and after you select your pivot and partition
the array, the first half has 10 elements (the half before the pivot). You know that
the desired element is the 4th smallest element in the larger half. To identify the
element, you partition the second half of the array and continue recursively. The
reason that this is not O(n log n) is that the recursive partition call is only on one
half of the array, so the expected running time is n + (n/2) + (n/4) + (n/8) + ... =
O(n).
Note that finding the median of an array is a special case of this where k = n / 2.
This is a very important point, as an interviewer will often ask you to find a way to
get the median of an array of numbers.

- MALF February 20, 2013 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

Are you sure this implementation will work and you will not miss the true median ? I have been asked once to find the first 4 largest elements in the array and I provided an O(4n) solution. Find first maximum then take it out, the 2nd maximum, etc. In the case of a median this will be O(n-square).

- aalimian April 11, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

@aalimian. Google kth selection algorithm or quick select algorithm. There's also a guaranteed O(n) algo that is a little more complicated to understand called the median of medians selection. Basically quick select evaluates and partitions n elements then n/2 then n/4, etc to find the top k elements.. This sums to 2n on average.

- Jose Cuervo July 12, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 5 vote

Write all the elements to disk in a first O(N) pass, and randomly capture about 99 of them into memory. Sort the 99 elements to create 100 ranges for the billion elements to be bucketed into. Each range will be a file on disk. For each of the billion integers, do a binary search on your 99 randomly sampled integers to determine which file they go to next. Keep track of how many integers are in each file. At the end of the second pass you will have a hundred files, each with about ten million elements each. You'll know how many elements are in each file, and you'll be able to easily compute which file contains the median, and where the median ranks within that file. At that point, use an in-memory solution to find the nth element in the file that you know contains the median. Sorting ten million integers is pretty fast, so you can probably just sort all of them without having to go to a fancy heap-based approach. The total cost here is roughly 2 billion disk writes and roughly 11 billion integer comparisons.

This will get the job done easily for most data sets, but you can refine it by having provisions for one range taking an unexpectedly large number of values. Also, after a certain amount of time, you'll know that it will be impossible for certain ranges to ever contain the median, so you can short circuit writing values to their files and simply update their counts.

- showell30@yahoo.com February 20, 2013 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

This is much better than the two heap approach,I am not even sure whether the heap approach can actually work...An upgrade to this algorithm would be to try to split till we get a partition of size at most 10^7,also while writing to a file we could keep track of max and min and divide intervals based on those instead of random sampling.

After we get a partition of about 10^7 we could use the ith order statistic method(similar to quicksort) to get the median in O(n) time instead of sorting the elements and taking O(NlogN) time..

But with infinite input i dont think it is possible to retrieve the median in O(1) time at any given time..

- Srinath.Venigandla April 27, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 1 vote

Let's assume memory and disk are both constrained, but you can assume that your dataset is ordered in a fairly random way. Use the first element as the provisional median. Maintain two heaps, one for numbers lower than the median and one for numbers higher than the median. For each new number, decide which heap to update and keep track of the size of both heaps. Whenever you get two elements in a row that go to the same heap, you pop off the top element from the large heap, compare it to the most recent number, and decide the new provisional median. Put the other number in the smaller heap.

Since memory and disk are constrained, you will want to limit the size of the heaps. This means that numbers that are "extreme" will get tossed. This is risky, because the "extreme" number might actually be a non-extreme number that's only "extreme" relative to early numbers in the dataset. For real world problems, this is often an acceptable risk, as you're effectively treating early values as outliers. You can detect the situation by keeping track of the max number discarded for being too small and keeping track of the min number discarded for being too big.

If you want to compute the *exact* median, then you have to store the first half billion integers *somewhere*, since every one of the first half billion integers could theoretically still be the median without any knowledge of the second half of the sample.

- showell30@yahoo.com February 21, 2013 | Flag Reply

Comment hidden because of low score. Click to expand.

of 2 vote

you need to ask more information. for example do we know the min, max (if so maybe bucket sort is a good way to go), or in what way is the numbers represented(array, bitmap)?
you can't find a optimal solution without necessary knowledge.

- Anonymous February 16, 2013 | Flag Reply

Comment hidden because of low score. Click to expand.

of 6 vote

For a finite input (1 Billion integers), the most efficient way (in both time and space) is simply sorting the array (e.g merge sort) and returning the middle item in the array - the median.
Efficiency is O(n) in space and O(n*log(n)) where n = 1,000,000,000.

For the infinite output, the best way would be to hold a self-balanced Binary Search Tree, or BST. With each input from the infinite stream of integers, the integer is inserted in the BST at a cost of O(log(n)) in time where n is the current number of elements in the tree.

- Asaf Nachmany February 16, 2013 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

For the infinite numbers, the two heaps approach described by Abhi is much better. In the self balanced BST, you will have to do O(log(N)) work to get the median everytime it is asked for, while in heap approach, you get it in O(1), rest of the complexities remaining the same for both algos.

- Akshay Jain February 17, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

self balanced BST - would not it have median as root or child nodes of root?

- Anonyous February 18, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

For the BST, you can just find the new median every time you insert. Insert is still O(log n).

- Anonymous February 23, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 0 vote

Assume that the integers are 32-bit integers and that you can read the data stream four different times (either by asking for it again or persisting it to disk). You can solve this in four passes. In the first pass, shift each number by 24 bits and increment one of 256 buckets. After the first pass, iterate through the buckets to find out which contains the median value and what its rank will be. On the second pass, mask out the highest eight bits to filter values then use the next eight bits to to populate 256 buckets again. Repeat for the third pass and fourth pass. After the fourth pass you will have your result.

- showell30@yahoo.com February 21, 2013 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

P.S. After the first pass you will already have an estimate of the median that is within 16 million of the exact median. On the second pass you will have an estimate that is within 64k of the precise median. After the third pass you will be at most 256 off. After the 4th pass you'll know the precise integer median. If you're dealing with 64-bit numbers, you will do eight passes.

- showell30@yahoo.com February 21, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 0 vote

Another solution can be by doing binary search for median. First iteration is used for Max,Min. And then we start binnary search for value withinh Min,Max each time iterating over all array. Storage O(1) complexity O(n*Log(n)) Log(n) - iteration for binnary search.

- yurkor.83 April 17, 2013 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

Approach Given by @Showell30 is probably the most elegant version I could see. Good job!

- Harsh June 28, 2013 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

Use AVL tree for stream of integers. The root will be pretty good estimate of median

- Riyad Parvez July 27, 2013 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

As for the first question, I think the "select" or "median of medians" algorithm could solves it perfectly, as the time complexity is O(n) to find the kth order statistic. For the second one, I think Ahbi's idea is good, and I think an AVL search tree could also do the job, because the root divides the elements in two subtrees, the left tree is less than the root element and the right tree is larger than the root element, and the difference of the subtree tree's size should be less than 1. The median could be get in three cases:
1) the balance factor of the root is 0, then the root is the median, which is O(1);
2) the balance factor of the root is -1, which means the size of the right subtree has one more element than the left subtree, and the root should be the median;
3) the balance factor of the root is 1, and this case maybe tricky, because the root is the upper median instead of the lower median(the number of element is even, definitely), and in order to get the lower median, we need to get the rightest element of the left subtree, whose time complexity is O(logn).
The insertion time complexity of the AVL search tree is O(logn), of course.

- weijjcg August 18, 2013 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

For finite approach use median of medians
For infinite use bucket sort, which places incoming elements into their corresponding buckets. Also we keep track of size of each bucket.

- IntwPrep August 19, 2013 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

Full solution in C++ using two heaps. Running time is O(n*log n):

#include <iostream>
#include <vector>

using namespace std;


#define LeftChild(i)    (2 * (i) + 1)
#define RightChild(i)   (2 * (i) + 2)
#define Parent(i)       (((i) - 1) / 2)

template <typename T>
class Heap {
public:
    Heap(int (*cmp)(T, T)) {
        this->cmp = cmp;
    }
    void push(T value) {
        this->data.push_back(value);
        this->siftUp();
    }
    T pop() {
        if (this->data.empty()) {
            // Error
        } else {
            T ret = this->data[0];
            this->data[0] = this->data.back();
            this->data.pop_back();
            this->siftDown();
            return ret;
        }
    }
    T peak() {
        return this->data[0];
    }
    int size() {
        return this->data.size();
    }
    void print() {
        for (vector<int>::iterator it = this->data.begin(); it != this->data.end(); ++it) {
            cout << *it << ", ";
        }
        cout << endl;
    }
private:
    vector<T> data;
    int (*cmp)(T, T);
    void siftDown() {
        int i = 0, j;
        while ((LeftChild(i) < this->data.size() && cmp(this->data[i], this->data[LeftChild(i)]) < 0)
                ||
               (RightChild(i) < this->data.size() && cmp(this->data[i], this->data[RightChild(i)]) < 0)) {
            j = (cmp(this->data[i], this->data[LeftChild(i)]) < 0) ? LeftChild(i) : RightChild(i);
            this->swap(i, j);
            i = j;
        }
    }
    void siftUp() {
        int i = this->data.size() - 1;
        while (i != 0 && cmp(this->data[i], this->data[Parent(i)]) > 0) {
            this->swap(i, Parent(i));
            i = Parent(i);
        }
    }
    void swap(int i, int j) {
        T tmp = this->data[i];
        this->data[i] = this->data[j];
        this->data[j] = tmp;
    }
};

int infComp(int a, int b) {
    return b - a;
}

int supComp(int a, int b) {
    return a - b;
}

void computeMedian(vector<int> L) {
    Heap<int> minheap(&infComp), maxheap(&supComp);
    for (vector<int>::iterator it = L.begin(); it != L.end(); ++it) {
        cout << "New item: " << *it << endl;
        if (minheap.size() == 0 || *it > minheap.peak()) {
            minheap.push(*it);
        } else {
            maxheap.push(*it);
        }
        if (minheap.size() > maxheap.size() + 1) {
            maxheap.push(minheap.pop());
        } else if (maxheap.size() > minheap.size() + 1) {
            minheap.push(maxheap.pop());
        }
        if (maxheap.size() > 0 && minheap.size() > 0) {
            cout << "Median is " << (maxheap.peak() + minheap.peak()) / 2 << endl;
        }
        maxheap.print();
        minheap.print();
    };
}

int main(int argc, char* argv[]) {
    vector<int> L;
    L.push_back(1);
    L.push_back(8);
    L.push_back(2);
    L.push_back(15);
    L.push_back(7);
    L.push_back(9);
    L.push_back(23);
    L.push_back(15);
    computeMedian(L);
    return 0;
}

- Thomas February 04, 2014 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

i think we could use sth lile bitmap to store the # of integers. then we can get o(n) algorithms

- fishwasser May 07, 2014 | Flag Reply

Comment hidden because of low score. Click to expand.

-1

of 1 vote

@Abhi...there are billions of numbers and if we use the concept of two heaps then we need space O(n). Is it possible to reduce space cmplexity..

- A2 February 16, 2013 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

that's impossible in theory. if you need the median, you need all the numbers.

- zyfo2 February 16, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

How about we assume the value of integer have some limitation to 2<<32.

- Anonymous February 20, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

If the integers can range from 0 to 2<<32 (approximately 4 billion), then you still have scaling issues. Assume the first half of your dataset ranges from 1 billion to 1.5 billion. You need to track every number, since the second half of the dataset could all be less than 1 billion or all be more than 1.5 billion.

It's a good question, though. If you constrain the size of the numbers more, than the "bucket" approach becomes practical.

- showell30@yahoo.com February 21, 2013 | Flag

Comment hidden because of low score. Click to expand.

-1

of 1 vote

For a fixed input, median can be found in O(n) time if you can modify the array. See Selection_algorithm in wikipedia.

- sameerBandla February 16, 2013 | Flag Reply

Comment hidden because of low score. Click to expand.

-1

of 1 vote

For constant-length array, select algorithm works well.

For continuous stream, it is unknown how much to allocate for the array ahead of time. Array allocation at runtime is very expensive. Hence, use a BST and keep a count of the total number of elements N. Then, when it is needed to find the median, do in-order traversal for the N/2-th element; worst-case O(N), best-case O(logN).

- Jack February 18, 2013 | Flag Reply

Comment hidden because of low score. Click to expand.

-1

of 1 vote

Max/Min Heap does not work when the streams of integer is infinite, since it will need to record all these numbers. Since the number is integer already is billion, assume the value of integer is less than 2<<32, i would use 2<<32 number of array to counter the number of element with that value in the infinite integer streams and track the media value in the array.

- Anonymous February 20, 2013 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

My approach is quiet the same and dump maybe... but billion is ~ 2^30 (GB). Since these numbers are integers, you have a range: 2^32 possibilities. If each possibility is a counter of 32bit (which gives you a massive support, despite hotspots, for the endless stream) you could use a int array to count the integers. Since this counter is indexed, the counting process is O(n) and scales with the stream.
More, you can use a double to count the TOTAL number of entires (or BigInteger), then to find the median, just /2 the Total and use a pointer to sum and reach that value.
Issues: The median is not real-time and the stream would need to stop
I would say the heap or bst options and not mine... but it was just a new idea...

- dfrnascimento March 02, 2014 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

I thought of the same solution. But, it won't work!

How many numbers would you keep track of, at any instant?

Consider the worst case, where a lot of numbers are inserted at one particular side of the median (i.e left or right of the median). Assume that you keep track of 4 numbers,

60,70,80,90 --> median = 70

I have lost track of other numbers like 50,55,95,100, since we are keeping track of the nearest neighbors of median, that is 4 here.

Now, I am inserting 20 numbers with values between (20 to 40). How would you get the median? You have lost the values 50, 55, etc which you should have considered. This is where it would fail.

So, the correct solution would be to use self-balancing trees like AVL tree. The root of the tree would be the median.

- Kiran Kumar February 17, 2013 | Flag

Comment hidden because of low score. Click to expand.

CareerCup

Google Interview Question for Software Engineer / Developers

Books

Videos

Resume Review

Mock Interviews