IBM Interview Question
Software Engineer / DevelopersI am not sure if there is any subtle way to take medians of each chunk and with some book keeping, we can get to the solution.
So, I would go with a naive way of sorting the entire array and then finding out the median. Let's say we have 'M' memory available and the file size is 'N'. (assuming N and M have same units)
1) Number of chunks C = N / M;
2) Sort each chunk into memory and sort it and flush it to disk into a new file F'
3) So, now do a merge sort over the chunks:
That is, bring in first half of Chunk1 and first half of Chunk2 and then do a merge sort and keeping outputting the correct sort into the new file. (just like how we create a new array during merge sort).
So, if you visualize, you are creating tree from leaves to root. (leaves being chunks and then gradually merge sorting until you get 1 big sorted file).
Yeah, disadvantage is that you are creating more memory every time (but we can flush everything later from the disk).
Once you have the sorted file on the disk, bring in memory Chunk C/2 into memory and then take its middle element.
PS: Yes, we will have to work out on the ceiling/floor issues but that is not interesting.
Hope it helps. Thanks!
The file is huge. The whole file cannot be loaded into memory! Moreover, why should you ever solve it in nlogn (sort) when you just have to find the median?
Think about doing a one-sided binary search since you do not know the size of the list at first. If you do not know about it, google it! I would use the same partition mechanism as used in the selection strategy to also incorporate one-sided binary search into it. I can imagine that this problem can be solved in two passes (i.e) O(n).
But that would involve lot of block transfers i.e. for one iteration of the partition mechanism there has to be k block transfers where k=No of blocks used to store the integers.
Besides the algorithm you suggested although has O(n) asymptotic complexity, the constants involved are pretty huge, in practice it may not work well.
i have a solution... first take 3 numbers and find the median.. now add the next number and find the median of that 4 number.... now add one more and continue this till the numbers end... it works... this is called decrease and conquer algo
Find the Range
Build the histogram of 10 bins ... that will give you a crude estimate
Keep refining the estimate till you converge to one value.
Complexity: log(Range)*N
There is a nice solution at http discuss.joelonsoftware.comdefault.asp?interview.11.794214 (credit to Calvin):
-----------------COPY&PASTE----------------
1. Use a max-heap with k elements to go through all the N elements, which will get the biggest k elements. Record the smallest in heap, say W. Complexity: O(N*logk)
2. Do step 1 again, but filter out elements bigger than W. This time the smallest element in heap is the 2kth biggest element. Do this step N/(2*k) times.
3. Output the heap in sorted order. The N/2 - floor(N/(2*k)) * k is the result.
------------END OF COPY&PASTE----------------
Use definition: median minimizes f(X) = sum(|xi - X|).
A possible solution of NlogN complexity:
1. run through the file and find min(x) and max(x): N ops
2. do a standard binary search to minimize f(X) = sum(|xi - X|), where initial X is set to (min(x)+max(x))/2 and the boundaries set to min(x) and max(x): each iteration requires going through the whole file to calculate f(X), also it takes logN cycles to reach a given precision: total NlogN ops
If the file contains 5*10^10 numbers with values -MAXINT, 5*10^10 numbers with values MAXINT, and 0 then 0 is the median. In this example, sum(|xi-X|) can cause a buffer overflow error.
Valid point, but the problem can be easily fixed: instead of sum() calculate the corresponding average using iterative approach to avoid overflow.
In particular, define
A(X,1)=x(1)-X,
A(X,n+1)=n/(n+1)*(A(X,n)+(x(n)-X)/n)
Then minimize A(X,N) as a function of X as described above.
Take the first element as pivot. run through the whole file to check how many elements are less than pivot and how many are greater. This will give an idea whether the median will be greater or less than the current pivot.
Now go for the next element in the file to make it pivot (use previos decisions to bound the range of elements dat can be possible candidates for median), run this way until for on pivot equal number of elements on both larger and smaller side are found..
What you have suggested is the Median of Medians strategy used in pivot selection of Quicksort but I don't think that it will always be the actual median of the list.
What I would suggest is take a constant k and then divide the list into k groups. For every group create a min heap of it(Choose k such that k elements fit into the memory)Complexity till now:O(n/k * k)=O(n).
Now create a heap out of the elements at the root of the heaps that we have. Remove the root and bring in the next element from the list where the root belonged and heapify, do it n/2 times,you get the median.Overall Complexity:O(n+ n*log k) = O(n log k)
Divide the list into many parts
- DashDash August 07, 2010Now take 2 list and find median of both the list
Now take the numbers which are in between the medians and the median of them will give the median of both the list.
Do it for rest of the part