Google Interview Question
Country: United States
Interview Type: Phone Interview
By using maxHeap (containing bottom 50% of the numbers we read), and minHeap(containing top 50% of the number we read) it would only cost O(log(n)) to keep updating median. as you keep reading the nextNumber, c if the
if (number > minHeap.peek())
minHeap.push(number);
balance();
else
maxHeap.push(number);
balance();
where balance method is supposed to keep the balance between min and max tree by comparing their sizes, (size difference shouldnt be bigger than 1) and update the median, accordingly (median = minHeap.peek() if minHeap.size() > maxHeap.size(), and so on)
// insert current element to the left or right heap and get median so far
public int getMedian(final int current, final int med, final MaxHeap<Integer> left, final MinHeap<Integer> right) {
final int balance = left.size() - right.size();
int median = med;
// both heaps are of equal size.
if (balance == 0) {
// need to insert in left
if (current < median) {
left.insert(current);
median = left.top();
}
// need to insert in right
else {
right.insert(current);
median = right.top();
}
}
// left heap is larger
else if (balance > 0) {
// need to insert in left
if (current < median) {
right.insert(left.extract());
left.insert(current);
}
// need to insert in left
else {
right.insert(current);
}
median = (left.top() + right.top()) / 2;
}
// right heap is larger
else if (balance < 0) {
// need to insert in left
if (current < median) {
left.insert(current);
}
// need to insert in left
else {
left.insert(right.extract());
right.insert(current);
}
median = (left.top() + right.top()) / 2;
}
return median;
}
public int getStreamMedian(final int[] stream) {
int median = 0;
final MaxHeap<Integer> left = new MaxHeap<Integer>();
final MinHeap<Integer> right = new MinHeap<Integer>();
for (int i = 0; i < stream.length; i++) {
median = getMedian(stream[i], median, left, right);
}
return median;
}
Huh?? What am I missing here? Call readNextNumber() till it hits null; while reading push data to a vector (talking C++). Sort the vector, divide size by 2 and you have the median. Sorting takes the longest time - O(nLNn)
You are not getting data at once. Its a stream of data. Consider someone sending the data intermittently over the network. So as soon as you get the some data, you need to find the median.
You will repeatedly sorting the data and finding the median. But since you already have sorted results from the previous iteration, when you add a new number to the list you, you should not be sorting the whole list again. That's where the heaps come.
The self-balancing BST (height balanced BST) will not work because the numbers of nodes in the left subtree can be more than night. We need to use the min and max heap combination. All the numbers less than median will be in max heap and the ones greater than median in the min heap. So when you add a new number, if it is less than median you add to max heap. Take out the top element from max heap, make it the median and move the previous median to the min heap. Visa versa if the number being added is less than median.
Time complexity wise it will be Log(n) as insertion and deletion for a binary heap is Log(n) which is way better than n Log(n)
@don - A self-balancing BST will absolutely work if you augment each node in the tree with a "size" attribute which holds the number of nodes in the subtree. Inserting new elements only affects the "size" attributes of nodes "local" to the insertion path. Thus updating the "size" attribute does not change the asymptotic time complexity of the insert operation (this holds also for delete). Each insertion is O(log n). Finding ANY order statistic at any time is O(log n).
You have a couple of choices. One is O(n) but will only tell you the median of an accumulated set of numbers. The other uses a minheap and maxheap - this will be O(1) to find the median once you are asked for it but requires O(log n) each time a new number is added to the set.
For the first approach you use the partition subroutine from quicksort to recursively partition the set - ideally roughly in two halves per recursion. To achieve an even split, select a pivot value using a linear time median finding function, such as "median of medians" (if you only need O(n) "expected" runtime then you can use a randomised pivot selection which is simpler to implement in an interview). At each level of recursion you have, in O(1) time, the order statistic of the pivot value. You compare this to the desired order statistic and either return the pivot value as a match, or enter the next level of recursion into the partition containing your desired element. This method can be used to find ANY order statistic, and as a byproduct also yields the set of values above and below that statistic - eg the 10 largest numbers in the set. The algorithm is O(n) because you have at most O(log n) recursions and at each level you have to scan half as many elements as the previous level.
A Python example implementation of approach two follows - we define the "median" of an even number of elements to be the greater of the two medial values:
def find_median(line):
seq = (int(i) for i in line.strip().split(' '))
left_heap = Heap(MAX)
right_heap = Heap(MIN)
left_heap.push(-(2 ** 63))
right_heap.push(2 ** 63 - 1)
for num in seq:
if right_heap.size > left_heap.size:
if num < right_heap.peek():
left_heap.push(num)
else:
left_heap.push(right_heap.replace(num))
else:
if num > left_heap.peek():
right_heap.push(num)
else:
right_heap.push(left_heap.replace(num))
return right_heap.peek()
You could also use an Order Statistic Tree, which is an augmented self-balancing binary tree, such as a RedBlack Tree. Implementation of this in an interview could be a bit much as you can see below - I haven't even bothered implementing "delete":
RED = "red"
BLACK = "black"
NIL = None
class Node(object):
def __init__(self, key, value=None):
super(Node, self).__init__()
self.key = key
self.value = value
self.parent = NIL
self.left = NIL
self.right = NIL
self.colour = BLACK
self.size = 0
def __str__(self):
return "NIL" if self == NIL else str(self.key)
# noinspection PyRedeclaration
NIL = Node(None)
class OSTree(object):
def __init__(self):
super(OSTree, self).__init__()
self.root = NIL
def os_select(self, i):
node = self.root
while node != NIL:
rank = node.left.size + 1
if i == rank:
return node
if i < rank:
node = node.left
else:
i -= rank
node = node.right
return node
@staticmethod
def os_rank(node):
rank = node.left.size + 1
while node != NIL:
if node == node.parent.right:
rank += node.parent.left.size + 1
node = node.parent
return rank
def search(self, key):
node = self.root
while node != NIL and node.key != key:
if key < node.key:
node = node.left
else:
node = node.right
return node
def insert(self, key, value=None):
node = Node(key, value)
node.size = 1
x = self.root
while x != NIL:
node.parent = x
x.size += 1
if node.key < x.key:
x = x.left
else:
x = x.right
if node.parent == NIL:
self.root = node
else:
if node.key < node.parent.key:
node.parent.left = node
else:
node.parent.right = node
node.colour = RED
self._insert_fix(node)
def _rotate_left(self, x):
y = x.right
y.parent = x.parent
if y.parent == NIL:
self.root = y
elif y.parent.left == x:
y.parent.left = y
else:
y.parent.right = y
x.right = y.left
if x.right != NIL:
x.right.parent = x
y.left = x
x.parent = y
y.size = x.size
x.size = x.left.size + x.right.size + 1
def _rotate_right(self, x):
y = x.left
y.parent = x.parent
if y.parent == NIL:
self.root = y
elif y.parent.left == x:
y.parent.left = y
else:
y.parent.right = y
x.left = y.right
if x.left != NIL:
x.left.parent = x
y.right = x
x.parent = y
y.size = x.size
x.size = x.left.size + x.right.size + 1
def _insert_fix(self, node):
while node.parent.colour == RED:
if node.parent == node.parent.parent.left:
uncle = node.parent.parent.right
if uncle.colour == RED:
uncle.colour = BLACK
node.parent.colour = BLACK
node.parent.parent.colour = RED
node = node.parent.parent
else:
if node == node.parent.right:
self._rotate_left(node.parent)
node = node.left
node.parent.colour = BLACK
node.parent.parent.colour = RED
self._rotate_right(node.parent.parent)
else:
uncle = node.parent.parent.left
if uncle.colour == RED:
uncle.colour = BLACK
node.parent.colour = BLACK
node.parent.parent.colour = RED
node = node.parent.parent
else:
if node == node.parent.left:
self._rotate_right(node.parent)
node = node.right
node.parent.colour = BLACK
node.parent.parent.colour = RED
self._rotate_left(node.parent.parent)
self.root.colour = BLACK
def delete(self, node):
pass
/*
* Algorithm:
*For the first two elements add smaller one to the maxHeap on the left, and bigger one to the minHeap on the right. Then process stream data one by one,
*
*Step 1: Add next item to one of the heaps
*
* if next item is smaller than maxHeap root add it to maxHeap,
* else add it to minHeap
*
* Step 2: Balance the heaps (after this step heaps will be either balanced or
* one of them will contain 1 more item)
*
* if number of elements in one of the heaps is greater than the other by
* more than 1, remove the root element from the one containing more elements and
* add to the other one
*
* Then at any given time you can calculate median like this:
* If the heaps contain equal elements;
* median = (root of maxHeap + root of minHeap)/2
* Else
* median = root of the heap with more elements
*/
Will the stream contains duplicate integers?
- minglotus6 August 29, 2015