showell30@yahoo.com
BAN USER- 0of 0 votes
AnswersCode up a simple Bloom Filter for four letter words. Assume all words are lowercase, exactly four letters long, and only use the letters a to z. An example set of test words might be able, band, bend, card, darn, form, grab, harm, look, make, mate, peer, and team, but assume the actual list of words is much bigger.
- showell30@yahoo.com in United States| Report Duplicate | Flag | PURGE
Amazon Software Engineer / Developer Data Structures
The question is poorly written. It's not clear whether the goal is to enumerate all the nodes in the range or simply find any matching node.
- showell30@yahoo.com February 23, 2013The question doesn't specifically mention storage constraints, but you might want to upvote "Storage-efficient C Sieve" anyway. :)
- showell30@yahoo.com February 23, 2013Storage-efficient C Sieve.
This code conserves memory (for large N) by applying the sieve to only 1000 numbers at a time. It also skips even numbers.
#include <stdlib.h>
#include <stdio.h>
struct prime {
int n;
int max_multiple;
};
void sieve_out_multiples(
int *sieve,
struct prime *primes,
int i,
int n_start,
int n_end)
{
int offset;
while (primes[i].max_multiple < n_end) {
offset = (primes[i].max_multiple - n_start) / 2;
sieve[offset] = 1;
primes[i].max_multiple += 2 * primes[i].n;
}
}
int count_odd_primes(n) {
if (n % 2 == 0) --n;
int chunk_size = 1000; // number of odds
int n_start = 3;
int n_end = n_start + 2 * chunk_size;
if (n_end > n+2) n_end = n+2;
int cnt = 0;
chunk_size = (n_end - n_start) / 2;
int alloc_size = chunk_size; // enough to get started
struct prime *primes = malloc(alloc_size * sizeof(*primes));
int *sieve = malloc(chunk_size * sizeof(int));
int i;
while (n_start <= n) {
n_end = n_start + 2 * chunk_size;
if (n_end > n+2) n_end = n+2;
chunk_size = (n_end - n_start) / 2;
for (i = 0; i < chunk_size; ++i) {
sieve[i] = 0;
}
// mark all composities from primes so far
for (i = 0; i < cnt; ++i) {
sieve_out_multiples(sieve, primes, i, n_start, n_end);
}
for (i = n_start; i < n_end; i += 2) {
int offset = (i - n_start) / 2;
if (sieve[offset] == 0) {
// we found a prime!
// printf("prime %d\n", i);
if (cnt >= alloc_size) {
alloc_size += chunk_size / 2 + 1;
primes = realloc(primes, alloc_size * sizeof(*primes));
}
primes[cnt].n = i;
primes[cnt].max_multiple = 3 * i;
sieve_out_multiples(sieve, primes, cnt, n_start, n_end);
++cnt;
}
}
n_start = n_end;
if (n_start % 2 == 0) ++n_start;
}
return cnt;
}
int count_primes(int n) {
if (n <= 1) return 0;
if (n == 2) return 1;
return count_odd_primes(n) + 1;
}
int main(int argc, char **argv) {
printf("%d\n", count_primes(100));
printf("%d\n", count_primes(1000000));
return 0;
}
Yep, it "stops" being O(1) as soon as the number of bits in the binary number can't be stored in a word. While not theoretically O(1), the algorithm does allow you to increment numbers up to 2 ** (2**32) very quickly, whereas a 32-bit machine only increments numbers up to 2**32 quickly.
- showell30@yahoo.com February 23, 2013Recursive single-pass Python solution:
def diameter_height(tree):
if tree is None: return (0,0)
left, val, right = tree
ld, lh = diameter_height(left)
rd, rh = diameter_height(right)
d = max([ld, rd, lh+rh+1])
h = max([lh+1, ld+1])
return d, h
def diameter(tree):
d, h = diameter_height(tree)
return d
assert 0 == diameter(None)
assert 1 == diameter((None, 1, None))
three_node_tree = ((None, 0, None), 1, (None, 2, None))
assert 3 == diameter(three_node_tree)
seven_node_tree = (three_node_tree, 'foo', three_node_tree)
assert 5 == diameter(seven_node_tree)
imbalanced_tree = (seven_node_tree, 'bar', None)
assert 5 == diameter(imbalanced_tree)
Python sort () and sorted() have supported a key argument for a while, which allows you to specify a key for sorting. Python also sorts tuples lexographically. So, it's easy to compose a key function from two other key functions:
def sorted2(lst, key1, key2):
def f(v): return (key1(v), key2(v))
return sorted(lst, key=f)
The problem with inorder traversal is that it's O(N). See "Divide and Conquer Solution" for a simple approach that is much more efficient for large trees.
- showell30@yahoo.com February 23, 2013Divide and Conquer Solution:
If the tree is empty, return not found. If min <= root <= max, then return root. If root < min, recursively search the right subtree. Otherwise, recursively search the left subtree.
Assuming the tree is balanced, this runs in O(log N) time, vs. O(N) time for an inorder traversal.
Can you determine which bit to flip in constant time? I assume the answer is yes.
- showell30@yahoo.com February 23, 2013P.P.S. I said all "even numbers are composite," but of course I meant all even numbers >= 4. A real world implementation would special case for small values of N.
- showell30@yahoo.com February 23, 2013It's only lazy with respect to B values. Yes, of course you'd have to sort the As, which is at least 50% of the work. I didn't mean to overstate the advantage.
- showell30@yahoo.com February 23, 2013P.S. This solution is less time-efficient than a simple Sieve (although not drastically so), but it's much better in terms of storage.
- showell30@yahoo.com February 23, 2013If storage is constained, see my solution "Maintain an ordered list of upcoming prime multiples." My solution requires storing a tuple for every prime less than N, which is of course better than O(N).
- showell30@yahoo.com February 23, 2013Maintain an ordered list of upcoming prime multiples. When n=20, your list will look like this:
(21, 3)
(21, 7)
(22, 2)
(22, 11)
(25, 5)
(26, 13)
(34, 17)
(38,19)
Pop off the top of the list, and now you know that 21 is composite. Push (24,3) on to the list:
(21, 7)
(22, 2)
(22, 11)
(24, 3)
(25, 5)
(26, 13)
(34, 17)
(38,19)
Pop off the top of the list again. 21 is a duplicate, so no new info on composites/primes here, but you can update the list:
(22, 2)
(22, 11)
(24, 3)
(25, 5)
(26, 13)
(28,7)
(34, 17)
(38,19)
Pop off (22,2) and push (24,2):
(22, 11)
(24, 2)
(24, 3)
(22, 11)
(25, 5)
(26, 13)
(28,7)
(34, 17)
(38,19)
Pop off (22,11) and push (33,11)
(24, 2)
(24, 3)
(25, 5)
(26, 13)
(28,7)
(33, 11)
(34, 17)
(38,19)
Next pop off (24,2). You know your last composite was 22, and your newest composite is 24, so you just found a prime! Now you push two values into the list:
(24, 3)
(25, 5)
(26, 2) # next multiple of 2
(26, 13)
(28,7)
(33, 11)
(34, 17)
(38,19)
(46, 23) # next multiple of 23
Hopefully you get the idea by now. An easy optimization is to only deal with odd numbers, since all even numbers are composite. When excluding evens, your list would look like this after 30:
(33, 3)
(33, 11)
(35, 5)
(35, 7)
(39, 13)
(51, 17)
(57, 19)
(69, 23)
(87, 29)
Three approaches come to mind.
1) Sort on A, then scan the list for runs of elements with the same value for A, then sort each sublist for field B.
2) Sort on A/B simultaneously by using a compare function or key function that uses both fields.
3) Sort the list on B, then stable-sort the list on A.
A nice feature of approach #1 is that you can do it lazily, i.e. only sort the sublists when the second field is queried.
Nax, you need to follow the process. Start with a. The suggested solution says that b, c, and d stand to the right of him (in some order) and then e is to left. In the recursive step, you arrange b, c, and d. There's no way you'll ever have e standing next to d if you follow the instructions.
- showell30@yahoo.com February 23, 2013P.S. To avoid off-by-one errors, make sure you understand how you define height and diameter for segments. I define these lengths in terms of the number of nodes, so a segment from A to B to C has length 3 by my definition, not 2, despite 2 making more sense from a physical/intuitive perspective. Likewise, a tree with a single node has height 1 by definition.
- showell30@yahoo.com February 22, 2013Write a function that returns a tuple/struct of two values for a given tree. The first value is the height, and the second value is the diameter length.
The terminating case is a an empty tree, which returns 0,0.
The recursive calculations are these:
height = max(height(left), height(left)) + 1
diameter = max(diameter(left), diameter(right), height(left)+height(right)+1)
To fully complete the exercise, have a helper function that calls the other function and returns only the diameter as the final answer.
The reason there's a gap in the representation for 4 (the None as the second value) is to facilitate the transformation to 5 without having to shift values. For small numbers this seems overkill, but think about incrementing b1011111001 to b1011111010. When the number of runs change, you don't want to shift over all the runs, because that would be O(N).
- showell30@yahoo.com February 22, 2013If you are asked this, try to make your first answer as simple as possible: "I would use a set to keep track of integers that I already encountered." Then engage the interviewer in dialog. Does she have any special constraints? Is time scarce, storage scarce, or both? Also, make sure that the interviewer knows that you know the difference between interface and implementation. There are a zillion ways to implement a set, but the most important feature of a set is that you can put a value into it, and the set will remember that it's part of the, um, set. The actual implementation can be based on a hash, but be clear that it's an implementation detail.
- showell30@yahoo.com February 22, 2013I believe this should work.
- showell30@yahoo.com February 22, 2013If you have to do an O(N) algorithm to find the rotation point, then you might as well just do a straightforward O(N) search as well, since O(N) + O(log N) is O(N).
- showell30@yahoo.com February 22, 2013This is the way to do it. Even though the underlying representation is only binary in an abstract sense, it's totally trivial to serialize the number in binary form from the compressed representation that the OP suggests. See my Python implementation on this page for more details.
- showell30@yahoo.com February 22, 2013This is not for the faint of heart. :)
# In this completely contrived problem, we represent a binary
# number as a list of runs of zeros and ones, and we minimize
# bit flipping and storage insertion/deletion by having the
# number of bits in a run represent the number of bits to
# jump in the array. Incrementing a binary number, no matter
# how large, involves only modifying up to three tuples in place
# and possibly appending a sentinel to the end of the list.
def test():
numbers = {
0: [None],
1: [(1,1), None],
2: [(1,0), (1,1), None],
3: [(2,1), None, None],
4: [(2,0), None, (1,1), None],
5: [(1,1), (1,0), (1,1), None],
6: [(1,0), (2,1), None, None],
7: [(3,1), None, None, None],
8: [(3,0), None, None, (1,1), None],
9: [(1,1), (2,0), None, (1,1), None],
10: [(1,0), (1,1), (1,0), (1,1), None],
}
for k, v in numbers.items():
assert k == to_decimal(v)
num = [None] # zero
for i in range(1024 * 128 + 5):
assert i == to_decimal(num)
incr_bin(num)
print i
print num
print to_binary(num)
print 'DONE!'
def incr_bin(num):
if num[0] is None:
# 0 -> 1
num[0] = (1, 1)
num.append(None)
return
num_bits, bit = num[0]
if bit == 0:
num_zeros = num_bits
if num_zeros == 1:
# 011 -> 111
num_ones, _ = num[1]
num[0] = (num_ones+1, 1)
else:
# 0001 -> 1001
num[0] = (1,1)
num[1] = (num_zeros-1, 0)
else:
num_ones = num_bits
if num[num_ones] is None:
# 1111 -> 00001
num[0] = (num_ones, 0)
num[num_ones] = (1,1)
# We only need to allocate storage
# when we reach a power of 2!
num.append(None)
else:
num_zeros, _ = num[num_ones]
if num_zeros == 1:
# 11101111 -> 0001111
num_higher_ones, _ = num[num_ones+1]
num[0] = (num_ones, 0)
num[num_ones] = (num_higher_ones+1, 1)
else:
# 1001xxx -> 0101xxx
num[0] = (num_ones, 0)
num[num_ones] = (1, 1)
num[num_ones+1] = (num_zeros-1, 0)
# helper function for testing
def to_decimal(num, i=0):
if num[i] is None:
return 0
result = 0
num_bits, bit = num[i]
if bit == 1:
result = 2 ** num_bits - 1
return result + (2 ** num_bits) * to_decimal(num, i+num_bits)
# another helper function
def to_binary(num):
s = ''
i = 0
while num[i]:
num_bits, bit = num[i]
s = str(bit) * num_bits + s
i += num_bits
return s
test()
This Python solution assumes two challenges. First, we have to do it in O(logN) time. Second, we aren't told in advance how many positions the array was rotated.
def test():
# Assume we have a list that is sorted, except that is
# rotated by an unknown offset. First, we build a helper
# function to determine the position of the smallest
# element in the list (aka the rotation offset).
assert 0 == pos_smallest_element([0, 0, 0])
assert 0 == pos_smallest_element([0, 1, 2])
assert 1 == pos_smallest_element([2, 0, 1])
assert 2 == pos_smallest_element([1, 2, 0])
assert 3 == pos_smallest_element([4, 5, 6, 0, 1, 2, 3])
# Once we know the smallest element, it's pretty trivial
# to implement search in terms of a generic binary search.
test_lst = [10, 2, 4, 6, 8]
assert 0 == search(10, test_lst)
assert 1 == search(2, test_lst)
assert 2 == search(4, test_lst)
assert 3 == search(6, test_lst)
assert 4 == search(8, test_lst)
assert None == search(0, test_lst)
assert None == search(7, test_lst)
assert None == search(11, test_lst)
def bsearch(val, lowest, highest, get):
# A generic binary search uses a get() function, so
# that we're not coupled to the storage scheme. We
# just need a guarantee that get(a) <= get(b) when a <= b.
def f(low, high):
if low > high:
return None
mid = (low+high) / 2
mid_val = get(mid)
if val == mid_val:
return mid
if val < mid_val:
return f(low, mid-1)
else:
return f(mid+1, high)
return f(lowest, highest)
def search(val, lst):
offset = pos_smallest_element(lst)
def actual_pos(i):
return (i+offset) % len(lst)
def get(i):
return lst[actual_pos(i)]
answer = bsearch(val, 0, len(lst)-1, get)
if answer is None:
return None
return actual_pos(answer)
def pos_smallest_element(lst):
def pos(low, high):
if low == high:
return low
if low + 1 == high:
if lst[low] <= lst[high]:
return low
else:
return high
mid = (low + high) / 2
if lst[low] <= lst[mid] <= lst[high]:
return low
if lst[low] <= lst[mid]:
return pos(mid, high)
else:
return pos(low, mid)
return pos(0, len(lst) - 1)
test()
Try finding 2 in 10, 2, 4, 6, 8. I think it breaks your algorithm, but there's an easy way to prove me wrong. :)
- showell30@yahoo.com February 22, 2013If you aren't told rotateAmount in advance, how do you determine it in non-linear time?
- showell30@yahoo.com February 22, 2013# You can simulate linked lists in Python with tuples.
# You want the least significant bit to be the head of
# the list.
def test():
one = (1, None)
two = (0, (1, None))
three = (1, (1, None))
six = (0, (1, (1, None)))
seven = (1, (1, (1, None)))
thirteen = (1, (0, (1, (1, None))))
assert three == sum(one, two)
assert thirteen == sum(six, seven)
assert 15 == to_decimal(sum(two, sum(six, seven)))
def sum(lst1, lst2, carry=0):
if lst1 is None and lst2 is None:
if carry:
return (1, None)
else:
return None
if lst1 is None:
b1, lst1 = 0, None
else:
b1, lst1 = lst1
if lst2 is None:
b2, lst2 = 0, None
else:
b2, lst2 = lst2
b = b1 + b2 + carry
if b >= 2:
return (b-2, sum(lst1, lst2, 1))
else:
return (b, sum(lst1, lst2, 0))
def to_decimal(lst):
if lst is None:
return 0
b, rest = lst
return b + 2 * to_decimal(rest)
test()
It appears that you are working with a reversed list, not a rotated list. It's not clear that you read the problem correctly.
- showell30@yahoo.com February 22, 2013Solving this is O(N) time is trivial, so the challenge is to solve it in sub-linear time. If the interviewer tells you where the smallest element in the list is, then you can do a binary search with the known rotation offset. If you don't know where the smallest element is, then the challenge is to find the smallest element in the list.
To find the smallest element in a rotated sorted list, break the lists in half. If low <= mid and mid <= low, then the smallest element is the first element, and you're done. If low > mid, then the smallest element must be in the first half of the list, so recurse on the first half of the list. If mid > high, then the smallest element must be in the second half of the list, so recurse on that.
The key insight here is that when you break a rotated sorted list in half, only one half of the list will have the smallest element, and the other half of the list will be in sequence.
P.S. After the first pass you will already have an estimate of the median that is within 16 million of the exact median. On the second pass you will have an estimate that is within 64k of the precise median. After the third pass you will be at most 256 off. After the 4th pass you'll know the precise integer median. If you're dealing with 64-bit numbers, you will do eight passes.
- showell30@yahoo.com February 21, 2013Assume that the integers are 32-bit integers and that you can read the data stream four different times (either by asking for it again or persisting it to disk). You can solve this in four passes. In the first pass, shift each number by 24 bits and increment one of 256 buckets. After the first pass, iterate through the buckets to find out which contains the median value and what its rank will be. On the second pass, mask out the highest eight bits to filter values then use the next eight bits to to populate 256 buckets again. Repeat for the third pass and fourth pass. After the fourth pass you will have your result.
- showell30@yahoo.com February 21, 2013Treating the x dimension separately from the y dimension is a nice touch here. For any interview problem in 2D or 3D, a good think to ask yourself is whether you can treat the movements in the different dimensions as being independent. If you can treat the dimensions as being independent, then you basically can reduce the problem to a 1D problem.
- showell30@yahoo.com February 21, 2013Here is a Python solution.
def compute_spiral_area(movements):
i = 0
max_x = 0
min_x = 0
max_y = 0
min_y = 0
x = 0
y = 0
while True:
if i >= len(movements): break
x += movements[i]
if x > max_x: max_x = x
i += 1
if i >= len(movements): break
y += movements[i]
if y > max_y: max_y = y
i += 1
if i >= len(movements): break
x -= movements[i]
if x < min_x: min_x = x
i += 1
if i >= len(movements): break
y -= movements[i]
if y < min_y: min_y = y
i += 1
return (max_x - min_x) * (max_y - min_y)
input = [4,3,5,4,10]
print compute_spiral_area(input)
See my solution "Here's the in-place solution in Python..." to see how you can solve this working backward after first compressing the 1s in an initial pass.
- showell30@yahoo.com February 21, 2013Read the problem again. The challenge here is rewriting the array in constant space. Streaming this to standard output is trivial.
- showell30@yahoo.com February 21, 2013Here's the in-place solution in Python, with tests. It's done in constant space. There's no recursion and the array never grows larger than the final result.
def test():
test_strings = [
('a1b1c1', 'abc'),
('a1b3c5', 'abbbccccc'),
('a5b3c1', 'aaaaabbbc'),
('a5b5c5', 'aaaaabbbbbccccc'),
]
for s, expected_result in test_strings:
# strings are immutable in Python, so make this a list
lst = list(s)
expand(lst)
assert expected_result == ''.join(lst)
def expand(a):
# i = input index
# j = output index
# In our first pass, we compress the 1s
# and compute how much space we'll have.
i = 0
j = 0
new_len = 0
while i < len(a):
c = a[i]
cnt = int(a[i+1])
new_len += cnt
i += 2
a[j] = c
j += 1
if cnt > 1:
a[j] = str(cnt)
j += 1
# Set the array to correct length.
while len(a) > new_len:
a.pop()
# Note that this is the only place we append
# to the list. Everything is in-place, and we
# have constant storage for the locals: i, j, cnt, new_len.
while len(a) < new_len:
a.append(None)
# In the second pass, we work backward through
# the string and write out the final result. By
# working backward, we know we'll have room.
i = j - 1 # last element in compress string
j = new_len - 1 # where we write
while i >= 0:
c = a[i]
i -= 1
try:
cnt = int(c)
c = a[i]
i -= 1
except:
cnt = 1
while cnt > 0:
a[j] = c
j -= 1
cnt -= 1
test()
You just need to compress the 1s in a first pass. After that, the reverse methodology works fine. In the first pass, you compress the 1s and compute the length of the expanded string. In the second pass, you start from the back. Read a character. If it's a letter, just write it to the end of the array. If it's a number, read back one more to get the letter, and then write the appropriate numbers to the back of the string.
- showell30@yahoo.com February 21, 2013If the integers can range from 0 to 2<<32 (approximately 4 billion), then you still have scaling issues. Assume the first half of your dataset ranges from 1 billion to 1.5 billion. You need to track every number, since the second half of the dataset could all be less than 1 billion or all be more than 1.5 billion.
It's a good question, though. If you constrain the size of the numbers more, than the "bucket" approach becomes practical.
See my answer that begins with "Let's assume memory and disk are both constrained...". It elaborates on the dual-heap approach in the context of large datasets.
- showell30@yahoo.com February 21, 2013Let's assume memory and disk are both constrained, but you can assume that your dataset is ordered in a fairly random way. Use the first element as the provisional median. Maintain two heaps, one for numbers lower than the median and one for numbers higher than the median. For each new number, decide which heap to update and keep track of the size of both heaps. Whenever you get two elements in a row that go to the same heap, you pop off the top element from the large heap, compare it to the most recent number, and decide the new provisional median. Put the other number in the smaller heap.
Since memory and disk are constrained, you will want to limit the size of the heaps. This means that numbers that are "extreme" will get tossed. This is risky, because the "extreme" number might actually be a non-extreme number that's only "extreme" relative to early numbers in the dataset. For real world problems, this is often an acceptable risk, as you're effectively treating early values as outliers. You can detect the situation by keeping track of the max number discarded for being too small and keeping track of the min number discarded for being too big.
If you want to compute the *exact* median, then you have to store the first half billion integers *somewhere*, since every one of the first half billion integers could theoretically still be the median without any knowledge of the second half of the sample.
Yup, if parens aren't allowed, then you're better off just generating all the possible answers and evaluating them. To generate the sequence of numbers, use a permutation algorithm. To generate the symbols, use an odometer algorithm. To evaluate the expression, first scan for * and /, then scan for + and -.
- showell30@yahoo.com February 21, 2013To clarify, I am defining N as the perimeter of the rectangle. The algorithm does not depend on it being a square. At worst it does m+m comparisons (N/2), which makes it O(N).
- showell30@yahoo.com February 20, 2013The only problem with this solution is that you don't really take advantage of the parallelism of a distributed system. You want to come up with some divide and conquer scheme that allows multiple nodes to be computing partial sums simultaneously. See the Python binary tree solution, for example.
- showell30@yahoo.com February 20, 2013Let the nodes form a binary tree. Every node has 0 or 1 parents and between 0 and 2 children. Nodes with children wait for them to report their values and pass a partial sum up to their parents, then wait for the parents to report the eventual sums. For parallel nodes this runs in O(logN) time. The synchronous simulation below runs in linear time.
import random
class Node:
def __init__(self, n, num_nodes, val, send, report_final_result):
self.send = send
self.report_final_result = report_final_result
self.children = []
self.sum = val
if 2*n <= num_nodes:
self.children.append(2*n)
if 2*n+1 <= num_nodes:
self.children.append(2*n+1)
self.parent = n / 2
self.values_to_wait_for = len(self.children)
self.child_values = []
def start(self):
if self.values_to_wait_for == 0:
if self.parent:
self.send(self.parent, self.sum)
else:
self.report_final_result(self.sum)
def receive(self, val):
if self.values_to_wait_for > 0:
self._handle_child_val(val)
else:
self._handle_parent_val(val)
def _handle_child_val(self, val):
self.sum += val
self.values_to_wait_for -= 1
if self.values_to_wait_for == 0:
if self.parent:
self.send(self.parent, self.sum)
else:
# we're the root
self.report_final_result(self.sum)
for child in self.children:
self.send(child, self.sum)
def _handle_parent_val(self, val):
self.report_final_result(val)
for child in self.children:
self.send(child, val)
def simulate(num_nodes):
nodes = []
pending_nodes = set()
sum = 0
def make_dispatch(sender):
def f(n, val):
# connect send to recipient's receive() method
print 'sender %d sent %d to %d' % (sender, val, n)
nodes[n-1].receive(val)
return f
def make_report_final_result(sender):
def f(val):
print sender, val
assert val == sum
pending_nodes.remove(sender)
return f
for i in range(num_nodes):
node_val = random.randint(0, 100)
sum += node_val
node = Node(i+1, num_nodes, node_val,
make_dispatch(i+1),
make_report_final_result(i+1)
)
nodes.append(node)
pending_nodes.add(i+1)
for i in range(num_nodes):
nodes[i].start()
assert 0 == len(pending_nodes)
simulate(2000)
Write all the elements to disk in a first O(N) pass, and randomly capture about 99 of them into memory. Sort the 99 elements to create 100 ranges for the billion elements to be bucketed into. Each range will be a file on disk. For each of the billion integers, do a binary search on your 99 randomly sampled integers to determine which file they go to next. Keep track of how many integers are in each file. At the end of the second pass you will have a hundred files, each with about ten million elements each. You'll know how many elements are in each file, and you'll be able to easily compute which file contains the median, and where the median ranks within that file. At that point, use an in-memory solution to find the nth element in the file that you know contains the median. Sorting ten million integers is pretty fast, so you can probably just sort all of them without having to go to a fancy heap-based approach. The total cost here is roughly 2 billion disk writes and roughly 11 billion integer comparisons.
This will get the job done easily for most data sets, but you can refine it by having provisions for one range taking an unexpectedly large number of values. Also, after a certain amount of time, you'll know that it will be impossible for certain ranges to ever contain the median, so you can short circuit writing values to their files and simply update their counts.
# In Python a simple way to represent a binary
# tree is as a tuple: (left_tree, val, right_tree)
def create_tree(lst):
def _maketree(low, high):
# Making a balanced tree is simply a matter
# of taking the middle value as the root
# and recursively building the subtrees
# on the left and right.
if low > high:
return None
if low == high:
return (None, lst[low], None)
mid = (low + high) / 2
val = lst[mid]
left = _maketree(low, mid - 1)
right = _maketree(mid + 1, high)
return (left, val, right)
return _maketree(0, len(lst) - 1)
def traverse(tree):
if tree is None:
return
left, val, right = tree
traverse(left)
print val
traverse(right)
def tree_height(tree):
if tree is None:
return 0
left, val, right = tree
lh = tree_height(left)
rh = tree_height(right)
return max([lh, rh]) + 1
lst = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15]
tree = create_tree(lst)
print tree
assert 4 == tree_height(tree)
traverse(tree)
Here is an efficient algorithm that uses divide and conquer. You divide up runners into the first division and second division and let them recursively order themselves. Then you combine the divisions. The winner of the bottom division tries to beat the last place person in the top division, and if successful, works his way up the ranks. As soon as the top division's loser beats the bottom division's winner, you can simply chain the two divisions together.
import random
class Runner:
pass
def race(race_results, c1, c2):
if (c1, c2) in race_results:
return race_results[(c1, c2)]
if random.random() < 0.5:
# if c1 > c2:
print '%d beats %d' % (c1, c2)
race_results[(c1, c2)] = c1
return c1
else:
print '%d beats %d' % (c2, c1)
race_results[(c1, c2)] = c2
return c2
def race_runners(race_results, runners, low, high):
if low == high:
return low, low
if low + 1 == high:
winner = race(race_results, low, high)
if winner == low:
runners[low].inferior = high
runners[high].superior = low
return low, high
else:
runners[low].superior = high
runners[high].inferior = low
return high, low
mid = (low + high) / 2
winner1, loser1 = race_runners(race_results, runners, low, mid)
winner2, loser2 = race_runners(race_results, runners, mid + 1, high)
print 'merge:'
print_chain(runners, winner1)
print_chain(runners, winner2)
while winner2 is not None:
on_deck = runners[winner2].inferior
# Try to promote the winner of the second division
# up through the ranks of the first division.
defeated = None
victor = None
loser = loser1
while loser is not None:
winner = race(race_results, winner2, loser)
if winner == winner2:
defeated = loser
loser = runners[loser].superior
else:
victor = loser
break
if defeated is None:
# winner2 beat nobody, so
# connect the ladders and we're done
print 'simple connect'
runners[loser1].inferior = winner2
runners[winner2].superior = loser1
break
if victor is None:
# we have a new alpha dog
runners[defeated].superior = winner2
runners[winner2].inferior = defeated
print 'new alpha dog'
winner1 = winner2
else:
# insert our new place
print 'insertion'
runners[victor].inferior = winner2
runners[winner2].superior = victor
runners[winner2].inferior = defeated
runners[defeated].superior = winner2
if on_deck is None:
# top division loser couldn't hold off anybody
loser2 = loser1
break
runners[on_deck].superior = None
winner2 = on_deck
print_chain(runners, winner1)
return winner1, loser2
def print_chain(runners, head):
chain = []
while head is not None:
chain.append(head)
next = runners[head].inferior
if next and runners[next].superior != head:
print chain
raise Exception('bad reverse linking')
head = next
print 'chain', chain
def ladder(num_runners):
runners = []
for i in range(num_runners):
runner = Runner()
runner.superior = None
runner.inferior = None
runners.append(runner)
race_results = {}
winner, loser = race_runners(race_results, runners, 0, num_runners-1)
print_chain(runners, winner)
def test():
for i in range(1, 25):
print '-----'
ladder(i)
test()
This seems overly complicated, when this problem is easily solved in O(M) by traversing adjacent cells and leaving breadcrumbs. See the Python solution, for example.
- showell30@yahoo.com February 19, 2013
It's reasonable to assume that the interviewer is talking about a balanced binary tree, even though the question doesn't explicitly state it. Even for a horribly imbalanced tree, this approach is better than a naive inorder traversal.
- showell30@yahoo.com February 23, 2013