## Facebook Interview Question

SDE1s**Country:**United States

@all: time complexity won't change. It's O(n+m) because you need to look at each element at least once. That is, for 2*n data it takes double time, you wont achieve anything better unless you are magician ;-)

What can be improved is the constant factor. Fernando actually pointed that out. But if you are running on a single machine (threads, not processes) your bottleneck is disc I/O or memory access.

So, what we learn? distribute, shard your big data sets and use map/reduce. But Big-O is not changing, it's still O(n) by the way it's as well o(n) and Teta(n) since it's a quite tight bound

according to the master theorem, it reduces to big theta (n)?!

there are some small n log n issues which occur in the real world.

edit: when i say becomes n from n log n,

n is the size of the fully merged array

gee whiz i sure hope you're an avid reader! theres 2 sections...

assumptions:

strategy is a 1 thread can merge a pair of arrays, and thus multiple threads can do some simultaneous merging (and we preserve the sorted quality obviously)

for convenience in thinking, assume k arrays are equal length

assume we have more threads and cpus than we will ever need

=========================

improve the time complexity: just theory section

A single thread does mergesort in n log n

we derived that from the master theorem which DOES account for the halving of array size in the recursion formula.

I would like to account for the changing array size. lets use the master theorem.

(if you ignore changing array size you get O(log n) for magically uniform merges hahaha)

n comparisons, split the thing in half, etc for merge sort.

Instead of T(n) = 2T(n/2) + n with a single thread

since we do it in parallel the split is done simultaneously

I imagine the formula becomes T(n) = T(n/2) + n

b still = 2

a now = 1 from 2

now n^(log_2 1) = 1 which is smaller than f(n) = n.

running time is dominated near the root

we get case 3 of the master theorem

the regularity condition easily holds when a = 1 and b is not negative

log_2 1 = 0, which is smaller than c.

case 3 says the time complexity = big theta (f(n)) = big theta (n)

woah. thats definetly an improvement!

=========================

"if so, how?": an implementation

implementation wise i'd have a queue of arrays that need to get worked on, new arrays fresh from a merge get reinserted. a nice way of dealing with unbounded buffer, producer-consumer situations is to employ a semaphore.

this introduces some rare small O(n log n) situations...

in the super rare case where threads finish their merges simultaneously they will have to insert/pop the queue ONE thread at a time as the semaphore's counter must not have a race condition. There are still O(n log n) times that threads can get blocked from locking

also dammit we need to tell when we're finished.

so then we'd need a counter for how many merges we do in order to deduce that we've finished, which would also be incremented O(n log n) times, and O(n) to actually calculate the number of merges done when complete.

Again these are really small things but they could dominate eventually haha. i'd test that by getting employed and using company infrastructure if it ever practically got that big ;)

Assuming a multi-core environment in which n is the number of cores/threads

- Fernando May 18, 2017using several threads wouldn't improve the big notation as on a perfect set up it would only provide a speed up of n which is a fixed number. Also, this problem doesn't offer a perfect setup as when k < n you can't use all the available threads/cores. Eventually, k will be less than n as you are merging arrays.

In practice, in a typical architecture, you can achieve important improvements, may be of orders of magnitude, as you can usually manage more efficiently the access to the main memory using several threads/cores.

Edited: The previous solution assumes a real scenario in which the number of threads/cores is less than the number of arrays. The maximum theoretical speed up you can achieve sorting in parallel with n cores/threads is O(n log n) / n => O(log n). As previously stated even in this ideal scenario you can't do that with merge sort as each time you halve in two the number of effective threads. For merge sort in the ideal case yes you can obtain a linear cost O(n log n) / log n = O(n).

@Chris.k It depends on the architecture. But lets assume that reading from memory takes x cycles this number has to be much higher than the cost of performing a comparison. Let's assume that cost is y. You can reserve one core to read from memory and to read each time (x/y) * (n - 1) blocks, in this way for the rest of the (n - 1) cores the access to memory is basically free as when they finish the y comparisons they have a new chuck to compare. Perhaps I am missing something here. What do you think?