Amazon Interview Question
Software Engineer / DevelopersCountry: United States
Great answer, @sonesh.
One thing to clarify with the interviewer is it's taken as a hypothetical that the actual addition computation is expensive enough to justify the overhead of passing values between the threads. Even for arbitrarily big integers, it's probably gonna cost more to hand off a number to a thread than it is to simply add the same number to a single accumulated sum managed by one core.
Also, since adding numbers is O(N), using a binary tree breakdown is way overkill. For a really long list of numbers, you're eventually gonna do N-1 additions, and the only real engineering concern is keeping the cores busy doing the real work (arithmetic), instead of keeping the cores busy doing synchronization (pure overhead). So, maybe send 10,000 integers at a time to each worker, let them spin a while, and then each worker update the overall sum after each iteration, using a mutex to manage contention.
You are right @showell,
In general we have two different organization of memory, one is shared memory and distributed one,
In shared memory we don't need to pass anything, as all cores share same address space, so for that above solution is ok,
But If we have distributed systems, where each core have different their own memory address space, then we do by MPI, here each core actually have to send their data to another core, so their we have two type of cost
1)computation cost,
2) communication cost.
so the computation cost is same but communication cost vary as it depends upon how distributed system are implemented, for example they all might be connected by a common bus, or they may be connected our a network, then again their software implementation also affect the actual cost.
What ever you are saying in last para is actually comes under vector calculation, here we read not only a single data item but a vector of data item, and in message passing interface also, we have optimize in terms of size of packet.
Explanation of why we need some sort of relation on cores and threads per core.
Nowadays each core have 2 hardware threads, so we can have 2 threads per core, but more does not give us any benefit, in terms of performance, because each core can run only two thread in parallel(both are mapped to existing hardware threads), when we create more threads, it increase overhead of their creation, their synchronization, coherency and many other, and the switch time of c.p.u. because now same work is done by more threads, so when you create more then 2 thread per processor, it actually reduce performance, (in terms of actual time).
Now to increase performance we increase core, here is the algorithm to sum n numbers with n/2 core.
Let A[1 to n] is the given array
they sum in binary tree form,
example :
...................23
..........12.............11
...6...........6........11
1...5.....2.....4...5...6
this require O(log(n)) time and O(1) space complexity per core.but overall it require O(n) + O(n),
- sonesh March 14, 2013But here we consider(in parallel program) complexity per core.
Now if number of core are limited, means we can only create let say P threads only then what ?
so here we create (n/P) number set and sum them at each thread which require O(n/p) time then we apply above procedure which require O(log(p))
so total complexity become O(n/p) + O(log(p))