Google Interview Question
Software Engineer / DevelopersCountry: United States
Interview Type: In-Person
I can think of three basic mechanisms for accelerating the content distribution to N nodes:
1. compression: data is compressed before transmission and decompressed before deployment. we'll need the compression factor in order to calculate its impact.
2. pipelining: intermediate nodes starts transmitting to the next machine before getting the full content. this is mainly effective when download speed is greater-equal to upload speed, so after some initial buffering period each machine transmits at the same rate it receives.
3. sprinkling: transmit to more than one machine simultaneously. this puts an interesting trade-off, because a single transmission becomes slower, but the number of transmitting machines grows exponentially.
Let's put all this in one model:
N: number of machines, FSZ: file size in Mb, CMP: compression factor, CMPT: time to compress/decompress 1Mb, SPD: min(Upload,Download) in 1Mbps, BUF: buffering period before starting to pipeline, SPR: sprinkling factor.
SingleTransmission = (FSZ * SPR) / (SPD * CMP)
DistributionTime(N) = SingleTransmission + BUF * (LOG(N-1, SPR)-2) + 2 * CMPT
For example, if the sprinkling factor is 2, then we only need log2(N-1) copying phases. only the intermediate ones needs a buffering period, so we can ignore the first and last phases. As for compression time, we count it once for the initial file compression and once for decompressing the file at the last phase's nodes. Decompressing at the intermediate nodes does not increase the total copying time.
Can we use the Bit Torrent protocol for transferring the large files across N machines. This is one of the way to overcome the network bandwidth constraints.
I believe overcoming network bandwidth constraints is not about large files, but transferring data efficiently through large capacity links. Depending on the network protocol you may think of different approaches of data segmentation.
I wonder, has anyone ever thought of using Minimum Spanning Tree (MST) for this problem? It is centralized, rather than distributed approach. Each node (machine) needs to have the full information of the network links and bandwidth to compute a tree that spans along entire network.
why we need upload and download bandwidth, meaning we can have a centralized server so that we can upload the data from the 0th machine to the server and then stream to the rest N machines?
In this sense, if there is no bandwidth constraints, and downloads can proceed while uploading is ongoing, the fastest way is to have sprinkling factor = N, meaning N machines to download at the same time, then take buffering into consideration, we can simply do the math; if there is bandwidth constraints, then we need to reconsider sprinkling factor, because more downloading simultaneously, more overhead.
Dude Guy!
- Anonymous April 21, 2014Are you serious!
So you posted 30 questions that google asked you? Hows that even possible?
Have some authenticity man!