design question:
I will request you to first read only upto Solution-2 only and try to think about your solution and then read other lines. If we read whole post(without thinking about your solution) then your brain will get polluted with my idea and you may not be able to think naturally.
----------------------------------------------------------------------------
One machine daily taking backup of its datatstore on some server machine, MINIMIZE network data transferred.
Solution -1 : Compress storage and then send over network
Solution - 2 : Take diff and send only diff(and not whole data daily)
Interviewer asked for better ideas as above two are well known ideas but I was stuck.
---------------------------------------------------------------------------------
So he gave me hint(no 1) and depending upon hint I devise following solution
Solution - 3: Divide storage into sequence of 1K bits so total number of possible patters are 2^1000. Now Using hashTable just store existing 1000bits-length pattern and all their sequence numbers. And, then transfer this hashTable to server. So for example datastore consists of "file1, file2, file3=file1+file2" then we can reduce network data by 50%
So interviewer asked why 1000bits and why not 100bits or 10000bits. Also, as we increase length of bit-pattern, probability of getting duplicate patterns reduces and if we decrease length of pattern we may not be able to effectively reduce network data transferred
Solution-4: Don't divided storage bits into equal length pattern but in variable length pattern so client tells server "Pattern-X(bit string) of length L exists at offsets 5, 10003, 363738, 42311, 43433.
Interviewer said how can check for variable length patterns on client machine as this will be computaionally inefficient. I suggested Trie, Huffman coding, file-by-file comparison but he straightway discarded those ideas :)
He gave final important hint that
somehow client should give indication to server that next bytes are going to be like this without tranferring those bytes and
then server should tell client that "OK don't send next 1000bits as I already have them from previously send/stored data at different section(like different file) in storage"
and asked me how to do it and how will be the handshake/protocol between server-client
Since, throughout interview I was stuck to my hashTable solution and not able to think beyond hashTable, as per HR I got negative feedback in this interview :(
Overall interviewer was helping lot and was really helpful. He gave lot of hints and time(never said "Ok thats enough let us move on"...Finally I myself gave up :) )
Simple Answer is to divide the whole data into smaller chunks (chunk size is upto you depending on the size of the data 1K,10K etc).
- Ganesh December 15, 2015Each chunk should have a CRC(or call it the hask key of Data). Client first sends the chunk number and its CRC to server, if the client CRC is different compared to server CRC then the server should signal to client to send that chunk of data for updation else not.
Hope that helps..