You have two servers. Both of

VMWare Inc Interview Question for Senior Software Development Engineers

1

of 1 vote

11
Answers
You have two servers. Both of these servers have an identical file with billion characters except for one single character. These servers are connected over a very slow connection.

How do you find the different character?

My ans: split those files in to batches of characters of 10,000 (say), now calculate checksum and compare the checksums for the chunks of 10,000 character lines. So now you are just comparing 'ints' and not the files per say. (remember the connection is very slow)

His question: How do you optimize this?
- xankar February 15, 2017 in United States | Report Duplicate | Flag | PURGE
VMWare Inc Senior Software Development Engineer Algorithm

Email me when people comment.

An error occurred in subscribing you.

Country: United States
Interview Type: Phone Interview

Email me when people comment.

An error occurred in subscribing you.

Comment hidden because of low score. Click to expand.

of 2 vote

Introduce some simple hashing of integer sequences (for example, polynomial hash or even simple xor) and locate mismatching byte with binary search.
On first step, calculate hash of first and second halves of sequence, send them to other server, check which one doesnt match with other server's one. Repeat for mismatching part of array.
Total O(ln(n)) network interactions, obviously.
Hashes can be calculated in O(n) time and O(1) space, or in O(1) time and O(n) space using partial hashes.

- Flux February 15, 2017 | Flag Reply

Comment hidden because of low score. Click to expand.

of 1 vote

Cann't you just count the characters(if we don't have special symbols) and just send count 52 (lower case and upper case) character's count?
O(n+n) =O(2n)=O(n).

Please let me know if it doesn't work.

- Anonymous April 01, 2017 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

Use Binary search and instead of sending data, only send sh1 or md5 of the data.
Slow network is the key so packet drop, retransmission, etc has to be avoided at any cost.

1) Send MD5 of first half of data. If it matches, send true else false.
2) if(ture) {MD5 of half of second half} else { MD5 of half of first half}
3) It should go on till the time it comes down to some bytes under or equal to MTU. Once length of data for which we are doing md5 reaches into MTU limit, then just send the data directly and compare the data in for loop.

- neumscs March 17, 2017 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

Use MD5 of data instead of sending data.
Use binary search approach in sending MD5.
When the size of data for which MD5 is being calculated reached almost MTU, just send data directly and comapare the data in for loop

- Anonymous March 17, 2017 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

- neumscs March 17, 2017 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

Use MD5 of data instead of sending data.
Use binary search approach in sending MD5.
When the size of data for which MD5 is being calculated reached almost MTU, just send data directly and comapare the data in for loop

- neumscs March 17, 2017 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

XOR all the characters on one side and send it other. And then XOR it again with all the characters in a file. It will give a character which is not identical

- mittalrishabh March 17, 2017 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

An improvement on Flux's solution:

In each step only calculate the hash of half the sequence, not both halves. If the half isn't equal, then the is in that half, otherwise it is in the other one.

Another improvement would be to keep the values of the first half calculations on the half been calculated, and if the change is in this half, then you already have the calculation of one of the lesser halves.

I.e.:
We want to calculate on the first half of 1 billion, 500 mil.
so we can already stop the hashes for the first halves in the 250 mil, 125 mil, 64.5 mil etc, since we are already calculating the whole hash.

If the difference isn't in this half, no harm done, since we didn't add to the calculations.

- yechiel.bardov April 17, 2017 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

I think mittalrishabh has the correct solution.
We need to minimize the communication between these serves as much as possible.
In the binary search case the solution is log(billion) transmissions in worst case.
In XOR case it is only 1 transmission in worst and best case.

- Grindelwald July 24, 2017 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

File1: aXa
File2: aYa
XOR gives: X^Y
Now how will you find what character actually 'Y' is?

- Sanjit November 07, 2020 | Flag

Comment hidden because of low score. Click to expand.

of 0 vote

mittalrishabh has a solution but not complete...
The number of characters are same on both sides, (and it is not missing on one side), so if P is the char on one side, other side of will have a P'.
And after XOR, you will get (P XOR P').
The complete solution will be -
(1) P XOR P' will have some bit set, which means P and P' differ on that bit.
(2) Divide every element in 2 sets... one with that bit as 0 & another with that bit as 1. (Note that both sets will span across both machines, but we will not merge them)
(3) P will be in set 1, P' will be in set2. XOR again within set-1 that will yield P & set-2 will yield P'.

- Rajesh August 20, 2017 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

@ Rajesh
Your solution is interesting. I am wringing some code to verify it.

However, what is wrong with the count each char's occurrence number and simply compare? Is the performance low or what?

(Update) We can find the two different elements but we could not tell which is from which file using your algorithm. Right?

- i@shibin.info November 04, 2017 | Flag Reply

CareerCup

VMWare Inc Interview Question for Senior Software Development Engineers

Books

Videos

Resume Review

Mock Interviews