Vdopia Interview Question
Country: United States
Interview Type: Phone Interview
@mk13: I didn't understand the second part of your answer
"select a number,add that in remaining 1GB, and while selecting this number, bring a number from corresponding chunk to replace it, finally write sorted 1GB to secondary storage."
Can you explain it more clearly?
@kiran kumar: let say we divide file in three chunk A,B,C and after sorting individully, content of these parts is like: A{12,18,20,25,33,35,37},B{8,13,14,40,41,45,47},C{2,15,50,70,72,75,78}.
Now suppose in memory, we have {12,18,20,25} {8,13,14,40} {2,15,50,70} respectively from A,B,C. now we'll select 2 from C's part as its minimum, so put this in remaining 1GB and replace 2 in memory by an element of chunk C. i.e. insert 72 in C's part in memory.. next replace 8 by 41 and so on.. we are maintaining a min heap.. i think its clear now
Suppose we can use the whole 4G memory to save our data, and the integer is 32bit, actually we can use bitmap to count the numbers of each integer in the 10G or larger files.
yes, we can use bitmap technique also.. we should elaborate both method to interviewer.
I did not understand the bitmap technique. Can you please elaborate. What happens when a number exists more than one time? Should the numbers be written back and the bit map re-initialized?
in bitmap method, we'll take a byte array where each bit of byte will represent unique number.. e.g. byte_[n], so this byte_[n] will represent the range {8*(n-1)+1to 8*(n-1)+8}, and so on. To allocate a bit to number x, simply divide x by 8 and get quotient (q) and reminder(r) and make r'th bit of byte_[q] ON.
this is how bitmap works.
now your point is correct as what happen when number repeats, since here in this file sorting, we need to retain all numbers, bitmap won't work. if it is given that many no repeats n times, then we can take n different bitmaps.. but by that many bit may be waste.. we can't re-initialize bitmap as it'll lost some other no... to get over this no, we can take any other small array if only few number repeats and for repeating no, we can update that array..
if you have any idea, plz put that here..
For reference to the bitmap sort, see Jon Bentley's book "Programming Pearls". It is the very first topic in the first chapter:)
It is not clear whether the resulted file is required to store the duplicate values or not. Bitmap works well if we are only required to store the distinct values. If not, the worst case scenario is when the whole file contains only one value. Given that the file size is 10GB, the occurence count still can't possibly get above 2^32. To handle this situation, we'd partition the search over smaller ranges, and use an array of 32-bit integers to store the occurence count for each number in range.
Thanks. I added that information back into the question.
Out of curiosity / so that I can improve the UI, why didn't you list the company when you added the questions? (There's a drop down to provide the company name.) Just trying to figure out what's happening when people don't list this.
I don't think many people know of the company and hence just added the question, which according to me is more important than knowing when and where was it asked, unless it is from a very well known company.
I see... okay. So was it too much hassle to add the company then? I mean, I agree that the question is more important than when / where, but obviously it could only be a good thing to add the company name...
will try to take care of that and add the name instead of skipping in the future....it isn't too much of a hassle but shear laziness from my side...:)
Divide the file in 3 parts(3GB+3GB+4GB,(each part must be less than 4GB )) and sort each chunk using any O(nlogn) sorting algo(as it is less than 4GB we can apply any in-place sorting algo). once each chunk of file sorted, bring 1GB of each chunk in memory, so it'll occupy 3GB(1+1+1), now sort this 3GB data(by 3-way merge sort),select a number,add that in remaining 1GB, and while selecting this number, bring a number from corresponding chunk to replace it, finally write sorted 1GB to secondary storage.
- mk13 May 25, 2012