Amazon Interview Question
Software Engineer / DevelopersUsing Quicksort would require costly compares for long strings that differ only at one end (which would be the case for 2GB of strings).
Assuming you have a lot of RAM, a better choice would be to use a 3-way radix quicksort.
- avoids re-comparing initial parts of the string
- adapts to data: uses just enough characters to resolve order
- is sub-linear when strings are long
- uses O(2NlnN) character comparisons on average for random strings
But, if RAM is limited, use external mergesort, as algooz pointed out.
I think the key here is limited ram and hence we have to load pages of the file from disk,so the bottle neck is file IO . Hence O(N^2) algo isnt neccessarily bad if it results in efficient file access.My take is bubble sort as it lends well to sequential read and write to file manipulated.
Assuming that the file is too large to fit in memory (or to be mapped into memory):
If strings are ASCII and the starting character is uniformly distributed, one could try a multi-pass approach: read all strings that start with 'a' in memory, sort them, and write'em to the output file; do another pass and read all strings that start with 'b', etc. This however does not work if all strings start with the same character (i.e. not uniformly distributed).
Another approach is to do a quicksort on disk. However, a fileseek(size / 2) may land in the middle of a string (in other words, the input file has records of variable size). We could do a first pass through the file and write the offsets where lines start down into an auxiliary file (keeping the offsets in memory may not work: assuming an average 8 character string, keeping track of offsets would require half the size of the input file on a 32-bit machine, and the same size as the input on a 64-bit system, not counting the CRLFs; so the memory constraint still holds).
The auxiliary file has records of equal size (each entry is an offset of 4 or 8 bytes, depending on the platform). We can now do a quick sort of this file on disk. The comparison function would look something like (in C++-like pseudo-code):
int compare(long lhs, long rhs) {
string lhsString = readStringFromMainFileAt(lhs);
string rhsString = readStringFromMainFileAt(rhs);
return lhsString.compare(rhsString);
}
After sorting the auxiliary offsets file, we do another pass:
foreach i in auxiliary file {
string s = readStringFromMainFileAt(i);
writeToOutput(s);
}
The runtime is 2N + O (N log N) == O (N log N) (the 2N is from the pre- and post- passes and NlogN is from qsort).
Assuming a 200 MB/s disk transfer rate, I think that a 2GB file can be sorted in a minute or less using this algorithm.
the critical part in this question is we don't have enough memory. so i think we can use some temporary files.
we could try multi-pass to sort this big file.
step 1: divide the big file into several smaller files (smaller files are temporary). And smaller files should be able to be hold in the memory.
step 2: sort every smaller files, say quick sort.
step 3: use similar method in the merge sort. we merge all smaller files into the big one.
Time: O(nlgn). Memory: constant, O(1)
If the file is larger than 2 GB and can fit in RAM then we can use any of the inplace O(nlgn) sorting algorithm preferably quick sort.
- algooz June 28, 2008If the whole file cant fit in the RAM, then use external merge sort.
en.wikipedia.org/wiki/External_sort