There is a huge file (> 2GB) w

Amazon Interview Question for Software Engineer / Developers

2

of 2 votes

11
Answers
There is a huge file (> 2GB) which has 1 string in every line. What sorting algorithm (not unix/linux commands) will you use to sort the file and why?
- phoenix June 28, 2008 | Report Duplicate | Flag | PURGE
Amazon Software Engineer / Developer Algorithm

Email me when people comment.

An error occurred in subscribing you.

Email me when people comment.

An error occurred in subscribing you.

Comment hidden because of low score. Click to expand.

of 0 vote

If the file is larger than 2 GB and can fit in RAM then we can use any of the inplace O(nlgn) sorting algorithm preferably quick sort.
If the whole file cant fit in the RAM, then use external merge sort.
en.wikipedia.org/wiki/External_sort

- algooz June 28, 2008 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

Using Quicksort would require costly compares for long strings that differ only at one end (which would be the case for 2GB of strings).

Assuming you have a lot of RAM, a better choice would be to use a 3-way radix quicksort.
- avoids re-comparing initial parts of the string
- adapts to data: uses just enough characters to resolve order
- is sub-linear when strings are long
- uses O(2NlnN) character comparisons on average for random strings

But, if RAM is limited, use external mergesort, as algooz pointed out.

- krkeet January 12, 2009 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

I think the key here is limited ram and hence we have to load pages of the file from disk,so the bottle neck is file IO . Hence O(N^2) algo isnt neccessarily bad if it results in efficient file access.My take is bubble sort as it lends well to sequential read and write to file manipulated.

- KK May 19, 2009 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

Assuming that the file is too large to fit in memory (or to be mapped into memory):

If strings are ASCII and the starting character is uniformly distributed, one could try a multi-pass approach: read all strings that start with 'a' in memory, sort them, and write'em to the output file; do another pass and read all strings that start with 'b', etc. This however does not work if all strings start with the same character (i.e. not uniformly distributed).

Another approach is to do a quicksort on disk. However, a fileseek(size / 2) may land in the middle of a string (in other words, the input file has records of variable size). We could do a first pass through the file and write the offsets where lines start down into an auxiliary file (keeping the offsets in memory may not work: assuming an average 8 character string, keeping track of offsets would require half the size of the input file on a 32-bit machine, and the same size as the input on a 64-bit system, not counting the CRLFs; so the memory constraint still holds).

The auxiliary file has records of equal size (each entry is an offset of 4 or 8 bytes, depending on the platform). We can now do a quick sort of this file on disk. The comparison function would look something like (in C++-like pseudo-code):

int compare(long lhs, long rhs) {
        string lhsString = readStringFromMainFileAt(lhs);
        string rhsString = readStringFromMainFileAt(rhs);
        return lhsString.compare(rhsString);
    }

After sorting the auxiliary offsets file, we do another pass:

foreach i in auxiliary file {
        string s = readStringFromMainFileAt(i);
        writeToOutput(s);
    }

The runtime is 2N + O (N log N) == O (N log N) (the 2N is from the pre- and post- passes and NlogN is from qsort).

Assuming a 200 MB/s disk transfer rate, I think that a 2GB file can be sorted in a minute or less using this algorithm.

- cristi.vlasceanu June 04, 2009 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

the critical part in this question is we don't have enough memory. so i think we can use some temporary files.
we could try multi-pass to sort this big file.

step 1: divide the big file into several smaller files (smaller files are temporary). And smaller files should be able to be hold in the memory.
step 2: sort every smaller files, say quick sort.
step 3: use similar method in the merge sort. we merge all smaller files into the big one.
Time: O(nlgn). Memory: constant, O(1)

- bbs59 July 27, 2009 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

Merge sort is not a good choice. It uses extra memory while merging. So RAM issues may arise right after you start merging files with size greater than size n/2.

- peace. October 16, 2009 | Flag

Comment hidden because of low score. Click to expand.

of 0 vote

correct me if i am wrong but aren't all proposed solutions using multi pass is nothing but external sorting (chunks are sorted individually and then merged together)

- Anonymous January 16, 2010 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

"which has 1 string in every line" no algorithm above has used this statement.

- M August 14, 2010 | Flag Reply

Comment hidden because of low score. Click to expand.

-1

of 0 vote

can this problem be solved in O(n)?

- travlinglion August 25, 2008 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

no , i don't believe it can ..

- whattay!! September 06, 2008 | Flag

Comment hidden because of low score. Click to expand.

-2

of 0 vote

Compare 1st line and 2nd line.
Suppose they have k common substrings.
Compare these common substrings with each subsequent line now.
The common array of common substring will either decrease or remain constant. as we can say k<<n, it's still O(n).

- Anonymous October 17, 2008 | Flag Reply

CareerCup

Amazon Interview Question for Software Engineer / Developers

Books

Videos

Resume Review

Mock Interviews