Amazon Interview Question
Software Engineer / DevelopersSplit the file such that each split has a suitable number of phone numbers that can be held in memory. Now, sort each split using any efficient sorting algorithm ( radix sort would be linear) and write it back to the same location in the file.
Once all the splits are sorted, we can have a array of file pointers to each of the file locations where splits start. While scanning through the sorted list, one can write the numbers to another file.
+1
create a bit vector of size 1 million, iterate through the phone numbers and set the corresponding bit (O(n))
then output the nonzero elements of the bit vector in order (O(n))
External sort + internal Sort
Step 1: partitioning
Divide the original file to k small file, k = original file size / available memory size .
Step 2: Internal sort
Input this small file to memory one by one, sort them and write back. In this case, maybe radix sort is better ( linear time, but linked list is used , overhead is large ), some other tradition sort method is OK.( maybe quicksort is best, it is called “QuickSort” for a reason☺ )
Step 3: Merge
You could chose M-way merge you want, 2 < M <= k ,
Pass = logM k
The key point in this case is to reduce I/O times as small as possible . In every single pass, all files are passed to memory and write back .
Total I/O = 2* Pass * (Blocks number of files )
Which means we need to reduce Pass number. There are 2 ways to achieve this:
1. enlarge M
Note: if M become so large, we’d better employ “Loser tree”, which reduce compare time to logM, otherwise we need to compare M-1 times to find a winner
2. reduce k
We could resort to “Replace-Selection Method” . The rule of thumb is that we could half k .( If we use “Replace-Selection Method”, every initial small file size will be different , to optimize I/O times, a Huffman tree style merge is recommended )
This is called external mergesort.
- Maxim Stepanenko March 10, 2011