## Amazon Interview Question

SDE1s**Country:**India

**Interview Type:**In-Person

Good job!

Your whole idea is the same as pulkit.mehra.001's one.

You choose an efficient heap data structure, with a constant "amortized" time decrease key operation (like Fibonacci heap). Thus, your algorithm is O(k) time, in theory!

However, I think, the most time consuming process is to read in the numbers from files, since reading file takes much more time than doing operations in main memory. That is, reading k numbers takes O(c.k) time, where c is a coeficient for file processing overhead, which is usually very big compared to that for accessing main memory. You may propose O(k) or O(k. log100) algorithms for this problem, but I think the file reading time dominates the algorithm running time. In other words, improving algorithm running time doesn't help much to reduce time for the overall problem.

Although you "abstract" out the files as stacks, you need to read files anyway, isn't it. Your solution (and pulkit.mehra.001's one) requires at most (k+99) times of reading file, which I think is the most efficient already.

In my opinion, this question concerns more about practical issue. Look at the way the interviewer formulated the question: he has a big data file, spitted into 100 files with sorted data... And he asks for "efficient" method, not "optimal" method. This makes me think of a practical issue rather than a pure theoretical problem.

An other point is that, for the solution, we need a heap data structure with a size of just 100, which is fairly small. So which heap implementation is more appropriate, considering the overhead cost for maintaining its structure?

I would say that binary heap is pretty simple yet efficient. It is usually implemented with array. Its simplicity makes the overhead cost negligible. While, for Fibonacci heap (and other theoretically efficient heaps), its complicated implementation may cause a costly overhead, especially for small size (like a size of 100 in this problem).

Anyway, your idea is great!

I have other questions:

1. Can "abstracting" out a file as stack make it faster to read?

2. I've heard of "memory-mapped file". Can it be efficiently used for this problem?

Cheers,

The tricky part about these problems, is that the interviewer is looking for something, and that sometimes is lost when it comes to careercup. Hey may have been going towards a billion numbers, in a million files. Which leans towards a distributed application.

mem-mapping the file will be the fastest way to access it. You also will get a benefit from the multiple levels of cacheing from the OS. If you read a single number, the OS really reads at least a block of data (maybe more), because if you read one part of the data, you will most likely want the rest of it.

If K is small enough, you could just read K numbers from all 100 files into memory, as a front-end way to mitigate the burden.

Heck:

100million numbers = 400million bytes = 400Mb of memory. Based on the bounds of the application, you could read them in today, without a whole lot of pain. Once loaded, your disk latency is eliminated.

Which goes back to what I said earlier. Tricks like what I said, are easily eliminated if you go up to a billion numbers, or a trillion, or whatever. Interviews like this keep upping the ante to see what the candidate is capable of.

This is a *standard algorithm* for merging multiple sorted lists. Put mins in a heap, and keep extract-min and getting from appropriate list and inserting into heap.

Yes, as you guys point out reading one int at a time from each file is grossly inefficient as there might be lot of paging. Best is to read all k numbers in each file and put it into min heap of size 100k.

Also if performance if not important but speed of writing code is important (say a quick test), then we can do it one line in Unix.

Assume current dir has the files.

cat * >> out && sort -n out | head -k

Obviously algorithm wise its inefficient, but good to point out in an interview as it can be quite handy to quick jobs.

You shouldn't create a min-heap. If you construct a min-heap, then you have the compare the new element with all the elements in the heap.

You should construct a max heap with first k elements. This will place the max element in the root. For each incoming element.

1) Check whether the element is less than the root.

2) If it's less than the root

3) Replace root with the incoming element

4) Heapify

This will produce a heap of k smallest elements at the end of traversing all the elements.

Complexity:O(nlogk)

Yeah, this is a good way. But in this case we would need to process all the elements in all the files. I think we can also have a flag which would be set to true if the root is smaller than the next elements from every file, implying that the smallest k elements have been found !!

Use the concept of merging in merge sort.

Pick up the first 2 files and merge to extract top k values. Complexity O(k)

Repeat this for all files. Total complexity O(100K)

Space complexity O(k)

I doubt if the complexity is O(100k)

From the first file, you will create a array of k size.

And, everytime you have to traverse all the elements in your array of k size for comparison with the first k elements of the file..

So, I guess the complexity is O(100*k^2).

Please correct me if I am wrong..

as per the problem each file is sorted. so we can pick smallest k elements from the 2 files and we can keep on doing the same using the merge sort technique. we can additional check like:

Assume I need 100 smallest numbers: So I will create smallest[100]. Once we get the smallest[100] from the first 2 files then we can compare like

a[100] < 3rd_File[0] -> If this is the case we can skip 3rd file and so on ....

Worst case would be o(100*k^2)

For each file that we are merging we do k comparisons. Total number of files to merge 100. So complexity O(100k). Am not sure how we will get k^2

Everytime you have to do k comparisons with each of k element in your current array. So, it's definitely going to be O(n^2)

I think the complexity will be linear in k

The fact that the files are sorted and the approach of merging 2 sorted files is being used here.

Suppose we have 2 files with 100 elements in each and each is sorted. We need 10 smallest elements. Then we will definitely need 10 comparisons.

Compare 1st element of file 1 with 1st element of file 2. Take the smaller and move to the next element in that file, and so on.

So for k elements the number of comparisons will be O(k)

We need to do this for all the 100 files finding the smallest k elements using the concept of merging so overall time complexity will be O(100k)

How abt this solution, Since he has not mentioned anything abt how much the memory can hold, we can take the first k elements from each file and put into a file.

Then we can use Quickselect algorithm to find the k smallest elements.

We can argue that this can be achieved in O(n) complexity...

I think pulkit.mehra.001's solution, which is O(k log100), is quite efficient already.

It reads the files only k times for k smallest numbers. Reading from file is much slower than accessing ram memory. Thus, I think reading these k numbers, which takes O(c.k) with a big coefficient c, is the bottle neck; O(k log 100) time algorithm for finding the answer is not the bottle neck.

Can you find the k-smallest number in these 100 files with less than k times of reading file?

Can a file be randomly accessed?

IDK!

```
public static class FindSmallest {
private static class Data implements Comparable<Data> {
private int value;
private int index;
private List<Integer> file_id;
public Data(int value, int index, List<Integer> file) {
this.value = value;
this.index = index;
this.file_id = file;
}
@Override
public int compareTo(Data o) {
return this.value - o.value;
};
public String toString(){
return value+"";
}
}
public static int[] findKSmallest(List<List<Integer>> files, int k) {
PriorityQueue<Data> minQueue = new PriorityQueue<Data>(k);
int index = 0;
for (List<Integer> file : files) {
Data data = new Data(file.get(0), index, file);
minQueue.add(data);
}
for ( int i = 0 ; i < k ; i++) {
System.out.println(" minQueue:" + Arrays.toString(minQueue.toArray()));
Data minData = minQueue.peek();
if (minData != null) {
minData.index++;
int nextIndex = minData.index;
if ( nextIndex >= minData.file_id.size() )continue;
Data nextData = new Data(minData.file_id.get(nextIndex),
nextIndex, minData.file_id);
minQueue.add(nextData);
}
}
Object[] result = minQueue.toArray();
int[] out = new int[k];
for (int i = 0; i < k; i++) {
out[i] = ((Data)result[i]).value;
}
return out;
}
}
```

Pure data structure approach: Fibonacci heap

Find min is very cheap, and each subtree could be loaded separately as a file.

Otherwise, I don't know if a data structure is needed.

You have 100 sorted files, so you can open each one in time, as a speedup, you could read the first N/100 lines if it's small enough to minimize disk access. Then go through each file and do a merge.

Many approaches with different tradeoffs. What I think they are looking for at the 100 million level, is the tradeoffs of each tactic. Lots of wrong answers and no stand out right one.

Let all the 'n' files are sorted in increasing order.

Let we have to find 'k' smallest numbers in 'n' sorted files.

MaxHeap h = createMaxHeap(k);

//Read 1st K elements from file1 and create a MAX_HEAP out of them.

FileHandler fh = readFile(file1);

for(i=0;i<k;i++)

h.insert(line[i].data of fh);

for(j=1;j<n;i++)

{

FileHandler fh = readFile(file[j]);

for(i=0;i<k;i++){

if(line[i].data of fh < getMax(h)){

deleteMax(h);

h.insert(line[i].data)

}

}

}

return the heap as result;

Hey Guys!! What do you think of this approach?

First we take the starting numbers from each of 100 files and sort them which will take 100log100 which can be treated as a constant. Along with the number we will store the file descriptor from which file it is extracted.

Now treat the array as a queue, starting from minimum element, got to its file and extract all the numbers that are smaller than the next minimum and store or print them. Now remove the minimum from queue. update the minimum and next minimum. Keep the count of no of elements printed so far. Continue this process untill the queue is empty or count = K. if the queue is empty and count < k. Then rebuild the queue with numbers that are being currently pointed to by the file pointers. And repeat the process.

Pros: As block caching is implemented in OS during reading a file printing contiguous elements will be a major adv.

In worst case when the numbers across the files are sorted. The time complexity would be k*100*log100 which is the case with the above implementations also.

Please correct me if am wrong.

He is right. Complete flow is as below:

Let all the 'n' files are sorted in increasing order.

Let we have to find 'k' smallest numbers in 'n' sorted files.

MaxHeap h = createMaxHeap(k);

//Read 1st K elements from file1 and create a MAX_HEAP out of them.

FileHandler fh = readFile(file1);

for(i=0;i<k;i++)

h.insert(line[i].data of fh);

for(j=1;j<n;i++)

{

FileHandler fh = readFile(file[j]);

for(i=0;i<k;i++){

if(line[i].data of fh < getMax(h)){

deleteMax(h);

h.insert(line[i].data)

}

}

}

return the heap as result;

I just had a brilliant idea! I'm pretty sure it's O(K) in complexity.

Abstract each file out, and treat it like a stack. Call it a FileStack.

Create a min-heap, with good decrease-key performance ( O(1) ) and find-min performance ( O(1) ).

Put each FileStack on the heap. You have O(100) complexity.

O(100) + O(1) + O(1) + O(K) = O(K)

- rotinom April 08, 2014Thoughts?

... and its effectively the same as OP, except that heapify() wouldn't need to be called every time.