I was given a very interesting

Amazon Interview Question for Software Engineer / Developers

0

of 0 votes

31
Answers
I was given a very interesting question.
You are given two very large files each containing integers in each line followed by a line break.

Produce a file with integers that are common in both files.

Estimate the time complexity of your code block?

Any edge cases and Bottlenecks?

Assumptions?
How would you solve this.

I did a little research on this and saw that some people suggested using Radix Sort for this and trying to split on large file into subsets and then using radix sort.

Here is the link

http://stackoverflow.com/questions/6520954/produce-a-file-that-has-integers-common-to-two-large-files-containing-integers

Can someone help me write this function. I read a lot of theories a lot of places but I am really looking for a code, preferably java for this.

This is a simple implementation of the same. I got a clarification that numbers are integers but the files are large but on one disk only.

I am using an approach as In case the file is too big to fit into memory on one go, break it into parts or read n integers at a time.and then use hashmaps to look for integers.

Cpmplexity: If the partition size is p and the file size is m then and no of rows per partition in n then ----- O(m/p * n) or O(N).
```
import java.io.*;
import java.util.*;
import java.io.FileWriter;

public class FileRead{

    /**
     * @param args
     */
    public static void main(String[] args) {
        // TODO Auto-generated method stub
   
    File file1 = new File("C:\\Documents and Settings\\chaos\\Desktop\\input1.txt");   
    File file2 = new File("C:\\Documents and Settings\\chaos\\Desktop\\input2.txt");
    try{
       
    Map mymap = new HashMap();
    Scanner scan1 = new Scanner(file1);
    Scanner scan2 = new Scanner(file2);
    
    
    
    while(scan1.hasNextLine(    ))
    {
        int line = scan1.nextInt();
        mymap.put(line,1);
    }
   
    while(scan2.hasNextLine())
    {
        int line = scan2.nextInt();
        try{
        int val = (Integer) mymap.get(line);
             if(val==1)
            
             try{
                    FileWriter f1 = new FileWriter("C:\\Documents and Settings\\chaos\\Desktop\\output.txt");
                    BufferedWriter out = new BufferedWriter(f1);
                    out.write(line);
                    System.out.println(line);
                    }catch (IOException e){}
             
        }catch(NullPointerException e2){}
    
    }
    } catch(FileNotFoundException e){e.printStackTrace();}
    
    }

}
```
- chaos June 07, 2012 in United States | Report Duplicate | Flag | PURGE
Amazon Software Engineer / Developer Algorithm

Email me when people comment.

An error occurred in subscribing you.

Country: United States
Interview Type: Phone Interview

Email me when people comment.

An error occurred in subscribing you.

Comment hidden because of low score. Click to expand.

of 1 vote

Assume that 1) the size of file A > the size of file B; 2) the size of each is too big to be loaded into Memory; the Memory only allows N numerb of integers.

do while() {
read N integers from A into the Hashtable in memory, if no more integer available, break;
read each integer from B, check if it is in the Hashtable. if yes, write it into the output file C
}

O(B*A/N)

- sqw June 07, 2012 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

would hash tables consume a lot of memory.
Can we do this using a radix sort??

- chaos June 08, 2012 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

I see here i do not have to write into both files but the condition that A>B is a little less ideal. What if B>A, i believe we can write a function to check for that.

Could you please help me with the hash functions. I am new to them.

- chaos June 08, 2012 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

Reasonable approach.

- Anonymous June 09, 2012 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

This is a simple implementation of the same.

This is a simple implementation of the same. I got a clarification that numbers are integers but the files are large but on one disk only.

I am using a similar approach as suggested above. In case the file is too big to fit into memory on one go, break it into parts or read n integers at a time.

import java.io.*;
import java.util.*;
import java.io.FileWriter;

public class FileRead{

    /**
     * @param args
     */
    public static void main(String[] args) {
        // TODO Auto-generated method stub
   
    File file1 = new File("C:\\Documents and Settings\\chaos\\Desktop\\input1.txt");   
    File file2 = new File("C:\\Documents and Settings\\chaos\\Desktop\\input2.txt");
    try{
       
    Map mymap = new HashMap();
    Scanner scan1 = new Scanner(file1);
    Scanner scan2 = new Scanner(file2);
    
    
    
    while(scan1.hasNextLine(    ))
    {
        int line = scan1.nextInt();
        mymap.put(line,1);
    }
   
    while(scan2.hasNextLine())
    {
        int line = scan2.nextInt();
        try{
        int val = (Integer) mymap.get(line);
             if(val==1)
            
             try{
                    FileWriter f1 = new FileWriter("C:\\Documents and Settings\\chaos\\Desktop\\output.txt");
                    BufferedWriter out = new BufferedWriter(f1);
                    out.write(line);
                    System.out.println(line);
                    }catch (IOException e){}
             
        }catch(NullPointerException e2){}
    
    }
    } catch(FileNotFoundException e){e.printStackTrace();}
    
    }

}

- chaos June 11, 2012 | Flag

Comment hidden because of low score. Click to expand.

of 0 vote

what does it mean "large integer"?

- sk June 07, 2012 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

can be a 64 or larger t integer. But for simplicity, I wanted to assume 32 bit for now.

- chaos June 08, 2012 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

Do you realize that 6 bits is actually incredibly small (0-63 range)? 6 bytes, on the other hand, would be quite big, if that's what you meant.

- Anonymous June 09, 2012 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

corrected to 64

- chaos June 10, 2012 | Flag

Comment hidden because of low score. Click to expand.

of 0 vote

Assume that the "large integer" is unsigned Int32's.
Allocate byte array of size 2^32 bytes(4Gb).
Read through the first file:
read next int(N), set the bit N % 8 of N /8 element of byte array.
close the 1st file.
Open the output file for writing
Read through the second file:
read next int(N), check the bit N % 8 of N /8 element of byte array. If it is set to 1 then write N to the output file.
close the 2nd file.
close the output file.

Complexity is O(n1 + n2) where the n1 and n2 are sizes of the input files.

- sa June 08, 2012 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

can you please elaborate a little more.

- chaos June 11, 2012 | Flag

Comment hidden because of low score. Click to expand.

of 0 vote

Is it really about two large integers or two large numbers? If it is integers we do not need strings to hold them. Moreover if it is for numbers then its easier.

Steps:
Read each line from file A and store the number in form of String in a HashSet ( I will prefer a caching mechanism or multiple HashSets of sqrt(n) size). This will take care of duplicates in the same file.

After the first file is completely read and stored in the HashSets. Read the second file and use the Boyre Moore algorithm to search the strings from the second file against the HashSets.

If a hit is found in any of the sets then write the String in third file.

Analysis: Reading and storing into HashSets will take O(N) time and comparing in worst case take O(N) time where you have to compare to each element.

We need to analyze the Boyre Moore Algorithms complexity too where if the Strings to compare are almost of same length. In this case if there is a mismatch to occur in average case it will take O (L/10) where L is the length of the String and 10 is the base of numbers.

- dharmendra June 08, 2012 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

import java.io.*;
import java.util.*;
import java.io.FileWriter;

public class FileRead{

    /**
     * @param args
     */
    public static void main(String[] args) {
        // TODO Auto-generated method stub
   
    File file1 = new File("C:\\Documents and Settings\\chaos\\Desktop\\input1.txt");   
    File file2 = new File("C:\\Documents and Settings\\chaos\\Desktop\\input2.txt");
    try{
       
    Map mymap = new HashMap();
    Scanner scan1 = new Scanner(file1);
    Scanner scan2 = new Scanner(file2);
    
    
    
    while(scan1.hasNextLine(    ))
    {
        int line = scan1.nextInt();
        mymap.put(line,1);
    }
   
    while(scan2.hasNextLine())
    {
        int line = scan2.nextInt();
        try{
        int val = (Integer) mymap.get(line);
             if(val==1)
            
             try{
                    FileWriter f1 = new FileWriter("C:\\Documents and Settings\\chaos\\Desktop\\output.txt");
                    BufferedWriter out = new BufferedWriter(f1);
                    out.write(line);
                    System.out.println(line);
                    }catch (IOException e){}
             
        }catch(NullPointerException e2){}
    
    }
    } catch(FileNotFoundException e){e.printStackTrace();}
    
    }

}

- chaos June 11, 2012 | Flag

Comment hidden because of low score. Click to expand.

of 0 vote

1) Get the file 1 size in bits and create a bitmap of that size and initialize to zero.
2) Read file1 and from file 1, read all integers.
if(bitmap[file1[i]] is not set then -> bitmap[file1[i]] = 1.
else //common numbers in same file. so no action.
3) Now read from integers from file 2.
Check the bitmap created by file1 for the number.
if it is already set then print as common element.
otherwise dont print.
time: O(Number of entries in file1+file2)
space: O(size(file1))

- VillageMonkey June 08, 2012 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

sorry bitmap of size 2^32 is needed. Creation of bitmap of size file1 wont work.

- VillageMonkey June 08, 2012 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

This indeed is the fastest and most efficient solution. No need for all the trie/set/hash table/<insert a high level data structure here>.

- Saurabh Rawat June 12, 2012 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

What if the integers are 64 bits? That's why folks worried about "high level data structures".

- eugene.yarovoi June 14, 2012 | Flag

Comment hidden because of low score. Click to expand.

of 2 vote

cant we use trie???

- Hector June 08, 2012 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

By all means mention random data structure names with no indication of how you intend to use them.

- Anonymous June 09, 2012 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

So let's store the long integers in out trie ...... by keeping each digit of number on each node ,after storing ,now we have to find it in our trie structure..

- Hector June 21, 2012 | Flag

Comment hidden because of low score. Click to expand.

of 0 vote

The problem is similar to 2 sets intersection. The difference is only in big numbers. We can treat big numbers as Strings. So actually all solutions for sets intersection should work.
1) Use external mergesort + radix sort for in-memory chunks. Traverse sorted files and determine intersection.
2) Store all strings from 1-st file inside 'trie'. Traverse 2-nd file and search for each string in trie.

- m@}{ June 08, 2012 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

use bool[int.maxvalue] as the hash table, to save the memory and make more fast. Here is my solution.

bool[] intindex = new bool[Int32.MaxValue];
            string file1name = "c:\\test1.txt";
            string file2name = "c:\\test2.txt";

            StreamReader file1 = new StreamReader(file1name);
            string line;
            while ((line=file1.ReadLine())!=null)
            {
                try
                {
                    intindex[Int32.Parse(line)] = true;
                }
                catch (Exception)
                {
                }
                
            }
            file1.Close();

            StreamReader file2 = new StreamReader(file2name);
            StreamWriter newfile = new StreamWriter("c:\\newfile.txt");
            line = null;

            while ((line = file2.ReadLine()) != null)
            {
                try
                {
                    if(intindex[Int32.Parse(line)]==true)
                    {
                        newfile.WriteLine(line);
                    }

                }
                catch (Exception)
                {
                }

            }

- Capsen June 22, 2012 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

What do these integers represent? how much memory do we have? What about duplicates? Is either file smaller? much smaller? Can we tell which file is smaller?

In the simplest case, we can use a simple counting sort.
In the tougher case, we might have to do resort to external mergesort.

e.g: if we dont have to worry about duplicates, and the integer is 32bits, we should be able to use a bitset that will fit in half a GB of RAM.

- ViX July 03, 2012 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

Correction 4GB

- ViX July 03, 2012 | Flag

Comment hidden because of low score. Click to expand.

of 0 vote

Hmm, we may be able to use MapReduce to deal with large files. For example, during Map phase, we emit <int, file-id>. Since we only have two files, file-id will be file1 or file2. Then, during Reduce phase, <key, value> pairs with same key will be sent to reducers, then reducers can check if a int appears in both files. If so, output int to the file.

- Anonymous October 14, 2012 | Flag Reply

Comment hidden because of low score. Click to expand.

-1

of 1 vote

assuming the cpu has multiple processors
three threads, 1 hashtable (or trie), 1 queue --

thread #1:
read from file #1 (treat the long ints as strings )
insert the long int into a hashtable, key is the #, value is a bool.
if the # is already in the table, check the bool to see if it's been queue'd for writing.
if it hasn't, add the # to a queue, set the bool to true

thread #2
same as thread #1, except it reads from file #2

thread #3
deque a # from the queue, write it to file
when thread #1 is finished and thread #2 is finished and the queue is empty...
done.

access to the hashtable and queue need to be sync'd across threads.

assuming the file is on disk; most of the time is going to spent in reading/writing files,
so breaking that work up across multiple threads is ideal (assuming the files are large).

The memory usage might be too high with a hashtable;
this could be reduced by using a trie/radix tree

- MessyHack June 07, 2012 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

What happens when the integer is present twice in one file and not present at all in other file.your algorithm will count that as a common entry I think.

- Anonymous June 08, 2012 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

yep, good catch.

easily fixed by additionally storing which thread (1 or 2) the # was encountered in
(if not queued yet; and seen by the other thread...
queue it for writing)

- MessyHack June 08, 2012 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

Looks like a good approach.
Some more reading and it looked like radix sort would be more optimal.

Can you help me with the functions here. I am new to hashmaps. I can add a key value pair but have never tried it within a thread or complex manipulations.. I really need to see some code to understand how this could be put to work.

- chaos June 08, 2012 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

Radix sort doesn't involve hashmaps.

- Anonymous June 09, 2012 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

And there are actually many reasons this is a terrible solution. A hashtable is only reasonably efficient while everything fits into memory. The threaded approach you suggested requires a huge amount of fine-grained, synchronization that will have lock contention issues.

Your best bet would probably be to do a mergesort on both files independently first (in parallel if desired), and then do a modified version of a merge algorithm to find the common elements. This has relatively good disk I/O performance.

The approach suggested by sqw can also be a good approach, depending on just how large the files are. Many real database implementations use similar techniques.

- Anonymous June 09, 2012 | Flag

CareerCup

Amazon Interview Question for Software Engineer / Developers

Books

Videos

Resume Review

Mock Interviews