Amazon Interview Question
Software Engineer / Developersif each username is even 20 Bytes, the total size of each file will be atleast 20 GB. So, a B+ tree can be constructed for one of the files (the smaller one) and then we look for entries of the second file on this tree. Probably there is another method, If I find I will post. Gayle, can you comment on my solution.
So you could sort the smaller of the two files. Since would be a problem with such a huge file, one can do a n way merge sort of the smaller file. Once that is there, do a binary search of each element of the other file in the sorted file. So if the no of elements in the smaller file is n and larger file is m then the running time would be O(mlogn). Any suggestions???
for 1 file
read each line,
hash into their abc buckets depending on the start of
the letter of the word. (26 buckets for A to Z
to form something like a 26 by xxx table)
then sort each 26 rows in the hash table and delete dups
for 2nd file
sort and delete dups
for each line/name, find match in the hashtable.
if matchfind, output to another file name foundmatch
I think the b+ tree would be a good suggestion if the memory is not a problem. This is because in a normal sorting, it would be nlogn where log is base 2, while using a b+ i could increase the fan outs and make the height smaller depending on the sample set tht i have. like say the larger file can be indexed and each value of the smaller file can be searched. so it would take mlog(base fanouts)n < nlog(base 2)n. Gayle what do you think???any suggestions when your input is large?
use hashset to store all the usernames present in one file.
than try to insert the usernames present in other file in this hashset one by one ....if the username is inserted than it is unique and returns true and if not than it returns false and so duplicate exist.
search time is O(1)......for hashset(i think so...i m not sure)
We can build a hashtable for all the alphabets from the larger file. This shud be done for the 26 alphabets. Now from the smaller file, we take each the charecter one by one and xor it with the hash table after a charecter look up. Now we know taht if the value is present in the hash table, then x^x = 0. So if we get a zero, then we can output that charecter. Any comments?
This was asked in my RIVERBED interview. I told him the same if memory is not a problem, we could hash al the words which would result in O(1) or memory is limited we could still do hash table but complexity would increase. Its basically a tradeoff on what you have and wanna achieve I felt. But again the interviewer was not exactly amused!
we can build a hash table and set all the first names from the first file.
xor the hash value of the first names from the second file. Do a lookup on the hashtable and find all the names that have a 0. So these will be the repeated names. But to ensure the uniqueness of the names, now we can pick up all the names which has a zero in the hash table, from file one and do a binary search in file two for it. If it is found, then that is the common name.
If memory is an issue, then another approach would be to hash all the usernames from both files into smaller temporary files. Then the program can deal with the smaller files individually to determine if there is a duplicate. To avoid ambiguity (say if there's a duplicate usernames in a single logfile), store the file id to know where a username originially from. I think the biggest bottleneck for this algoritm is the disk access, so it is a good idea to read in huge chucks (say 100MB).
1. Dictionary principle
For the smaller file create tries tree.
Than for each username from bigger file check the tries tree which will take O(1).
If username contain only 26 alphabets characters and not longer than 10 than tree is going to have 26^10 nodes.
But for longer usernames and not only alphabets it might be not the best solution.
or
2. Use hash
Looks like both files have almost the same number of usernames, so any algorithm that depends on finding the smaller file wont be reliable.
Here's another thought: go through both files simultaneously, add each entry from both files into hash. If unable to add, the element already exists.
Pseudo Code:
while((!EOF(file1)) || (!EOF(file1)))
{
read userName1 from file1.
if (add (userName1 into Hash) fails)
{
add userName1 into listOfCommonUsers
}
read userName2 from file2.
if (add (userName2 into Hash) fails)
{
add userName2 into listOfCommonUsers
}
goto next line of file1
goto next line of file2
}
if(!EOF(file1))
{
while(!EOF(file1))
{
read userName1 from file1.
if (add (userName1 into Hash) fails)
{
add userName1 into listOfCommonUsers
}
goto next line of file1
}
}
if(!EOF(file2))
{
while(!EOF(file2))
{
read userName2 from file1.
if (add (userName1 into Hash) fails)
{
add userName2 into listOfCommonUsers
}
goto next line of file2
}
}
Looks good but it won't handle scenario when a single file has duplicate names in it. I think in that case we need to split hashes and add file1 entry in hash2 and file2 entry in hash1. For file1 entry look into hash1 and for file2 entry look into hash2.
Agreed with Swapnil, however it will become inefficient since we need to create two separate hash tables and then compare those two hashes.
hence O(n1) for hash1, O(n2) for hash 2.
Assume K1 and K2 are the number of elements in the hash tables hash1 and hash2. And final step of comapring these hash tables O(K1) and O(k2)
how about uniq -u to get all unique names in file1 and file2, concatenate the results and then uniq -d
do a m-way merge sort of both files independtnly.
Repeat the following as long as both files are not completely read:
1.Read one buffer from file1, one buffer from file2.
2.while the first name in buffer1 is > last name in buffer2
read next buffer from file2 into buffer2.
while the first name in buffer2 is > last name in buffer1
read next buffer from file1 into buffer1.
3. if EOF not reached, now we have 2 buffers that have overlapping names. since the buffers are sorted find the common names by just going through the names once (similar to merge sort)
Depends on restriction on usernames.
My idea: If a username is at most 8 characters, say, you can view it as an 8 'digit' number in base 27 (you need one extra character for a 'null' character for those usernames that aren't as long as 8).
2 billion ASCII usernames at 8 chars each is ~16GB, doing it all in-memory is a problem.
So, you can use radix sort and do 8 passes over the data. Make 27 temp files, one for 'a', one for 'b' ... 'null', and push all usernames with the right character in position 8 into the right file, then do the same thing over all those files for position 7, then position 6, until you've sorted by the first character. Then you'll have all the usernames from both files in sorted order. At that point you can just scan through each file via a stream and look for repeated usernames in sequence.
If L is the max length of your usernames, S1 is the # of usernames in first file, S2 is # of usernames in second file, this is O( L(S1 + S2) ), and uses constant space in memory. Since L is a constant, this is linear time. Lots of disk I/O though.
You could probably improve the runtime by doing more of the radix-sorting in memory and flushing to disk in blocks, but that's what buffered IO is for :D
first of all, since the log file has a billion usernames, then, we should assume there is no enough memory to hold all the usenames. So, we should use external sort to sort both files at first. Then, we can merge these two sorted file.
If memory is not a problem, then, just use unix command like 'cat f1 f2 | sort | uniq -d > f3'.
i think, we can achieve this by following, merge sort technique (External Merge Sort)
split the file into chunks.
use merge sort/insertion sort (as you doing only chunks by chunks, performance wise both are okay)
and after sorting each chuck , write them back to the disk, repeat this process for all chunks and for both files.
when each file are completed, treat them two separate sorted arrays, and now do the merging.
while doing merging, if you see duplicates keep them in a List (some data structure in memory) and throw away all other string
at the end you have only duplicates that you want to find.
this can be done in O(nlogn) where n = x+y (number of words in file 1 + number of words in file 2)
1) First of all sort both the file.
2) Read a username from file1, say username1, Read a username from file2, say username2.
3) If username1 = username2 then
print as output
else if username1 > username2 then
Read username from file 2 and say username2 and start step 3 again
else if username1 < username2 then
Read username from file 1 and say username1 and start step 3 again
If any of the file reading reach end of file. STOP program.
What's the problem with sorting? That's the correct solution. You use external sort if the files don't fit in memory.
Sort the 2 files. O(n.log(n))
Take one entry from the first file and do a binary search in the second file to find if it exists.
Go to the next entry in the first file O(log(n))
....repeat the process
even reading the file?
- what? February 22, 2006