Given two log files, each with

Amazon Interview Question for Software Engineer / Developers

0

of 0 votes

50
Answers
Given two log files, each with a billion usernames (each username appended to the log file), find the usernames existing in both documents in the most efficient manner? Use pseudo-code or code. If your code calls pre-existing library functions, create each library function from scratch.
- dotNet February 20, 2006 | Report Duplicate | Flag | PURGE
Amazon Software Engineer / Developer Coding Algorithm Object Oriented Design

Email me when people comment.

An error occurred in subscribing you.

Email me when people comment.

An error occurred in subscribing you.

Comment hidden because of low score. Click to expand.

of 0 vote

even reading the file?

- what? February 22, 2006 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

This seems like a question from a freddy nightmare. It's asking you to simulate a scanner for a compiler by using regular expressions.

- Jack February 22, 2006 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

wont building an index (B+ Tree) be better than sorting?

- creator June 08, 2006 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

creator,

Building a B-tree doesn't change the complexity much. It would still be NlgN kind of thing.

- ray June 19, 2006 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

if each username is even 20 Bytes, the total size of each file will be atleast 20 GB. So, a B+ tree can be constructed for one of the files (the smaller one) and then we look for entries of the second file on this tree. Probably there is another method, If I find I will post. Gayle, can you comment on my solution.

- Krrish March 03, 2007 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

B+ sorting is surely a possible solution we can look for.

I could think of external sort...Any comments ?

- Daniel Johnson February 07, 2008 | Flag

Comment hidden because of low score. Click to expand.

of 0 vote

Some sort of suffix tree should do the trick..
If memory is not the problem then hash all the names in the first one and check when reading the second one in O(n) time

- codeyman March 03, 2007 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

1) File 1.append(file2) o(n)
2) Sort O(nlogn)
3) Find the duplicates by a single traversal and comparing the successive elements.

2 functions to implement -
- Append
- Sort

Problem: If duplicates are in the same file, wrong information will popout.

- Saumil May 26, 2007 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

So you could sort the smaller of the two files. Since would be a problem with such a huge file, one can do a n way merge sort of the smaller file. Once that is there, do a binary search of each element of the other file in the sorted file. So if the no of elements in the smaller file is n and larger file is m then the running time would be O(mlogn). Any suggestions???

- ram August 30, 2007 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

for 1 file
read each line,
hash into their abc buckets depending on the start of
the letter of the word. (26 buckets for A to Z
to form something like a 26 by xxx table)
then sort each 26 rows in the hash table and delete dups

for 2nd file
sort and delete dups
for each line/name, find match in the hashtable.
if matchfind, output to another file name foundmatch

- bwc September 12, 2007 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

I think the b+ tree would be a good suggestion if the memory is not a problem. This is because in a normal sorting, it would be nlogn where log is base 2, while using a b+ i could increase the fan outs and make the height smaller depending on the sample set tht i have. like say the larger file can be indexed and each value of the smaller file can be searched. so it would take mlog(base fanouts)n < nlog(base 2)n. Gayle what do you think???any suggestions when your input is large?

- Ram September 20, 2007 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

use hashset to store all the usernames present in one file.
than try to insert the usernames present in other file in this hashset one by one ....if the username is inserted than it is unique and returns true and if not than it returns false and so duplicate exist.

search time is O(1)......for hashset(i think so...i m not sure)

- dharam September 22, 2007 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

We can build a hashtable for all the alphabets from the larger file. This shud be done for the 26 alphabets. Now from the smaller file, we take each the charecter one by one and xor it with the hash table after a charecter look up. Now we know taht if the value is present in the hash table, then x^x = 0. So if we get a zero, then we can output that charecter. Any comments?

- Java Coder November 26, 2007 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

This was asked in my RIVERBED interview. I told him the same if memory is not a problem, we could hash al the words which would result in O(1) or memory is limited we could still do hash table but complexity would increase. Its basically a tradeoff on what you have and wanna achieve I felt. But again the interviewer was not exactly amused!

- Anonymous December 04, 2007 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

we can build a hash table and set all the first names from the first file.
xor the hash value of the first names from the second file. Do a lookup on the hashtable and find all the names that have a 0. So these will be the repeated names. But to ensure the uniqueness of the names, now we can pick up all the names which has a zero in the hash table, from file one and do a binary search in file two for it. If it is found, then that is the common name.

- Java Coder December 05, 2007 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

Considerations:
The file may be too big to fit into RAM.

Approach:
The problem asks to find intersection; basic set theory. Use 1 Hash Set(array-based preferably). If an intersection occurs, display.

- Jack December 12, 2007 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

If memory is an issue, then another approach would be to hash all the usernames from both files into smaller temporary files. Then the program can deal with the smaller files individually to determine if there is a duplicate. To avoid ambiguity (say if there's a duplicate usernames in a single logfile), store the file id to know where a username originially from. I think the biggest bottleneck for this algoritm is the disk access, so it is a good idea to read in huge chucks (say 100MB).

- dumak23 January 01, 2008 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

1. Dictionary principle

For the smaller file create tries tree.
Than for each username from bigger file check the tries tree which will take O(1).

If username contain only 26 alphabets characters and not longer than 10 than tree is going to have 26^10 nodes.

But for longer usernames and not only alphabets it might be not the best solution.

or
2. Use hash

- zdmytriv January 27, 2008 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

Good analysis, agreed with you.

- 2 degree May 20, 2014 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

Tries are constructed incrementally upon inserting new elements. We do not need to pre-allocate 26^10 (or 1 PB) of memory to support usernames of length 10 letters.

- jatin085 February 09, 2015 | Flag

Comment hidden because of low score. Click to expand.

of 0 vote

Assumption: Sufficient memory
First file needs to be preprocessed - use a hashmap to key values by alphabet. The value is a presorted list of user names.
Second file - use the key to get the list and do a binary search thereafter

- Tarun February 05, 2008 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

How about using comm utility of unix. comm -1 -2 file1 file2

- pink March 16, 2008 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

The question is to check your level of understanding on data structures, so this may not be an expected answer. But your solution still works.

- 2 degree May 20, 2014 | Flag

Comment hidden because of low score. Click to expand.

of 0 vote

Looks like both files have almost the same number of usernames, so any algorithm that depends on finding the smaller file wont be reliable.

Here's another thought: go through both files simultaneously, add each entry from both files into hash. If unable to add, the element already exists.

Pseudo Code:

while((!EOF(file1)) || (!EOF(file1)))
{
read userName1 from file1.
if (add (userName1 into Hash) fails)
{
add userName1 into listOfCommonUsers
}
read userName2 from file2.
if (add (userName2 into Hash) fails)
{
add userName2 into listOfCommonUsers
}
goto next line of file1
goto next line of file2
}

if(!EOF(file1))
{
while(!EOF(file1)) 
{
read userName1 from file1.
if (add (userName1 into Hash) fails)
{
add userName1 into listOfCommonUsers
}
goto next line of file1
}
}

if(!EOF(file2))
{
while(!EOF(file2)) 
{
read userName2 from file1.
if (add (userName1 into Hash) fails)
{
add userName2 into listOfCommonUsers
}
goto next line of file2
}
}

- Cavey April 05, 2008 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

Looks like an O(n) solution to me.

Comments?

- Cavey April 05, 2008 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

Looks good but it won't handle scenario when a single file has duplicate names in it. I think in that case we need to split hashes and add file1 entry in hash2 and file2 entry in hash1. For file1 entry look into hash1 and for file2 entry look into hash2.

- swapsmagic April 27, 2012 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

Why do you need first while block ? 2nd and 3rd will do.

- Anon May 30, 2012 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

Agreed with Swapnil, however it will become inefficient since we need to create two separate hash tables and then compare those two hashes.

hence O(n1) for hash1, O(n2) for hash 2.
Assume K1 and K2 are the number of elements in the hash tables hash1 and hash2. And final step of comapring these hash tables O(K1) and O(k2)

- 2 degree May 20, 2014 | Flag

Comment hidden because of low score. Click to expand.

of 0 vote

how about uniq -u to get all unique names in file1 and file2, concatenate the results and then uniq -d

- Karim Fatehi April 08, 2008 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

I agree. 'comm' doesn't give a output file, just stdio. Also, if the file is too big, we can use 'split' and do the chunks.

Unix command:

cat f1 f2 | sort | uniq -d > f12

- chaos January 26, 2010 | Flag

Comment hidden because of low score. Click to expand.

of 0 vote

do a m-way merge sort of both files independtnly.

Repeat the following as long as both files are not completely read:
1.Read one buffer from file1, one buffer from file2.
2.while the first name in buffer1 is > last name in buffer2
read next buffer from file2 into buffer2.
while the first name in buffer2 is > last name in buffer1
read next buffer from file1 into buffer1.
3. if EOF not reached, now we have 2 buffers that have overlapping names. since the buffers are sorted find the common names by just going through the names once (similar to merge sort)

- alcoholic April 12, 2008 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

Definitely not an efficient method

- 2 degree May 20, 2014 | Flag

Comment hidden because of low score. Click to expand.

of 0 vote

Depends on restriction on usernames.

My idea: If a username is at most 8 characters, say, you can view it as an 8 'digit' number in base 27 (you need one extra character for a 'null' character for those usernames that aren't as long as 8).

2 billion ASCII usernames at 8 chars each is ~16GB, doing it all in-memory is a problem.

So, you can use radix sort and do 8 passes over the data. Make 27 temp files, one for 'a', one for 'b' ... 'null', and push all usernames with the right character in position 8 into the right file, then do the same thing over all those files for position 7, then position 6, until you've sorted by the first character. Then you'll have all the usernames from both files in sorted order. At that point you can just scan through each file via a stream and look for repeated usernames in sequence.

If L is the max length of your usernames, S1 is the # of usernames in first file, S2 is # of usernames in second file, this is O( L(S1 + S2) ), and uses constant space in memory. Since L is a constant, this is linear time. Lots of disk I/O though.

You could probably improve the runtime by doing more of the radix-sorting in memory and flushing to disk in blocks, but that's what buffered IO is for :D

- Debo May 09, 2008 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

BTW you can see similar problems to this in Jon Bentley's Programming Pearls 2ed., chapter 2 (I think).

- Debo May 09, 2008 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

Build a trie of the first file using all the usernames. then pick up each name from the second file and check if it exists in the trie.

- Jitendra June 27, 2008 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

first of all, since the log file has a billion usernames, then, we should assume there is no enough memory to hold all the usenames. So, we should use external sort to sort both files at first. Then, we can merge these two sorted file.

If memory is not a problem, then, just use unix command like 'cat f1 f2 | sort | uniq -d > f3'.

- lensbo July 16, 2008 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

fools

- db October 17, 2008 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

Can we use trie?

- Ashupriya June 10, 2012 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

yes, we can use provided space complexity is not an issue.

- 2 degree May 20, 2014 | Flag

Comment hidden because of low score. Click to expand.

of 0 vote

We can use XOR to compare duplicate. Take XOR of all usernames in file1 and then XOR with each username of File2 . If result is 0 then user name exist in both the files.

- Amit Kumar January 12, 2014 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

i think, we can achieve this by following, merge sort technique (External Merge Sort)

split the file into chunks.

use merge sort/insertion sort (as you doing only chunks by chunks, performance wise both are okay)

and after sorting each chuck , write them back to the disk, repeat this process for all chunks and for both files.

when each file are completed, treat them two separate sorted arrays, and now do the merging.

while doing merging, if you see duplicates keep them in a List (some data structure in memory) and throw away all other string

at the end you have only duplicates that you want to find.

this can be done in O(nlogn) where n = x+y (number of words in file 1 + number of words in file 2)

- janardhana.ch April 02, 2016 | Flag Reply

Comment hidden because of low score. Click to expand.

-1

of 1 vote

1) First of all sort both the file.
2) Read a username from file1, say username1, Read a username from file2, say username2.
3) If username1 = username2 then
print as output
else if username1 > username2 then
Read username from file 2 and say username2 and start step 3 again
else if username1 < username2 then
Read username from file 1 and say username1 and start step 3 again

If any of the file reading reach end of file. STOP program.

- Amit Solanki March 21, 2006 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

you would be thrown out on the first line itself :)

- JH December 30, 2007 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

yep

- Anonymous April 21, 2008 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

What's the problem with sorting? That's the correct solution. You use external sort if the files don't fit in memory.

- Anonymous August 16, 2008 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

Why do you require sort given file which has billions of lines? Can't you find without sort?

- Naga Samrat Chowdary, Narla July 05, 2010 | Flag

Comment hidden because of low score. Click to expand.

-1

of 1 vote

Sort the 2 files. O(n.log(n))

Take one entry from the first file and do a binary search in the second file to find if it exists.
Go to the next entry in the first file O(log(n))
....repeat the process

- ak June 05, 2006 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

wow!! this just keeps getting better and better... lemme guess.. you play football and have blonde hair?

- JH December 30, 2007 | Flag

Comment hidden because of low score. Click to expand.

-1

of 1 vote

find out manually

- idiot July 22, 2007 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

idiot is apt name for you. This is a professional site and is meant for those who are sincerely looking to learn and grow in their career. Dont waste your and our time. Get outta here.

- Anonymous September 23, 2012 | Flag

CareerCup

Amazon Interview Question for Software Engineer / Developers

Books

Videos

Resume Review

Mock Interviews