Adobe Interview Question
Software Engineer / DevelopersWell, in practice it could be md5 or some other "good" hash functions to calculate the hash value for each line by reading it byte by byte. But in theory - it does not guarantee absence of collisions. I think original task could be solved for example by sorting lines (indexes of lines of course). To compare two lines there is no need to load them to memory completely. The number of comparisons will be O(N*ln(N)) and during comparisons the dup line will be detected.
two files with a huge number of lines say 10 to the power 9, find common lines in two files
i do not see any other option than hashing and then external sort but if some line is quite big, probably we need multipl level of hashing per line eg 100 char line but memory is 25 bytes, so has 20 char 5 times and then again hash those 5 hashes
or do we need to use mapreduce?
1. Read first char of lines (lets say mem can accomodate m chars and there are n lines so read first m lines and so on)
2. catogerize them as per the matches and first chars, so the lines starting from char 'a' will go in set 'a' and so on..
3. compare lines as per the last matches..discard lines which had no matches.
4. we may compare it somewhat like k way merge and keep on deciding which lines to compare and which to discard...
BitVector b_vector = sizeof (no .of lines) <-- all set to false;
- Anonymous January 11, 20101. Hash every string to a number y;
2. if ( b_vector[y] == true )
found identical line
else
b_vector[y] = true;