Given a file with millions of

Adobe Interview Question for Software Engineer / Developers

0

of 0 votes

11
Answers
Given a file with millions of lines of data. 2 lines are identical, rest all are unique. Each line is so long that it may not even fit in memory. What is the most efficient solution for finding the identical lines?
- Anonymous January 10, 2010 | Report Duplicate | Flag | PURGE
Adobe Software Engineer / Developer Algorithm

Email me when people comment.

An error occurred in subscribing you.

More Questions from This Interview

Email me when people comment.

An error occurred in subscribing you.

Comment hidden because of low score. Click to expand.

of 0 vote

BitVector b_vector = sizeof (no .of lines) <-- all set to false;
1. Hash every string to a number y;
2. if ( b_vector[y] == true )
found identical line
else
b_vector[y] = true;

- Anonymous January 11, 2010 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

How are you guaranteeing that each string will hash to a unique hash number? You aren't storing the string, so how do you resolve collisions?

- Anonymous January 11, 2010 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

I guess the hashing is necessary. After hashing, for each key, if there are more than one element, compare them.

- Anonymous January 11, 2010 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

Read the line character by character and create a hash value out of it. Put this hash along with line number in a hashtable. Now compare only those lines which were hashed to same value. Now the complexity depends on how good your hash function is.

- TripleH January 15, 2010 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

seeing the constraints the soln listed above looks the best

- Anonymous January 18, 2010 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

The whole point here is --how to create a hash when the line is very big (can't be loaded to memory)-- ?
At the first look it seems to me that not a hash, but some prefixes tree (tries) should be build while going through the lines in the file.

- celicom January 19, 2010 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

Well, in practice it could be md5 or some other "good" hash functions to calculate the hash value for each line by reading it byte by byte. But in theory - it does not guarantee absence of collisions. I think original task could be solved for example by sorting lines (indexes of lines of course). To compare two lines there is no need to load them to memory completely. The number of comparisons will be O(N*ln(N)) and during comparisons the dup line will be detected.

- celicom January 19, 2010 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

with the given constraint we can't create a approach of trie. As every character occupies some about of storage space in a heap.

Use hashmap or one of the external sort.

- master January 24, 2010 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

Good one... We need some way to store seek position of start of each line, which can be used for comparison in case of same hash.

- master January 24, 2010 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

two files with a huge number of lines say 10 to the power 9, find common lines in two files

i do not see any other option than hashing and then external sort but if some line is quite big, probably we need multipl level of hashing per line eg 100 char line but memory is 25 bytes, so has 20 char 5 times and then again hash those 5 hashes

or do we need to use mapreduce?

- nutcracker July 01, 2010 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

1. Read first char of lines (lets say mem can accomodate m chars and there are n lines so read first m lines and so on)
2. catogerize them as per the matches and first chars, so the lines starting from char 'a' will go in set 'a' and so on..
3. compare lines as per the last matches..discard lines which had no matches.
4. we may compare it somewhat like k way merge and keep on deciding which lines to compare and which to discard...

- Anonymous January 20, 2012 | Flag Reply

CareerCup

Adobe Interview Question for Software Engineer / Developers

Books

Videos

Resume Review

Mock Interviews