Assume we have a very large fi

Microsoft Interview Question

2

of 2 votes

15
Answers
Assume we have a very large file containing millions of lines of data. Only 2 lines are identical, rest are unique.
Each line is long enough that they may not fit in memory.
Design an efficient algorithm to determine identical lines?
And, then generalize it for 'n' identical lines.
- Anand Barnwal June 08, 2015 in India | Report Duplicate | Flag | PURGE
Microsoft Algorithm

Email me when people comment.

An error occurred in subscribing you.

Country: India
Interview Type: Phone Interview

Email me when people comment.

An error occurred in subscribing you.

Comment hidden because of low score. Click to expand.

of 0 vote

For 2 identical lines, generate hash of line and input into hashtable where key is the hash of the line and the value is the line number. If hashtable contains key, then that line is a duplicate.

For 'n' identical lines, generate hash of line and input into hashtable where key is the hash of the line and value is an array of line numbers like so => {hash: [line number]}. Then check for keys with values of length greater than 1 for identical lines .

- Anonymous June 08, 2015 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

what about collision?

- pc June 08, 2015 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

But it is given that the complete line may not fit in memory.

- Anand Barnwal June 08, 2015 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

how can we create a hash if the entire line doesnt fit in memory?? I think taking the first few words of the line and calculating hash recursively or taking few words and throwing into buckets may solve the problem

- raviteja June 08, 2015 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

Use SHA-2 or SHA-3 for hashing, no collisions have been found for that hash algorithm so it will be secure. If you do find a collision then you could potentially be rewarded for breaking the hash algorithm :)

- Anonymous June 09, 2015 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

Also you can compress the lines using some compression algorithm (even an easy one like run-length encoding) and hash it and include the line number.

- Anonymous June 09, 2015 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

sha-2 or sha-3 does not have collisions, and compress the line before storing into hashtable (simplest algorithm that comes to mind is run-length encoding, however this is not the optimum compression algorithm).

- Anonymous June 09, 2015 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

Instead of compressing, couldn't you just has the longest string that fits into memory, and then hash the hash with the rest. Repeating if needed? No two non-identical strings should collide because at some point their hashes will differ.

(Not an expert on cryptographic hash functions so I don't know how much of an effect rehashing has on collision although I know it does increase the likelyhood).

- SycophantEve June 14, 2015 | Flag

Comment hidden because of low score. Click to expand.

of 0 vote

What do you think to use a retrieval of sha1 hashs?
It would be a good approach to compact the data in memory and to perform fast comparisons.

The complexity would be N log L (Where L is the sha1 hash length).

PS: this solution is valid for N repeated lines.

Another interesting point: sha1 or md5 doest not require to read the whole line... so, it does not matter if the line can be stored in memory or not.

The cons about this solution against using a hashset is because a retrieval will compact the data (sha1 hashs).

- Felipe Cerqueira June 08, 2015 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

The standrad solution of this problem is via sorting,
We have merge sort implementation for large array, which will take O((N/M)*LOG_M(N)) I/O Complexity and O(N*LOG(N)) as Internal memory complexity, where N is input size in bytes, and M is internal memory size, once we do, finding identical is easy

Complexity : O(Nlog(N)) Internal memory complexity
O((N/M)Log_M(N)) I/O Complexity.

- sonesh June 09, 2015 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

All the answers involve generating a hash value for each line, however perfect hash functions can only be guaranteed for a known finite set, there may be clashes, so, the solution will take 2 passes thru the file, one pass to generate all the line hashes, which must be regarded as 'candidate's, and another pass to verify that the identical hash values are actually unique lines.

- Anonymous June 09, 2015 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

I would approach using the following steps:
1. Given the whole line doesn't fit memory, you can compact the line, by taking it in chunks and generating an MD5 hash on them
2. Apply merge sort on hashed lines
3. Go through the lines and check if

line[i]==line[i-1]

- Marcello Ghali June 10, 2015 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

This is a problem involving external merge sort,using the sort-merge technique which will
divide the data into smaller chunks that can be easily fit into memory
using Hash may bring in collision.

- Anonymous June 15, 2015 | Flag Reply

Comment hidden because of low score. Click to expand.

-1

of 1 vote

- Alistair June 08, 2015 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

this is a good answer to start with.
However you also need to think for collision. Hash can not be guaranteed to be unique for each unique key

CareerCup

Microsoft Interview Question

Books

Videos

Resume Review

Mock Interviews