Microsoft Interview Question

Country: India
Interview Type: Phone Interview

Comment hidden because of low score. Click to expand.
0
of 0 vote

For 2 identical lines, generate hash of line and input into hashtable where key is the hash of the line and the value is the line number. If hashtable contains key, then that line is a duplicate.

For 'n' identical lines, generate hash of line and input into hashtable where key is the hash of the line and value is an array of line numbers like so => {hash: [line number]}. Then check for keys with values of length greater than 1 for identical lines .

Comment hidden because of low score. Click to expand.
0

Comment hidden because of low score. Click to expand.
0

But it is given that the complete line may not fit in memory.

Comment hidden because of low score. Click to expand.
0

how can we create a hash if the entire line doesnt fit in memory?? I think taking the first few words of the line and calculating hash recursively or taking few words and throwing into buckets may solve the problem

Comment hidden because of low score. Click to expand.
0

Use SHA-2 or SHA-3 for hashing, no collisions have been found for that hash algorithm so it will be secure. If you do find a collision then you could potentially be rewarded for breaking the hash algorithm :)

Comment hidden because of low score. Click to expand.
0

Also you can compress the lines using some compression algorithm (even an easy one like run-length encoding) and hash it and include the line number.

Comment hidden because of low score. Click to expand.
0

sha-2 or sha-3 does not have collisions, and compress the line before storing into hashtable (simplest algorithm that comes to mind is run-length encoding, however this is not the optimum compression algorithm).

Comment hidden because of low score. Click to expand.
0

Instead of compressing, couldn't you just has the longest string that fits into memory, and then hash the hash with the rest. Repeating if needed? No two non-identical strings should collide because at some point their hashes will differ.

(Not an expert on cryptographic hash functions so I don't know how much of an effect rehashing has on collision although I know it does increase the likelyhood).

Comment hidden because of low score. Click to expand.
0
of 0 vote

What do you think to use a retrieval of sha1 hashs?
It would be a good approach to compact the data in memory and to perform fast comparisons.

The complexity would be N log L (Where L is the sha1 hash length).

PS: this solution is valid for N repeated lines.

Another interesting point: sha1 or md5 doest not require to read the whole line... so, it does not matter if the line can be stored in memory or not.

The cons about this solution against using a hashset is because a retrieval will compact the data (sha1 hashs).

Comment hidden because of low score. Click to expand.
0
of 0 vote

The standrad solution of this problem is via sorting,
We have merge sort implementation for large array, which will take O((N/M)*LOG_M(N)) I/O Complexity and O(N*LOG(N)) as Internal memory complexity, where N is input size in bytes, and M is internal memory size, once we do, finding identical is easy

Complexity : O(Nlog(N)) Internal memory complexity
O((N/M)Log_M(N)) I/O Complexity.

Comment hidden because of low score. Click to expand.
0
of 0 vote

All the answers involve generating a hash value for each line, however perfect hash functions can only be guaranteed for a known finite set, there may be clashes, so, the solution will take 2 passes thru the file, one pass to generate all the line hashes, which must be regarded as 'candidate's, and another pass to verify that the identical hash values are actually unique lines.

Comment hidden because of low score. Click to expand.
0
of 0 vote

I would approach using the following steps:
1. Given the whole line doesn't fit memory, you can compact the line, by taking it in chunks and generating an MD5 hash on them
2. Apply merge sort on hashed lines
3. Go through the lines and check if

``line[i]==line[i-1]``

Comment hidden because of low score. Click to expand.
0
of 0 vote

This is a problem involving external merge sort,using the sort-merge technique which will
divide the data into smaller chunks that can be easily fit into memory
using Hash may bring in collision.

Comment hidden because of low score. Click to expand.
-1
of 1 vote

For 2 identical lines, generate hash of line and input into hashtable where key is the hash of the line and the value is the line number. If hashtable contains key, then that line is a duplicate.

For 'n' identical lines, generate hash of line and input into hashtable where key is the hash of the line and value is an array of line numbers like so => {hash: [line number]}. Then check for keys with values of length greater than 1 for identical lines .

Comment hidden because of low score. Click to expand.
0

However you also need to think for collision. Hash can not be guaranteed to be unique for each unique key

Comment hidden because of low score. Click to expand.
-1
of 1 vote

For 2 identical lines, generate hash of line and input into hashtable where key is the hash of the line and the value is the line number. If hashtable contains key, then that line is a duplicate.

For 'n' identical lines, generate hash of line and input into hashtable where key is the hash of the line and value is an array of line numbers like so => {hash: [line number]}. Then check for keys with values of length greater than 1 for identical lines .

Name:

Writing Code? Surround your code with {{{ and }}} to preserve whitespace.

Books

is a comprehensive book on getting a job at a top tech company, while focuses on dev interviews and does this for PMs.

Videos

CareerCup's interview videos give you a real-life look at technical interviews. In these unscripted videos, watch how other candidates handle tough questions and how the interviewer thinks about their performance.