Microsoft Interview Question


Country: India
Interview Type: Phone Interview




Comment hidden because of low score. Click to expand.
0
of 0 vote

For 2 identical lines, generate hash of line and input into hashtable where key is the hash of the line and the value is the line number. If hashtable contains key, then that line is a duplicate.

For 'n' identical lines, generate hash of line and input into hashtable where key is the hash of the line and value is an array of line numbers like so => {hash: [line number]}. Then check for keys with values of length greater than 1 for identical lines .

- Anonymous June 08, 2015 | Flag Reply
Comment hidden because of low score. Click to expand.
0
of 0 votes

what about collision?

- pc June 08, 2015 | Flag
Comment hidden because of low score. Click to expand.
0
of 0 votes

But it is given that the complete line may not fit in memory.

- Anand Barnwal June 08, 2015 | Flag
Comment hidden because of low score. Click to expand.
0
of 0 votes

how can we create a hash if the entire line doesnt fit in memory?? I think taking the first few words of the line and calculating hash recursively or taking few words and throwing into buckets may solve the problem

- raviteja June 08, 2015 | Flag
Comment hidden because of low score. Click to expand.
0
of 0 votes

Use SHA-2 or SHA-3 for hashing, no collisions have been found for that hash algorithm so it will be secure. If you do find a collision then you could potentially be rewarded for breaking the hash algorithm :)

- Anonymous June 09, 2015 | Flag
Comment hidden because of low score. Click to expand.
0
of 0 votes

Also you can compress the lines using some compression algorithm (even an easy one like run-length encoding) and hash it and include the line number.

- Anonymous June 09, 2015 | Flag
Comment hidden because of low score. Click to expand.
0
of 0 votes

sha-2 or sha-3 does not have collisions, and compress the line before storing into hashtable (simplest algorithm that comes to mind is run-length encoding, however this is not the optimum compression algorithm).

- Anonymous June 09, 2015 | Flag
Comment hidden because of low score. Click to expand.
0
of 0 votes

Instead of compressing, couldn't you just has the longest string that fits into memory, and then hash the hash with the rest. Repeating if needed? No two non-identical strings should collide because at some point their hashes will differ.

(Not an expert on cryptographic hash functions so I don't know how much of an effect rehashing has on collision although I know it does increase the likelyhood).

- SycophantEve June 14, 2015 | Flag
Comment hidden because of low score. Click to expand.
0
of 0 vote

What do you think to use a retrieval of sha1 hashs?
It would be a good approach to compact the data in memory and to perform fast comparisons.

The complexity would be N log L (Where L is the sha1 hash length).

PS: this solution is valid for N repeated lines.

Another interesting point: sha1 or md5 doest not require to read the whole line... so, it does not matter if the line can be stored in memory or not.

The cons about this solution against using a hashset is because a retrieval will compact the data (sha1 hashs).

- Felipe Cerqueira June 08, 2015 | Flag Reply
Comment hidden because of low score. Click to expand.
0
of 0 vote

The standrad solution of this problem is via sorting,
We have merge sort implementation for large array, which will take O((N/M)*LOG_M(N)) I/O Complexity and O(N*LOG(N)) as Internal memory complexity, where N is input size in bytes, and M is internal memory size, once we do, finding identical is easy

Complexity : O(Nlog(N)) Internal memory complexity
O((N/M)Log_M(N)) I/O Complexity.

- sonesh June 09, 2015 | Flag Reply
Comment hidden because of low score. Click to expand.
0
of 0 vote

All the answers involve generating a hash value for each line, however perfect hash functions can only be guaranteed for a known finite set, there may be clashes, so, the solution will take 2 passes thru the file, one pass to generate all the line hashes, which must be regarded as 'candidate's, and another pass to verify that the identical hash values are actually unique lines.

- Anonymous June 09, 2015 | Flag Reply
Comment hidden because of low score. Click to expand.
0
of 0 vote

I would approach using the following steps:
1. Given the whole line doesn't fit memory, you can compact the line, by taking it in chunks and generating an MD5 hash on them
2. Apply merge sort on hashed lines
3. Go through the lines and check if

line[i]==line[i-1]

- Marcello Ghali June 10, 2015 | Flag Reply
Comment hidden because of low score. Click to expand.
0
of 0 vote

This is a problem involving external merge sort,using the sort-merge technique which will
divide the data into smaller chunks that can be easily fit into memory
using Hash may bring in collision.

- Anonymous June 15, 2015 | Flag Reply
Comment hidden because of low score. Click to expand.
-1
of 1 vote

For 2 identical lines, generate hash of line and input into hashtable where key is the hash of the line and the value is the line number. If hashtable contains key, then that line is a duplicate.

For 'n' identical lines, generate hash of line and input into hashtable where key is the hash of the line and value is an array of line numbers like so => {hash: [line number]}. Then check for keys with values of length greater than 1 for identical lines .

- Alistair June 08, 2015 | Flag Reply
Comment hidden because of low score. Click to expand.
0
of 0 votes

this is a good answer to start with.
However you also need to think for collision. Hash can not be guaranteed to be unique for each unique key

- pc June 08, 2015 | Flag
Comment hidden because of low score. Click to expand.
-1
of 1 vote

For 2 identical lines, generate hash of line and input into hashtable where key is the hash of the line and the value is the line number. If hashtable contains key, then that line is a duplicate.

For 'n' identical lines, generate hash of line and input into hashtable where key is the hash of the line and value is an array of line numbers like so => {hash: [line number]}. Then check for keys with values of length greater than 1 for identical lines .

- Alistair June 08, 2015 | Flag Reply


Add a Comment
Name:

Writing Code? Surround your code with {{{ and }}} to preserve whitespace.

Books

is a comprehensive book on getting a job at a top tech company, while focuses on dev interviews and does this for PMs.

Learn More

Videos

CareerCup's interview videos give you a real-life look at technical interviews. In these unscripted videos, watch how other candidates handle tough questions and how the interviewer thinks about their performance.

Learn More

Resume Review

Most engineers make critical mistakes on their resumes -- we can fix your resume with our custom resume review service. And, we use fellow engineers as our resume reviewers, so you can be sure that we "get" what you're saying.

Learn More

Mock Interviews

Our Mock Interviews will be conducted "in character" just like a real interview, and can focus on whatever topics you want. All our interviewers have worked for Microsoft, Google or Amazon, you know you'll get a true-to-life experience.

Learn More