Microsoft Interview Question
Country: India
Interview Type: Phone Interview
how can we create a hash if the entire line doesnt fit in memory?? I think taking the first few words of the line and calculating hash recursively or taking few words and throwing into buckets may solve the problem
Use SHA-2 or SHA-3 for hashing, no collisions have been found for that hash algorithm so it will be secure. If you do find a collision then you could potentially be rewarded for breaking the hash algorithm :)
Also you can compress the lines using some compression algorithm (even an easy one like run-length encoding) and hash it and include the line number.
sha-2 or sha-3 does not have collisions, and compress the line before storing into hashtable (simplest algorithm that comes to mind is run-length encoding, however this is not the optimum compression algorithm).
Instead of compressing, couldn't you just has the longest string that fits into memory, and then hash the hash with the rest. Repeating if needed? No two non-identical strings should collide because at some point their hashes will differ.
(Not an expert on cryptographic hash functions so I don't know how much of an effect rehashing has on collision although I know it does increase the likelyhood).
What do you think to use a retrieval of sha1 hashs?
It would be a good approach to compact the data in memory and to perform fast comparisons.
The complexity would be N log L (Where L is the sha1 hash length).
PS: this solution is valid for N repeated lines.
Another interesting point: sha1 or md5 doest not require to read the whole line... so, it does not matter if the line can be stored in memory or not.
The cons about this solution against using a hashset is because a retrieval will compact the data (sha1 hashs).
The standrad solution of this problem is via sorting,
We have merge sort implementation for large array, which will take O((N/M)*LOG_M(N)) I/O Complexity and O(N*LOG(N)) as Internal memory complexity, where N is input size in bytes, and M is internal memory size, once we do, finding identical is easy
Complexity : O(Nlog(N)) Internal memory complexity
O((N/M)Log_M(N)) I/O Complexity.
All the answers involve generating a hash value for each line, however perfect hash functions can only be guaranteed for a known finite set, there may be clashes, so, the solution will take 2 passes thru the file, one pass to generate all the line hashes, which must be regarded as 'candidate's, and another pass to verify that the identical hash values are actually unique lines.
For 2 identical lines, generate hash of line and input into hashtable where key is the hash of the line and the value is the line number. If hashtable contains key, then that line is a duplicate.
For 'n' identical lines, generate hash of line and input into hashtable where key is the hash of the line and value is an array of line numbers like so => {hash: [line number]}. Then check for keys with values of length greater than 1 for identical lines .
For 2 identical lines, generate hash of line and input into hashtable where key is the hash of the line and the value is the line number. If hashtable contains key, then that line is a duplicate.
For 'n' identical lines, generate hash of line and input into hashtable where key is the hash of the line and value is an array of line numbers like so => {hash: [line number]}. Then check for keys with values of length greater than 1 for identical lines .
For 2 identical lines, generate hash of line and input into hashtable where key is the hash of the line and the value is the line number. If hashtable contains key, then that line is a duplicate.
- Anonymous June 08, 2015For 'n' identical lines, generate hash of line and input into hashtable where key is the hash of the line and value is an array of line numbers like so => {hash: [line number]}. Then check for keys with values of length greater than 1 for identical lines .