Yahoo Interview Question
Software Engineer / DevelopersThe file name including path comes after the FQDN (fully qualified domain name).
So the first part would be to extract that using a regular expression. Then add the file path (this includes the hierarchy of the file as well on the server) to a data structure of your choice. Something with a low lookup and Insertion time would do great. Hashtables would be my preferred ds... although a BST would do well as well.
Do not add filepaths or flag duplicate if it already exists in your data structure.
It is also important how you define "duplicate" documents. Does duplicate docs mean those which are exactly the same? I feel there should a thresh-hold for 2 docs to be considered duplicate. i.e. the number of words which are similar as well as their position of occurance in the document.
I think a Bag of Words approach along with storing the position of occurance of words should work well.
Hashing or calculating CRC is a good idea. CRC have few flavors too, full CRC (longer) vs quick CRC (shorter, based on few keywords). So the matching sequence can be something like:
1. URL match (Handle IP address, hostname, :80, redundant www., ? )
2. Match quick CRC (retrieved from in-memory hashtable)
3. if equal, match full CRC (retrieved from in-memory hashtable)
4. If equal, match the document contents (from disk)
People claiming B-Tree, trie, Suffix tree, don't seem to realize the problem (or they seem to have misread this to check for duplicate URLS, not duplicate pages).
Doing a hash(es) (CRC whatever) is the right way to go. Once you have created a hash for each page, you can use a hashtable/B-Tree whatever.
The point is not to compare the pages each time, since they are HUGE.
If you understand the motivation behind hashing, this should come across as a _basic_ hashing problem.
NOTE: THE PROBLEM IS TO CHECK FOR DUPLICATE PAGES, NOT DUPLICATE URLS.
Suggest a cryptographic hash function, SHA1 or MD5. They are expensive but very good hash function.
- vodangkhoa June 05, 2007