You have a billion urls, where

Yahoo Interview Question for Software Engineer / Developers

0

of 0 votes

16
Answers
You have a billion urls, where each has a huge page. How to you detect the duplicate documents?
- liandage June 05, 2007 | Report Duplicate | Flag | PURGE
Yahoo Software Engineer / Developer Algorithm

Email me when people comment.

An error occurred in subscribing you.

Email me when people comment.

An error occurred in subscribing you.

Comment hidden because of low score. Click to expand.

of 0 vote

Suggest a cryptographic hash function, SHA1 or MD5. They are expensive but very good hash function.

- vodangkhoa June 05, 2007 | Flag Reply

Comment hidden because of low score. Click to expand.

-1

of 0 votes

4gnTsU fw831f64td9glse934dkjga

- jimmy May 09, 2008 | Flag

Comment hidden because of low score. Click to expand.

of 0 vote

I say hashing the document contents, so the interview ask do I know which hash function should I used? I have no clue.

- liandage June 06, 2007 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

The file name including path comes after the FQDN (fully qualified domain name).

So the first part would be to extract that using a regular expression. Then add the file path (this includes the hierarchy of the file as well on the server) to a data structure of your choice. Something with a low lookup and Insertion time would do great. Hashtables would be my preferred ds... although a BST would do well as well.

Do not add filepaths or flag duplicate if it already exists in your data structure.

- Anon June 06, 2007 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

It is also important how you define "duplicate" documents. Does duplicate docs mean those which are exactly the same? I feel there should a thresh-hold for 2 docs to be considered duplicate. i.e. the number of words which are similar as well as their position of occurance in the document.

I think a Bag of Words approach along with storing the position of occurance of words should work well.

- AlgoFreak June 07, 2007 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

yes. depends on how you difine 'duplicate' documents. note 'similar' not equals 'duplicate'.

- nunbit romance June 09, 2007 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

I would like to use a balanced tree such as B+ Tree. The search time is O(tlogtn) and height of tree is very small (say the branching factor = 1000, then the depth is 2). It has less I/O operations since most of data needs to be stored on disk.

Another choice is hash table.

- tom jerry April 22, 2008 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

If duplicate detection can be approximate, one can use bloom filters.

- Anonymous May 18, 2008 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

how about a hashed bag-of-words approach? create a bag of words (increased optimization might be possible by removing "common words") and hash that bag of words.
now compare hashes ..
(doesnt seem quite right though :) )

- whattay!! September 06, 2008 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

Hashing or calculating CRC is a good idea. CRC have few flavors too, full CRC (longer) vs quick CRC (shorter, based on few keywords). So the matching sequence can be something like:
1. URL match (Handle IP address, hostname, :80, redundant www., ? )
2. Match quick CRC (retrieved from in-memory hashtable)
3. if equal, match full CRC (retrieved from in-memory hashtable)
4. If equal, match the document contents (from disk)

- Hashing January 11, 2009 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

use either suffix trees or
simple linux command
sort urlfile.txt | uniq

- piyush1989kapoor February 09, 2009 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

Why not trie ?

- Time Pass ! March 12, 2009 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

People claiming B-Tree, trie, Suffix tree, don't seem to realize the problem (or they seem to have misread this to check for duplicate URLS, not duplicate pages).

Doing a hash(es) (CRC whatever) is the right way to go. Once you have created a hash for each page, you can use a hashtable/B-Tree whatever.

The point is not to compare the pages each time, since they are HUGE.

If you understand the motivation behind hashing, this should come across as a _basic_ hashing problem.

NOTE: THE PROBLEM IS TO CHECK FOR DUPLICATE PAGES, NOT DUPLICATE URLS.

- T March 13, 2009 | Flag Reply

Comment hidden because of low score. Click to expand.

-1

of 0 vote

I think identifying duplicate url will be enough for duplicate document. We can use hash table to identify duplicate url.

- ayub June 13, 2007 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

Are you serious?

- Anonymous April 17, 2008 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

LOL LOL

- wtf July 21, 2011 | Flag

CareerCup

Yahoo Interview Question for Software Engineer / Developers

Books

Videos

Resume Review

Mock Interviews