Yahoo Interview Question for Software Engineer / Developers






Comment hidden because of low score. Click to expand.
0
of 0 vote

Suggest a cryptographic hash function, SHA1 or MD5. They are expensive but very good hash function.

- vodangkhoa June 05, 2007 | Flag Reply
Comment hidden because of low score. Click to expand.
-1
of 0 votes

4gnTsU fw831f64td9glse934dkjga

- jimmy May 09, 2008 | Flag
Comment hidden because of low score. Click to expand.
0
of 0 vote

I say hashing the document contents, so the interview ask do I know which hash function should I used? I have no clue.

- liandage June 06, 2007 | Flag Reply
Comment hidden because of low score. Click to expand.
0
of 0 vote

The file name including path comes after the FQDN (fully qualified domain name).

So the first part would be to extract that using a regular expression. Then add the file path (this includes the hierarchy of the file as well on the server) to a data structure of your choice. Something with a low lookup and Insertion time would do great. Hashtables would be my preferred ds... although a BST would do well as well.

Do not add filepaths or flag duplicate if it already exists in your data structure.

- Anon June 06, 2007 | Flag Reply
Comment hidden because of low score. Click to expand.
0
of 0 vote

It is also important how you define "duplicate" documents. Does duplicate docs mean those which are exactly the same? I feel there should a thresh-hold for 2 docs to be considered duplicate. i.e. the number of words which are similar as well as their position of occurance in the document.

I think a Bag of Words approach along with storing the position of occurance of words should work well.

- AlgoFreak June 07, 2007 | Flag Reply
Comment hidden because of low score. Click to expand.
0
of 0 vote

yes. depends on how you difine 'duplicate' documents. note 'similar' not equals 'duplicate'.

- nunbit romance June 09, 2007 | Flag Reply
Comment hidden because of low score. Click to expand.
0
of 0 vote

I would like to use a balanced tree such as B+ Tree. The search time is O(tlogtn) and height of tree is very small (say the branching factor = 1000, then the depth is 2). It has less I/O operations since most of data needs to be stored on disk.

Another choice is hash table.

- tom jerry April 22, 2008 | Flag Reply
Comment hidden because of low score. Click to expand.
0
of 0 vote

If duplicate detection can be approximate, one can use bloom filters.

- Anonymous May 18, 2008 | Flag Reply
Comment hidden because of low score. Click to expand.
0
of 0 vote

how about a hashed bag-of-words approach? create a bag of words (increased optimization might be possible by removing "common words") and hash that bag of words.
now compare hashes ..
(doesnt seem quite right though :) )

- whattay!! September 06, 2008 | Flag Reply
Comment hidden because of low score. Click to expand.
0
of 0 vote

Hashing or calculating CRC is a good idea. CRC have few flavors too, full CRC (longer) vs quick CRC (shorter, based on few keywords). So the matching sequence can be something like:
1. URL match (Handle IP address, hostname, :80, redundant www., ? )
2. Match quick CRC (retrieved from in-memory hashtable)
3. if equal, match full CRC (retrieved from in-memory hashtable)
4. If equal, match the document contents (from disk)

- Hashing January 11, 2009 | Flag Reply
Comment hidden because of low score. Click to expand.
0
of 0 vote

use either suffix trees or
simple linux command
sort urlfile.txt | uniq

- piyush1989kapoor February 09, 2009 | Flag Reply
Comment hidden because of low score. Click to expand.
0
of 0 vote

Why not trie ?

- Time Pass ! March 12, 2009 | Flag Reply
Comment hidden because of low score. Click to expand.
0
of 0 vote

People claiming B-Tree, trie, Suffix tree, don't seem to realize the problem (or they seem to have misread this to check for duplicate URLS, not duplicate pages).

Doing a hash(es) (CRC whatever) is the right way to go. Once you have created a hash for each page, you can use a hashtable/B-Tree whatever.

The point is not to compare the pages each time, since they are HUGE.

If you understand the motivation behind hashing, this should come across as a _basic_ hashing problem.


NOTE: THE PROBLEM IS TO CHECK FOR DUPLICATE PAGES, NOT DUPLICATE URLS.

- T March 13, 2009 | Flag Reply
Comment hidden because of low score. Click to expand.
-1
of 0 vote

I think identifying duplicate url will be enough for duplicate document. We can use hash table to identify duplicate url.

- ayub June 13, 2007 | Flag Reply
Comment hidden because of low score. Click to expand.
0
of 0 votes

Are you serious?

- Anonymous April 17, 2008 | Flag
Comment hidden because of low score. Click to expand.
0
of 0 votes

LOL LOL

- wtf July 21, 2011 | Flag


Add a Comment
Name:

Writing Code? Surround your code with {{{ and }}} to preserve whitespace.

Books

is a comprehensive book on getting a job at a top tech company, while focuses on dev interviews and does this for PMs.

Learn More

Videos

CareerCup's interview videos give you a real-life look at technical interviews. In these unscripted videos, watch how other candidates handle tough questions and how the interviewer thinks about their performance.

Learn More

Resume Review

Most engineers make critical mistakes on their resumes -- we can fix your resume with our custom resume review service. And, we use fellow engineers as our resume reviewers, so you can be sure that we "get" what you're saying.

Learn More

Mock Interviews

Our Mock Interviews will be conducted "in character" just like a real interview, and can focus on whatever topics you want. All our interviewers have worked for Microsoft, Google or Amazon, you know you'll get a true-to-life experience.

Learn More