Design a system for showing qu

Facebook Interview Question for Software Engineer / Developers

0

of 0 votes

3
Answers
Design a system for showing quotes on the web?

For example, when the user is looking at page A, part of which is reproduced in page B, the system could highlight part of page A present the user with a link to page B.
What constitutes a quote?
How do you find quotes?
How do you make it scale to the web?
How do you handle updates?
How would you arrange the servers?
What data structures would you use?
- Steve September 24, 2012 in United States | Report Duplicate | Flag | PURGE
Facebook Software Engineer / Developer Application / UI Design

Email me when people comment.

An error occurred in subscribing you.

Country: United States
Interview Type: In-Person

Email me when people comment.

An error occurred in subscribing you.

Comment hidden because of low score. Click to expand.

of 0 vote

I would chunk each page in blocks of text (I can leverage the DOM of the page... the way you optimize this step deeply affect the final result). Then I would apply an hash function on each chunk and represent the page A as the set of that hash numbers A = [h_0, ... h_n].

Now, I can build a inverted index of the hashes so that h_i => {A,B} meaning that the h_i is present both in A and in B. At this point whenever I render the page A I will generate a link to the page B associated to the chunk of text having h_i as hash number.

I will keep updated the index by keep crowling the Web and updating internal representation of the pages and the index like a SE does.

- Claudio October 05, 2012 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

What constitutes a quote? - A fragment of text in a page constitutes a quote. For now we will only consider a text dump rather than gathering any semantic meaning to it.

How do you find quotes?
If more than one page has common occurrences of fragments of text, then it constitutes a quote. As for which page is going to be quoted can be determined by some page rank or the timestamp of the page that was created.

How do you make it scale to the web?
We crawl the web across to gather documents. These documents would then have to be compared amongst each other. The comparison can be limited to similar domains of the pages.

How do you handle updates?

The way the information would be laid out is, typically there would/could be a many(children) to one (parent) relationship of pages. Lets call this a group.

When a new page comes in, there are three possibilities:
1. It is a page which will be added to an already existing grouping.
2. It is a page for which a grouping doesn't exist yet. And this will be the new parent referring to itself.
3. It is a page that will replace the parent of a group.

Even within a domain there would be many such groupings. To assign some priority, we can have the number of children quoting the parent treated as a rank. So the node with the highest rank gets checked against first.

#2
If there is no match against any of the highest rank nodes, then that means the children will not match as well. So we create a new parent with rank 1 and the page points to itself.

#1It matches one of the high rank document and gets added to the other nodes.

#3
Along side checking against the highest rank i.e. checkIfQuoteFrom(parent,new_page) , we should also do a checkIfQuoteFrom(new_page,parent). If it returns true for the second call, then either a) the pages are the same. b) new_page is the parent of the current parent ( determined by timestamp amongst other metrics).

If it is a) then we can discard it. If it is b) then we replace the parent with new_page and have the parent join the other set of nodes.

How would you arrange the servers?
There would be a set of servers that would be responsible for crawling the web.
Another set of servers who would be doing a comparison of pages to get the quotes.
As soon as the the crawlers would return back with the set of results, it would be written to some cache and not the database. After massaging the data received is when they will be stored in the db with their state ( discussed above).

Even to fetch the results, we would cache the high frequency requests to ensure minimum db reads.

What data structures would you use?

A decreasing order of nodes ( based on rank ) for the pages to scan. Each of these nodes would be referred to by children. An object representing a domain pointing to these nodes.

A tree like ds that represents the reverse domain and each node there in would store the location to access the aforementioned list.

- capt_caveman March 17, 2014 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

What constitutes a quote?
A fragment text longer than some threshold, say 20 words

How do you find quotes?
Cluster all the web pages. For a web page, compare with all other pages in the cluster. O(mn).
Store quotes in db. Key is url, value is a list of position and url where it is reproduced.

How do you make it scale to the web?
There are 1t web pages over the internet, store all of them in clusters.

How do you handle updates?
Periodically crawl all the web pages and update the quotes. Given a url, check if the quoted positions are updated. If it is updated, then recompute the quotes.

How would you arrange the servers?
Two database, one to store all the clustered web pages, the other stores the quotes.

- ly October 08, 2016 | Flag Reply

CareerCup

Facebook Interview Question for Software Engineer / Developers

Books

Videos

Resume Review

Mock Interviews