Facebook Interview Question
Software Engineer / DevelopersCountry: United States
Interview Type: In-Person
What constitutes a quote? - A fragment of text in a page constitutes a quote. For now we will only consider a text dump rather than gathering any semantic meaning to it.
How do you find quotes?
If more than one page has common occurrences of fragments of text, then it constitutes a quote. As for which page is going to be quoted can be determined by some page rank or the timestamp of the page that was created.
How do you make it scale to the web?
We crawl the web across to gather documents. These documents would then have to be compared amongst each other. The comparison can be limited to similar domains of the pages.
How do you handle updates?
The way the information would be laid out is, typically there would/could be a many(children) to one (parent) relationship of pages. Lets call this a group.
When a new page comes in, there are three possibilities:
1. It is a page which will be added to an already existing grouping.
2. It is a page for which a grouping doesn't exist yet. And this will be the new parent referring to itself.
3. It is a page that will replace the parent of a group.
Even within a domain there would be many such groupings. To assign some priority, we can have the number of children quoting the parent treated as a rank. So the node with the highest rank gets checked against first.
#2
If there is no match against any of the highest rank nodes, then that means the children will not match as well. So we create a new parent with rank 1 and the page points to itself.
#1It matches one of the high rank document and gets added to the other nodes.
#3
Along side checking against the highest rank i.e. checkIfQuoteFrom(parent,new_page) , we should also do a checkIfQuoteFrom(new_page,parent). If it returns true for the second call, then either a) the pages are the same. b) new_page is the parent of the current parent ( determined by timestamp amongst other metrics).
If it is a) then we can discard it. If it is b) then we replace the parent with new_page and have the parent join the other set of nodes.
How would you arrange the servers?
There would be a set of servers that would be responsible for crawling the web.
Another set of servers who would be doing a comparison of pages to get the quotes.
As soon as the the crawlers would return back with the set of results, it would be written to some cache and not the database. After massaging the data received is when they will be stored in the db with their state ( discussed above).
Even to fetch the results, we would cache the high frequency requests to ensure minimum db reads.
What data structures would you use?
A decreasing order of nodes ( based on rank ) for the pages to scan. Each of these nodes would be referred to by children. An object representing a domain pointing to these nodes.
A tree like ds that represents the reverse domain and each node there in would store the location to access the aforementioned list.
What constitutes a quote?
A fragment text longer than some threshold, say 20 words
How do you find quotes?
Cluster all the web pages. For a web page, compare with all other pages in the cluster. O(mn).
Store quotes in db. Key is url, value is a list of position and url where it is reproduced.
How do you make it scale to the web?
There are 1t web pages over the internet, store all of them in clusters.
How do you handle updates?
Periodically crawl all the web pages and update the quotes. Given a url, check if the quoted positions are updated. If it is updated, then recompute the quotes.
How would you arrange the servers?
Two database, one to store all the clustered web pages, the other stores the quotes.
I would chunk each page in blocks of text (I can leverage the DOM of the page... the way you optimize this step deeply affect the final result). Then I would apply an hash function on each chunk and represent the page A as the set of that hash numbers A = [h_0, ... h_n].
- Claudio October 05, 2012Now, I can build a inverted index of the hashes so that h_i => {A,B} meaning that the h_i is present both in A and in B. At this point whenever I render the page A I will generate a link to the page B associated to the chunk of text having h_i as hash number.
I will keep updated the index by keep crowling the Web and updating internal representation of the pages and the index like a SE does.