Facebook Interview Question
Software Engineer / DevelopersCountry: United States
Interview Type: In-Person
My approach:
1) Data Model
I'll have 2 hash maps:
- Based on source page address as key
- Destination page as key.
- For both maps value will be position in a vector which will contain pointer to struct.
struct quote
{
string src; // Links
string dest; // Links
string src_element; // We can have div, p etc
int src_id; // Id of element on source page i.e. to be quoted
string dest_element; // We can have div, p etc
int dest_id; // Id of element on source page i.e. to be quoted
};
unordered_map<string, quote*> srcMap;
unordered_map<string, quote*> destMap;
2) Operations supported will be:
- void insertQuote(string src, string dest ...) // properties to be added to our struct
- string fetchQuotedString(string dest, string element, int id);
- string fetchDestPage(string src, string element, int id);
Hash can be calculated on complete web address which might look like: "http : / / w w w . b l a h . c o m & t y p e = i d "
3) Execution
- Client sends insertquote request to server.
- Insertion can be an asynchronous event, which returns with a success error code and lazily updates the data structures.
4) Complexity
- Since we're using hash maps and simply appending a new node to end of a STL list, our time complexity for each operation will be O(1).
5) Write to Disks
We can keep a threshold for time or for size of data in memory after which contents can be dumped to the disk. (These will basically be serial writes)
6) Concurrency
- Read operation
- For displaying page A, whose text is quoted we'll have to invoke fetchDestPage(...) and for displaying page B, we'll invoke fetchQuotedString(...)
- For read, we can use reader-writer locking mechanism on the maps. If no writer is updating the corresponding entry, reader process can fetch node value based on the type of page being displayed (hashing source or destination web address).
- Write operation
- As mentioned earlier, insertion can be asynchronous in nature. Once a client invokes insertQuote(...), preforked server process can record the invocation and return a success code to the caller.
- Each process update the List of nodes and receive an iterator. In order to insert entry to the maps, the process has to wait to acquire locks on the hashed positions and once received can insert a new entry to each map.
7) Scalability
We can distribute our hash maps to different nodes with a DHT based setup. (Caching most frequently used data). Therefore, when a server will receive request for fetching/ inserting an entry it'll:
- Return 2xx to client
- Hash the "web-address" and accordingly find the destination host
- Route the message to the destination
- Recipient node will have it's own segment of hash tables and list.
We basically import all the features of DHT to share resources between number of hosts and provide better performance to the user.
9) Replication for availability
Data can be replicated on many DHT nodes, with DHT based bootstrapping and division of labor. For every write operation once a node receives a request it can be considered at the co-ordinator who'll take care of the successful commit to each replica using 2-phase commit protocol. In case of aborted transaction failure code can be returned to the client. Client is responsible for retrying.
10) Fault Tolerance
- Again we're importing fault tolerance features from DHT. In case of faulty destination hosts, its replicas should be able to service the request.
- If none of the replica's is able to respond we'll assume it's a cache miss and fetch data from disk.
Breaking down to aspects:
1. Definition (what the quote is for) --
a. Paragraph based. Not just a word ('the' is everywhere, we cannot say this is a quote). Check paragraph length.
b. Main text. Not header, footer, menu, signature, etc. No html tags.
c. Has tolerance (similarity > 95%, for instance). This is important for dynamic pages as some parts of them change very often.
2. Detection --
a. Remove header/footer/menu and remove html tag.
b. Calculate edit distance of the whole two pages. If big, run edit distance over all paragraphs (O(n^2) * O(edit distance)) to find the quote. Store them if existing.
3. Representation, storage --
CREATE TABLE Quote (id BIGINT PRIMARY KEY DEFAULT NEXTSEQ, url1 VARCHAR(MAX) NOT NULL, url2 VARCHAR(MAX) NOT NULL, content TEXT NOT NULL);
CREATE INDEX idx_Quote_Url1 ON Quote (url1) INCLUDE(id, url2, content);
CREATE INDEX idx_Quote_Url2 ON Quote (url2) INCLUDE(id, url1, content);
4. Present --
a. $url is given. $page = get_content_from_url($url); $altered_page = $page;
b. SELECT * FROM Quote WHERE url1 = '$url' OR url2 = '$url'
c. foreach ($quotes as $quote) {
$verify = verify_content_exists($page, $quote['content']);
if (!$verify) sql("DELETE FROM Quote WHERE id = " . $quote['id']);
$other_url = $quote['url2'];
if ($other_url == $url) $other_url = $quote['url1'];
$other_page = get_content_from_url($other_url);
verify_content_exists($other_page, $quote['content']);
if (!$verify) sql("DELETE FROM Quote WHERE id = " . $quote['id']);
str_replace($quote['content'], "\"" . $quote['content'] . "\" (<a href=\"" . $other_url . "\">exists here too</a>", $altered_page);
}
5. Update --
a. Delete on the fly as shown in 4.
b. Run detect regularly to add new quotes.
How do you find quotes?
There are 1T web pages on the internet. We may not afford anything more than linear. So just compare if paragraph is part of any other webpage using kmp.
How do you make it scale to the web?
We cannot afford to check if a paragraph is quoted in the 1T web pages. We have to cluster the web pages. Then check in each cluster. The cluster is not very big hopefully.
How do you handle updates?
Store all the web pages and periodically crawl them.
schema paragraph_id, page_id1, page_id2,...
If one is updated, see if the changed paragraph is part of any parapraph in its cluster.
How would you arrange the servers?
no sql database for webpages.
no sql database to store the quotes.
How much storage would you need?
Assume 10k for a page, 10p to store all pages
I was asked the same question in amazon , 1 yr back
- nr November 09, 2012