Amazon Interview Question
Software Engineer / DevelopersCountry: United States
Interview Type: In-Person
Consider the web as a graph of related URLs, all you have to do is traverse this graph
its can be done using simple BFS, DFS
for parsing the web pages it can be done using any parsers but most of them used FSM to generate all the different components of the web page.
The webCrawler system needs to be distributed to support the work load. There should be a Link Discovering component that would feed urls to a queue in the network (ie. amazon SQS). There should be multiple machines with multiple worker threads to download and parse. One of the challenges is to minimize multiple worker threads writing the same portion of the index at the same time (read/write block is expensive). Another challenge is to manage and optimize an enormous index
The challenge is to manage the en
MapReduce ?
sudocode :
INPUT : url \t contents
OUTPUT : url \t n
step.mp :
map (line:String)
url = regex =~ line.split("\t")[0]
return url \t 1 \n
reduce (vector:tuple(String, Integer))
count = 0
url = vector[0].1
foreach tuple in vector
count++
return url \t count \n
INPUT : url \t n
OUTPUT : url \t contents
crawl.mp :
map (line:String)
contents = URL.GET(line.split(\t)[0])
return line \t contents
reduce (vector:tuple(String, String))
return line \t contents
main :
outputCrawl = initialFile
0.upto(n)
outputStep = step(outputCrawl)
crawl(outputStep)
MapReduce ?
sudocode :
INPUT : url \t contents
OUTPUT : url \t n
step.mp :
map (line:String)
url = regex =~ line.split("\t")[1]
return url \t 1 \n
reduce (vector:tuple(String, Integer))
count = 0
url = vector[0].0
foreach tuple in vector
count++
return url \t count \n
INPUT : url \t n
OUTPUT : url \t contents
crawl.mp :
map (line:String)
contents = URL.GET(line.split(\t)[0])
return line \t contents
reduce (vector:tuple(String, String))
return line \t contents
main :
outputCrawl = initialFile
0.upto(n)
outputStep = step(outputCrawl)
crawl(outputStep)
Seems that they just want to crawl the internet to know the URLs.
I don't buy the idea of using a Graph to connect the URLs as it is not part of the requirement but I do think it is usefully to do if you want to keep the relationships between URLs
I do not see the point of doing deduplication on memory by threads that query as this can be done separate thread as connection speeds is more limiting than CPU speed and having a non-blocking thread querying is more useful time than waiting for deduplication.
The best way is to use some data store to be somewhat of a Queue every time you find a new URL hash or index all of the URLs and check for duplicates and stop if it is a duplicate as some other thread or machine or itself has already crawl them.
Here are the bottlenecks and solutions for them:
Memory: Distribute in different machines
CPU: Distribute in different machines
Disk: Distribute the Queues or Databases partitioning the data using a evenly distributed Hash Algorithm
Connection of Speed Crawler: Pre-Compute as much as possible before doing the query like pre-read the queue till memory is full and pre-create required querie objects.
Connection Speed of Sites: Use a Thread Pool to handle slow connections to not block if a connection is slower.
Here is my design:
1, "Download worker". input: URL, output: URL & documentation
2, "URL extract worker". Input: documentation, output: URLs
3, "URL feed". it accept URLs and then prepare to dispatch URLs to works. you can choose message queue or any technology.
4, "URL judgement". "URL extract worker" parse the document and output URLs. URL judgement component will lookup DB/memcache/... and decide if we need add all or parts fo URLs to "URL feed".
5, URL key and value storage. the value can includes URL expiration time, URL downloaded document location, size, ... we can use memcache, DB...
6, background thread. it can check the URL key and value storage periodically. if a URL expired, we send the URL to "URL feeder" to update the documents and URLs
In this design, I don't use DFS or complex algo, but it's easy to support millions of URLs. all communication will be asynchronous messages.
public IParser
// downloads the url and parser it
IEnumerable<string> Parse(url)
IRepository
void SaveToDB(url)
IWebCrawler
void PerformAction()
WebCrawler : IWebCrawler
IParser m_parser
IRepository m_repository
HashTable<string> m_visited
Queue<string> m_urls;
public WebCrawler(IParser, IRepository, HashTable<string> visited){}
public void PerformAction(IEnumerable urls)
{
// place the urls into the Queue
// call to PerformActionImpl
}
private void PerformActionImpl()
{
// go over the queue
// if the url already in the m_visited then continue to the following one
// if not, parse the url, and place the results into the queue
// if there also be a requirement to do something with the result (like save it), then use the repository
}
WebCrawlerManager
public int m_threadAmount;
public HashTable<string> m_visitedUrl;
public IEnumerable<string> m_urls;
public WebCrawlerManager(IEnumerable<string> urls)
Run()
{
// created threads by the amount of threads that we got into the constructor
// generate m_threadAmount IEnumerable<string> urls
// run every WebCrawler with it's own IEnumerable<string>
}
I think we can use a simple linked list to traverse the URLs appending it in end whenever we find new nodes. There is no requirement for order of parsing the URLs, so a graph search may be unnecessary.
- coolraj334 December 03, 2014As for bottlenecks,
1) Method used for parsing :
Using simple regex to figure out <a> tags in the page may be more efficient than a external library for parsing HTML.
2) Using multi-threading: given the large number of URLs, this may be efficient. If we use n threads, for example, the time taken is (T/n) . [where T is the time taken if we use a single thread.]
3) Outsourcing and Distributed Computing: If Bandwidth/processing power of our server is low, we may want to outsource the URLs to a remote machine with a superior resources.
4) Language Used for implementation : Because of difference in goals of design, different languages have different execution times for different tasks. A more "network-oriented" language may be more suitable.