Design the algorithm and the s

Amazon Interview Question for Software Engineer / Developers

-1

of 1 vote

11
Answers
Design the algorithm and the system for a WebCrawler.
The webcralwler will be provided millions of URLs. The webpage will be downloaded and then parsed for more URLs. If more URLs are found then they should also be downloaded and parsed.

He was interested in:
1. Scale to handle millions of URLs
2. What are the bottle necks in the system? How will you resolve them
- JSDUDE October 27, 2014 in United States | Report Duplicate | Flag | PURGE
Amazon Software Engineer / Developer Object Oriented Design

Email me when people comment.

An error occurred in subscribing you.

Country: United States
Interview Type: In-Person

More Questions from This Interview

Email me when people comment.

An error occurred in subscribing you.

Comment hidden because of low score. Click to expand.

of 2 vote

I think we can use a simple linked list to traverse the URLs appending it in end whenever we find new nodes. There is no requirement for order of parsing the URLs, so a graph search may be unnecessary.

As for bottlenecks,

1) Method used for parsing :
Using simple regex to figure out <a> tags in the page may be more efficient than a external library for parsing HTML.

2) Using multi-threading: given the large number of URLs, this may be efficient. If we use n threads, for example, the time taken is (T/n) . [where T is the time taken if we use a single thread.]

3) Outsourcing and Distributed Computing: If Bandwidth/processing power of our server is low, we may want to outsource the URLs to a remote machine with a superior resources.

4) Language Used for implementation : Because of difference in goals of design, different languages have different execution times for different tasks. A more "network-oriented" language may be more suitable.

- coolraj334 December 03, 2014 | Flag Reply

Comment hidden because of low score. Click to expand.

of 1 vote

Consider the web as a graph of related URLs, all you have to do is traverse this graph
its can be done using simple BFS, DFS
for parsing the web pages it can be done using any parsers but most of them used FSM to generate all the different components of the web page.

- mBk October 28, 2014 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

One last thing is generate key words and store them in some data structure..

you can use different datastructures based on your requirement.
you can use tries, hash table or any balanced tree..

- mBk October 28, 2014 | Flag

Comment hidden because of low score. Click to expand.

of 0 vote

The webCrawler system needs to be distributed to support the work load. There should be a Link Discovering component that would feed urls to a queue in the network (ie. amazon SQS). There should be multiple machines with multiple worker threads to download and parse. One of the challenges is to minimize multiple worker threads writing the same portion of the index at the same time (read/write block is expensive). Another challenge is to manage and optimize an enormous index

The challenge is to manage the en

- Anonymous November 08, 2014 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

MapReduce ?

sudocode :

INPUT : url \t contents
OUTPUT : url \t n
step.mp :

map (line:String)
	
	url = regex =~ line.split("\t")[0]
	return url \t 1 \n
	
reduce (vector:tuple(String, Integer)) 

	count = 0
	url = vector[0].1
	
	foreach tuple in vector 
		count++
		
	return url \t count \n
	

INPUT : url \t n
OUTPUT : url \t contents
crawl.mp :

map (line:String) 
	contents = URL.GET(line.split(\t)[0])
	return line \t contents
	
reduce (vector:tuple(String, String))
	return line \t contents
	

main :

outputCrawl = initialFile

0.upto(n) 
	outputStep = step(outputCrawl)
	crawl(outputStep)

- Anonymous December 11, 2014 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

MapReduce ?

sudocode :

INPUT : url \t contents
OUTPUT : url \t n
step.mp :

map (line:String)
	
	url = regex =~ line.split("\t")[1]
	return url \t 1 \n
	
reduce (vector:tuple(String, Integer)) 

	count = 0
	url = vector[0].0
	
	foreach tuple in vector 
		count++
		
	return url \t count \n
	

INPUT : url \t n
OUTPUT : url \t contents
crawl.mp :

map (line:String) 
	contents = URL.GET(line.split(\t)[0])
	return line \t contents
	
reduce (vector:tuple(String, String))
	return line \t contents
	

main :

outputCrawl = initialFile

0.upto(n) 
	outputStep = step(outputCrawl)
	crawl(outputStep)

- Anonymous December 11, 2014 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

Seems that they just want to crawl the internet to know the URLs.

I don't buy the idea of using a Graph to connect the URLs as it is not part of the requirement but I do think it is usefully to do if you want to keep the relationships between URLs

I do not see the point of doing deduplication on memory by threads that query as this can be done separate thread as connection speeds is more limiting than CPU speed and having a non-blocking thread querying is more useful time than waiting for deduplication.

The best way is to use some data store to be somewhat of a Queue every time you find a new URL hash or index all of the URLs and check for duplicates and stop if it is a duplicate as some other thread or machine or itself has already crawl them.

Here are the bottlenecks and solutions for them:
Memory: Distribute in different machines

CPU: Distribute in different machines

Disk: Distribute the Queues or Databases partitioning the data using a evenly distributed Hash Algorithm

Connection of Speed Crawler: Pre-Compute as much as possible before doing the query like pre-read the queue till memory is full and pre-create required querie objects.

Connection Speed of Sites: Use a Thread Pool to handle slow connections to not block if a connection is slower.

- Nelson Perez December 17, 2014 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

Here is my design:
1, "Download worker". input: URL, output: URL & documentation
2, "URL extract worker". Input: documentation, output: URLs
3, "URL feed". it accept URLs and then prepare to dispatch URLs to works. you can choose message queue or any technology.
4, "URL judgement". "URL extract worker" parse the document and output URLs. URL judgement component will lookup DB/memcache/... and decide if we need add all or parts fo URLs to "URL feed".
5, URL key and value storage. the value can includes URL expiration time, URL downloaded document location, size, ... we can use memcache, DB...
6, background thread. it can check the URL key and value storage periodically. if a URL expired, we send the URL to "URL feeder" to update the documents and URLs

In this design, I don't use DFS or complex algo, but it's easy to support millions of URLs. all communication will be asynchronous messages.

- licanhua January 29, 2015 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

public IParser
	// downloads the url and parser it
	IEnumerable<string> Parse(url)
	
IRepository
	void SaveToDB(url)
	
IWebCrawler
	void PerformAction()
	
WebCrawler : IWebCrawler
	IParser	m_parser
	IRepository m_repository
	HashTable<string> m_visited
	Queue<string> m_urls;
	
	public WebCrawler(IParser, IRepository, HashTable<string> visited){}
	
	public void PerformAction(IEnumerable urls)
	{
		// place the urls into the Queue

		// call to PerformActionImpl
	}
	
	private void PerformActionImpl()
	{
		// go over the queue
		
		// if the url already in the m_visited then continue to the following one
		
		// if not, parse the url, and place the results into the queue
		
		// if there also be a requirement to do something with the result (like save it), then use the repository
	}

	
WebCrawlerManager
	
	public int m_threadAmount;
	public HashTable<string> m_visitedUrl;
	public IEnumerable<string> m_urls;
	
	public WebCrawlerManager(IEnumerable<string> urls)
	
	Run()
	{
		// created threads by the amount of threads that we got into the constructor
		
		// generate m_threadAmount IEnumerable<string> urls
		
		// run every WebCrawler with it's own IEnumerable<string>
	}

- Feldman.Max March 17, 2015 | Flag Reply

Comment hidden because of low score. Click to expand.

-1

of 1 vote

for(int i = 0 ; i  <10 ; i++

CareerCup

Amazon Interview Question for Software Engineer / Developers

Books

Videos

Resume Review

Mock Interviews