Amazon Interview Question for Software Engineer Interns

Country: United States
Interview Type: In-Person

Comment hidden because of low score. Click to expand.
of 2 vote

Hi, I'm not sure if this is Exactly the answer to the problem you posted, but this is would be my approach to the problem.
Any inputs or suggestions are most welcome!

import lxml
from lxml import etree
import urllib2
class webCrawler(object):
	self.urlStack = set() #Set to avoid duplicates in the list
	def __init__(self,baseURL):
		"""The init function(can do database connection setup here)"""

	def checkURL(self,url):
		"""Check if the url already exists in the database and return True or False"""
	def db_call(self,url,html):
		"""Insert into the db/disk the URL and the html content of the URL"""
	def crawl(self,url):
			html = urllib2.urlopen(url).read()
			print "Unable to retrieve HTML"
			if len(self.urlStack)>0:
		href_xpath = "//a/@href"
		tree = lxml.etree.HTML(html)
		hrefs = tree.xpath(href_xpath)
		for each in hrefs:
			if each.find("")<=0 or (self.checkURL(url)):
		for each in hrefs:
		if len(self.urlStack)>0:

- Ankit September 11, 2013 | Flag Reply
Comment hidden because of low score. Click to expand.
of 1 vote

I haven't tested the your code but logic looks correct, there is one glitch that might cause problems in the interview.

The problem is with using a parser and its time complexity. You dont need to parse the whole html file when all you need are the attributes begining with href. A simple regex should suffice for it.

- Rage October 26, 2013 | Flag
Comment hidden because of low score. Click to expand.
of 1 vote

One more thing to consider: the existing URLs set assumes pages never change. In reality pages may come with cache control metadata (in the form of response headers) that tell you how soon to expire your cached version, whether you should try to invalidate your cache on each access, etc.

- estebistec June 04, 2015 | Flag
Comment hidden because of low score. Click to expand.
of 0 vote

This is mine. Has depth limit so that it won't crawl all of Amazon's pages.

from collections import deque
import urllib2
import re

class crawler: 
    def __init__(self):
        self.visited = set()

    def crawl(self, url, max_depth):
        queue = deque()
        queue.append((url, 0))

        while queue:
            url, depth = queue.popleft()
            if depth < max_depth:
                    source = urllib2.urlopen(url).read()
                print url
                href_pos = [m.end() for m in re.finditer(r'href=', source)]
                for pos in href_pos:
                    quote = source[pos]
                    end_pos = source.find(quote, pos + 1)
                    href = source[pos + 1 : end_pos]
                    if href not in self.visited and href.find('') >= 0:
                        queue.append((href, depth + 1))
        return self.visited

- Bogdan September 30, 2013 | Flag Reply

Add a Comment

Writing Code? Surround your code with {{{ and }}} to preserve whitespace.


is a comprehensive book on getting a job at a top tech company, while focuses on dev interviews and does this for PMs.

Learn More


CareerCup's interview videos give you a real-life look at technical interviews. In these unscripted videos, watch how other candidates handle tough questions and how the interviewer thinks about their performance.

Learn More

Resume Review

Most engineers make critical mistakes on their resumes -- we can fix your resume with our custom resume review service. And, we use fellow engineers as our resume reviewers, so you can be sure that we "get" what you're saying.

Learn More

Mock Interviews

Our Mock Interviews will be conducted "in character" just like a real interview, and can focus on whatever topics you want. All our interviewers have worked for Microsoft, Google or Amazon, you know you'll get a true-to-life experience.

Learn More