Amazon Interview Question for SDE-3s


Country: United States




Comment hidden because of low score. Click to expand.
0
of 0 vote

Given a seed URL, the crawler needed to auto-discover the value of the missing fields for a particular record. So if a web page didn't contain the information that I was looking for, the crawler needed to follow outbound links, until the information was found.
It needed to be some kind of crawler-scraper hybrid, because it had to simultaneously follow outbound links and extract specific information from web pages.
The whole thing needed to be distributed, because there were potentially hundreds of millions of URLs to visit.
The scraped data needed to be stored somewhere, most likely in a database.
The crawler needed to work 24/7, so running it on my laptop wasn't an option.
I didn't want it to cost too much in cloud hosting1.
It needed to be coded in Python, my language of choice.

- justhelping September 11, 2019 | Flag Reply
Comment hidden because of low score. Click to expand.
0
of 0 vote

Functionality of Web Crawler
1. List of website to be crawled.
2. All the pages crawled should be stored.
3. Defined frequency for different type of web sites - New websites should be crawled frequenty
4. Consider robot.txt to determine what should not be crawled
5. Understand if there is any change in the page, if so recrawl.
6. Parse and persist.

Need a Queue for BST kind of experience.
Datastructure
1. Set : Key is hash of URL, value is parsed content
2. Zset: Key as hash of URL and timestamp

Queue - FIFO. Will check if content is available in Set, if no then it will store in the Set along with Zset.

Technique
- Bloom filter for determining if the page is not present in the storage. This is OOB in Redis.
- For page modification, rely on modification time, MD5 etc. this can be persisted as a separate set.

A Hash in redi

- anshuman101 February 12, 2020 | Flag Reply


Add a Comment
Name:

Writing Code? Surround your code with {{{ and }}} to preserve whitespace.

Books

is a comprehensive book on getting a job at a top tech company, while focuses on dev interviews and does this for PMs.

Learn More

Videos

CareerCup's interview videos give you a real-life look at technical interviews. In these unscripted videos, watch how other candidates handle tough questions and how the interviewer thinks about their performance.

Learn More

Resume Review

Most engineers make critical mistakes on their resumes -- we can fix your resume with our custom resume review service. And, we use fellow engineers as our resume reviewers, so you can be sure that we "get" what you're saying.

Learn More

Mock Interviews

Our Mock Interviews will be conducted "in character" just like a real interview, and can focus on whatever topics you want. All our interviewers have worked for Microsoft, Google or Amazon, you know you'll get a true-to-life experience.

Learn More