Amazon Interview Question
Software Engineer / DevelopersTeam: AWS
Country: United States
Interview Type: In-Person
I think just a memcacheDB (or any other harddisk-based key-value store) will satisfy this. So do something like
<?php
function incr($url, $host, $inLinks, $outLinks) {
global $memcache;
$memcache->incr("url:$url");
$memcache->incr("host:$host");
foreach ($inLinks as $link) {
$memcache->incr("inlink:$link");
}
foreach ($outLinks as $link) {
$memcache->incr("outlink:$link");
}
}
public class InfiniteMachine {
String[] domains = { ".com", ".net", ".org" }; // etc.
public void storeTuple(Tuple record) {
String url = record.url;
String domain = url.substring(url.indexOf("://")).substring(0, url.indexOf("/"));
String host = null;
try {
host = new URL(url).getHost();
} catch (MalformedURLException murle) {
murle.printStackTrace();
System.out.println("Couldn't get host from URL " + url + "!");
System.exit(1);
}
// use JSoup for links
Document doc = Jsoup.parse(record.html);
Elements links = doc.select("a[href]");
// write everything to its respective file
write(url, domain, host, links, url.replaceAll(":", "").replaceAll("/", "") + ".txt");
}
public void write(String url, String domain, String host, Elements links, String filePath) {
try (BufferedWriter writer = new BufferedWriter(new FileWriter(new File(filePath)))) {
writer.write("URL\t" + url + "\n\r");
writer.write("DOMAIN\t" + domain + "\n\r");
writer.write("HOST\t" + host + "\n\r");
for (Element link : links) {
writer.write("LINK\t" + link.text() + "\n\r");
}
} catch (IOException ioe) {
ioe.printStackTrace();
System.out.println("Couldn't write to file " + filePath + "!");
System.exit(1);
}
}
}
This sound like a simple web crawler.
- gameboy1024 August 21, 2014I think the key point of question is
"We have one machine that has enough disk space but limited memory."
No matter what kind of data structure is used to store the data in memory, there must be a grabage-collector-alike job that periodically check for node/record that are stale(not written for a long time). So they can be dumped into file system to free the memory usage.
To avoid the duplication you could either check for existence when new node is created and load it if there is or create a mechanism to merge data when dumping into disk.