Amazon Adobe Interview Question
Software Engineer / DevelopersCountry: United States
@eugene.yarovoi, Before writing my thoughts, i experimented how firefox works in this respect. I found that they even keep browser history of 6 months old.
Here is my thought:
I would use a double linked list of dates. Each day, browser is accessed, i would be adding a new node to it. Now, each node of the double linked list will be another linked list for storing the URL's. I don't think timestamp is necessary here. We can add new visited pages to head of the list. In this way, the recent visited pages would be near the front of the list.
I liked your idea of treeSet. It will help if a user visits previously visited page. We need to adjust the nodes of the list.
Now, if the user wants to see the history by data, list will serve the purpose. If the user wants to search a particular URL, we can provide prefix searching like TRIE.
Let me know your thoughts.
I believe Firefox has a setting for how long to keep browser history.
I proposed a timestamp because that's something the user might actually want to look at. "When did I access this page?"
Sure, you can use a linked list of linked lists if you want. I proposed just using one linked list and storing a pointer to something that's a day old, then two days old, etc. It doesn't matter much which of these strategies you use. It's much like how you can implement a 2D array as either an array of arrays or as a 1D array and an array of offsets into it.
A prefix tree is a good idea because it's probably a little easier to use it to provide search results as the user types (before the user has finished typing, that is). A modified TreeSet would also work for this purpose.
Java Developers can use LinkedHashMap and override the removeEldestEntry(Map.Entry) method to implement the browsing history retention policy
More info on LinkedHashMap : docs.oracle.com/javase/6/docs/api/java/util/LinkedHashMap.html
How about using an ArrayList or Vector to store page access history, so that users can perform range query (which page was accessed during Jun. 1st - Jun. 5th)
@First Anonymous: yes, LinkedHashMap is a convenient class that does a lot of what you need. It may not do everything that the combination of data structures I propose above does; namely; there's no convenient way to jump to a particular date and search for the pages accessed on that date.
@Second Anonymous: the data structures I proposed above are sufficient to be able to do range queries.
Web browsers commonly allow you to navigate through a "history" of web pages that have previously been visited. The mechanism is somewhat like a stack, in that the most recently visited pages are at the top of the history and revisited when the "back" button is pressed.
However, the history does not necessarily have unbounded capacity. In reality, there may exist a fixed limit on the size of the history. The issue arises as to what should happen when the capacity is exhausted and a new item is pushed onto the stack. One possibility is to throw an exception. But this is not how a Web browser behaves. If it only has room to save 50 pages in its history and yet you visit more pages, it will make room in the history for a new page by throwing away the page that is on the very bottom of the history (i.e., the least recently visited page). The formal Stack interface does not help, as it gives us no way to directly access or remove the object on the bottom of the stack.
I typically think of browser history as being distinct from browser navigation (forward / back buttons). The record that you visited a page should still be there even if you press "back".
There are supposed to be 2 Stacks one for Back and another for Forward buttons. When you move forward, add the current page details in back stack and when u move back (i.e pick up the top element details from back stack which will be the current page) add the current page details to forward stack. Dont worry about the size of the array, dynamic array can be used. This can also be done using list.
Sets to be used as one page visited again will be overwritten, will go with treeset with sorting according to time. Guys ,Any problem in this approach is welcomed
You're saying your equality criterion will be the URLs of two PageVisitedRecord objects matching, but yet your tree ordering criterion will be the date. So two PageVisitedRecord objects could be equal (same URLs) and yet still be supposed to be inserted in different places in the tree. This won't work for a variety of reasons.
First, when you update the set with an equal PageVisitedRecord to something you already have in the tree (equal by your equality criterion -- the URLs match), how will the set search for this equality? The tree is sorted by date.
Second, even if this update did somehow work, the set might do something stupid. It might think that since you're not inserting a new element, it doesn't need to move the existing element. That might be bad because your tree is sorted by date and your new element has a new date that might need to be in a different place in the tree. Or the set might think that it can just ignore your update request, since inserting an element into a set that is already there should have no effect, in theory. Then the date might not get updated.
It can't be stack, as there in you need to pop each element.
In history, we need to simply update one on URL if not already there.
I would use a min heap, with each node conatining the timestamp and the url. in this way, the most recent url (having minimum timestamp) will be on the top of the heap
The actions the user can perform are:
- 1) Add a new URL in history (it would have the greatest timestamp).
- 2) See history
-> The user can browse today's history
-> The user can browse yesterday's history
-> The user can browse the history of a specific month
-> The user can browse the history of a specific year
- 3) The user can delete history (last day, last week, last month, all, a particular entry).
- 4) Search in browsing history (can use page's URL or page's title)
Maybe — to make the search — more efficient, we want to find a way to have a quicker access to pages that are visited more frequently.
For 2) and 3), I guess a skip-list would do it. The skip list would have 4 levels (days, weeks, months, years).
For 4), we could use a Generalized Suffix Tree.
The nodes from the skip list would reference nodes from the suffix tree and vice versa.
I think the best DS is Hash Map in association with linked list. The hashmap will have name value pair as <Date, Linked List*>. Where for a given date the browsing details will be added in the linked list. The addition will be done at the start of the linked list. This, is to make the latest visited URL at the first.
Comments to it will be appreciated.
I think the best DS is Hash Map in association with linked list. The hashmap will have name value pair as <Date, Linked List*>. Where for a given date the browsing details will be added in the linked list. The addition will be done at the start of the linked list. This, is to make the latest visited URL at the first.
Comments to it will be appreciated.
If we visit chrome://history/ we see that the browsing history is grouped by date. The immediate browsing history is of course separate from this. I think every browser is exhibiting more or the same behavior.
So the data structure for storing the browsing history needs to be split into 2 categories:
1. Immediate - less than 6 hours
2. Long term - more than 6 hours
For immediate, we can use stacks. When the age runs out, we can copy this into a BST where each node represent a time stamp key and URL as value.
The long term storage can also go into a hash map/hash table. Where time stamp can be the key and URL the value. Agreed, some time stamps could be pretty close of each other, but the hash function should be able to identify the difference based on difference in seconds value.
Browse through webkit open source browser project. It will have the answer:
www . webkit . org /
Mentioning at least a little bit about what strategy they use, instead of just inviting readers to look through a huge open-source project, would be a good idea.
I think it should be some implementation of MRU (most recently used). Don't know the data structure though. Can it be Heap?
I will go with an array of linked lists.
Lets say my browser stores history of last 7 days then there will be array of length 7 and each element of the array is a linked list which is nothing but all the urls it had accessed that day. So the array will be used like a queue and every day 1 linked list would be deleted from top to create a new linked list on the bottom for that particular day.
What if we create a hashtable where key is simhash of domainName and value is a map of pageName as key and list of accesstime as value(here we have the flexibility to keep track of multiple access times). I feel this will improve searching and we do not keep redundant information as suggested above..
Can someone comment as to why this approach is not good. Looks good to me but I am not confident.
What if we create a hashtable where key is simhash of domainName and value is a map of pageName as key and list of accesstime as value(here we have the flexibility to keep track of multiple access times). I feel this will improve searching and we do not keep redundant information as suggested above..
This is a very complex question that needs a detailed analysis of the requirements and typical use cases. Only then can we properly weigh the different possible solutions. Below are just some simple thoughts off the top of my head:
- eugene.yarovoi September 03, 2012A simple idea is to keep 2 data structures. First I'd have a doubly-linked list-based queue sorted by date for each page access. Each node would have the timestamp, URL, and a pointer to the node in the queue representing the next older access of that same URL. The queue would be capped to both a certain size and certain time range and then I could delete entries from the back of the queue as needed.
Second, I would have a TreeSet sorted alphabetically by URL domain name first and then by page URL within that domain name, mapping URLs to pointers to the node in the doubly-linked list that represents the most recent access to that page.
That way, if someone wanted to look at when all their accesses to a specific page occurred, it could be done very quickly. They'd be presented with URLs in alphabetical order and they'd select the right one, a lookup would be done and they'd get the record of the most recent access, and the record of the most recent access has a pointer to the next access, etc. Then if they wanted to see their *entire* history of visited sites, they could view the entire queue and it would be showing them a log of things they've visited in chronological order.
The use of additional data structures is possibly desirable. For instance, maybe we'd store a pointer somewhere to a node that's a day old, a node that's 2 days old, etc., so certain timeslices of the history could be presented with the least amount of node traversals. Of course these pointers would have to be updated too, but background work is much better than having time during which the user is actively waiting for something to happen.
That's just if it's an in-memory structure. Another thought is that you need to persist this data structure. So maybe I'd rather use something that's more like a database (but local, without the whole remote server aspect that technologies like SQL Server have). If you used an indexed database table, you'd want at least an index on the date and an index on the URL.