Amazon Interview Question
SDE1sTeam: Advertising
Country: United States
Interview Type: In-Person
Seem like one issue is sharing the same set of ad IDs across keys (queries). If the intersection of the lists across all the queries is a large set, then reducing duplication would be important. So if you maintain a list for each query...if it is a list<long> then you could have a large amount of duplicate ad IDs on each list. If it is a list<long*> then you don't save space since each pointer is 64 bits pointing to a 64 bit long. If it was a list<short> then each short would be a key into a map that contained the associated 64 bit ad ID.
Yes, definitely. It could be partial. Depends on the contraints/requirements of your application, there might be cases that a search query may return huge number of ad IDs. In order not to fill up in-memory cache quickly, we use pagination. By using pagination, two other parameters are by default introduced to system. (limit and offset [naming can be different]) [limit: # of elements that returns from query at a time (ad ID's in this case), offset: from where you want to continue to search]
Having said that, lets talk about it on an example now.
<HOST>/search?q="Comedy"&offset=0&limit=25 => this will return the first 25 ad IDs from this search query
<HOST>/search?q="Comedy"&offset=25&limit=25 => this will return next 25 ad IDs for the same search query.
(Think about offset and limit are automatically populated by UI)
These two queries will stay in the cache as separate entries. Whenver the same search query is requested, the same result will be delivered.
Question comes when the result of search query will be changed in the cache in case it has been accessed all the time; in other words LRU hasn't discarded it for a long time.
To cover this scenario, I'd add TTL (Time to Live) feature as well to the above solution that I provided earlier. A single TTL can be used globally or TTL per search query entry. Let's think about using global TTL. Let's say we want our customers to see new ads that are added to system right away. We can keep created time per search query entry in the cache and after certain TTL we can easily invalidate the search result and remove it from the cache. [removeEldestEntry is good place to do this] When the same search query is requested again, the new result of the query will be populated into cache and it will be used for the period of next TTL.
Implementation of LRU (least recently used) cache
Just kept the size of cache small for the simplicity. But over time, there would be more readers then writers to the cache. If we have multiple readers to read the cache at the same time (as long as allowed by locking process), then the performance of cache would increase. Therefore, we can either choose to use ConcurrentHashMap (lock-striping) or ReentrantReadWriteLock as coded below.
However, if we choose to use ConcurrentHashMap, we lose LRU capacility that comes handly with LinkedHashMap.
- Hope September 26, 2014