Amazon Interview Question
SDE-2sCountry: India
Interview Type: In-Person
View Count could be a major factor. Question is what is the condition or factor to cause a post / product a trend? is it people view, click, people response? Based of that you can use priority queue at least for normal interview based question but in real time, this could blow up heavily
The following code is for hourly basis. And can be easily extended.
Check the items list on every 100 milliseconds to clean the older data.
To simplify the code names are used instead of product ids. But can be easily extended.
import java.util.SortedSet;
import java.util.TreeSet;
public class FindTrending {
private SortedSet<Item> items = new TreeSet<>();
public static void main(String[] args) throws InterruptedException {
FindTrending findTrending = new FindTrending();
RemoveAction removeAction = findTrending.new RemoveAction(findTrending.items);
Thread subThread = new Thread(removeAction);
subThread.start();
for(int i=0;i<100;i++){
findTrending.addItem("test"+i);
Thread.sleep(500);
synchronized (findTrending.items) {
System.out.println(findTrending.items);
}
}
subThread.interrupt();
}
public void addItem(String itemName){
Item item = new Item(itemName);
synchronized (items) {
if(items.contains(item)){
items.remove(item);
}
items.add(item);
}
}
class Item implements Comparable<Item>{
String name;
long time;
public Item(String name) {
this.name = name;
time = System.currentTimeMillis();
}
public Item(String name,long time){
this.name = name;
this.time = time;
}
@Override
public boolean equals(Object obj) {
if(obj instanceof Item){
((Item) obj).name.equalsIgnoreCase(this.name);
}
return false;
}
@Override
public String toString() {
return name;
}
@Override
public int hashCode() {
return name.hashCode();
}
@Override
public int compareTo(Item o) {
if(time-o.time > 0){
return 1;
}else{
return -1;
}
}
}
class RemoveAction implements Runnable{
private SortedSet<Item> items;
public RemoveAction(SortedSet<Item> items) {
this.items = items;
}
@Override
public void run() {
while(true){
synchronized (items) {
long currTime = System.currentTimeMillis();
long diff = currTime - 60*60*1000;
Item temp = new Item("Dummy",diff);
SortedSet<Item> headSet = items.headSet(temp);
headSet.clear();
}
try {
Thread.sleep(100);
} catch (InterruptedException e) {
}
}
}
}
}
I am a beginner so this might not be the most optimal solution. I was thinking about using a maxHeap to keep track of the products with largest counts.
MapHeap or a Deque would be a good solution if the requirement is to keep track of only the top trending product (1 product) since we need to keep track of Top K products, we will need to keep track of multiple top elements at the same time.
A sliding window is one of the solutions. I have explained the approach in another answer.
What about this solution.
We maintain a rolling hash map(productId,sumOfSoldQuantiesTillNow)
For each time slot, We store the map of (products,sumOfSoldQuantiesTillNow)
which can be retrieved later.
time hour1:
hour1_map={(p1,34),(p2,23),(p3,12),(p4,17),(p6,12)}
time hour2:
hour2_map=hour1_map+{(p1,3),(p2,14),(p3,8),(p4,34),(p5,12)}
time hour3:
Here, we can calculate the count of product sold of time_hour2 simply
by subtraction and take out top K
So, if I understand the problem description, you are given (timestamp, id) data stream and asked to compute the most frequent ids falling into the specified time window?
This is a well known problem. We need to know the size of the data stream that fits in the biggest time window of interest as compared to the available memory. If you need to compute the exact top-K heavy hitters it will require O(N) memory (it's proved that even exact top-1 requires O(N) space worst-case), where N is the biggest number of data points fitting into the time window. If you can satisfy this requirement, then it's one thing and the solution is relatively easy. On the other hand (I suspect this is what the interviewer wanted to hear) if the size of the stream is too huge and you can tolerate errors, then it calls for an approximation solution like lossy counting etc. see https://en.m.wikipedia.org/wiki/Streaming_algorithm#Frequent_elements
See also here https://cstheory.stackexchange.com/questions/19802/top-k-frequent-items-in-data-stream
I am considering the case where the selling rate is too high. I am assuming that the trending products are the products that are sold most.
As the selling rate is high, we need to process quicky, so I will divide it further into more queues for faster processing.
Q1->W1
Q -> Q2->W2
Q3->W3
.
.
.
.
In the above figure, Q is the real time selling data. Q1,Q2,Q3 are the queues we are further dividing into. W1,W2, W3 are the corresponding workers. I explained about the workers in the next para.
There will be one worker for every queue which consumes the data and inserts into a database say IntervalDB_2018_01_01_12:25to12:30. In the database IntervalDB.., I will be storing the product ids. For every 5 minutes we create a new IntervalDB. So each IntervalDB corresponds to one five minute interval as shown below.
IntervalDB_2018_01_01_12:25to12:30
IntervalDB_2018_01_01_12:30to12:35
IntervalDB_2018_01_01_12:35to12:40
.
.
.
Note : For increasing the write speed to IntervalDB.. , we can use DB sharding, which I explained here.
Here for multiple orders with same pId, I am inserting the pId instead of increasing the count of pId. Because in the case where we need to process multiple orders with same pids, processing gets slowed because of concurrency.
Once insertion is done in a 5 minutes intervalDB, we count the number of items sold in that interval for each pId and insert to another db say CountDB.
______________________
| CountDB |
|_____________________|
| pId | 5MinInterval | count |
------------------------------------
1 ...... 5
.
.
.
.
We can delete the old intervalDBs whose counting is done.
For finding the products that are sold most in an hour or day or month, we can query the CountDB.
There are 2 parts to this question
1. Keeping track of Top K products at all times
2. Data comes as a Stream. So we do not have access to all the data at once. So overall sorting is not possible.
Based on these requirements sliding window is a good option where we keep track of top K products each time a new data come in from a stream.
Note: Trending of products could be on any basis like "quantity sold" or "user ratings" etc.
Idea: We use a sliding window to keep track of Top K products, Within the sliding window, we sort the items based on their trends.
Example, assuming for a stream of products with "quantity sold" values coming as [10,5,13,2,21,8,3,40,12,15,19], what are the top 3 products
Sliding window = 3
Stream: 10 comes in
Window: [10]
Stream: 5 comes in
Window[10, 5]
Stream: 13 comes in
Window: [13, 10, 5]
Stream: 2 comes in
Window: [13, 10, 5] --> 2 is ignored as its not in Top K
etc.
For Sliding Window we can use TreeMap that can sort elements within the window internally. And for keeping it a fixed size, we can have our own implementation of TreeMap as follows:
Time Complexity: O(N)
Note: Sorting within TreeMap will be constant as regardless of the stream size, it will only sort K items in the Map.
Space Complexity: O(K) where K is the size of the window.
- Saurabh July 16, 2018