Linkedin Interview Question
Senior Software Development EngineersCountry: United States
Interview Type: In-Person
You guys are nuts! :-)
Problem explicitly states scale of the problem - facebook/twitter, so you should rightfully expect millions of events per hour. You should also understand that topics freq chart has a long-long tail. Keeping these millions of tags in RAM (all these heap-based solutions) would kill you. Storing them on disk is an option (DB based solutions) but then computing and retrieving them would be an offline job with certain latency.
I guess the most appropriate solution is any of the streaming counting algorithms - see sticky sampling or lossy counting for instance.
This is a system design question and most answers seem to be solutions for a single system. Here is the high-level design points
1.) We need to keep the counts for 5min, 1 hour and 24 hours in a rolling window fashion.
2.) There are multiple machines, each machine should keep track of the counts of topics. At every delta interval, each machine updates counts with central server and resets it counts to zero.
3.) The central server merges the counts from all machines to have final list of topics and counts. At every delta interval, the central takes a snap shot of the top N topics and counts in an array and resets its counter. N is a tunable parameter. Higher N means more memory and accurate metrics, lower N means lower memory but less accurate metrics.
4.) There are three counters that keep track of topic counts for last 5 mins, 1 hr and 24 hrs. As the central server takes a snap shot, it adds the counts with each of these counters and remove the counts from the 5mins/delta, 1h2/delta and 24hrs/delta array value from the coressponding counter thus maintaining the rolling counts for topics.
5.) at any point of time, the sytems pull the topics and values from these counters to show the top tending topics for this duration
--------------------- Alternate Approach ----------------------
This a cleaner approach that gives statistically right results but the counts are not accurate. The central server or Load balancer can sample queries and increment the topic counter and the last update time. There are three counters, one for the 5mins, one for the 1hr and one for the 24 hrs. Each counter automatically removes the topics if the last update time is < corresponding duration, thus continuously trimming the counts. The sampling rate can be different for different counters which also is a tunable parameter between accuracy and performance.
Thanks for your answer; I'd like to propose some improvements:
1.) in-memory db, such as Redis, with 16G RAM can keep all counters for all topics. However, if size of topics grow, we can use a cluster of Redis, with partitioned keys;
1.5.) Introduce a queue of events, and our compute servers will simply process the stream, aggregate counts for each batch, and update those aggregates in the DB. This is a huge game changer, because it reduces the number of updates to DB. A few aggregators can process events from the queue at 1,000 RPS; if queue builds up, aggregators can be simply increased.
If we use Kafka, we can set 10-20 partitions no the topic, so consumers can scale out easily to 10-20.
2) So, multiple machines need to be clarified: is it handful or thousands.
3) With introducing streaming, compute is separated and results are kept in 1 or two in-memory DBs. Central server aggregates and updates the 3 counts in moving window: 5min every minute, 1hour every 5 minutes gets updated, and 24hr number every hour gets updated.
Yes, this is more of system design question rather than algorithmic. That said, just wanted to add few notes to already answered algorithmic answers.
I think both a max heap and min heap of N tags (built based on their counts) would be required,
So whenever a new tag comes,
a. Get count of tag
b. Check min heap O(1) if its worthy to add i.e. if its really a Top N tag
c. Abort if not
d. if yes, then add it to both max and min heap (at end, as its a complete tree) and call heapify.
e. Now max heap always has the top tag at its root O(1)
Let me know if this can be really done with help of 1 heap?
w.r.t System design, i will leave it to some person to answer it but I believe the heaps should be designed like a big cache residing in-memory (for O(1) faster access, strong use-case) and the challenge here is how to design such a distributed cache in-memory, persisting asynchronously the history of trending tags over time to some big data store for later retrieval, in case if needed (slow access, not a strong use-case)
yes, not sure if everyone here is realizing that you need to keep the data structure up-to-date with cutoffs at 5m/1hr/24hrs. Using the heap or whatever, doesnt automatically solve that issue. The calculation of what is "currently" trending (in those set intervals) needs to be done elsewhere, may be once every minute or so and update the heaps or even a in-memory map somewhere for clients to read from.
Design a Cassandra database for this.
Query: Select original_posts where shared_time > current_time - (1 hr/24hrs/5months) order by numshares desc
So design the database around this query.
Create table posts_by_time_numshares (
original_post_id uuid,
original_post_title text,
original_post_creator frozen <user>,
shared_by frozen <user>,
time_shared_at timestamp,
numshares counter
PRIMARY KEY ((original_post_id), time_shared_at, numshares, shared_by)
CLUSTERNING ORDER BY (time_shared_at DESC, numshares DESC, shared_by ASC)
);
So when the write happens it will write to the Cassandra nodes lets say with RF=3, We can read the nodes with CONSISTENCY QUORUM, so we have the agreement with majority of nodes for the counter numshares.
In my opinion, we can maintain a nosql DB ( for better performance ) for the users and the posts. So, will have a list of all the users in the network ( lets say it's U) and a list of all the posts ( lets say it's P).
Next we will have a p2u table which will have a mapping between a particular post with the list of users that have shared it. And we need to update this table every 5 mins/1 hr etc. So, at every point of time p2u table will look like :
p2u = { 'p1': [u1, u2, u3, u5], 'p2': [u4,u5]
So, this dict can fairly solve the problem.
Max Heap can be used in this scenario.
Basic tending system works based on #tag and its count e.g. how many times it has been shared in particular geography,You can go with Heap data structure when designing such system, say for example you wanna find out the top N trending keywords in you application/website,you can create heap of N #tag based on count (its should be max-heap).
Algorithm:
> Create max heap of size N. Time complexity to build the heap is O(N).
> For every new #tag compare its count with existing root (#tag) count if its greater then root count , replace root with this node (#tag, count) and call heapify to maintain the max-heap property. Heapify will take O(log N) time.
> Sort the heap if you're interested in ordering, it will take O(N log N) to sort the heap.
> At any time T heap will only contain the top N trending keywords.
I think when we get a new tag we can't just compare with the root.
What if the number of times it has appeared is less than the root but greater than the 2nd/ 3rd highest tag???
I suggest using a min heap. Add randomly n tags and build min-heap and then for remaining of the tags compare with root of heap and change if current_tag.count > roothead.count and call heapify. Same can be done when we get a new tag.
This way we always maintain a heap of N elements with maximum counts
Also I think we could use a balanced BST and whenever we want the trending topics, we could just do reverse inorder traversal.
public class ImprovedHashMap {
Map<Integer, TreeNode> map = new HashMap<Integer, TreeNode>();
public static void main(String[] args) {
ImprovedHashMap m = new ImprovedHashMap();
m.put(4, 1, 3);
m.put(4, 0, 4);
m.put(4, 3, 16);
m.put(4, 5, 17);
m.put(4, 2, 12);
int x = m.get(4, 0.6f);
System.out.println("x is " + x);
}
public int get(int key, float t) {
TreeNode root = map.get(key);
int i = searchBST(root, t, root.value, root.t);
return i;
}
public int searchBST(TreeNode root, float t, int closestMatch, float closestT) {
if (root == null) {
return closestMatch;
}
if (root.t > t) {
return searchBST(root.left, t, closestMatch, t);
} else if (root.t < t) {
if (closestT > root.t) {
closestMatch = root.value;
closestT = root.t;
}
return searchBST(root.right, t, closestMatch, closestT);
} else {
return root.value;
}
}
public void put(int key, float t, int value) {
// TreeNode node = new TreeNode(value, key);
TreeNode n = new TreeNode(value, t);
TreeNode root = map.get(key);
if (root == null) {
map.put(key, n);
} else {
insert(root, n);
}
}
public static void insert(TreeNode root, TreeNode child) {
if ((root.t > child.t)) {
if (root.left == null) {
root.left = child;
} else {
insert(root.left, child);
}
} else if (root.t < child.t) {
if (root.right == null) {
root.right = child;
} else {
insert(root.right, child);
}
}
}
private static class TreeNode {
float t;
int value;
TreeNode left;
TreeNode right;
public TreeNode(int i, float t) {
this.value = i;
this.t = t;
}
}
Data Structure: BoundedPriorityQueues - one for 1hrs, one for 24 hrs and one for 5 months.
- ss January 10, 2016