keighobadi
BAN USERexplanation on chirangit's idea of trie:
Preprocessing: insert all entries from the dictionary into a trie.
Algo: 1) Take characters one by one from the beginning of the given text and look up the sequence in the trie, As soon as you hit a leaf or a character that cannot be appended to the current sequence, back track and take the latest valid word from the trie.
- Note that here we take the longest possible word. e.g "hereby" would be parsed as one word.
2) Start a new search from the last offset a valid word was detected.
3) If no valid sequence can be found at some point, back-track to the last time we had the option to choose a shorter word, and make a different choice.
Memorization: In order to avoid re-calculating the same sub-problem during back-tracks, the offset where the longest word was not the good choice needs to be memorized so that any time we hit that offset again as the beginning of a word, we know this is not going to end up in a successful split.
Runtime: O(n^2) since in a straight forward situation, we consume the given text once, but in worst case, we need to back track "n" times and re-process from the beginning.
Java code implementation with min-heap (PriorityQueue); O(N log M) where N is the size of all elements in the lists.
private static List<Integer> kWayMerge(List<List<Integer>> lists) {
Queue<List<Integer>> queue = new PriorityQueue<List<Integer>>(new Comparator<List<Integer>>(){
public int compare(List<Integer> l1, List<Integer> l2) {
return l1.get(0) - l2.get(0);
}
});
for (List<Integer> list : lists)
{
if (!list.isEmpty()) queue.add(list);
}
List<Integer> result = new LinkedList<Integer>();
while (!queue.isEmpty()) {
List<Integer> minlist = queue.remove();
result.add(minlist.remove(0));
if (!minlist.isEmpty()) queue.add(minlist);
}
return result;
}
Thanks for your answer; I'd like to propose some improvements:
- keighobadi May 26, 20221.) in-memory db, such as Redis, with 16G RAM can keep all counters for all topics. However, if size of topics grow, we can use a cluster of Redis, with partitioned keys;
1.5.) Introduce a queue of events, and our compute servers will simply process the stream, aggregate counts for each batch, and update those aggregates in the DB. This is a huge game changer, because it reduces the number of updates to DB. A few aggregators can process events from the queue at 1,000 RPS; if queue builds up, aggregators can be simply increased.
If we use Kafka, we can set 10-20 partitions no the topic, so consumers can scale out easily to 10-20.
2) So, multiple machines need to be clarified: is it handful or thousands.
3) With introducing streaming, compute is separated and results are kept in 1 or two in-memory DBs. Central server aggregates and updates the 3 counts in moving window: 5min every minute, 1hour every 5 minutes gets updated, and 24hr number every hour gets updated.