Google Interview Question
Software Engineer / DevelopersCountry: United States
Interview Type: In-Person
I think they would expect a solution that spreads the work more uniformly among the different servers, which is hard with these constraints.
However, i think you have the merit of getting the right result. +1
Am not sure if this is going to work all the time. Please correct me if i am wong
X:Y Means Urls X has a count of Y
We want top 2 uRLs
M/C A=> 1:100 2:96 7:98
M/C B=> 3:99 5:97 7:2
M/C C=> 4:98 6:95 7:2
1st Step
A=>1,2
B=>3,5
C=>4,6
2nd Step
Top two after merging
1,3 Urls are selected
3rd Step
Threshold=99(Selected from url 3)
4th Step
Score = 99/3=33
A=>1,2,7
B=>3,5
C=>4,6
5th Step
Merging will again give us 1,3 when infact Url 7 has the highest count
Mihun, I think you're missing a step.
Step (2) says you merge the sets and *request* counts from all servers. This will give you the score of 102 for Url 7.
Mithun, you made a mistake in your example and I think you misunderstood the suggested approach.
M/C A=> 1:100 2:96 7:98 , so 1 and 7 would be the top 2.
Anyway, even if we change it to
M/C A=> 1:100 2:96 7:94
M/C B=> 3:99 5:97 7:4
M/C C=> 4:98 6:95 7:4
Where 7 is the most visited site as you intended, the merging in 5th step implies that "For all URLs in the merged set calculate correct values by asking
all servers for their scores for each URL", so we would get 102 as the count for url 7.
Oh okok.. got it.. So in the worst case if all URLs satisfy the score of > T/S, all the URLs will be sent to the server for merging right ?
The approach is correct ... the only issue being that the solution will not scale and for a large number of machines the network traffic may be gargantuan. Consider an example with top ten counts in range of 1million and the number of servers in 10k range (which is not a big number considering the scale of amazon or google scale) so eventually you will ask for all URL which have count >= T/S which is 100 in this case. So you will end up sending a lot more data than is actually needed (as you will be sending URL for counts between 100 and 1 million). Also the bottleneck in such a solution would be central node processing this algorithm which wont scale but as I said earlier the solution is correct but not scalable.
Thanks, vik, for your thoughts on scalability. I think this shows how open ended the question actually is. Without more knowledge about the topology of the machines, datacenters, loadbalancers, etc involved it is not possible to proof the quality and scalability of the algorithm in real life. A few things I would suggest to discuss about this:
- is it a typical weblog file? Typical weblogs have a steep declining start and a long flat tail, so the number of accesses on the tenth rank are usually high: if it is such a distribution, this algorithm is not bad.
- how biased are the loadbalancers? If the loadbalancers have a high degree of randomness built in, then the differences between the machines are statistically already similar enough that the first pass is quite good and the second pass is not much work.
- can clusters of machines be pre-merged? If racks or lines of racks can have a logserver that collects the full logs for the rack or line, then the number of servers to contact can be drastically reduced.
- how often are these calculations repeated? Which similar questions need to be answered? How often? How exact must the numbers be? How up-to-date? Each answer would influence the program and tuning we want to set up to produce fast and good numbers.
Consider this example. Lets find the top 1 URL for now and we can extrapolate the question for top 10 URLs too.
3 Servers
Server 1 : URL A-> 2 URL G->1
Server 2 : URL B->2 URL G -> 1
Server 3 : URL C -> 2 URL G -> 1
Wouldn't the above algo give URL A, B or C as the top visited URL where as the actual top URL should be G?
Hi Izaaz, in your example the first pass finds the critical threshold T to be 2. The second pass would then calculate T=2 devided by S=3 and ask all servers for all URLs that have a score >= 2/3. In other words it would merge the complete set of URLs and thus get the URL G and the sum of the accesses to it in the second pass.
Sorry if I miss something. But I don't think it's going to yield the correct result by selecting Top N with or without threshold. Considering the following case:
Server A
Y1 - 11
Y2 - 11
Y3 - 11
Y4 - 10
Y5 - 10
Y6 - 9
Y7 - 9
Y8 - 9
Y9 - 9
Y10 - 9
…
G1 – 4
Server B
M1 - 12
M2 - 12
M3 - 11
M4 - 11
M5 - 10
M6 - 10
M7 - 10
M8 - 10
M9 - 10
M10 - 10
…
G1 - 9
The threshold 10 / 2 = 5.
In the second pass, G1 in server A won’t be included for tally. In fact, G1 with total 13 visits could be the top 1. But it does not even get into top 10 based on the method. Do I miss something?
@aka777, G1 is not part of the top of server A but it will be part of the top of server B with visits >= 10 / 2. So, the algorithm will ask for all G1 occurrences in other servers and it will correctly put this as top1.
it will yield the correct result, with possibly the cost of multiple rounds.
Thanks, Miguel. I got it.
Theoretically, the algorithm will yield correct result. But Google has more than one million servers. I don't know how this is gonna work out. (Threshold is gonna be very low like vik said.)
- can you follow the thought: every URL that has less than T/S hits on
all servers cannot be among the top ten anymore? The rationale being
that S*T/S equals T and if they have a bit less than T/S, then all
of them together have less than T.
- if so: how about a URL that has less than T/S hits on a single
server? Sure it can be among the top ten iff the same URL has more
than T/S hits on a different server. Right?
- if so: we will get a hold of such a URL on the other server that has
more than T/S hits there. Either we have already got it in the first
pass (1) or in the second pass (4)
- the exactness of the number of hits is guaranteed in steps (2) and
(5), after we have determined candidate URLs. In the first pass just
a minimal seed of URLs, in the second pass all URLs that might
qualify
What am I missing here. Isn't the solution as simple has having master ask each server for their 10th URL and count. Then computing the top out of those, call it U. Then in a second round asking all servers for all URLs with count greater than that if U and finding the top 10 out of those? Why such a complicated solution has been invited so much? I'm honestly asking, am I missing something?
the overall most viewed URL could potentially be the 11th in all servers. So that approach would ignore it.
I understand the question now. Thanks. Got an interview with Google in a week, I better start paying more attention :)
What is the reason/idea behind second pass ? Why are we selecting T/S and probing all servers again ?
The constraints puzzle me a bit, especially the "using MapReduce directly, is not allowed" one. I would try to discuss what that means exactly in an interview.
I'll give another shot at the question:
Denote N as the number of computers in our network.
1) Pick a good string hash function. This function must take urls as input and produce hash values uniformly in the range [0, MAX_HASH_VALUE]
2) Divide the hash range [0, MAX_HASH_VALUE] into N intervals with equal length MAX_HASH_VALUE / N, and assign each interval to 1 computer in the network, e.g.
CPU_1) [0, length-1]
CPU_2) [length, 2*length-1]
CPU_3) [2*length, 3*length-1]
...
CPU_N) [(N-1)*length, MAX_HASH_VALUE]
3) Each computer computes the hash values of its list of urls and sends the url and its visited information to the computer responsible by the hash of that url.
4) Each computer received a list of information url->visits for the urls in its hash interval. Now it must combine the visits of the urls and produce its top 10.
5) Each computer sends its top 10 to a central computer. This central computer will receive 10*N urls and compute the overall top 10.
Due to our good hash function, we expect that each computer will receive roughly the same amount of urls to process in the 4th step. I think this approach is the best we can do to distribute the work among the cluster.
About the constraints, they're not very clear.
a) This is sending all urls over the network but not to a single computer. In order to produce an exact result, I think all approaches end up sending all urls over the network in the worst case. Again, we would need to discuss with the interviewer if this is ok (the "especially sending all of them to a central server" part).
b) This is similar to MapReduce. I think that by saying "using MapReduce directly is not allowed", the interviewer meant that we have to give a detailed explanation about how the work is distributed among the network of computers, instead of just saying "the MapReduce framework will distribute and combine the work for us".
I think this is the only reasonable solution. The word "especially" in the phrase "especially to a central computer" seems to imply that sending the maps in a distributed manner might be acceptable, which is the only constraint this solution violates. It is the only solution that produces the correct result and is guaranteed (modulo the goodness of the hash function) to not send all the urls to the same server in the worst case.
I think so as well.
Someone -1 all my replies to this and a few other threads, for some reason, and I guess this reply was kind of forgotten.
As you said , you are sending all of the urls on the network , which I guess is impossible as it's been said in the question. It works but it doesn't conform to the constraints.
This is the best solution I can come up with. It is effectively distribute the computations and traffic over the network. It does not need a central server.
Your solution assumes that each url will exist in exactly one server in the network.. The original question addresses the problem where each url can reside in multiple servers with different visit counts in each of them..
You misunderstood what I wrote. That's the point of steps 3 and 4. Equal urls will be sent to (3) and combined by (4) the same machine.
One major assumption has been taken in Step 3.
Each server needs to know how to talk to ALL the other servers in the network. If we have 1M CPU in the network, this information is big and also hard to maintain. Something similar happens in Domain servers and they evolved. Earlier every node knew about every other node. Later they evolved to only know about some node and established a hierarchy. IMO, it would be fruitful to think in that direction as well
I don't think any of the suggested solutions are right. It is possible to imagine a situation where some url is ranked number 11 on every box, and so has very high visits overall, while every url in the top 10 on each individual box is seen nowhere else, so has low visits overall.
That said, I don't have any better ideas. This problem is hard!
The first pass gives you a lower bound. In the second pass, you will include that 11th url in your candidate set, since its overall visits is certainly greater than the 10th url found in the first pass.
Anon is correct. The second pass will not help. Take the following situation:
3 servers:
SERVER A:
URL #1 - #10 : 2 counts
URL #0 : 1 count
SERVER B:
URL #11-#20: 2 counts
URL #0: 1 count
SERVER C:
URL #21-#30: 2 counts
URL #0: 1 count
now the first pass will find URL #1-#30 but not URL #0, this will be completely missed. Yet it is the winning URL, with 3 counts.
Hence the whole strategy is flawed.
langeolli: it the two pass solution I described, you miss rule number 4:
Ask every server for all its top N that satisfy score >= T/S
In your example, T=2 and S=3, so the threshold is < 1 and consequently all three servers will have to deliver their full logfiles and the result will find URL #0 to have 3 hits.
I understand that with the given constraints it is not possible to get a trivial solution. but i thik we have to consider the senario where one url in the actual 10 top urls is visited by most of the machines but only a few times that will keep it out of the individual top 10 lists of each machine.
does that sound right?
I agree : This example would fail
Consider this example. Lets find the top 1 URL for now and we can extrapolate the question for top 10 URLs too.
3 Servers
Server 1 : URL A-> 2 URL G->1
Server 2 : URL B->2 URL G -> 1
Server 3 : URL C -> 2 URL G -> 1
Wouldn't the above algo give URL A, B or C as the top visited URL where as the actual top URL should be G?
You could possibly continually request the top urls until you have ten where the smallest value in the top10 list is not more then the highest value in any server using paginated sets. For example,:
page = 1;
max=10;
while(top.length <max || top.smallest < highest && page <maxpages){
highest = -1;
for(server in farm){
temp = getTop(server,max, page);
merge(top, temp, max);
if(highest < temp.top){
highest = temp.top;
}
}
page++;
}
Merge would keep no more than max values based on unique url and highest visits.
I would recursively group the servers into sets of two and aggregate the URL's .
Suppose there are 6 servers.
a) Group s1s2, s3s4, s5s6.
b) We know the entire map cannot be transmitted. Find out a safe message size for the network. Supposing n. Break the map in s1 into n size chunks and send over to s2.
Similarly from s3-s4, s5-s6.
c) This is the tricky part. The ques says we cannot do map-reduce directly. Does that mean map-reduce on the entire set? Is it allowed for individual machines? But it will be silly to solve this without ever getting a count of the URLs. So if map reduce is not allowed, write a procedure to sort the urls and track each one's count. This is done in s2, s4 and s6.
d) Now, again group the machines. This time {s2s4}, {s6}.
e) Repeat b, c . We will have s4, s6 left. Transmit from s4-s6 and perform a final count.
Lets assume, we can store count of k urls in the central machine. If the corpus is of size n, we get the top n/k elems from each server and send it to our central machine.
So, in our central machine, we have a set of k counters each initialized to 0 to start with. As we get our data from the stream, for every url, if it exists, we increment the counter. If not:
case1: if there is size in our central machine to add the url, we add it and set its count
case 2: if not, we delete the count of every url we have in the central machine. If there is any
url that has a count of 0, we delete it.
Now, the moment we hit case 2 above, we record the max_count_so_far and take a snapshot of the top 10 elements.
We process the next set of top n/k elements from the machines and for every max_count_so_far elements we take a snapshot of the top 10 elements.
At some point, say after we have 10 such snapshots, we find the final top 10 elements from the snapshots we have so far
On each server, we sort the urls based on their frequencies.
Say total log file lines across all servers are N. Number of servers is s. Capacity of server is k. Now split the ranked urls such that there are k/s elems per group on each server. Label each group , there will be total N/(k/s) ids.
Now, from this set of ids, we randomly select s ids (i.e s groups) such that each id doesnt occur more than threshold times on one machine. (To keep it simple lets say threshold=1).
Now, we employ the following algorithm on the central machine:
If there is size in our central machine to add the new group, we add it and set the count of each of the elems in the group. If not, we delete the count of every group/element in the group. We delete all urls that have a count of 0.
Now, one might argue that we might endup spending too much time on low frequency entries. We could then employ multiple iterations of elimination here. In the first pass, we can only consider frequencies that are above a certain threshold: say frequency above the median of the frequencies in each server. And divide those set of urls into groups. And consider random group from each server.
have a centralized counter. every time a new hit is recorded, recalc the centralized counter for that url with properly synced data structure. then compare it with the top10 hit stack by popping out the ones smaller than the current count result. keep updating the top 10 stack every time the new hit is recorded and you can always query the top 10 stack for the top 10 hits.
Lets say the number of servers is N. I think the solution requires 10 passes among the top ten scorers of each server. In each pass you can only identify one in the top ten. After each pass, the selected URL from the previous pass must be excluded from the evaluation in the subsequent passes, and the top ten scores from each server must be updated.
Think of the first pass. When all the top ten lists are considered, there must be at least one URL among these top tens which will land into the "real" top ten list. It is possible to generate scenarios where 9 of the real top ten list do not appear in the "current" top ten lists. But, there has to be at least one URL in the real top ten list which also appears in the 10N URLs that are collected from the servers' current top ten lists. Note that this is true only when N>=10. If N<10, all the real top ten list may not appear in the 10N URLs.
Also note that after collecting the 10N URLs, for some of them, you will have to ask the servers their frequencies so that you can sort the 10N URLs properly. Because, you want to sort the sum of visit frequencies of these URLs at each server.
After 10 such passes the real top ten list will emerge. The messaging complexity of each pass could be as bad as N^2 since you may have to collect the frequency values for each one of the 10N URLs. But, the computational complexity of each step is O(N) since you only need to find the max of 10N URLs.
If we were to design an algorithm that can find out not just 10 but the top K list. Then, we would be have to collect the top N scorers from each server .So, rather than 10N, we would bring together the top N^2 list at each pass. If K>N, this design is preferable.
Here's a counter example where the top 3 does not appear in the top 3 of any of the 3 servers:
s1= { a:10, b:9, c:8, d:7, e: 6, f: 5 },
s2= { g:10, h:9, i:8, d:7, e: 6, f: 5 },
s3= { j:10, j:9, l:8, d:7, e: 6, f: 5 }
the top 3 is {d: 21, e: 18, f: 15 }, so that approach also does not guarantee 100% correctness
Good example..
In the first pass, it will find a (or g or j). In the second pass, d will be in the first three of s1 since a is kicked out of s1's top three. So, the second pass will find d. The third pass will, in turn, find e. I can see, however, that f will not be included in the top three list after three passes.
We can update the algorithm as follows:
After a pass, if a new fellow is selected that places into an unexpected spot (i.e., item found in the i th pass places into a spot < i th place in the top items list), then we increase the total number of passes by one.
This should do it..
If not, I think making N^2 passes should do it.
Actually, even top N^2 may not do it.. :) Here is one more update to the algorithm I proposed:
Find the max value among all the lists: This should be easy to do. Findmax for all lists. Let's say that the max value is M.
Then, apply the iterative algorithm (with the extra check I described in my prev post) among the top lists of each server such that the top lists include all the values greater than or equal to M/N.
We partition the work by using a hash on the url. The hash gives an address to an aggregator who is responsible counting a subset (as defined by the hash) of urls. Then we crawl through the aggregators and fill the top-ten list. As each url is mapped to exactly one aggregator the crawler only has to maintain a map with ten entries.
Step 1) send table with aggregator addresses: say we use 10k aggregators
Step 2) on individual machine count all urls
Step 3) on individual machine iterate through counted urls, find hash-key, modulo 10k
gives you the address of responsible aggregator -> send to aggregator the count of said url.
Step 4) aggregators accumulate the counts
Step 5) initialize empty top-ten list, iterate through 10k aggregator machines
and maintain the top-ten list
We have a central server which will talk to each of the computers in the network containing the log files. Let's say there are n such computers.
Lets objectify each entry in the network of computers containing the log as <url, count, computer_id> where computer_id is the id of these individual computers.
class Entry {
String URL,
int count,
String computer_id
}
We are going to solve this problem with map-reduce with minimal network data.
Map-step
-------------
Each of the log computers create a max heap (sorted on count) for all the URLs.
Reduce Step:
-----------------
The central server can ask a given computer, the top of the stack. Lets call this function as
function Entry getTopEntry(computer_id)
The central server can also ask a given computer to get the count for a given URL
function int getCount(computer_id, URL)
The central server has two heaps -
- one max-heap of size n (lets call this current_max_heap) containing the one (the top) entry from each log-computers. We also store the sum of all the entries' count in a variable and keep it updated.
- one min-heap of size k (lets call this k_min_heap) containing the current top k elements across all servers. At the end the reduce step, this heap will contain the required k entries
The idea:
Initialization: We initialize the current_max_heap with the first entry from each of the log computers (by polling their individual max_heaps).
We now get the top entry from the current_max_heap and add it to the k_min_heap. Before adding we need to update the entry's count so that it is equal to the count of that entry's URL across all the log-computers, which can be done by asking the count for a URL from all the log-computers) Every time an entry is removed from current_max_heap, we get the top entry from the computer's max-heap to which this entry belonged. We do it k times, so that the k_min_heap has k entries.
At any time, if the top element of the k_min_heap's count is less than the sum of the count's for element in current_max_heap), there is a possibility that the top URLs for all the computer's are same and when added-up they will be higher than this top-entry. So, we continue to move elements from current_max_heap to k_min_heap, till this condition is no longer true.
Here is the code running on the server:
// declare current_max_heap;
// declare k_min_heap;
int current_max_count = 0;
// Initialize the
for (int i = 0; i < n; i++) {
Entry entry = getTopEntry(computer_id);
current_max_heap.add(entry);
current_max_count += entry.count;
}
do {
Entry top = current_max_heap.poll();
Entry new_top = getTop(top.computer_id);
current_max_heap.add(new_top);
current_max_count = current_max_count - top.count + new_top.count;
current_max_heap.add();
for (int i = 0; i < n; i++) {
entry.total_count += getCount(i, top.URL);
}
if (k_min_heap.size() < k) {
k_min_heap.add(entry);
}
else {
if (k_min_heap.peek().count < entry.count) {
k_min_heap.poll();
k_min_heap.add(entry);
}
}
while(k_min_heap.size() != k || k_min_heap.peek().count <= current_max_count);
Hi. My simple answer is in Swift 3 developed in an xcode Playground. You will notice my urls are simply the domain. You can expand that part. I code not because the comments on Careercup do not allow links at all...
// Imports
import Foundation
// Our Function to input the log files of visited Urls
func determineTopVisitedUrls( vistedUrls : [[String]]) -> [(String, Int)] {
// Set initial Variables
var urlVisits: [(url: String, visits: Int)] = []
var urlFound : Bool = false
var foundIndex : Int = Int()
// Loop thru all files
for file in vistedUrls {
// Loop thru all urls in Log Files
for url in file {
// Search thru existing results before appending to the list...
if urlVisits.count > 0 {
for visitIndex in 0..<urlVisits.count {
if (urlVisits[visitIndex].url == url) {
// Increase List at that URL,
foundIndex = visitIndex
urlVisits[visitIndex].visits += 1
urlFound = true
//print("Working with url : \(url) which was found at \(foundIndex)")
break
}
}
}
// If not found, append to the end...
if urlFound == false {
// Append to the list with a 1
urlVisits.append(( url, 1))
//print("Appending url: \(url) to the array urlVisits(count: \(urlVisits.count))")
// Else, sort the list if necessary
} else {
// print("Found url, increasing visits by +1 to total of \(urlVisits[foundIndex].visits) and foudnIndex: \(foundIndex)")
// loop thru results from found location up only....
while foundIndex > 0 {
// print("foundIndex: \(foundIndex)")
if urlVisits[foundIndex].visits > urlVisits[foundIndex - 1].visits {
// Switch them in the placement
let tempUrlVisit : (url: String, visits: Int) = urlVisits[foundIndex]
urlVisits[foundIndex] = urlVisits[foundIndex - 1]
urlVisits[foundIndex - 1] = tempUrlVisit
} else {
break
}
foundIndex -= 1
}
}
// Reset Values before next search and append
urlFound = false
}
}
// remove everything after topTen
if urlVisits.count > 10 {
for afterIndex in 10...urlVisits.count {
urlVisits.remove(at: afterIndex)
}
}
// Return Top Urls no greater then 10
return urlVisits
}
// Hi, you will need to make the urls longer. The comments can't have links so I couldn't provide the full long url with query strings and such... Please expand the urls at the end and beginning of each logFile
..
// Create 2 Log Files
var logFile1 : [String] = [ "espn", "youtube", "youtube", "apple", "apple", "mindcraft" ]
var logFile2 : [String] = [ "espn", "apple", "youtube", "youtube", "mindcraft" ]
// Create an Array of the Log Files
var allFiles : [[String]] = [ logFile1, logFile2 ]
// Call our Function for testing
var topTenUrls : [(String, Int)] = determineTopVisitedUrls(vistedUrls: allFiles)
// Print the results of the top Urls
print("The top \(topTenUrls.count) Urls are \(topTenUrls) ")
1) Use distributed caching like memcached
2) Have a central computer which will act as primary server for processing final results.
Add multiple secondary computer as backups incase primary goes down.
3) The central computer cluster updates a threshold for count of most visited urls.
We can either put the threshold in memcached or distribute it using an observer pattern where all
registered computers get the updated threshold.
4) Each computer calculates up the 10 max urls by visited count which are more than the
threshold and update them in memcached. This can be either scheduled or triggered when they are notified
in the observer pattern.
5) The central computer will iterate over all the data cached in memcached and calculates the top ten.
1. Central server asks which computer has highest top ten and receives that. Highest is defined as bigger tenths element (minimum of top ten).
2. Sever ask each computer for corresponding URLs, it just received, and computers remove those URLs from their list.
3. Server updates its top then with received URLs.
4. Repeat this until no computer has bigger top ten than server top ten divided by number of computers.
The easy solution to solve this kind of problem is MapReduce. Since MapReduce is not allowed, the other alternative is to sort the string (url) ==> int (visits) in each machine independently such that it is the increasing order of visits. Then each server can send their top 10 visited URLs, Visit mapping to one of the server.
This single server would receive data from all the others for the top 10 visited sites and do a merge to decide on the top 10 sites. ( A n-way merge sort of the URL vs visits).
edit: this is does not guarantee 100% correctness
The question says we can't use MapReduce directly because the amount of data is too large. However, the overall top 10 sites are *expected* to be in the top 10 of at least one of the computers.
Hence, we can sort the logs of each computer and emit the top 10 of each computer in the mapping phase. Then, the reduce phase aggregates the number of visits for each site. The result will be the top 10 of the aggregate.
This is wrong. Imagine the most frequent terms are
a1={url1:100, url2:99, url3:98, url4:97 },
a2={url5:100, url6:99, url7:98, url4:97 }
Then the top three most frequent in the merged list should contain url4 as the top one though url4 is not the top three in any of the original lists.
What if we manage a MaxHeap of top 10 sites at each node separately, and a heap of size 10 in central unit. The central heap can be updated according to :
if(heap[0]<RecentlyReceievedData)
{
heap[0]=RecentlyReceivedData;
MaxHeap(heap,0);
}
Then at every pre-specified interval, we can request updates from the nodes whose responses are the content of their respective MaxHeap.
We could use a BFS, visit every node and add sum of the integers to the master map that has all the URL's checking, if the URL is already present, then just add the integer orelse create a new key and put it into the map. Then traverse through the map to find the top ten values or just sort descending and return the first 10 values.
There are N number of machines,
Determine top 10 in every machine.
Each machine transmits its top 10 to every other machine i.e (n-1)10 urls
This Implies each machine also recieves (n-1)10 urls
Process these and determine the top 10 in every machine.
Each machine now sends its Top 10 to a single place,(n*10)
The top 10 among these (n*10) will be the most visited URL's
I do not know if there is any way to avoid not replicating the data on each machine. But will N*(N-1)*10 url's be too much traffic for the network to handle, cos that will be the total number of replications required.
Other possible solutions I could think of was using P-S pattern to publish count of URL's periodically.
that's quite similar to the approach i posted before. this does not guarantee 100% correctness. the aggregate top 10 does not need to be in the top 10 of any individual server.
1. tag the nodes as n1, n2, n3..... nk
2. First n1 sorts its list of URLs to find top10
3. n1 sends this list to n2. n2 adds this list to the data set, and gets a top 10
4. now n2 sends its top10 (calculated in step 3) to n3
5. keep doing it till we reach nk
nk will have the cumulative top 10
Presuming a protocol exists that can ask three questions to each server:
- laperla September 27, 2013* the score of a single url
* the top 10
* the top n that satisfy score >= N
We program a two pass solution like so:
We denote the number of servers as S.
[First pass]
(1) Ask every server for its own top ten
(2) merge the results. For all URLs in the merged set calculate correct values by asking
all servers for their scores for each URL. Calculate a set of top ten from our sample.
(3) pick score of the now tenth URL as the threshold that we try to beat
in the second round. We denote the threshold as T.
[Second pass]
(4) Ask every server for all its top N that satisfy score >= T/S
(5) Merge these bigger samples again as in step (2)
(6) We now have the correct top ten with correct scores.