Given a large network of compu

Google Interview Question for Software Engineer / Developers

25

of 27 votes

69
Answers
Given a large network of computers, each keeping log files of visited urls, find the top ten of the most visited urls.
(i.e. have many large <string (url) -> int (visits)> maps, calculate implicitly <string (url) -> int (sum of visits among all distributed maps), and get the top ten in the combined map)

The result list must be exact, and the maps are too large to transmit over the network (especially sending all of them to a central server or using MapReduce directly, is not allowed)
- chandeepsingh85 September 26, 2013 in United States | Report Duplicate | Flag | PURGE
Google Software Engineer / Developer System Design

Email me when people comment.

An error occurred in subscribing you.

Country: United States
Interview Type: In-Person

More Questions from This Interview

Email me when people comment.

An error occurred in subscribing you.

Comment hidden because of low score. Click to expand.

of 32 vote

Presuming a protocol exists that can ask three questions to each server:

* the score of a single url
* the top 10
* the top n that satisfy score >= N

We program a two pass solution like so:

We denote the number of servers as S.

[First pass]
(1) Ask every server for its own top ten

(2) merge the results. For all URLs in the merged set calculate correct values by asking
all servers for their scores for each URL. Calculate a set of top ten from our sample.

(3) pick score of the now tenth URL as the threshold that we try to beat
in the second round. We denote the threshold as T.

[Second pass]
(4) Ask every server for all its top N that satisfy score >= T/S

(5) Merge these bigger samples again as in step (2)

(6) We now have the correct top ten with correct scores.

- laperla September 27, 2013 | Flag Reply

Comment hidden because of low score. Click to expand.

-1

of 1 vote

I think they would expect a solution that spreads the work more uniformly among the different servers, which is hard with these constraints.

However, i think you have the merit of getting the right result. +1

- Miguel Oliveira September 28, 2013 | Flag

Comment hidden because of low score. Click to expand.

-1

of 3 votes

Am not sure if this is going to work all the time. Please correct me if i am wong

X:Y Means Urls X has a count of Y
We want top 2 uRLs

M/C A=> 1:100 2:96 7:98
M/C B=> 3:99 5:97 7:2
M/C C=> 4:98 6:95 7:2

1st Step
A=>1,2
B=>3,5
C=>4,6

2nd Step
Top two after merging
1,3 Urls are selected

3rd Step
Threshold=99(Selected from url 3)

4th Step
Score = 99/3=33
A=>1,2,7
B=>3,5
C=>4,6

5th Step
Merging will again give us 1,3 when infact Url 7 has the highest count

- Mithun September 30, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 1 vote

Mihun, I think you're missing a step.

Step (2) says you merge the sets and *request* counts from all servers. This will give you the score of 102 for Url 7.

- VSuba September 30, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 2 votes

Mithun, you made a mistake in your example and I think you misunderstood the suggested approach.
M/C A=> 1:100 2:96 7:98 , so 1 and 7 would be the top 2.

Anyway, even if we change it to
M/C A=> 1:100 2:96 7:94
M/C B=> 3:99 5:97 7:4
M/C C=> 4:98 6:95 7:4

Where 7 is the most visited site as you intended, the merging in 5th step implies that "For all URLs in the merged set calculate correct values by asking
all servers for their scores for each URL", so we would get 102 as the count for url 7.

- Miguel Oliveira September 30, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 2 votes

Oh okok.. got it.. So in the worst case if all URLs satisfy the score of > T/S, all the URLs will be sent to the server for merging right ?

- Mithun October 01, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 3 votes

The approach is correct ... the only issue being that the solution will not scale and for a large number of machines the network traffic may be gargantuan. Consider an example with top ten counts in range of 1million and the number of servers in 10k range (which is not a big number considering the scale of amazon or google scale) so eventually you will ask for all URL which have count >= T/S which is 100 in this case. So you will end up sending a lot more data than is actually needed (as you will be sending URL for counts between 100 and 1 million). Also the bottleneck in such a solution would be central node processing this algorithm which wont scale but as I said earlier the solution is correct but not scalable.

- vik October 01, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 2 votes

Thanks, vik, for your thoughts on scalability. I think this shows how open ended the question actually is. Without more knowledge about the topology of the machines, datacenters, loadbalancers, etc involved it is not possible to proof the quality and scalability of the algorithm in real life. A few things I would suggest to discuss about this:

- is it a typical weblog file? Typical weblogs have a steep declining start and a long flat tail, so the number of accesses on the tenth rank are usually high: if it is such a distribution, this algorithm is not bad.

- how biased are the loadbalancers? If the loadbalancers have a high degree of randomness built in, then the differences between the machines are statistically already similar enough that the first pass is quite good and the second pass is not much work.

- can clusters of machines be pre-merged? If racks or lines of racks can have a logserver that collects the full logs for the rack or line, then the number of servers to contact can be drastically reduced.

- how often are these calculations repeated? Which similar questions need to be answered? How often? How exact must the numbers be? How up-to-date? Each answer would influence the program and tuning we want to set up to produce fast and good numbers.

- laperla October 02, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

Consider this example. Lets find the top 1 URL for now and we can extrapolate the question for top 10 URLs too.

3 Servers
Server 1 : URL A-> 2 URL G->1
Server 2 : URL B->2 URL G -> 1
Server 3 : URL C -> 2 URL G -> 1

Wouldn't the above algo give URL A, B or C as the top visited URL where as the actual top URL should be G?

- Izaaz October 07, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

Hi Izaaz, in your example the first pass finds the critical threshold T to be 2. The second pass would then calculate T=2 devided by S=3 and ask all servers for all URLs that have a score >= 2/3. In other words it would merge the complete set of URLs and thus get the URL G and the sum of the accesses to it in the second pass.

- laperla October 07, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

Sorry if I miss something. But I don't think it's going to yield the correct result by selecting Top N with or without threshold. Considering the following case:

Server A
Y1 - 11
Y2 - 11
Y3 - 11
Y4 - 10
Y5 - 10
Y6 - 9
Y7 - 9
Y8 - 9
Y9 - 9
Y10 - 9
…
G1 – 4

Server B
M1 - 12
M2 - 12
M3 - 11
M4 - 11
M5 - 10
M6 - 10
M7 - 10
M8 - 10
M9 - 10
M10 - 10
…
G1 - 9

The threshold 10 / 2 = 5.

In the second pass, G1 in server A won’t be included for tally. In fact, G1 with total 13 visits could be the top 1. But it does not even get into top 10 based on the method. Do I miss something?

- aka777 October 12, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

@aka777, G1 is not part of the top of server A but it will be part of the top of server B with visits >= 10 / 2. So, the algorithm will ask for all G1 occurrences in other servers and it will correctly put this as top1.

it will yield the correct result, with possibly the cost of multiple rounds.

- Miguel Oliveira October 12, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

Thanks, Miguel. I got it.
Theoretically, the algorithm will yield correct result. But Google has more than one million servers. I don't know how this is gonna work out. (Threshold is gonna be very low like vik said.)

- aka777 October 12, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

why this threshold T can make sure the exact result please?

- chrisgates30@yahoo.com August 15, 2015 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

- can you follow the thought: every URL that has less than T/S hits on
all servers cannot be among the top ten anymore? The rationale being
that S*T/S equals T and if they have a bit less than T/S, then all
of them together have less than T.

- if so: how about a URL that has less than T/S hits on a single
server? Sure it can be among the top ten iff the same URL has more
than T/S hits on a different server. Right?

- if so: we will get a hold of such a URL on the other server that has
more than T/S hits there. Either we have already got it in the first
pass (1) or in the second pass (4)

- the exactness of the number of hits is guaranteed in steps (2) and
(5), after we have determined candidate URLs. In the first pass just
a minimal seed of URLs, in the second pass all URLs that might
qualify

- laperla August 16, 2015 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

What am I missing here. Isn't the solution as simple has having master ask each server for their 10th URL and count. Then computing the top out of those, call it U. Then in a second round asking all servers for all URLs with count greater than that if U and finding the top 10 out of those? Why such a complicated solution has been invited so much? I'm honestly asking, am I missing something?

- bp November 07, 2015 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

the overall most viewed URL could potentially be the 11th in all servers. So that approach would ignore it.

- Miguel Oliveira November 07, 2015 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

Ok. Thanks. I hadn't understood the question at all earlier.

- bp November 08, 2015 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

I understand the question now. Thanks. Got an interview with Google in a week, I better start paying more attention :)

- bp November 08, 2015 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

What is the reason/idea behind second pass ? Why are we selecting T/S and probing all servers again ?

- Kaushik Lele October 31, 2016 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

What is the reason/idea behind second pass ? Why are we selecting T/S and probing all servers again ?

- Kaushik Lele October 31, 2016 | Flag

Comment hidden because of low score. Click to expand.

of 25 vote

The constraints puzzle me a bit, especially the "using MapReduce directly, is not allowed" one. I would try to discuss what that means exactly in an interview.
I'll give another shot at the question:

Denote N as the number of computers in our network.

1) Pick a good string hash function. This function must take urls as input and produce hash values uniformly in the range [0, MAX_HASH_VALUE]

2) Divide the hash range [0, MAX_HASH_VALUE] into N intervals with equal length MAX_HASH_VALUE / N, and assign each interval to 1 computer in the network, e.g.
CPU_1) [0, length-1]
CPU_2) [length, 2*length-1]
CPU_3) [2*length, 3*length-1]
...
CPU_N) [(N-1)*length, MAX_HASH_VALUE]

3) Each computer computes the hash values of its list of urls and sends the url and its visited information to the computer responsible by the hash of that url.

4) Each computer received a list of information url->visits for the urls in its hash interval. Now it must combine the visits of the urls and produce its top 10.

5) Each computer sends its top 10 to a central computer. This central computer will receive 10*N urls and compute the overall top 10.

Due to our good hash function, we expect that each computer will receive roughly the same amount of urls to process in the 4th step. I think this approach is the best we can do to distribute the work among the cluster.

About the constraints, they're not very clear.
a) This is sending all urls over the network but not to a single computer. In order to produce an exact result, I think all approaches end up sending all urls over the network in the worst case. Again, we would need to discuss with the interviewer if this is ok (the "especially sending all of them to a central server" part).

b) This is similar to MapReduce. I think that by saying "using MapReduce directly is not allowed", the interviewer meant that we have to give a detailed explanation about how the work is distributed among the network of computers, instead of just saying "the MapReduce framework will distribute and combine the work for us".

- Miguel Oliveira October 01, 2013 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

I think this is the only reasonable solution. The word "especially" in the phrase "especially to a central computer" seems to imply that sending the maps in a distributed manner might be acceptable, which is the only constraint this solution violates. It is the only solution that produces the correct result and is guaranteed (modulo the goodness of the hash function) to not send all the urls to the same server in the worst case.

- Anonymous October 09, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

I think so as well.
Someone -1 all my replies to this and a few other threads, for some reason, and I guess this reply was kind of forgotten.

- Miguel Oliveira October 09, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

As you said , you are sending all of the urls on the network , which I guess is impossible as it's been said in the question. It works but it doesn't conform to the constraints.

- Arian October 15, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

This is the best solution I can come up with. It is effectively distribute the computations and traffic over the network. It does not need a central server.

- sztaoyong November 08, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

Isn't this similar to Map-reduce ?

- titan June 30, 2014 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

what if two URLs have the same hash value??

- jigsaw November 27, 2014 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

Your solution assumes that each url will exist in exactly one server in the network.. The original question addresses the problem where each url can reside in multiple servers with different visit counts in each of them..

- Anonymous December 12, 2014 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

You misunderstood what I wrote. That's the point of steps 3 and 4. Equal urls will be sent to (3) and combined by (4) the same machine.

- Miguel Oliveira December 12, 2014 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

One major assumption has been taken in Step 3.
Each server needs to know how to talk to ALL the other servers in the network. If we have 1M CPU in the network, this information is big and also hard to maintain. Something similar happens in Domain servers and they evolved. Earlier every node knew about every other node. Later they evolved to only know about some node and established a hierarchy. IMO, it would be fruitful to think in that direction as well

- benp68384 October 28, 2020 | Flag

Comment hidden because of low score. Click to expand.

of 4 vote

I don't think any of the suggested solutions are right. It is possible to imagine a situation where some url is ranked number 11 on every box, and so has very high visits overall, while every url in the top 10 on each individual box is seen nowhere else, so has low visits overall.

That said, I don't have any better ideas. This problem is hard!

- Anon September 26, 2013 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

The first pass gives you a lower bound. In the second pass, you will include that 11th url in your candidate set, since its overall visits is certainly greater than the 10th url found in the first pass.

- Yong November 08, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

Anon is correct. The second pass will not help. Take the following situation:
3 servers:
SERVER A:
URL #1 - #10 : 2 counts
URL #0 : 1 count
SERVER B:
URL #11-#20: 2 counts
URL #0: 1 count
SERVER C:
URL #21-#30: 2 counts
URL #0: 1 count

now the first pass will find URL #1-#30 but not URL #0, this will be completely missed. Yet it is the winning URL, with 3 counts.
Hence the whole strategy is flawed.

- langeolli March 07, 2014 | Flag

Comment hidden because of low score. Click to expand.

of 2 votes

langeolli: it the two pass solution I described, you miss rule number 4:

Ask every server for all its top N that satisfy score >= T/S

In your example, T=2 and S=3, so the threshold is < 1 and consequently all three servers will have to deliver their full logfiles and the result will find URL #0 to have 3 hits.

- Anonymous March 07, 2014 | Flag

Comment hidden because of low score. Click to expand.

of 0 vote

I understand that with the given constraints it is not possible to get a trivial solution. but i thik we have to consider the senario where one url in the actual 10 top urls is visited by most of the machines but only a few times that will keep it out of the individual top 10 lists of each machine.

does that sound right?

- rajesh September 26, 2013 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

I agree : This example would fail

Consider this example. Lets find the top 1 URL for now and we can extrapolate the question for top 10 URLs too.

3 Servers
Server 1 : URL A-> 2 URL G->1
Server 2 : URL B->2 URL G -> 1
Server 3 : URL C -> 2 URL G -> 1

Wouldn't the above algo give URL A, B or C as the top visited URL where as the actual top URL should be G?

- izaazyunus October 07, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 0 vote

You could possibly continually request the top urls until you have ten where the smallest value in the top10 list is not more then the highest value in any server using paginated sets. For example,:

page = 1;
max=10;
while(top.length <max || top.smallest < highest && page <maxpages){
        highest = -1;
	for(server in farm){
            temp = getTop(server,max, page);
            merge(top, temp, max);
            if(highest < temp.top){
               highest = temp.top;
           }
        }
      page++;
}

Merge would keep no more than max values based on unique url and highest visits.

- Reallistic September 26, 2013 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

I would recursively group the servers into sets of two and aggregate the URL's .
Suppose there are 6 servers.
a) Group s1s2, s3s4, s5s6.
b) We know the entire map cannot be transmitted. Find out a safe message size for the network. Supposing n. Break the map in s1 into n size chunks and send over to s2.
Similarly from s3-s4, s5-s6.
c) This is the tricky part. The ques says we cannot do map-reduce directly. Does that mean map-reduce on the entire set? Is it allowed for individual machines? But it will be silly to solve this without ever getting a count of the URLs. So if map reduce is not allowed, write a procedure to sort the urls and track each one's count. This is done in s2, s4 and s6.
d) Now, again group the machines. This time {s2s4}, {s6}.
e) Repeat b, c . We will have s4, s6 left. Transmit from s4-s6 and perform a final count.

- Aparna September 27, 2013 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

Lets assume, we can store count of k urls in the central machine. If the corpus is of size n, we get the top n/k elems from each server and send it to our central machine.

So, in our central machine, we have a set of k counters each initialized to 0 to start with. As we get our data from the stream, for every url, if it exists, we increment the counter. If not:
case1: if there is size in our central machine to add the url, we add it and set its count
case 2: if not, we delete the count of every url we have in the central machine. If there is any
url that has a count of 0, we delete it.

Now, the moment we hit case 2 above, we record the max_count_so_far and take a snapshot of the top 10 elements.

We process the next set of top n/k elements from the machines and for every max_count_so_far elements we take a snapshot of the top 10 elements.

At some point, say after we have 10 such snapshots, we find the final top 10 elements from the snapshots we have so far

- Anonymous September 27, 2013 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

On each server, we sort the urls based on their frequencies.
Say total log file lines across all servers are N. Number of servers is s. Capacity of server is k. Now split the ranked urls such that there are k/s elems per group on each server. Label each group , there will be total N/(k/s) ids.

Now, from this set of ids, we randomly select s ids (i.e s groups) such that each id doesnt occur more than threshold times on one machine. (To keep it simple lets say threshold=1).

Now, we employ the following algorithm on the central machine:
If there is size in our central machine to add the new group, we add it and set the count of each of the elems in the group. If not, we delete the count of every group/element in the group. We delete all urls that have a count of 0.

Now, one might argue that we might endup spending too much time on low frequency entries. We could then employ multiple iterations of elimination here. In the first pass, we can only consider frequencies that are above a certain threshold: say frequency above the median of the frequencies in each server. And divide those set of urls into groups. And consider random group from each server.

- Anonymous September 27, 2013 | Flag Reply

Comment hidden because of low score. Click to expand.

of 2 vote

have a centralized counter. every time a new hit is recorded, recalc the centralized counter for that url with properly synced data structure. then compare it with the top10 hit stack by popping out the ones smaller than the current count result. keep updating the top 10 stack every time the new hit is recorded and you can always query the top 10 stack for the top 10 hits.

- automan September 27, 2013 | Flag Reply

Comment hidden because of low score. Click to expand.

-1

of 1 vote

I think that violates this "the maps are too large to transmit over the network (especially sending all of them to a central server "

there are too many visits to centralize a counter in one machine

- Miguel Oliveira September 28, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 0 vote

Lets say the number of servers is N. I think the solution requires 10 passes among the top ten scorers of each server. In each pass you can only identify one in the top ten. After each pass, the selected URL from the previous pass must be excluded from the evaluation in the subsequent passes, and the top ten scores from each server must be updated.

Think of the first pass. When all the top ten lists are considered, there must be at least one URL among these top tens which will land into the "real" top ten list. It is possible to generate scenarios where 9 of the real top ten list do not appear in the "current" top ten lists. But, there has to be at least one URL in the real top ten list which also appears in the 10N URLs that are collected from the servers' current top ten lists. Note that this is true only when N>=10. If N<10, all the real top ten list may not appear in the 10N URLs.

Also note that after collecting the 10N URLs, for some of them, you will have to ask the servers their frequencies so that you can sort the 10N URLs properly. Because, you want to sort the sum of visit frequencies of these URLs at each server.

After 10 such passes the real top ten list will emerge. The messaging complexity of each pass could be as bad as N^2 since you may have to collect the frequency values for each one of the 10N URLs. But, the computational complexity of each step is O(N) since you only need to find the max of 10N URLs.

If we were to design an algorithm that can find out not just 10 but the top K list. Then, we would be have to collect the top N scorers from each server .So, rather than 10N, we would bring together the top N^2 list at each pass. If K>N, this design is preferable.

- yuksem September 28, 2013 | Flag Reply

Comment hidden because of low score. Click to expand.

-1

of 1 vote

Here's a counter example where the top 3 does not appear in the top 3 of any of the 3 servers:

s1= { a:10, b:9, c:8, d:7, e: 6, f: 5 },
s2= { g:10, h:9, i:8, d:7, e: 6, f: 5 },
s3= { j:10, j:9, l:8, d:7, e: 6, f: 5 }

the top 3 is {d: 21, e: 18, f: 15 }, so that approach also does not guarantee 100% correctness

- Miguel Oliveira September 28, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

Good example..

In the first pass, it will find a (or g or j). In the second pass, d will be in the first three of s1 since a is kicked out of s1's top three. So, the second pass will find d. The third pass will, in turn, find e. I can see, however, that f will not be included in the top three list after three passes.

We can update the algorithm as follows:

After a pass, if a new fellow is selected that places into an unexpected spot (i.e., item found in the i th pass places into a spot < i th place in the top items list), then we increase the total number of passes by one.

This should do it..

If not, I think making N^2 passes should do it.

- yuksem September 28, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

Actually, even top N^2 may not do it.. :) Here is one more update to the algorithm I proposed:

Find the max value among all the lists: This should be easy to do. Findmax for all lists. Let's say that the max value is M.

Then, apply the iterative algorithm (with the extra check I described in my prev post) among the top lists of each server such that the top lists include all the values greater than or equal to M/N.

- yuksem September 28, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

Well, top N^2 lists will do it. :) But, cutting the top lists at M/N threshold will perform better on the avg case.

- yuksem@cse.unr.edu September 28, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 0 vote

the logic is similar to tetris game algorithm. Or generalize voting algorithm.

- sfsh October 02, 2013 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

We partition the work by using a hash on the url. The hash gives an address to an aggregator who is responsible counting a subset (as defined by the hash) of urls. Then we crawl through the aggregators and fill the top-ten list. As each url is mapped to exactly one aggregator the crawler only has to maintain a map with ten entries.

Step 1) send table with aggregator addresses: say we use 10k aggregators
Step 2) on individual machine count all urls
Step 3) on individual machine iterate through counted urls, find hash-key, modulo 10k
gives you the address of responsible aggregator -> send to aggregator the count of said url.
Step 4) aggregators accumulate the counts
Step 5) initialize empty top-ten list, iterate through 10k aggregator machines
and maintain the top-ten list

- langeolli March 07, 2014 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

We have a central server which will talk to each of the computers in the network containing the log files. Let's say there are n such computers.

Lets objectify each entry in the network of computers containing the log as <url, count, computer_id> where computer_id is the id of these individual computers.

class Entry {
	String URL,
	int count,
	String computer_id
}

We are going to solve this problem with map-reduce with minimal network data.

Map-step
-------------
Each of the log computers create a max heap (sorted on count) for all the URLs.

Reduce Step:
-----------------
The central server can ask a given computer, the top of the stack. Lets call this function as

function Entry getTopEntry(computer_id)

The central server can also ask a given computer to get the count for a given URL

function int getCount(computer_id, URL)

The central server has two heaps -
- one max-heap of size n (lets call this current_max_heap) containing the one (the top) entry from each log-computers. We also store the sum of all the entries' count in a variable and keep it updated.
- one min-heap of size k (lets call this k_min_heap) containing the current top k elements across all servers. At the end the reduce step, this heap will contain the required k entries

The idea:
Initialization: We initialize the current_max_heap with the first entry from each of the log computers (by polling their individual max_heaps).
We now get the top entry from the current_max_heap and add it to the k_min_heap. Before adding we need to update the entry's count so that it is equal to the count of that entry's URL across all the log-computers, which can be done by asking the count for a URL from all the log-computers) Every time an entry is removed from current_max_heap, we get the top entry from the computer's max-heap to which this entry belonged. We do it k times, so that the k_min_heap has k entries.
At any time, if the top element of the k_min_heap's count is less than the sum of the count's for element in current_max_heap), there is a possibility that the top URLs for all the computer's are same and when added-up they will be higher than this top-entry. So, we continue to move elements from current_max_heap to k_min_heap, till this condition is no longer true.

Here is the code running on the server:

// declare current_max_heap;
// declare k_min_heap;
int current_max_count = 0;
// Initialize the 
for (int i = 0; i < n; i++) {
	Entry entry = getTopEntry(computer_id);
	current_max_heap.add(entry);
	current_max_count += entry.count;
}
do {
	Entry top = current_max_heap.poll();
	Entry new_top  = getTop(top.computer_id);
	current_max_heap.add(new_top);
	current_max_count = current_max_count - top.count + new_top.count;
	current_max_heap.add();
	for (int i = 0; i < n; i++) {
		entry.total_count += getCount(i, top.URL);
	}
	if (k_min_heap.size() < k) {
		k_min_heap.add(entry);
	}
	else {
		if (k_min_heap.peek().count < entry.count) {
			k_min_heap.poll();
			k_min_heap.add(entry);
		}
	}
while(k_min_heap.size() != k || k_min_heap.peek().count <= current_max_count);

- Sumeet Singhal March 06, 2016 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

Hi. My simple answer is in Swift 3 developed in an xcode Playground. You will notice my urls are simply the domain. You can expand that part. I code not because the comments on Careercup do not allow links at all...

// Imports
import Foundation



// Our Function to input the log files of visited Urls
func determineTopVisitedUrls( vistedUrls : [[String]]) -> [(String, Int)] {
    
    // Set initial Variables
    var urlVisits: [(url: String, visits: Int)] = []
    var urlFound : Bool = false
    var foundIndex : Int = Int()
    
    // Loop thru all files
    for file in vistedUrls {
        
        // Loop thru all urls in Log Files
        for url in file {
            
            // Search thru existing results before appending to the list...
            if urlVisits.count > 0 {
                for visitIndex in 0..<urlVisits.count {
                
                    if (urlVisits[visitIndex].url == url) {
                        
                        // Increase List at that URL,
                        foundIndex = visitIndex
                        urlVisits[visitIndex].visits += 1
                        urlFound = true
                        
                        //print("Working with url : \(url) which was found at \(foundIndex)")
                        break
                        
                    }
                }
            }
            
            // If not found, append to the end...
            if urlFound == false {
                
                // Append to the list with a 1
                urlVisits.append(( url, 1))
                //print("Appending url: \(url) to the array urlVisits(count: \(urlVisits.count))")
                
                
            // Else, sort the list if necessary
            } else {
                
                // print("Found url, increasing visits by +1 to total of \(urlVisits[foundIndex].visits) and foudnIndex: \(foundIndex)")
                
                // loop thru results from found location up only....
                while foundIndex > 0 {
                    
                    // print("foundIndex: \(foundIndex)")
                    
                    if urlVisits[foundIndex].visits > urlVisits[foundIndex - 1].visits {
                        
                        // Switch them in the placement
                        let tempUrlVisit : (url: String, visits: Int) = urlVisits[foundIndex]
                        urlVisits[foundIndex] = urlVisits[foundIndex - 1]
                        urlVisits[foundIndex - 1] = tempUrlVisit
                        
                    } else {
                        
                        break
                    }
                    
                    foundIndex -= 1
                    
                }
                
            }
            
            // Reset Values before next search and append
            urlFound = false
            
        }
    }
    
    // remove everything after topTen
    if urlVisits.count > 10 {
        
        for afterIndex in 10...urlVisits.count {
            
            urlVisits.remove(at: afterIndex)
            
        }
    }
    
    // Return Top Urls no greater then 10
    return urlVisits
    
}



// Hi, you will need to make the urls longer.  The comments can't have links so I couldn't provide the full long url with query strings and such... Please expand the urls at the end and beginning of each logFile
..


// Create 2 Log Files
var logFile1 : [String] = [ "espn", "youtube", "youtube", "apple", "apple", "mindcraft" ]


var logFile2 : [String] = [ "espn", "apple", "youtube", "youtube", "mindcraft" ]


// Create an Array of the Log Files
var allFiles : [[String]] = [ logFile1, logFile2 ]




// Call our Function for testing
var topTenUrls : [(String, Int)] = determineTopVisitedUrls(vistedUrls: allFiles)





// Print the results of the top Urls
print("The top \(topTenUrls.count) Urls are \(topTenUrls) ")

- sgtschedule January 09, 2017 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

1) Use distributed caching like memcached
2) Have a central computer which will act as primary server for processing final results.
Add multiple secondary computer as backups incase primary goes down.
3) The central computer cluster updates a threshold for count of most visited urls.
We can either put the threshold in memcached or distribute it using an observer pattern where all
registered computers get the updated threshold.
4) Each computer calculates up the 10 max urls by visited count which are more than the
threshold and update them in memcached. This can be either scheduled or triggered when they are notified
in the observer pattern.
5) The central computer will iterate over all the data cached in memcached and calculates the top ten.

- turukmakto1990 January 10, 2017 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

1. Central server asks which computer has highest top ten and receives that. Highest is defined as bigger tenths element (minimum of top ten).

2. Sever ask each computer for corresponding URLs, it just received, and computers remove those URLs from their list.

3. Server updates its top then with received URLs.

4. Repeat this until no computer has bigger top ten than server top ten divided by number of computers.

- Nima November 21, 2020 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

5. Repeat this for top nine, top eight, etc.

- Nima December 08, 2020 | Flag

Comment hidden because of low score. Click to expand.

-1

of 1 vote

The easy solution to solve this kind of problem is MapReduce. Since MapReduce is not allowed, the other alternative is to sort the string (url) ==> int (visits) in each machine independently such that it is the increasing order of visits. Then each server can send their top 10 visited URLs, Visit mapping to one of the server.

This single server would receive data from all the others for the top 10 visited sites and do a merge to decide on the top 10 sites. ( A n-way merge sort of the URL vs visits).

- Vs September 26, 2013 | Flag Reply

Comment hidden because of low score. Click to expand.

of 3 votes

That's not correct. There could be a URL that is not in the top 10 for any one server, but is in the top 10 overall. See tony's response to Miguel Oliveira's answer.

- eugene.yarovoi September 30, 2013 | Flag

Comment hidden because of low score. Click to expand.

-1

of 1 vote

edit: this is does not guarantee 100% correctness

The question says we can't use MapReduce directly because the amount of data is too large. However, the overall top 10 sites are *expected* to be in the top 10 of at least one of the computers.

Hence, we can sort the logs of each computer and emit the top 10 of each computer in the mapping phase. Then, the reduce phase aggregates the number of visits for each site. The result will be the top 10 of the aggregate.

- Miguel Oliveira September 26, 2013 | Flag Reply

Comment hidden because of low score. Click to expand.

of 1 vote

This is wrong. Imagine the most frequent terms are

a1={url1:100, url2:99, url3:98, url4:97 },
a2={url5:100, url6:99, url7:98, url4:97 }

Then the top three most frequent in the merged list should contain url4 as the top one though url4 is not the top three in any of the original lists.

- tony September 26, 2013 | Flag

Comment hidden because of low score. Click to expand.

-1

of 1 vote

yeah, Anon gave an example in his answer. These types of questions require some discussion with the interviewer. I don't see a way to do it in a very efficient way given the constraints, maybe those constraints were not so strict as it seems.

- Miguel Oliveira September 26, 2013 | Flag

Comment hidden because of low score. Click to expand.

-1

of 1 vote

What if we manage a MaxHeap of top 10 sites at each node separately, and a heap of size 10 in central unit. The central heap can be updated according to :

if(heap[0]<RecentlyReceievedData)
{
   heap[0]=RecentlyReceivedData;
   MaxHeap(heap,0);
}

Then at every pre-specified interval, we can request updates from the nodes whose responses are the content of their respective MaxHeap.

- Anonymous September 26, 2013 | Flag Reply

Comment hidden because of low score. Click to expand.

-1

of 1 vote

the aggregated top 10 is not necessarily in the top 10 of a individual server. so it won't guarantee correctness

- Miguel Oliveira September 28, 2013 | Flag

Comment hidden because of low score. Click to expand.

-1

of 1 vote

We could use a BFS, visit every node and add sum of the integers to the master map that has all the URL's checking, if the URL is already present, then just add the integer orelse create a new key and put it into the map. Then traverse through the map to find the top ten values or just sort descending and return the first 10 values.

- AVK September 27, 2013 | Flag Reply

Comment hidden because of low score. Click to expand.

-1

of 1 vote

"the maps are too large to transmit over the network (especially sending all of them to a central server (..)"

can't go that way

- Miguel Oliveira September 28, 2013 | Flag

Comment hidden because of low score. Click to expand.

-1

of 1 vote

There are N number of machines,
Determine top 10 in every machine.
Each machine transmits its top 10 to every other machine i.e (n-1)10 urls
This Implies each machine also recieves (n-1)10 urls

Process these and determine the top 10 in every machine.

Each machine now sends its Top 10 to a single place,(n*10)
The top 10 among these (n*10) will be the most visited URL's

I do not know if there is any way to avoid not replicating the data on each machine. But will N*(N-1)*10 url's be too much traffic for the network to handle, cos that will be the total number of replications required.

Other possible solutions I could think of was using P-S pattern to publish count of URL's periodically.

- iGottaBeKidding September 28, 2013 | Flag Reply

Comment hidden because of low score. Click to expand.

-1

of 1 vote

that's quite similar to the approach i posted before. this does not guarantee 100% correctness. the aggregate top 10 does not need to be in the top 10 of any individual server.

- Miguel Oliveira September 28, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

My bad, had not read any answers, before replying.

- iGottaBeKidding September 28, 2013 | Flag

Comment hidden because of low score. Click to expand.

-1

of 1 vote

1. tag the nodes as n1, n2, n3..... nk
2. First n1 sorts its list of URLs to find top10
3. n1 sends this list to n2. n2 adds this list to the data set, and gets a top 10
4. now n2 sends its top10 (calculated in step 3) to n3
5. keep doing it till we reach nk

nk will have the cumulative top 10

CareerCup

Google Interview Question for Software Engineer / Developers

Books

Videos

Resume Review

Mock Interviews