I was asked in an interview: Y

Google Interview Question for Interns

1

of 1 vote

10
Answers
I was asked in an interview: You are given a dump file of IPv4 addresses. You are to find 4 most common occurring subnets. Lets say an IP address if of type a.b.c.d you have to find most common occurring four subnets of the form,

a.*.*.*

a.b.*.*

a.b.c.*

a.b.c.d

Here * matches anything.

My first solution was build an in memory hashtable. Given an IP address a.b.c.d split it as ["a","b","c","d"] and add "a", "a.b", "a.b.c", "a.b.c.d" to the hash table and count it. [There are optimizations possible like considering the entire IP address as a 32 bit unsigned integer and count it with masks and shifts]

Then the question got extended: "assume you can never hold everything in memory, how would you solve it?" Now, the very first solution that I could say was to do an external sort and then count it.

The next solution I gave was to split the IP addresses into buckets. The algorithm was,
```
while there is an IP
    		IP <- an IP address
    		a <- first quadruple
    		push IP to bucket[a]
```
The bucket which has maximum elements would give me the a.*.*.* solution. Now take each bucket and do the same. Even though this might give the correct result, in worst case I might end up having 255^4 buckets.

This is indeed an open ended question with more than one correct answer. What would be the best way to solve this?
- miscanon July 15, 2016 in United States | Report Duplicate | Flag | PURGE
Google Intern Algorithm

Email me when people comment.

An error occurred in subscribing you.

Country: United States
Interview Type: In-Person

Email me when people comment.

An error occurred in subscribing you.

Comment hidden because of low score. Click to expand.

of 1 vote

Use a Trie and have a integer counter in every trie node.
Increment the counter whenever there is a new element is added to the trie.

- Partha July 18, 2016 | Flag Reply

Comment hidden because of low score. Click to expand.

of 1 vote

Sweep through the file build a 32 bit IP address from each line and AND with each mask if result is equal to mask then increment the counter for that mask O(n) time. Example searches

def advance_counters(counters, masks, fd):
    for ip in fd:
        ipbin = 0
        i = 0
        for octet in ip.strip().split('.'):
            i += 1
            ipbin |= int(octet.strip())
            if i < 4:
                ipbin <<= 8
        mask_idx = 0
        for mask in masks:
            if (ipbin & mask) == mask:
                counters[mask_idx] += 1
            mask_idx += 1

def count_subnets(filename):
    fd = open(filename, 'r')
    counters = [0, 0, 0, 0]
    masks = [198 << 24, (192 << 24 | 168 << 16), (128 << 24 | 30 << 16 | 40 << 8), (127 << 24 | 1)]
    advance_counters(counters, masks, fd)
    fd.close()
    return counters

- foxtrot July 19, 2016 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

I believe there is a bug in your addIPAddress. What happens if the node you are currently traversing already has an entry for that prefix? You will overwrite it.

The fix:

for(String s:sp)
		{
			int k=Integer.parseInt(s);
			if (v[k-1] != null) {
				v[k-1]=new Node();
			}
			counts[k-1]++;
			v=v[k-1];
		}

- Quinn P July 18, 2016 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

/* Few clarifying question crossed by mind on reading this and I think the
first question that needs to be asked in this case is

a. how big is the file of ip addresses. Depending on the size it might be
just easier to build few hash tables with key types the first 8 bytes, or
first 16 bytes or first 24 bytes or first 32 bytes. but if the file sizes
are too big and with a possiblity that the size of 4th hash table may grow
to 4b entries and with a pay load of 4 bytes for each record might get to
16GB in size.
b. how often do we have to find these counts and if it is
cumulative as in previous results need to be merged with new ones we might
want to store the results of the previous calcuation. another implication
of how often is to be able to do this really fast. so may be we need to
divide and conquer here with work being dispatched to multiple machines
where each one does its part and sends set of top results. Or we could
dedicate each machine to store certain range of IP addresses and as
different machine read the big file split into pieces send the corresponding
information to the corresponding partition to increment their counters and
then they can send their top n results of each category and the final
result can be obtained.
c. it could also be that this computation needs to be done for a sliding
window in a batch mode. for example we need to find these counts for the
ip address files generated for the last 30 days and then from then on add
one file remove the oldest file. if that is the case then if feasible do
the computation for these 30 day worth of computation every time the count
is needed. if recomputing 30days worth of ip addresses is not an option
then we are better off keeping counts for all the IP addresses so that
the delta operation is much cheaper.

From coding perspective, however I will attempt the trie based solution
where i build and store entries only for different IP address and their
counts for the prefixes only if present in the IP address file. attempting
to code this option will help me refresh my data structure concepts as well*/

- Any July 21, 2016 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

Trie might be slightly faster and take less memory, but a bit harder to code. But I think the hash table approach is good too. I agree with the external sort, it's a good suggestion. One thing that you missed was using some sort of a distributed system, eg. hadoop or spark to do the counting. This might be applicable if the amount of data is very large or the system is set up in such a way that handling the data with hadoop or spark or easy. I'm not sure how important it is, but generally any company of decent size will handle this kinda stuff on distributed systems since there's simply too much data, but they might not expect the answer from someone with no big data experience. So... who knows.

- Anonymous July 21, 2016 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

import java.util.*;

public class IPTree {

    public class Subnet {
	int count;
	Subnet[] subnets;
	int value;
	Subnet parent;

	public Subnet() {
	    count = 0;
	    subnets = new Subnet[256];
	    parent = null;
	    value = -1;
	}

	public Subnet(int value) {
	    count = 0;
	    subnets = new Subnet[256];
	    parent = null;
	    this.value = value;
	}
    }
    
    Subnet root = new Subnet();
    
    public void insertIP(String ip) {
	String[] parts = ip.split("\\.");
	Subnet curr = root;
	for (int i = 0; i < 4; i++) {
	    int value = Integer.parseInt(parts[i]);
	    if (curr.subnets[value] == null) {
		curr.subnets[value] = new Subnet(value);
		curr.subnets[value].parent = curr;
	    }
	    curr = curr.subnets[value];
	    curr.count++;
	}
    }

    public void getMostCommonSubnets() {
	Queue<Subnet> q = new LinkedList<Subnet>();
	q.add(root);
	q.add(null);
	int maxFreq = -1;
	Subnet common = null;
	while(!q.isEmpty()) {
	    Subnet curr = q.poll();
	    if (curr != null) { 
		for (int i = 0; i < curr.subnets.length; i++) {
		    Subnet s = curr.subnets[i];
		    if (s != null) {
			if (s.count > maxFreq) {
			    maxFreq = s.count;
			    common = s;
			}
			q.add(s);
		    }
		}
	    } else {
		if (!q.isEmpty())
		    q.add(null);
		StringBuilder ip = new StringBuilder();
		while(common != root && common != null) {
		    ip.insert(0, ".").insert(0, common.value);
		    common = common.parent;
		}
		if (maxFreq > 0) {
		    System.out.println("Most common subnet = " + ip
				       + " with freq = " + maxFreq);
		}
		maxFreq = -1;
		common = null;
	    }
	}
    }

    public static void main(String[] args) {
	String[] ips = {"1.0.1.2", "2.3.1.0", "3.4.5.0", "4.5.6.7",
			"1.1.2.3", "2.3.2.0", "3.4.5.1", "4.5.6.7",
			"1.2.3.4", "2.3.3.0", "3.4.5.2", "4.5.6.7",
			"1.3.4.5", "2.3.4.0", "3.4.5.3", "0.1.2.3",
			"1.4.5.6", "2.3.5.0", "5.2.3.4", "6.1.2.3",
			"1.5.6.7", "7.1.2.3", "8.1.2.3", "9.1.2.3"};
	IPTree tree = new IPTree();
	for (String ip : ips) {
	    tree.insertIP(ip);
	}
	tree.getMostCommonSubnets();
    }

}

- Dab August 26, 2016 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

Hi Guys, It says we can never hold everything in the memory. So it is a memory problem. Instead of bringing the data to the system, we send the program to the data. It seems like a big data problem. We might have to use map reduce.

- Baradwaj Aryasomayajula September 08, 2016 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

It seems like a big data problem.

- Baradwaj Aryasomayajula September 08, 2016 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

what is a subnet? Is he refering to the class a,b,c,d,(e) networks?
If so, you can determine the class of the network by the ip address.
This would reduce data, since you are not required to count the hosts within the network
(there are roughly ~2.1 Mio class a+b+c+d networks). If he want to identify potential
subnets or even hosts (belongig to a botnet?) inside the a-b-c-d networks he will not get
what he wants by spliting on 8.bit boundaries but we need to guess a subnet everywhere.
for a given IP, ignoring class a,b,c,d, there are *theoretically* 32 subnets (from single machine
to whole IP v4 range)

To simplify I would create an integer, lets call it subnet-id with the subnet-size and
the remaining subnet id:
1st bit set: remaining 32 bit contain the ip
1st bit 0, 2nd bit 1: remaining 31 bits will contain the subnet (actually that does not exist in practice)
1st, 2nd 0, 3rd 1: remaining 30 bits will contain the subnet id...

With 33 bits all potential subnets are covered, each ip will create 32 such subnet-ids.

To actually count, I would create a self balancing tree in memory to count frequencies
by above subnet-id. If memory gets full, write id:frequency into a file, ordered by id
(thus no HT, whereas a HT could be used but before writing I have to sort it in-place
(no more memory) which is not a standard HT use case ... it could be done by a custom
HT that uses open addressing... which is prefered over a tree)

Continue untill all has been processed and we have m sorted files. Merge them by key.
Traverse final file to find 4 highest values.

Can be implemented using map-reduce to scale to many machines.

Data can be appended if more logfiles come without the need to re-process prior .processed
data.

- Chris October 10, 2016 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

Sort the dump file using constant memory off-heap merge sort.

Then do a max scan for class A B C D simultaneously.

- Anonymous October 15, 2016 | Flag Reply

Comment hidden because of low score. Click to expand.

-1

of 1 vote

//Read n lines (whee n is the maximum amount of lines that can fit into memory) at a time. Once line 0 has been processed read the n+1th line.For each line, add the
ip address into a trie as shown below.

//Node in trie.
public class Node{
	int[] counts;//Stores frequency of each value. assumes that max value is the   maximum value that can fit into a 32 bit integer.
	Node[] arr;
	
	public Node()
	{
		arr=new Node[Integer.MAX_VALUE];
		counts=new int[arr.length];
	}
}

//Adds a single IP address to the trie whose root is rt.
public void addIpAddress(String s,Node rt)
{
	if(s!=null && s.length()!=0)
	{
		String[] sp=s.split('.');
		Node v=rt;
		for(String s:sp)
		{
			int k=Integer.parseInt(s);
			v[k-1]=new Node();
			counts[k-1]++;
			v=v[k-1];
		}
	}
}

//Extracts the four most common subnets.
public ArrayList<String> getTopFour(Node rt)
{
	ArrayList<String> ls=new ArrayList<String>();
	int[] ips=new int[4];
	int ipIdx=0;
	for(int i=0;i<4;i++)
	{
		
		int maxCount=0;
		int idx=-1;
		for(int j=0;j<rt.arr.length;j++)
		{
			if(counts[j]>maxCount)
			{
				maxCount=counts[j];
				idx=j;
			}
		}
		ips[ipIdx++]=idx+1;
		rt=rt.arr[idx];
		
		
	}
	ls.add(""+ips[0]+".*.*.*");
	ls.add(""+ips[0]+"."+ips[1]+".*.*");
	ls.add(""+ips[0]+"."+ips[1]+"."+ips[2]+".*");
	ls.add(""+ips[0]+"." +ips[1]+"."+ips[2]+"."+ips[3]);
	return ls;
	
}

- divm01986 July 15, 2016 | Flag Reply

CareerCup

Google Interview Question for Interns

Books

Videos

Resume Review

Mock Interviews