Google Interview Question
InternsCountry: United States
Interview Type: In-Person
Sweep through the file build a 32 bit IP address from each line and AND with each mask if result is equal to mask then increment the counter for that mask O(n) time. Example searches
def advance_counters(counters, masks, fd):
for ip in fd:
ipbin = 0
i = 0
for octet in ip.strip().split('.'):
i += 1
ipbin |= int(octet.strip())
if i < 4:
ipbin <<= 8
mask_idx = 0
for mask in masks:
if (ipbin & mask) == mask:
counters[mask_idx] += 1
mask_idx += 1
def count_subnets(filename):
fd = open(filename, 'r')
counters = [0, 0, 0, 0]
masks = [198 << 24, (192 << 24 | 168 << 16), (128 << 24 | 30 << 16 | 40 << 8), (127 << 24 | 1)]
advance_counters(counters, masks, fd)
fd.close()
return counters
/* Few clarifying question crossed by mind on reading this and I think the
first question that needs to be asked in this case is
a. how big is the file of ip addresses. Depending on the size it might be
just easier to build few hash tables with key types the first 8 bytes, or
first 16 bytes or first 24 bytes or first 32 bytes. but if the file sizes
are too big and with a possiblity that the size of 4th hash table may grow
to 4b entries and with a pay load of 4 bytes for each record might get to
16GB in size.
b. how often do we have to find these counts and if it is
cumulative as in previous results need to be merged with new ones we might
want to store the results of the previous calcuation. another implication
of how often is to be able to do this really fast. so may be we need to
divide and conquer here with work being dispatched to multiple machines
where each one does its part and sends set of top results. Or we could
dedicate each machine to store certain range of IP addresses and as
different machine read the big file split into pieces send the corresponding
information to the corresponding partition to increment their counters and
then they can send their top n results of each category and the final
result can be obtained.
c. it could also be that this computation needs to be done for a sliding
window in a batch mode. for example we need to find these counts for the
ip address files generated for the last 30 days and then from then on add
one file remove the oldest file. if that is the case then if feasible do
the computation for these 30 day worth of computation every time the count
is needed. if recomputing 30days worth of ip addresses is not an option
then we are better off keeping counts for all the IP addresses so that
the delta operation is much cheaper.
From coding perspective, however I will attempt the trie based solution
where i build and store entries only for different IP address and their
counts for the prefixes only if present in the IP address file. attempting
to code this option will help me refresh my data structure concepts as well*/
Trie might be slightly faster and take less memory, but a bit harder to code. But I think the hash table approach is good too. I agree with the external sort, it's a good suggestion. One thing that you missed was using some sort of a distributed system, eg. hadoop or spark to do the counting. This might be applicable if the amount of data is very large or the system is set up in such a way that handling the data with hadoop or spark or easy. I'm not sure how important it is, but generally any company of decent size will handle this kinda stuff on distributed systems since there's simply too much data, but they might not expect the answer from someone with no big data experience. So... who knows.
import java.util.*;
public class IPTree {
public class Subnet {
int count;
Subnet[] subnets;
int value;
Subnet parent;
public Subnet() {
count = 0;
subnets = new Subnet[256];
parent = null;
value = -1;
}
public Subnet(int value) {
count = 0;
subnets = new Subnet[256];
parent = null;
this.value = value;
}
}
Subnet root = new Subnet();
public void insertIP(String ip) {
String[] parts = ip.split("\\.");
Subnet curr = root;
for (int i = 0; i < 4; i++) {
int value = Integer.parseInt(parts[i]);
if (curr.subnets[value] == null) {
curr.subnets[value] = new Subnet(value);
curr.subnets[value].parent = curr;
}
curr = curr.subnets[value];
curr.count++;
}
}
public void getMostCommonSubnets() {
Queue<Subnet> q = new LinkedList<Subnet>();
q.add(root);
q.add(null);
int maxFreq = -1;
Subnet common = null;
while(!q.isEmpty()) {
Subnet curr = q.poll();
if (curr != null) {
for (int i = 0; i < curr.subnets.length; i++) {
Subnet s = curr.subnets[i];
if (s != null) {
if (s.count > maxFreq) {
maxFreq = s.count;
common = s;
}
q.add(s);
}
}
} else {
if (!q.isEmpty())
q.add(null);
StringBuilder ip = new StringBuilder();
while(common != root && common != null) {
ip.insert(0, ".").insert(0, common.value);
common = common.parent;
}
if (maxFreq > 0) {
System.out.println("Most common subnet = " + ip
+ " with freq = " + maxFreq);
}
maxFreq = -1;
common = null;
}
}
}
public static void main(String[] args) {
String[] ips = {"1.0.1.2", "2.3.1.0", "3.4.5.0", "4.5.6.7",
"1.1.2.3", "2.3.2.0", "3.4.5.1", "4.5.6.7",
"1.2.3.4", "2.3.3.0", "3.4.5.2", "4.5.6.7",
"1.3.4.5", "2.3.4.0", "3.4.5.3", "0.1.2.3",
"1.4.5.6", "2.3.5.0", "5.2.3.4", "6.1.2.3",
"1.5.6.7", "7.1.2.3", "8.1.2.3", "9.1.2.3"};
IPTree tree = new IPTree();
for (String ip : ips) {
tree.insertIP(ip);
}
tree.getMostCommonSubnets();
}
}
what is a subnet? Is he refering to the class a,b,c,d,(e) networks?
If so, you can determine the class of the network by the ip address.
This would reduce data, since you are not required to count the hosts within the network
(there are roughly ~2.1 Mio class a+b+c+d networks). If he want to identify potential
subnets or even hosts (belongig to a botnet?) inside the a-b-c-d networks he will not get
what he wants by spliting on 8.bit boundaries but we need to guess a subnet everywhere.
for a given IP, ignoring class a,b,c,d, there are *theoretically* 32 subnets (from single machine
to whole IP v4 range)
To simplify I would create an integer, lets call it subnet-id with the subnet-size and
the remaining subnet id:
1st bit set: remaining 32 bit contain the ip
1st bit 0, 2nd bit 1: remaining 31 bits will contain the subnet (actually that does not exist in practice)
1st, 2nd 0, 3rd 1: remaining 30 bits will contain the subnet id...
With 33 bits all potential subnets are covered, each ip will create 32 such subnet-ids.
To actually count, I would create a self balancing tree in memory to count frequencies
by above subnet-id. If memory gets full, write id:frequency into a file, ordered by id
(thus no HT, whereas a HT could be used but before writing I have to sort it in-place
(no more memory) which is not a standard HT use case ... it could be done by a custom
HT that uses open addressing... which is prefered over a tree)
Continue untill all has been processed and we have m sorted files. Merge them by key.
Traverse final file to find 4 highest values.
Can be implemented using map-reduce to scale to many machines.
Data can be appended if more logfiles come without the need to re-process prior .processed
data.
//Read n lines (whee n is the maximum amount of lines that can fit into memory) at a time. Once line 0 has been processed read the n+1th line.For each line, add the
ip address into a trie as shown below.
//Node in trie.
public class Node{
int[] counts;//Stores frequency of each value. assumes that max value is the maximum value that can fit into a 32 bit integer.
Node[] arr;
public Node()
{
arr=new Node[Integer.MAX_VALUE];
counts=new int[arr.length];
}
}
//Adds a single IP address to the trie whose root is rt.
public void addIpAddress(String s,Node rt)
{
if(s!=null && s.length()!=0)
{
String[] sp=s.split('.');
Node v=rt;
for(String s:sp)
{
int k=Integer.parseInt(s);
v[k-1]=new Node();
counts[k-1]++;
v=v[k-1];
}
}
}
//Extracts the four most common subnets.
public ArrayList<String> getTopFour(Node rt)
{
ArrayList<String> ls=new ArrayList<String>();
int[] ips=new int[4];
int ipIdx=0;
for(int i=0;i<4;i++)
{
int maxCount=0;
int idx=-1;
for(int j=0;j<rt.arr.length;j++)
{
if(counts[j]>maxCount)
{
maxCount=counts[j];
idx=j;
}
}
ips[ipIdx++]=idx+1;
rt=rt.arr[idx];
}
ls.add(""+ips[0]+".*.*.*");
ls.add(""+ips[0]+"."+ips[1]+".*.*");
ls.add(""+ips[0]+"."+ips[1]+"."+ips[2]+".*");
ls.add(""+ips[0]+"." +ips[1]+"."+ips[2]+"."+ips[3]);
return ls;
}
Use a Trie and have a integer counter in every trie node.
- Partha July 18, 2016Increment the counter whenever there is a new element is added to the trie.