Google Interview Question
Software Engineer / Developers30gb hash table in memory is very bad, so in one way cost of swapping the block will dominate. So, I think you need to say some idea like pagetable having multiple levels.
->If hash table is of this big size, then definitely we are talking about disk usage here. Then, overall performance of key lookup will be impacted.
->If more than 50% of the hash table is empty, then something is wrong in the hash function.
(Ignore this, mentioned as part of problem)
->We need to choose a proper file format to store it, so that the performance can be enhanced.
I think one problem might be hash table re-allocation. If we want to increase number of slots because we reached some threshold, we will have to copy all the data in a new allocated table, and recalculate all hashes which will be time consuming.
We don't have this problem with tree for example.
The answer partly depends on whether they are talking about a classic hashtable implementation (like HashTable / HashMap in Java) or something more sophisticated. In the end, 30 GB of memory is still quite big for a single machine/VM by today's standards.
- Anonymous May 22, 2012So think about what's going on underneath:
It has to read write at an arbitrary position in some massive array.
It has to grow if it fills up beyond some measure; see 'load factor' in Java implementation.
In a garbage collected language/implementation, all the objects stored in the hash table need to be inspected by the garbage collector
Which leads to the following problems:
It is unclear that even today's operating systems deal well with allocating chunks of memory in the tens of GBs
For simplicity, say that half of the table was actually used by the table itself (not the key and value objects). So there is a 15 gb array inside. So every time the table grows, you need to allocate at least another 15 gb
Even if a tens of GB array was allocated, the OS would page some of this memory. Since we are assuming a good hash function, we will break page caching if we use most of the data in the array. There will be a lot of page faults.
Let's say we don't use all the data. Some keys are used frequently, and others are not. To illustrate, say that each key-value is tiny -- 128 bytes. And for simplicity, say that we store everything in the hash table as values. So 30G/128 = ~ 250M entries. But say 25k commonly used keys. (25k / 250M = 0.01%). But with good a hash function, these would be evenly scattered across the massive array. Even with small page sizes -- say 4kb, the 25K (entries) * 128 bytes (entry size) = ~3.5Mb worth of commonly used data costs us 25K (entries) * 4K (page size) = ~ 100Mb of memory that needs to be kept paged in... at a whopping 3.5% efficiency!
In the Java world, practitioners don't recommend heap sizes larger that 4 - 8Gb. Sure there's stuff like Azul, but that simply proves the point -- a typical garbage collector don't scale to these sizes very well.
I agree with other posters that Google is looking for distributed as a solution. But I think at the heart, a simple hashtable stops scaling beyond a point. In the above,
You would have to distribute if all the entries are accessed relatively evenly
If some are accessed a majority of the time, using two maps (one for most frequently used) can buy you a lot.
In the Java world, using specialized maps that store data off the heap can buy you performance too; see Peter Lawrey's work for example.
Even simply striping the underlying array in a hashtable (like Java's ConcurrentHashMap does) can buy you major improvements when you have to grow the hashtable.