What is the possible problem i

Google Interview Question for Software Engineer / Developers

0

of 0 votes

11
Answers
What is the possible problem if Hash Table grows more than 30 gb (ignore problems like bad hash function )
- psycho February 03, 2011 | Report Duplicate | Flag | PURGE
Google Software Engineer / Developer Hash Table

Email me when people comment.

An error occurred in subscribing you.

More Questions from This Interview

Email me when people comment.

An error occurred in subscribing you.

Comment hidden because of low score. Click to expand.

of 4 vote

The answer partly depends on whether they are talking about a classic hashtable implementation (like HashTable / HashMap in Java) or something more sophisticated. In the end, 30 GB of memory is still quite big for a single machine/VM by today's standards.

So think about what's going on underneath:

It has to read write at an arbitrary position in some massive array.
It has to grow if it fills up beyond some measure; see 'load factor' in Java implementation.
In a garbage collected language/implementation, all the objects stored in the hash table need to be inspected by the garbage collector
Which leads to the following problems:

It is unclear that even today's operating systems deal well with allocating chunks of memory in the tens of GBs
For simplicity, say that half of the table was actually used by the table itself (not the key and value objects). So there is a 15 gb array inside. So every time the table grows, you need to allocate at least another 15 gb
Even if a tens of GB array was allocated, the OS would page some of this memory. Since we are assuming a good hash function, we will break page caching if we use most of the data in the array. There will be a lot of page faults.
Let's say we don't use all the data. Some keys are used frequently, and others are not. To illustrate, say that each key-value is tiny -- 128 bytes. And for simplicity, say that we store everything in the hash table as values. So 30G/128 = ~ 250M entries. But say 25k commonly used keys. (25k / 250M = 0.01%). But with good a hash function, these would be evenly scattered across the massive array. Even with small page sizes -- say 4kb, the 25K (entries) * 128 bytes (entry size) = ~3.5Mb worth of commonly used data costs us 25K (entries) * 4K (page size) = ~ 100Mb of memory that needs to be kept paged in... at a whopping 3.5% efficiency!
In the Java world, practitioners don't recommend heap sizes larger that 4 - 8Gb. Sure there's stuff like Azul, but that simply proves the point -- a typical garbage collector don't scale to these sizes very well.
I agree with other posters that Google is looking for distributed as a solution. But I think at the heart, a simple hashtable stops scaling beyond a point. In the above,

You would have to distribute if all the entries are accessed relatively evenly
If some are accessed a majority of the time, using two maps (one for most frequently used) can buy you a lot.
In the Java world, using specialized maps that store data off the heap can buy you performance too; see Peter Lawrey's work for example.
Even simply striping the underlying array in a hashtable (like Java's ConcurrentHashMap does) can buy you major improvements when you have to grow the hashtable.

- Anonymous May 22, 2012 | Flag Reply

Comment hidden because of low score. Click to expand.

of 3 vote

I think the question is asking what is the possible problem but not the solution...

- Tracy February 08, 2011 | Flag Reply

Comment hidden because of low score. Click to expand.

of 1 vote

your hash can not go beyond 3GB on 32bit system, you will experience OOM exception, normally 2GB is given for user address space out of 4GB (2GB is for kernal address space) but you can modify it and make it 3GB leaving 1GB for kenal.

on 64bit you can calculate accordingly.

- Anonymous February 12, 2011 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

Why specifically 30GB?
Maybe we need external hashing?

- GekkoGordan February 03, 2011 | Flag Reply

Comment hidden because of low score. Click to expand.

of 2 vote

hash collisions might be one problem. And if table grows 30gb then it should be stored on disk, it will be inefficient to make frequent disk reads when data store in disk as hash table.

- neo February 04, 2011 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

30gb hash table in memory is very bad, so in one way cost of swapping the block will dominate. So, I think you need to say some idea like pagetable having multiple levels.

- Anonymous February 06, 2011 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

and you might need to tell the number of levels also.

- Anonymous February 06, 2011 | Flag

Comment hidden because of low score. Click to expand.

of 0 vote

Why not use a B-Tree then?
you only load part of the tree in memory while searching, the rest is kept on the disk, you can also index segments of the tree using a hash. Thus making faster the search.

- ronin February 06, 2011 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

Simple, It gives MemoryOutOfBounds exception... You need to use DiskCache explicitly and implement the hashset (Extending original HashSet)

- Lohith Ravi March 26, 2011 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

->If hash table is of this big size, then definitely we are talking about disk usage here. Then, overall performance of key lookup will be impacted.
->If more than 50% of the hash table is empty, then something is wrong in the hash function.
(Ignore this, mentioned as part of problem)
->We need to choose a proper file format to store it, so that the performance can be enhanced.

- Anonymous February 07, 2012 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

I think one problem might be hash table re-allocation. If we want to increase number of slots because we reached some threshold, we will have to copy all the data in a new allocated table, and recalculate all hashes which will be time consuming.

We don't have this problem with tree for example.

- Grindaizer June 25, 2013 | Flag Reply

CareerCup

Google Interview Question for Software Engineer / Developers

Books

Videos

Resume Review

Mock Interviews