Amazon Interview Question
Software Engineer / DevelopersPls. note that 'beyond normal memory limits'! In this way, how can you make a in-memory hash table of these data?
hash compresses the storage. So that shouldn't be a problem, but nevertheless you need to ask the interviewer to clarify.
@eric: Well, I would guess that "beyond memory limits" applies only to the entire data set, and not to the range of tokens. For example, the entire corpus of English literature would take petabytes to store, but the number of English words (the tokens) is no more than 100K.
Even if the range of tokens is really that big as to not fit in memory, a hashmap can still be used. Except that once the map balloons to a certain size, we have to merge the counts to a set of files on disk.
In context of "hash compresses the storage", you will still need to at least know the "range" (upper bound) of data "values" for allocating Hash or else any collisions would fail the idea. Basically you need a direct map not a perfect Hash, and hence need to know the range.
If range is/can be known then your trick works.
Else, my idea is, process the data in smaller chunks (say chunks of size 2n), and keep track of first n most occurred data. (???)
Nope. Hashmaps will still work. Remember that the thing that is being hashed to does not have to be the count itself. It can be a linked list (or even some other other data structure depending on what kind of collision-resolution scheme one uses) containing the exact (key, count) pairs. So the hash will send you to a linked list, which you can traverse to find the EXACT key that was hashed, and hence update the correct count. Collisions will of course reduce the efficiency of the hash, but it will (or should) never compromise the correctness. This is an important of most hashmaps you have to understand. If correctness were an issue hashmaps would not be used as commonly.
This is a variant to ditributed processing problem.
Since the data cannot be stored in memory(RAM), split the data into multiple files, each file is less than the size of RAM.
For each file, find the "top n" freequently occuring data and store them in another file(call it as ResultFile). The entry of ResultFile consists of data, its occurance count and name of the original file.
We then sort the ResultFile based on occurance count and print the top n freequently occured data.
If the hard-disk cannot store such huge data, then split the data in multiple computers. Each computer will return its "n freequent data" using above algorithm and the master computer will aggregate the results and print the final "n freequent data"
The difficulty is how to get the frequency of the items. Simply splitting data into N files and counting based on each file will not work, because you still have to load all the counts into the memory so that you can merge the counts. One solution is to hash the items into N files so that for each file, you can load it into memory and get the counts. Go through the each file and use a heap to keep track of the top-k items.
can we use trie to represent the data? That will be a compressed representation. The leaf node corresponding to a particular data will also contain the count of the whole phrase (group of tokens) occuring in the file. Also can maintain some auxiliary map/hash/tree that contains a pointer to a node in the trie. Then we will be able to keep track of the most frequent ones.
use hash to count words, then use another Binary Search Tree (or sorted array) to find K most frequent word
- Anonymous October 12, 2009