Given really large (beyond nor

Amazon Interview Question for Software Engineer / Developers

0

of 0 votes

17
Answers
Given really large (beyond normal memory limits) with data in some sort of tokenized format,

write code to print top "n" frequently occuring data.
- Good Luck October 12, 2009 | Report Duplicate | Flag | PURGE
Amazon Software Engineer / Developer Algorithm

Email me when people comment.

An error occurred in subscribing you.

More Questions from This Interview

Email me when people comment.

An error occurred in subscribing you.

Comment hidden because of low score. Click to expand.

of 0 vote

use hash to count words, then use another Binary Search Tree (or sorted array) to find K most frequent word

- Anonymous October 12, 2009 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

Pls. note that 'beyond normal memory limits'! In this way, how can you make a in-memory hash table of these data?

- eric October 13, 2009 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

hash compresses the storage. So that shouldn't be a problem, but nevertheless you need to ask the interviewer to clarify.

- geniusxsy October 14, 2009 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

@eric: Well, I would guess that "beyond memory limits" applies only to the entire data set, and not to the range of tokens. For example, the entire corpus of English literature would take petabytes to store, but the number of English words (the tokens) is no more than 100K.

Even if the range of tokens is really that big as to not fit in memory, a hashmap can still be used. Except that once the map balloons to a certain size, we have to merge the counts to a set of files on disk.

- Bullocks December 31, 2009 | Flag

Comment hidden because of low score. Click to expand.

of 0 vote

Pls. note that 'beyond normal memory limits'! In this way, how can you make a in-memory hash table of these data?

- eric October 13, 2009 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

Pls. note that 'beyond normal memory limits'! In this way, how can you make a in-memory hash table of these data?

- eric October 13, 2009 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

In context of "hash compresses the storage", you will still need to at least know the "range" (upper bound) of data "values" for allocating Hash or else any collisions would fail the idea. Basically you need a direct map not a perfect Hash, and hence need to know the range.

If range is/can be known then your trick works.
Else, my idea is, process the data in smaller chunks (say chunks of size 2n), and keep track of first n most occurred data. (???)

- Nix October 17, 2009 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

Nope. Hashmaps will still work. Remember that the thing that is being hashed to does not have to be the count itself. It can be a linked list (or even some other other data structure depending on what kind of collision-resolution scheme one uses) containing the exact (key, count) pairs. So the hash will send you to a linked list, which you can traverse to find the EXACT key that was hashed, and hence update the correct count. Collisions will of course reduce the efficiency of the hash, but it will (or should) never compromise the correctness. This is an important of most hashmaps you have to understand. If correctness were an issue hashmaps would not be used as commonly.

- Bullocks December 31, 2009 | Flag

Comment hidden because of low score. Click to expand.

of 0 vote

How abt using a heap...where the frequency of occurence is represented using the parental dominance condition?

- sathipkr October 19, 2009 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

This is a variant to ditributed processing problem.
Since the data cannot be stored in memory(RAM), split the data into multiple files, each file is less than the size of RAM.
For each file, find the "top n" freequently occuring data and store them in another file(call it as ResultFile). The entry of ResultFile consists of data, its occurance count and name of the original file.
We then sort the ResultFile based on occurance count and print the top n freequently occured data.

If the hard-disk cannot store such huge data, then split the data in multiple computers. Each computer will return its "n freequent data" using above algorithm and the master computer will aggregate the results and print the final "n freequent data"

- Avinash October 23, 2009 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

Map Reduce, I guess then.

- Nix November 15, 2009 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

The difficulty is how to get the frequency of the items. Simply splitting data into N files and counting based on each file will not work, because you still have to load all the counts into the memory so that you can merge the counts. One solution is to hash the items into N files so that for each file, you can load it into memory and get the counts. Go through the each file and use a heap to keep track of the top-k items.

- Martin December 25, 2009 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

could this be an extension to quick sort / median of medians algorithm ?

- fragzilla January 07, 2010 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

How about external mergesort?

- bluenosetiger January 09, 2010 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

sathipkr is right
use a heap which will have the data as well as the frequency.

- Mohan January 28, 2010 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

can we use trie to represent the data? That will be a compressed representation. The leaf node corresponding to a particular data will also contain the count of the whole phrase (group of tokens) occuring in the file. Also can maintain some auxiliary map/hash/tree that contains a pointer to a node in the trie. Then we will be able to keep track of the most frequent ones.

- Anonymous February 04, 2010 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

may be you can also keep track of some mapping from the words to index representation if the trie can not hold the tokens.

- Anonymous February 04, 2010 | Flag Reply

CareerCup

Amazon Interview Question for Software Engineer / Developers

Books

Videos

Resume Review

Mock Interviews