Interview Question
Developer Program EngineersCountry: United States
Interview Type: Phone Interview
You cannot have a Linked List containing More than one TB of nodes. Even the number is in GB's you might need a highly paralleled super computer with a huge amount of Memory to do that. We need to do some disk based approach to achieve that
While reading the file I would build a hash table translating (num => freq)
To get them sorted I'd build a BST or a Heap with node containing (freq, num) sorted by num.
According to the question much of them repeat, so both hash table and heap should be rather small.
Sorry, you have O(1) + O(n*logn).
After thinking about this further, wouldn't hash table be bad in this scenario, as it would waste a lof of space?
Assume you have access to a hard disk of unlimited size (we'll get rid of this assumption later). Read the numbers from file and set up a hash map of numbers to their frequency. Keep building the hash map until you run out of main memory. Now you would need to flush this "index" onto disk. One strategy is to modulo the numbers by 1024 (say) and store the record in an appropriately named file.
index_0 : all records of type <number, frequency> where number % 997 = 0
index_1 : all records of type <number, frequency> where number % 997 = 1
index_2 : all records of type <number, frequency> where number % 997 = 2
index_3 : all records of type <number, frequency> where number % 997 = 3
...
index_1023 : all records of type <number, frequency> where number % 997 = 1023
Keep building the hash map from the original data and keep flushing the map back to disk. Next, sort every index file according to the number. If each index file is too large to load in memory for sorting, you need an external sorting algorithm. Then do another pass on the algorithm to merge similar records.
You now have multiple sorted arrays and you need to produce a single sorted array - this is the standard n-way merge algorithm! Once you have such a consolidated index file, you can answer any frequency query you want.
How do you speed up this algorithm? Let's assume you have access to a cluster of 100 hosts available to you. Split the 1TB file into 100 chunks. Let every host run this algorithm locally. We have now reduced the problem to merging 100 sorted arrays, albeit on different computers. Let 100 arrays reduce to 50 arrays by pairing two hosts. Then reduce to 25 arrays, then 12, then 6, then 3, then 2 and finally the master host would receive the sorted array.
What if the host does not have enough disk space to accommodate the index? Quite simple really! Find the capacity on disk that can be made available. If 1 GB of space is available for the index, then chunk the data set into 1024 parts and run it on a cluster of 1024 hosts (worst case being that the frequency of every number is 1, which means the index will be roughly equal to the original data set). Instead of merging the indices in one file (because one host does not have that much space), rearrange the files in such a way:
host 0 has index_0 files from all the computers.
host 1 has index_1 files from all the computers.
host 2 has index_2 files from all the computers.
host 3 has index_3 files from all the computers.
...
host 1023 has index_1023 files from all the computers.
Merge the index files on each computer (standard merge algorithm). Query every host for the topmost frequency and then find the winner of winners.
This algorithm scales horizontally - throw in more hosts and you'll get results faster.
This is a nifty solution to efficiently attaining the frequency of a number given the number.
I don't believe it quite answers OPs question though.
What we want is to determine the average frequency, and second highest frequency.
Devin,
No. The question is to redact the data set into something that can be easily queried. The challenge is to develop an algorithm on commodity hardware (limited memory, limited disk size) but numerous such commodity hardware. To find the average frequency you can iterate over the entire index and compute.
Here is my attempt at this problem. I think it's important to ask if you need to optimize for speed or space. In my implementation, I optimized for space. You could use a hash table to optimize for speed and sacrifice more space.
To conserve space I used a doubly-linked link list. Link list ensures minimum required space and ease of adding elements. The double links help with sorting of the linked list on the go.
This link is being sorted based on the frequency of each number. The more frequent, the closer to the front the node is.
Average is being computed the hard way at the end...
Sample output...
- Adrian November 18, 2012