Interview Question
Country: United States
The usual intent to such a question is check your design approach and your understanding of the problem.
List of questions that you might want to ask.
1: understanding uniqueness {userID?}
2: understand structure of file {hierarchical/flat}
3: understand if the folder is a typical disk folder or resides on a storage platform (this will allow deciding of concurrent reads are possible or are there any disk hot spots blocking us)
4: understand the context in which this problem is being solved {do we need results in seconds or hours}
Since you are expected to build a case - first start with how you would solve this problem if the file size and no of files was less
then increase one factor - say no of files - how does your approach become different
then the file size - how do you solve it now
consider the physical limitations of disk I/O, if you need to copy files to distribute - consider network copy times (and copy of course means you have to use some kind of hash to verify network corruption has not taken place). Consider if your approach is going to be multithreaded or not, Consider how would create a scheme that multiple machines accessing the data do not clash or do redundant work. You also need to think what parts of the what operations will be asynchronous and how will you collate the results.
Once you have figured the big picture items - feel free to use whatever algorithm you want to (or existing patterns like map/reduce; but it should not be a snap answer)
This is a design question - do not jump into code prematurely
This problem would be trivial if the size of the input is small (so that it fits in a single hashtable in memory).
The main point is that the input is large and doesnt fit in memory.
So you have to devise a way to partition the work so you could process each part in turn and finally combine the partial results into a final result.
To know how to do this, you need to be familiar with various well known techniques that exist to solve similar problems.
Clearlyfrom your "answer" you simply dont knpw these techniques, so you just dont know how to solve the problem.
Its like asking how to perform sorting, to someone that is not familiar with sorting algorithms. He could just try to talk around the problem without actually providing the sorting algorithm because he doest know it.
Anything you write in this interview would not beat the UNIX commands posted. Except when result set is too big for its very efficient library to calculate in memory. In which case, you have to persist some of the data to disk.
A 2 Gig file is perfectly fine to parse out the names and fit into memory, letting the OS persist when needed. But when a file is, say, 10 Terabytes, that's no longer an option.
The most robust solution can deal with, say, 1 million files of 10 terabytes each - with a map reduce. Where you pre-split the files into smaller chunks, say, 1 gigabyte chunks. Then you run map phases to parse out file and emit the names, and reduce phase to sort results and remove duplicates.
HyperLogLog?
May be redis implementation? It allows you count unique users in memory.
In the Redis implementation it only uses 12kbytes per key to count with a standard error of 0.81%, and there is no limit to the number of items you can count, unless you approach 2^64 items (which seems quite unlikely).
HyperLogLog?
May be redis implementation? It allows you count unique users in memory.
In the Redis implementation it only uses 12kbytes per key to count with a standard error of 0.81%, and there is no limit to the number of items you can count, unless you approach 2^64 items (which seems quite unlikely).
First I think it would be best to clarify with the interviewer how we should compare users to establish uniqueness. Since the fields you mentioned are date, user and name he would most likely tell you that user is something like a userId that would be unique.
Now since the files are large in number and each have a considerable size, Hashtable is not an option and Sorting is definitely not!
Bitmaps or an array of bits would definitely be a good suggestion. This would mean we should enquire the range of the userId from the interviewer.
If a probabilistic estimate of the count would suffice then there are several cardinality estimation algorithms (extremely complicated) out there which give a space-accuracy tradeoff, that would be worth suggesting. Its certainly an open ended question.
Bitmaps seems to be the only possibility though, if the count required must be exact.
The size of data set would be quite large...This can be solved by writing simple Map/Reduce job. (Assuming this is not some data structure problem.)
- shreshtha February 10, 2015