Interview Question


Country: United States




Comment hidden because of low score. Click to expand.
0
of 0 vote

The size of data set would be quite large...This can be solved by writing simple Map/Reduce job. (Assuming this is not some data structure problem.)

- shreshtha February 10, 2015 | Flag Reply
Comment hidden because of low score. Click to expand.
0
of 0 vote

If we have a hadoop cluster it can be solved effectively with a 4 line pig script.If not I will use a hashset to solve this one.
Will keep on refining the code there only.

- adlakha.praveen February 11, 2015 | Flag Reply
Comment hidden because of low score. Click to expand.
0
of 0 vote

The usual intent to such a question is check your design approach and your understanding of the problem.
List of questions that you might want to ask.
1: understanding uniqueness {userID?}
2: understand structure of file {hierarchical/flat}
3: understand if the folder is a typical disk folder or resides on a storage platform (this will allow deciding of concurrent reads are possible or are there any disk hot spots blocking us)
4: understand the context in which this problem is being solved {do we need results in seconds or hours}

Since you are expected to build a case - first start with how you would solve this problem if the file size and no of files was less
then increase one factor - say no of files - how does your approach become different
then the file size - how do you solve it now
consider the physical limitations of disk I/O, if you need to copy files to distribute - consider network copy times (and copy of course means you have to use some kind of hash to verify network corruption has not taken place). Consider if your approach is going to be multithreaded or not, Consider how would create a scheme that multiple machines accessing the data do not clash or do redundant work. You also need to think what parts of the what operations will be asynchronous and how will you collate the results.

Once you have figured the big picture items - feel free to use whatever algorithm you want to (or existing patterns like map/reduce; but it should not be a snap answer)

This is a design question - do not jump into code prematurely

- him February 11, 2015 | Flag Reply
Comment hidden because of low score. Click to expand.
0
of 0 votes

This problem would be trivial if the size of the input is small (so that it fits in a single hashtable in memory).
The main point is that the input is large and doesnt fit in memory.
So you have to devise a way to partition the work so you could process each part in turn and finally combine the partial results into a final result.
To know how to do this, you need to be familiar with various well known techniques that exist to solve similar problems.
Clearlyfrom your "answer" you simply dont knpw these techniques, so you just dont know how to solve the problem.

Its like asking how to perform sorting, to someone that is not familiar with sorting algorithms. He could just try to talk around the problem without actually providing the sorting algorithm because he doest know it.

- gen-y-s February 12, 2015 | Flag
Comment hidden because of low score. Click to expand.
0
of 0 votes

It is also useful to validate your assumptions, the ones that you are making about the problem or about the intent/knowledge of the interviewer :)

- zaphod February 13, 2015 | Flag
Comment hidden because of low score. Click to expand.
0
of 0 vote

Anything you write in this interview would not beat the UNIX commands posted. Except when result set is too big for its very efficient library to calculate in memory. In which case, you have to persist some of the data to disk.

A 2 Gig file is perfectly fine to parse out the names and fit into memory, letting the OS persist when needed. But when a file is, say, 10 Terabytes, that's no longer an option.

The most robust solution can deal with, say, 1 million files of 10 terabytes each - with a map reduce. Where you pre-split the files into smaller chunks, say, 1 gigabyte chunks. Then you run map phases to parse out file and emit the names, and reduce phase to sort results and remove duplicates.

- C.C. February 12, 2015 | Flag Reply
Comment hidden because of low score. Click to expand.
0
of 0 vote

"Map-reduce" my eye! Use external merge sort -- no need for much memory. Two options - either put output to a new merged file and then make one pass over it, or do multiway merge of all created chunks, which will require a priority queue.

- Coding Owl February 21, 2015 | Flag Reply
Comment hidden because of low score. Click to expand.
0
of 0 vote

HyperLogLog?
May be redis implementation? It allows you count unique users in memory.

In the Redis implementation it only uses 12kbytes per key to count with a standard error of 0.81%, and there is no limit to the number of items you can count, unless you approach 2^64 items (which seems quite unlikely).

- Gleb March 27, 2015 | Flag Reply
Comment hidden because of low score. Click to expand.
0
of 0 vote

HyperLogLog?
May be redis implementation? It allows you count unique users in memory.

In the Redis implementation it only uses 12kbytes per key to count with a standard error of 0.81%, and there is no limit to the number of items you can count, unless you approach 2^64 items (which seems quite unlikely).

- Gleb March 27, 2015 | Flag Reply
Comment hidden because of low score. Click to expand.
-1
of 1 vote

for i in *.filext
do
cat $i|cut -f2 -d'delemeter'|sort -u
done|sort -u

- Anonymous February 11, 2015 | Flag Reply
Comment hidden because of low score. Click to expand.
-1
of 1 vote

First I think it would be best to clarify with the interviewer how we should compare users to establish uniqueness. Since the fields you mentioned are date, user and name he would most likely tell you that user is something like a userId that would be unique.

Now since the files are large in number and each have a considerable size, Hashtable is not an option and Sorting is definitely not!

Bitmaps or an array of bits would definitely be a good suggestion. This would mean we should enquire the range of the userId from the interviewer.

If a probabilistic estimate of the count would suffice then there are several cardinality estimation algorithms (extremely complicated) out there which give a space-accuracy tradeoff, that would be worth suggesting. Its certainly an open ended question.

Bitmaps seems to be the only possibility though, if the count required must be exact.

- teli.vaibhav February 11, 2015 | Flag Reply


Add a Comment
Name:

Writing Code? Surround your code with {{{ and }}} to preserve whitespace.

Books

is a comprehensive book on getting a job at a top tech company, while focuses on dev interviews and does this for PMs.

Learn More

Videos

CareerCup's interview videos give you a real-life look at technical interviews. In these unscripted videos, watch how other candidates handle tough questions and how the interviewer thinks about their performance.

Learn More

Resume Review

Most engineers make critical mistakes on their resumes -- we can fix your resume with our custom resume review service. And, we use fellow engineers as our resume reviewers, so you can be sure that we "get" what you're saying.

Learn More

Mock Interviews

Our Mock Interviews will be conducted "in character" just like a real interview, and can focus on whatever topics you want. All our interviewers have worked for Microsoft, Google or Amazon, you know you'll get a true-to-life experience.

Learn More