Interview Question
Country: United States
Interview Type: Written Test
Bucket sort the file based on last name (A... - Z...) every day. Now compute the hash for every bucket. Ignore buckets where hashes match. For every bucket where the hashes don't match, compare records using binary search.
Same hash code DOES-NOT necessary mean the buckets are exactly the same.
A simple example:
String a="Aa";
String b="BB";
sysout(a.hasCode());
sysout(b.hasCode());
this produces same hashcode: 2112
There are 2^32 possible hashcodes. Hashcode collision Is possible.
@Prithvi: he may be intending to use a secure hash of some sort. Collisions would be possible but statistically unlikely.
Two person can have same last name but different first name.
I would treat all the three columns combined together to be unique (or ask the interviewer which columns make the row unique) and use those columns to build the key for the dictionary and value will be rowid or the row itself.
The dictionary's key is concatenation of the unique columns.
if Sorting is not an option, the only way to check new records is to compare each row in new file with rows in old file.
CollectionOfExcelObject oldSetofPerson;// this contains all objects of all old persons info
CollectionOfExcelObject newSetofPerson;//this contains all objects of new persons info
foreach(persons in new newSetofPerson){
if(oldSetofPerson.has(person))
this is old person. continue the loop.
else
this is new person.
}
CollectionOfExcelObject any collection implementation.
From what I could understand, there is no way to sort the data in the given file. I think the below approach would work. I am not sure whether it works for a very large data set.
1. Either sort the entries in the first by email id, or store the email ids using some data structure like HashMap (if no duplicates are found, and where key is the email id and value is first name + last name) or trie (where each valid email id entry contains a list of first name + last name).
2. For each entry in the second file, compare it using the previous data structure. If info is not found, add the entry. If found, remove the entry from the data structure. At the end of this stage, we obtain all non-matching entries in the new file.
3. The data structure now contains only the entries which were present in the old file, but absent in the new one. Iterate through it and print all entries.
Assuming that emailid is unique, if it is not then conider Firstname_LastName_emaiid as a key.
Create tries for previous day.
From current day excel file, for each record check it is present in previous day's tries or not.
If its present delete it from tries, if its not present then add it to new_added list for today.
At the end we would have new_added list and deleted_list (remaining entries in tries)
When the file is "large", they usually mean that you can't throw the whole thing into your memory. If you can fit it in your memory, then all you need is a hashmap for both day and compare them carefully.
- Mo Lam May 16, 2013So what if you can't fit it all into your memory. My solution would be a bucket sort (in a different sense).
You probably would use up some harddisk space to accomplish this. You should put all of the once with last name starting with A into a file, all of them starting with B into a file, etc. Then you have smaller files to deal with and you only need to compare the A files with each other, B files with each other, etc.
Now, if the files are not small enough, you can break them down using the second letter. You need to break it up until both file can fit into your memory with enough memory to do other things too.
Now for each pair of files (yesterday's A and today's A, etc).
You can sort them using your normal O(log(n)) methods, and then compare line by line. And log them into your desire log file (note that there's a chance that the log won't fit into your memory, print them to your hard disk too)