Bazaarvoice Interview Question
Software Engineer / DevelopersI think this is the right answer. Many of the BV questions seem to be incompletely specified,like they are seeing if you will ask questions to further define the end goal, instead of just going to town with the partial requirement spec.
I haven't done a complete analysis on this exact problem, but have on a very similar one, and the best data structure for string based sorting/searching was a 'trie' (retrieval tree). For de-duping, which isn't the normal use for 'tries', the insert function would be the flavor that only inserts, and resolves partial collisions, to the minimum depth for uniqueness. This ends up with an O(n) with a worst case coefficient/multiplier of the longest length string/email, and the average case multiplier being the average number of compares for inserts, and average depth for retrievals.
Hash has the least complexity.
Create the hash of each email address. If there is a duplicate hash, check if emails are duplicate. If yes, consider only one of them. Otherwise include both the email addresses.
Complexity O(n)
I think the complexity is not O(n) but O(n log(n)).
You have to iterate over the list of email addresses (O(n)) and to check the hashes of each email address against some data structure like a hashset (O(log(n)). The overall complexity should be O(n log(n)).
(Reposted this, as the first one went to the wrong comment)
I haven't done a complete analysis on this exact problem, but have on a very similar one, and the best data structure for string based sorting/searching was a 'trie' (retrieval tree). For de-duping, which isn't the normal use for 'tries', the insert function would be the flavor that only inserts, and resolves partial collisions, to the minimum depth for uniqueness. This ends up with an O(n) with a worst case coefficient/multiplier of the longest length string/email, and the average case multiplier being the average number of compares for inserts, and average depth for retrievals.
Additionally, the trie results in much less processing. The question involves strings, which can be operation intensive. The 'trie' only does compares/operations as far into the string as necessary. Hashing strings, not. The trie behaves like an auto-optimizing hash for the number of string elements to use for the hash to minimized processing, as well as self-sort the collision chain.
It should be obvious that the dupe count for a node can be incremented when inserting, then decremented when retrieving, so that the first batch is the super set of unique emails, and each subsequent traversal generates a set of emails with increasing dupe counts.
Whether this is the answer they were looking for can't really be known, since there isn't enough info about the problem to be solved.
but wouldnt this defeat the purpose of the question... where u are asked send out emails using minimum number of batches. The question has no restriction on the number of emails per batch, so if you remove duplicate entries you can send all emails in one batch. So I think the question is to be interpreted in different way.
It's kind of like bin packing. There will be multiple hashes corresponding multiple bins. Scan each email address one by one and lookup the hashes one by one. If it occurs say in the first bin then check in the second bin. Now say there are only 2 bins. In this case create another bin(hash) and put it into this.
What you are essentially saying, in a more complicated way, is already part of a hash- the collision chain. After a complete scan of the emails, the hash table just needs to generate as many batches as the highest order collision chain. There are many markup strategies for ID'ing the hash members/collision chains.
If the list of email addresses is not sorted, then Put the e-mail addresses into a Balanced Binary search tree( this will remove the duplicates from the list). now serialize the binary tree and send it to the server , on the server end it can be de-serialized to get the original tree with no duplicates.
- Addy September 30, 2009