Google Interview Question
Software Engineer in TestsThe second approach sounds good. But can't we go for adding the unique prime number generated for the each alphabet rather than multiplying. Addition will never produce same key value for different words as we are given the size of each word is 'n'. Also it can handle strings of much larger length than in case of multiplication. So the complexity will be O(n) since K is some constant and not a function of 'n'. Correct me if I am wrong guys.
Nope, addition of primes won't work. Try a simple example:
a -> 2
b -> 3
c -> 5
d -> 7
e -> 11
f -> 13
What do "bf" and "ce" compute to under addition?
Multiplication works because of Unique Factorization. But I don't like it because it does more work than needed, and for long enough strings you might require big number operations. If you think about it, you are really just using primes as placeholders; the exponents of those primes actually give you the count for each letter encountered. So why not simplify things and just keep an array of letter counts? Then you can push each array into a set and count the set at the end. Or if you are more comfortable dealing with big integers just think of the array as a big integer with each count being multiplied by a power of 2 depending on the bit-width of a count variable. Heck, you can even do the original prime thing (but in a faster way) by raising each prime by the associated count and multiplying together. That way, you'll be doing mostly additions.
Yaa.. I get you. Considering the prime number multiplication approach into account, in order to avoid a higher product value which can't be handled, probably we can do a preprocessing step were the prime number associated with each letter can be based on the frequency of occurrence of that alphabet in English language rather than assigning them from a-z sequentially. So most frequently occurring letter will get minimum prime number "2" and so forth. In this way we can handle the product value explosion to a certain level and also make sure to handle longer strings than in previous case. So the complexity will be maintained as O(n).
it looks like map-reduce code
for each set
generate
<anagram, {setofwords}> using map
end
//Reduce
Take global map;
for each <anagram, {setofword}>
if(exists GMap<anagram>)
merge GMap<anagram>, {setofwords>
else
insert GMap<anagram, {setofwords}>
end
GMap will hold answer.
Just because this is a Google interview question doesn't mean you have to pander to the interviewer with a meaningless map-reduce blurb. In fact, the above code will probably get you nowhere, and the interviewer will make you write the proper low-level code, meaning that you will have wasted precious time from your 45 minutes. When you get into Google, then you can talk all you want about map-reduce. But in the meantime, stick to the basics and answer to the intent of the question, which is essentially: what signature to use for identifying anagrams.
I think multiplication of prime number is just myth, as one can't take multiplication as O(1) op. O(m^1.8) where m = log(n), n is largest multiplication needed to be done.
To use prime numbers, upper bound on number will be 26*log26~ 32(5bits best case) (26 primes, need to go upto 26*log26).
You can take any operation you want as the basic unit for evaluating complexity, as long as it makes sense for the situation you are trying to shed light on. These days people don't usually distinguish mult-op from other arithmetic operations performed by the CPU, there just isn't a whole lot of difference. In the past it made sense to do so (e.g. on Intel 286s) when division and multiplication took up many more cycles. But what matters still is distinguishing between big number multiplication and big number addition. The former is way more expensive, so you'd want to count those two separately.
And just a reminder that 64-bit CPUs are common today, so your 32-bit estimate is a bit conservative.
Generation of key:
Given size of alphabet is 26.
Take a "int" and and use only 26 bits
Ex. If lets say word is "aba"
sets bit for
'a' -> set first bit in key,
'b' -> set second bit in key
..
It would gernate a unique key, and in case of anagrams keys would collide
How about sorting it twice ....
Initially sort all the words .. Hence in k*n lg n time you have sorted everything .. and then after that sort these lists again ... Hence the words which are anagrams will automatically come together ...
For the second part .. sorting will just require k pointers which can be moved forward ...
For each of the k pointers which match we then have to recursively minimize the set... and keep on partitioning ...
Any comments, flaws?
First of all, the question is still not clear to me, I am guessing that it's this:
Given a set of k words, each of size n, find sets of anagrams in that set.
The algorithm for this seems straightforward:
(1) Rearrange the characters in each word alphabetically; e.g. "never" --> "eenrv". Running time O(k * n * lg n) which for small n is really just O(k). You should keep the original words so that later you can reproduce them.
(2) Sort the rearranged words -- using radix sort, this can be done in O(n*k) = O(k) for small n.
(3) Find sets of identical (rearranged) words. O(k * n) = O(k).
(4) Output the original words for each identical set O(k * n) = O(k).
Total running time is O(k) (assuming that n <= c for some small constant c).
There are two solutions that i can think of..
- Anon September 07, 2010i) O(nlogn) - Sort each word alphabetically and store them in a hash table with key being the sorted word and value being the frequency.
ii) O(n) - Tag each alphabets with the first 26 primes. For each letter in each word take the product of its corresponding prime. Store the final product along with the frequency as the key and value respectively in a hash table.