Microsoft Interview Question
Software Engineer in Tests". create a hashtable where the hashcode is generated using the characters in the string..I got the hash of each string by doing an Exclusive OR of each charcater in the string although i had the dbt if the Ex-OR of non anagrams can be same"
Imagine that the alphabet only has two letters: 1 and 0,
Char representation: a bit for each character
Two words given:
10
100
Hash value:
1 XOR 0 = 1
1 XOR 0 XOR 0 = 1
The two words has the same hash value, but are not anagrams.
Num. input words: N
Maximum word length: M
OP's soultion:
1. Sort each word: O(M*logM*N).
2. Compare with each anagram group: O(M*N^2)
Total complexity, assuming that logM < N: O(M*N^2)
ibnipun10's solution:
1. Count occurance of each character: O(M*N)
2. Compare with each anagram group: O(N^2)
Total complexity, assuming that M < N: O(N^2)
However, if we assume that word lengths will be "reasonable" (say, always less than C characters)? If so, the complexity of the first solution becomes O(N^2). So I don't think it is clear which solution is the better. Thoughts, anyone?
here u can use counting sort which run in O(n + k) where k = number of key which are 256 i.e. no. of ASCII values
string CountingSort(string str){
int a[256];
int size = 255;
for(int i = 0; i < 256; i++){
a[i] = 0;
}
//count number of characters in string
for(string::iterator it = str.begin() ; it != str.end(); it++)
a[*it] += 1;
int len = str.length() - 1;
//rebuild string from back to keep sort stable
while(size--){
if(a[size]){
while(a[size]--){
str[len--] = char(size);
}
}
}
return str;
}
Problem apears in the book Programming Pearls. The solution there is pretty straight-forward:
1. "sign" the words, i.e. for each word the signature is itself sorted
2. order by signature
3. "squash" the structure to produce the anagrams.
The author suggests a pipeline like solution, so you could potentially paralelize this.
it sound good. we can maintain a hashtable, with the key as the word "sign" and value as the the anagrams.
Then for each string S
Find the sign of this string S_Sign
Insert it into the hashtable with (S_Sign, S)
Then we only need to iterator each hashtable pair
For each key
Print all the entries with same key
check my solution:
give to each letter in alphabet a unique prime number('a'->2,'b'->3,'c'->5,...)
then give to each word from input its code=product of prime numbers of this word's letters. then if 2 words are anagrams, their codes are equal, if not anagrams, the codes must be different too.
then sort by codes and output. O(N*logN).
am i right?
if we allow to use hash, we could reach time complexity of O(NMlgM), where N is the number of words and M is the number of characters in the longest word. Following is the abstract step.
First, define the signature of a word to be the word with it's characters being sorted, e.g. the signature of Microsoft is cfiMorst. Anagrams should have the same signature
1. Create a hash which the key is the "signature" of a word, the value is a list of words having the same signature. (We have to becareful of the case here, we can convert the word to lower/upper cases before generating the hash key)
2. Go through each word in the file, generate the signature and add to the hash table, if the key already exist, add to the list
3. Go through the hash table, print the words from the list of each hash key ( which should be grouped by anagrams)
The time complexity is bounded by O(NMlgM), memory complexity would be O(N)
Rather than sorting each string , we can compute the hashcode as , each character in string occupies 8 bits (or in some case 16 bits) , so we consider the string as a number in base 2^8 , and compute the value of string , say for e.g. for vinay ,the value is calculated as :- ('v'-'a')*2^(8*4) + ('i'-'a')*2^(8*4) + ('n'-'a')*2^(8*4) + ('a'-'a')*2^(8*1) + ('y'-'a')*2^(8*0) ;
and let this sum is represented by k
then k mod (2^8 - 1) gives unique code for all anagrams
Rather than sorting each string to calculate its signature , we can compute the hashcode as , each character in string occupies 8 bits (or in some case 16 bits) , so we consider the string as a number in base 2^8 , and compute the value of string , say for e.g. for vinay ,the value is calculated as :- ('v'-'a')*2^(8*4) + ('i'-'a')*2^(8*4) + ('n'-'a')*2^(8*4) + ('a'-'a')*2^(8*1) + ('y'-'a')*2^(8*0) ;
and let this sum is represented by k
then k mod (2^8 - 1) gives unique code for all anagrams
Rather than sorting each string to calculate its signature , we can compute the hashcode as , each character in string occupies 8 bits (or in some case 16 bits) , so we consider the string as a number in base 2^8 , and compute the value of string , say for e.g. for vinay ,the value is calculated as :- ('v'-'a')*2^(8*4) + ('i'-'a')*2^(8*4) + ('n'-'a')*2^(8*4) + ('a'-'a')*2^(8*1) + ('y'-'a')*2^(8*0) ;
and let this sum is represented by k
then k mod (2^8 - 1) gives unique code for all anagrams
read more about using trees to capture strings..this is a variety of the suffix tree..
you put each string in a tree, but before hand we'll sort each string, meaning, if you wanna insert bda, what you'll insert will be abd, o/c we'll cramp everything in.
this way, when you go down the tree, if you find an anagram it'll mean you're down the tree and found a \n or a null of some sort.
should be O(n*m) as each step means you have to reorder the current part of the string and go down the tree, thing is going down the tree..if we use count-sort then we don't even have to re-sort each time, just keep an array of size 26, each time incrementing the right cell.
going down the tree again and again makes it O(n*m), n being length of file, m being the max length of the largest anagram
Anagram checking can be done in O(n). From the question I guess the anagrams in the same input file are supposed to be printed.
This be done by bucketing all strings per their length. i.e. for length 5, the linked list could contain vinay, naviy, inavy, there, using , ...
We'll compute the hash of each strings such that the hash function will be of O(n) where n is the length of the string. The has function can be as simple as sorting the characters. Yes we can sort the strings in O(n) by using a bitmap counter.
For ascii characterset we can take an array of 128 int and initialize to zero and increment the array[i]++ where i is the character encountered. once the pass is done, we can reconstruct the sorted string by passing the array again "there" becomes "eehrt". this can be compared to the other string's sorted values.
In Worst case however if the text file contains all words of the same length, then we'd need an extra space again of the same size of the file to store the sorted strings and then do a string compare using KMP algo O(n) again.
My bit: position insensitive on the fly hash code computation. This can be done by associating first 26 prime numbers(>2) to each char. Keep on multiplying the associated prime numbers from start position till delimitor. Using hashmap lookup for this hashcode value. If null is returned create a list(each element holding start and end index of same hashcode) and add the element(start position and end position) corresponding to current hashcode. If not null return value add the element to the returned list.
This will take time-o(n) and space-0(k)
Next step, iterate over the hashmap and print the elements of the list(value of key,value pair) together.
This will take time-o(k)
Overall: time-o(n) and space o(k).
Please correct me if i am wrong.
My 2 cents
- Divide the problem into sub problems by using multiple hashing
a) By Length
b) By XOR or any hash function which is irrespective of the order of it.
- From each of minimal set, Prepare a list to display strings group by anagrams
For instance , S1, S2, S3, S4 ,S5 , S6 is a linear list containing three strings which would be appeared as
S1->S2->S6
|
S3
|
S4 ->S5
If S7 get hashed into this list, Compare the string and add it to the above list accordingly.
- Checking whether two words are anagrams or not.
a) Prepare a hash table of 26 character
b) Fill the hash table for both string with one +1 and other as -1
c) Sum of hash table as non-zero implies - Not an anagram else anagram
I think maybe part of the point is to come up with a good hashing scheme. If 256 characters or more are being used, then the hashing scheme should allow for "sparse" computation so there isn't an automatic constant like 256. Assigning a large (prime?) weight to each character and then letting a word's hash key be the sparse linear combination of characters seems reasonable.
This question reduce to check for 2 strings whether they are anagrams of not
- DashDash September 04, 2010For Anagrams we can count the occurrence of each letter in the first string and check this with the other