Amazon Interview Question
Country: United States
Where there any restriction on Memory complexity. It not this looks like a good candidate of Hashing.
public class Anagram{
HashMap<Character> ana;
public Anagram(char[] word){
for(char c: word){
//populate HashMap with char freq.
}
}
public int hashCode(){
// implement a hashcode for Anagram object.
}
}
1. Create a empty HashSet<Anagram> anagrams.
2. Iterate over each word in the file. ( you can read line by line and split on space for simplicity.
3. Create a Anagram object of word.
4. Check if alread in word in anagrams HashSet. if not dnt add else add.
This is just for clarity of explanation. You can just add it to anagrams Hastset.
HashSet will keep only unique entries based on our hashcode() for Anagram obj.
5. At the end of file the size of HashSet anagrams if the number of anagrams in file.
why cant you use simple Set<string> for anagrams. It will ensure no duplication and no time will waste in calculating hashcode.If there are too many words to store anagrams in memory, we can make different sets of anagrams for different length. So if a new word will have only 4 character, it will load only that set.
If all thr charcters from the file are alphabets,
Step 1:- we can assign a prime number to each character like A->1 B->3 and create a Map or that.
Step 2:- Get each word from the file and look each character from the above map and multiply the prime numbers.
eg:- ABC ->1*3*5=15
BAC -> 3*1*5 =15..Now we tell both words are anagram
Like mentioned, sorting each word is the correct way.
But after doing so, the best way to store the data is a "trie" tree, which is usually the most effective DS for kinds of dictionaries.
In this case finding a word (or checking it doesn't exist) would be O(m), where m is the word length, while in the above example of sorted word file that action would take O(logN), where N is the number of words.
Java 8 one liner to group a list of strings according to anagrams
private static Collection<List<String>> getAnagrams(
final List<String> words) {
return words.stream()
.collect(Collectors.groupingBy(s -> {
final char[] sortedArray = s.toLowerCase().toCharArray();
Arrays.sort(sortedArray);
return new String(sortedArray).intern();
})).values();
}
(1) Read file line by line, replace(",", " ").replace("?"," ").replace(" "," "), then split(" ") to get each word, then sorted this word.
(2) Check if this word is in HashSet<word>(); if exists, add this word to Map<word, count>if not exists, add to HashSet<word>
(3) Machine memory normally could contain English words in HashSet, if still restrict, alternatively we can if word in HashSet, remove it , and check if exist in Map,
(4) output Map
If they explicitly say it's a large file, then you probably can't keep all the data in in-memory data structures like hash tables or binary search trees.
- 010010.bin July 17, 2015In any case, grouping together all the anagrams can be achieved by sorting each string while keeping a copy of the original, and then sort all the words based on their sorted version.
So, start by making a first pass over the file. The result of this step is to create a second file of key value pairs, where each pair K, V consists of a sorted word followed by the original word.
For example, if the file had the words "My dog is god", the first pass would create a file with these contents:
My My
dgo dog
is
dgo god
Then, sort the auxiliary file that was just created. Since we're dealing with very large datasets, you might have to use an external sort algorithm. Merge sort is a good candidate here -- probably the interviewer would test your knowledge of how merge sort works externally to sort a big amount of data.
After sorting the file, you have all anagrams close to each other:
dgo dog
dgo god
is is
My My