Google Interview Question
Software DevelopersCountry: United States
Interview Type: Phone Interview
the problem is finding at least one anagram of a string as a sub-string in an other string.
an anagram can be any permutations of a string S. Obviously we could search any
permutations in the next string N. This is the brute force approach and maybe codeable
in 15 minutes (probably with errors) if we accept creating repeating permutations.
Then it's O(|S|!*|L|*|S|) if |S|>=|L|
(|X| denotes the length of the string X, a larger string is never an anagram of a shorter)
Improved and easier to code is by sorting characters of |S| string and extracting
all sub-strings of the size of |S| from |L|, sorting the character of this sub-strings
and compare the sub-string against the sorted S. This is better to code and faster,
but still not optimal.
Time complexity: O(|S|*lg(|S|)+(|L|-|S|)*|S|*lg(|S|)+|S|) = O((|L|-|S|)*|S|*lg(|S|))
Best is:
1) Anagram is the multi-set S of a string P.
note: A set or multi-set ignores the order.
2) Sub-string B is a sequence or interval of a string L with length |P|
for all such sub-string B in L, if multi-set(B) = multi-set(B) then at least one sub-string
of L is a anagram of P
How to code it:
1) build a frequency table F for P (how many times does each character occure)
2) initialize a sub-string B = empty string
3) for each character c in L
append to sub-string, decrement occurrence of c in F
if sub-string is now larger than P, remove first character from sub-string and increment
occurrence of c in F
check if all frequencies are 0 in F, if so, it's a sub-string
Time complexity for a single pair of strings would be O(m+(n-m)) = O(n) if n >= m, that's
better. In the given problem we do this at most |A|-1 times where A is the given array of
strings.
#include <string>
#include <vector>
#include <unordered_map>
#include <algorithm>
#include <iostream>
using namespace std;
// solution with sorting O((|b|-|a|)*|a|*lg(|a|))
bool isAnagramSubstring_sort(const string& a, const string& b)
{
if (a.size() > b.size()) return false;
vector<char> av(a.begin(), a.end());
sort(av.begin(), av.end());
auto bb = b.begin();
auto be = next(b.begin(), a.size());
while(true) {
vector<char> sbv(bb, be);
sort(sbv.begin(), sbv.end());
if (equal(av.begin(), av.end(), sbv.begin())) return true;
if (be == b.end()) break;
++bb;
++be;
}
return false;
}
// O(|b|) solution
bool isAnagramSubstring(const string& a, const string& b)
{
if (a.size() > b.size()) return false;
unordered_map<char, int> am;
for_each(a.begin(), a.end(), [&am](char v) { am[v]++; });
int mismatch_ct = a.size();
auto be = b.begin();
auto bb = b.begin();
while (be != b.end()) {
int c = am[*be]; // *be: element entering the window
am[*be] = c - 1;
mismatch_ct += abs(c - 1) - abs(c);
if (be - bb >= a.size()) { // *bb: is an element leaving?
c = am[*bb];
am[*bb] = c + 1;
mismatch_ct += abs(c + 1) - abs(c);
++bb;
}
if (mismatch_ct == 0) return true;
++be;
}
return false;
}
// driver
int countNoNextAnagramOfPrev(const vector<string>& arr)
{
int count = 0;
for (size_t i = 1; i < arr.size(); ++i) {
if (isAnagramSubstring(arr[i - 1], arr[i])) count++; // or isAnagramSubstring_sort
}
return count;
}
int main()
{
cout << countNoNextAnagramOfPrev({ "ab", "ab", "abc", "bca" }) << endl;
cout << countNoNextAnagramOfPrev({ "ab", "aqb" }) << endl;
}
We can solve this by something similar to Rabin-Karp algorithm. Since size of the alphabet is constant, worst case time complexity is O(N), where N is total number of characters in the input strings.
#include <vector>
#include <iostream>
#include <string>
using namespace std;
class Hash {
public:
Hash()
{
char_counts_.resize(256, 0);
}
void AddChar(char c)
{
++char_counts_[c];
}
void RemoveChar(char c)
{
--char_counts_[c];
}
bool operator==(Hash const &other)
{
for (int i = 0; i < char_counts_.size(); ++i) {
if (other.char_counts_[i] != char_counts_[i]) {
return false;
}
}
return true;
}
private:
vector<int> char_counts_;
};
bool ContainsAnagram(string const s, string const sub_s)
{
if (!sub_s.empty() &&
sub_s.size() <= s.size())
{
Hash sub_s_hash;
for (char c : sub_s) {
sub_s_hash.AddChar(c);
}
Hash s_window_hash;
for (int i = 0; i < s.size(); ++i) {
s_window_hash.AddChar(s[i]);
if (i >= sub_s.size()) {
s_window_hash.RemoveChar(s[i - sub_s.size()]);
}
if (s_window_hash == sub_s_hash) {
return true;
}
}
}
return false;
}
int AnagramCasesCount(vector<string> const &strings)
{
int count = 0;
for (int i = 0; i + 1 < strings.size(); ++i) {
if (ContainsAnagram(strings[i + 1], strings[i])) {
++count;
}
}
return count;
}
int main()
{
cout << AnagramCasesCount({"ab", "ab", "abc", "bca"}) << "\n";
cout << AnagramCasesCount({"ab","aqb"}) << "\n";
}
@SomeOneOnTheInternet:
"ab" (at i=0) has two anagrams "ab" and "ba" of which, "ab" is a substring of "ab"(at i=1). Thus, `counter` is `1`. The string "ab"(at i=1) which is an anagram of itself is the substring of string "abc" (at i=2). Thus `counter` is now 2. One of the anagrams of "abc"(at i=2) is "bca" which is a substring of "bca" at (i=3). This makes the counter `3`, which is now our answer!
@ lkjhgfdsa. You are welcome. To decide if any anagram of string sub_s is present in string s, we create a hash of sub_s, and then, we compare the hash with the hash of the sliding window (of size sub_s.size()) in string s. If the hashes are equal, then we've found a window in s which has the same characters (and the same amount of those characters) as in sub_s. The hash is basically an array, containing character counts (frequencies).
@Alex, Hi, sorry to pinch you, but I consider this page as a learning platform ... anyways your solution is not O(n), because within your "n loop" you compare the "hash" which is a loop over the number of distinct character, so it becomes O(n*f) where f is the number of distinct characters.
Rabin-Karp maintains a hash (some integer, comparable in O(1)) which it does using modulo arithmetic ... you could do the same within your hash class so it would become close to O(n) expected average case, but Rabin-Karp has the problem, that if the hash matches it will have to deep compare, so one needs to distinguish between worst and average case behavior.
For pattern matchin R-K is bad, when the patterns occurs often in the search string (e.g. "a" in "aaaaaaaaaaa"). which is not the case here. So Rabin-Karp is a cool idea, but it will not work exactly, because it takes the order into account. So you will need to come up with a "flying" hash that will ignore order (e.g. by applying a prime to each character ...). It's a cool idea!
Since finding all anagrams is costly as it is equivalent to finding all permutations, why don't we instead go through all the substrings in arr[i+1] of length arr[i]?
let isAnagram(s1, s2) return True if the two strings are anagrams of each other, by creating a dictionary of counts for each of their letters and checking equality.
let counter = 0
let x = len(arr[i]) (our current string)
there are O(n) substrings in arr[i+1]
We can go through each substring, and call our isAnagram(s1, s2) function. If true, increment our counter.
Time complexity:
O(n^2m). We are going through O(n) substrings for each string, which is O(nm) where m=len(arr). Then for each substring, we are checking if it is an anagram, which is O(n).
This is a costly solution, but probably a quick one given the time constraint?
@Alex I took your code and improved it a bit. I added a "m_totalChars" counter to Hash class. There is no need to "operator==" - you simply check if "m_totalChars" is equal to 0
So the logic for "ContainsAnagram" is now like this:
* Construct a Hash class for the substring
* Loop over the second string, for each char call sub_s_hash.RemoveChar (which reduces the total counter by 1)
* We return sub_s_hash.isEmpty() if its true this means its an anagram
#include <vector>
#include <iostream>
#include <string>
#include <algorithm>
class Hash
{
private:
std::vector<int> m_charCounts;
size_t m_totalChars;
private:
void AddChar(char c)
{
++m_charCounts[c];
++m_totalChars;
}
public:
Hash(const std::string& str)
: m_totalChars(0)
{
m_charCounts.resize(256, 0);
std::for_each(str.begin(), str.end(), [&](char ch) { AddChar(ch); });
}
void RemoveChar(char c)
{
if(m_totalChars == 0) return;
if(m_charCounts[c]) {
--m_charCounts[c];
--m_totalChars;
}
}
bool isEmpty() const { return m_totalChars == 0; }
};
bool ContainsAnagram(const std::string& s, const std::string& sub_s)
{
if(sub_s.empty()) return false;
if(sub_s.size() > s.size()) return false;
Hash h(sub_s);
for(size_t i = 0; i < s.size(); ++i) {
h.RemoveChar(s[i]);
if(h.isEmpty()) break;
}
// If we "consumed" all the characters from the string itself, than "sub_s is" contained in "s"
return h.isEmpty();
}
int AnagramCasesCount(const std::vector<std::string>& strings)
{
int count = 0;
for(size_t i = 0; i + 1 < strings.size(); ++i) {
if(ContainsAnagram(strings[i + 1], strings[i])) {
++count;
}
}
return count;
}
int main()
{
std::cout << AnagramCasesCount({ "ab", "ab", "abc", "bca" }) << std::endl;
std::cout << AnagramCasesCount({ "ab", "aqb" }) << std::endl;
return 0;
}
@PenChief. Cool! The solution looks like ChrisK's algorithm applied to my code structure :)
Though, only half of ChrisK's algorithm is applied. There should also be a pieces of code for cases when we add a char from the long string, and this char makes the difference count greater. A test case that will work wrong with your solution: "ab", "acb".
ChrisK's algorithm with improved readability (in my opinion):
bool ContainsAnagram(string const s, string const sub_s)
{
if (!sub_s.empty() &&
sub_s.size() <= s.size())
{
vector<int> char_counts;
char_counts.resize(256, 0);
int diffs_count = 0;
for (char c : sub_s) {
++char_counts[c];
++diffs_count;
}
for (int i = 0; i < s.size(); ++i) {
if (char_counts[s[i]] > 0) {
--char_counts[s[i]];
diffs_count--;
} else {
--char_counts[s[i]];
diffs_count++;
}
if (i >= sub_s.size()) {
if (char_counts[s[i - sub_s.size()]] < 0) {
++char_counts[s[i - sub_s.size()]];
diffs_count--;
} else {
++char_counts[s[i - sub_s.size()]];
diffs_count++;
}
}
if (diffs_count == 0) {
return true;
}
}
}
return false;
}
Based on hashing -
public static void main(String args[]) {
String[] arr = {"ab", "ab", "abc", "bca"};
int n = count(arr);
System.out.println(n);
}
public static int count(String[] arr) {
int n = arr.length;
Map<String, Integer> map = new HashMap<>();
int count = 0;
String p = arr[0];
for (int i = 1; i < n; i++) {
int k = p.length();
int l = arr[i].length();
for (int j = 0; j < l; j += k) {
if (arr[i].length() >= (j + k) && getHash(p, map) == getHash(arr[i].substring(j, j + k), map)) {
count += 1;
break;
}
}
p = arr[i];
}
return count;
}
public static int getHash(String str, Map<String, Integer> map) {
if(map.containsKey(str))
return map.get(str);
char[] c = str.toCharArray();
int h = 0;
double cons = 2.35;
for (int i = 0; i < c.length; i++) {
h += ((int) c[i] * cons);
}
map.put(str, h);
return h;
}
A simple solution without hashing, but has complexity of O(n*m) :
import sys
def anagram( src, dst ):
try:
dst = list(dst)
for char in src:
del dst[dst.index( char )]
except:
return False
return True
if __name__ == "__main__":
input = sys.argv[1:]
#input = [ 'ab', 'ab', 'abc', 'bca' ]
count = 0
for i in range( len( input ) - 1 ):
if len( input[i] ) > len( input[i+1] ):
continue
for j in range( len( input[i+1] ) - len(input[i]) + 1 ):
if anagram( input[i], input[i+1][j:j + len(input) - 1] ):
count += 1
print count
And with a simple sum hash:
import sys
def hash( wholeString ):
sum = 0;
for char in wholeString:
sum += ord( char )
return sum
def rollhash( prev, next , current ):
return current - ord(prev) + ord(next)
def anagram( src, dst ):
try:
dst = list(dst)
for char in src:
del dst[dst.index( char )]
except:
return False
return True
if __name__ == "__main__":
input = sys.argv[1:]
count = 0
for i in range( len( input ) - 1 ):
if len( input[i] ) > len( input[i+1] ):
continue
patternHash = hash( input[i] )
subhash = 0
for j in range( len( input[i+1] ) - len(input[i]) + 1 ):
substr = input[i+1][j:j + len(input[i])]
if j == 0:
subhash = hash( substr )
else:
subhash = rollhash( input[i+1][j-1], input[i+1][j + len(input[i]) - 1], subhash )
if patternHash == subhash and \
anagram( input[i], input[i+1][j:j + len(input[i])] ):
count += 1
print count
Well, I just noticed that my code only handles permutations of the i'th string as substrings in the ( i+1) string. permutations are only a subset of anagrams.
I used the karp-rabin like approach, by using a 'sum of characters' hash, because a string and its permutations have the same hash value with this hash. I roll this hash when the i+1 string is larger then the i'th string.
Finally, because a sum of chars hash can have lots of multiple hits, if the hashes are equal, I verify that it is a permutation, by taking every character in the original string, and removing it in the candidate substring. If I found all the characters and removed them in the candidate, it is a permutation.
public static int findAnagrams(String[] s)
{
int[] c = new int[26];
String prev;
String next;
int res=0;
for(int i=0; i<s.length-1; i++)
{
Arrays.fill(c, 0);
prev = s[i];
next = s[i+1];
int j=0;
if(prev.length() > next.length())
continue;
while(j < next.length())
{
if(j<prev.length())
c[prev.charAt(j) - 'a']++;
if(j<prev.length())
{
c[next.charAt(j) - 'a']--;
}else
{
c[next.charAt(j-prev.length()) - 'a']++;
c[next.charAt(j) - 'a']--;
}
if(j >= prev.length()-1)
if(isAnagram(c))
res++;
j++;
}
}
return res;
}
private static boolean isAnagram(int[] c)
{
int n=c.length;
for(int i=0; i<n; i++)
if(c[i] != 0)
return false;
return true;
}
public static void main(String[] args) {
String[] s = {"ab", "ab", "abc", "bca"};
System.out.println(findAnagrams(s));
}
Using a frequency table and finding all substrings of element i+1 with length of element i.
Implemented in C++98.
#include <vector>
#include <string>
#include <map>
#include <iostream>
using namespace std;
vector<string> getAllSubstringsOfLength(string s, int l){
vector<string> subs = vector<string>();
for(int i = 0; i < s.size() - l + 1; i++){
string sub = s.substr(i, l);
subs.push_back(sub);
}
return subs;
}
bool isAnagram(vector<string> subs, string w, int l){
for(vector<string>::iterator it = subs.begin(); it != subs.end(); it++){
bool check = false;
map<char, int> lookup = map<char, int>();
string sub = (*it);
for(int i = 0; i < l; i++){
if(lookup.find(w[i]) != lookup.end())
lookup[w[i]]++;
else
lookup[w[i]] = 1;
}
for(int i = 0; i < l; i++){
if(lookup.find(sub[i]) != lookup.end()){
lookup[sub[i]]--;
if(lookup[sub[i]] < 0){
check = true;
break;
}
}
else{
check = true;
break;
}
}
if(!check){
for(map<char, int>::iterator it1 = lookup.begin(); it1 != lookup.end(); it1++){
char key = (*it1).first;
int val = (*it1).second;
if(val != 0){
check = true;
break;
}
}
if(!check)
return true;
}
}
return false;
}
int main(){
int count = 0;
vector<string> list = vector<string>();
list.push_back("ab");
list.push_back("aqb");
//list.push_back("abc");
//list.push_back("bca");
for(int i = 0; i < list.size() - 1; i++){
string word2 = list[i+1];
string word1 = list[i];
vector<string> subs = getAllSubstringsOfLength(word2, word1.size());
count += (isAnagram(subs, word1, word1.size())) ? 1 : 0;
}
cout << count << endl;
}
@ChrisK. That's correct. It's definitely not Rabin-Karp algorithm, it's something similar, in sense that it's using a hash and a sliding window.
- Alex October 01, 2017Regarding time complexity, we are both right, because the complexity is really O(n*f), but since f is a constant, it's actually O(n).