Interview Question
Country: India
Interview Type: Written Test
Hash these strings char-by-char, e.g.:
hash[f]=5
hash[d]=1
...
hash[o]=6
....
Now sweep through the hash table and pick the values that are :
1. most repeated
2. form a sub-string (i.e., they are sequential)
It is also important to keep a counter that indicates the number of strings that have a selected substring in common. This is to address cases such as "fruit" in above example. One approach is to select the substring if it is common to more than 50% of the entire strings.
The problem is not stated very clearly. I will try to reword it:
You're given an array of strings, for example {fog, fool, food, fruit, for}. You pick a string S, and for each string x in the array, if S is a prefix of x, you remove S from the beginning of x. In other words, if x could be written as S.concat(y), where y is some (possibly empty) string, you replace x with y. The goal is to find the choice of S that minimizes the sum of all string lengths in the array after this replacement + length(S).
This can be solved in optimal O(M) time, where M is the total number of characters present in the input, by using a prefix tree. Build a prefix tree that is augmented at each node with a counter of how many strings in the prefix tree begin with the prefix defined by the path to that node. For example, if the prefix tree contains the words "cow" and "cool", it would look like
root (2) -> c(2) -> o(2) -> o(1) -> l (1)
|
-> w(1)
I hope the above formats correctly when people look at it. The top c and o nodes have a count of 2 because there are 2 words that begin with "c" and "co" respectively. The next "o" node has a count of 1 because only 1 word begins with "coo".
Now, you want to choose S to be the string defined by some path from the root to a node in this prefix tree. If you don't, your choice cannot be optimal, since there are characters at the end of your chosen string that could be removed to improve the cost. The question now becomes which path we should choose.
We can do a traversal of the prefix tree. Each node corresponds to a possible choice of S. At each node, we have the count of how many strings in the array have that choice as a prefix. All we need to do is multiply the count by the depth of the node (length of the corresponding choice for S), which we maintain during the traversal, to determine how many total characters will be removed by that choice of S. We then subtract the depth of the node once to compensate for the length(S) term in the cost function. The result is the cost reduction that would be achieved by this choice of S. Minimizing the cost is equivalent to maximizing the cost reduction. We keep a reference to the node seen so far that has the largest cost reduction as we traverse. After the traversal, we do a depth-first search to find the path from the root to that node, and reconstruct the optimal choice of S from the path.
A prefix tree can be constructed in O(M) time, traversed in O(M) time, and DFS searched in O(M) time, where M is the total number of characters in the input. Adding an additional count to each node and maintaining the depth and the maximum so far during the traversal does not change the asymptotic bounds, so the total time is O(M). The total space will also be O(M), assuming storing a count takes O(1) space per node.
I think its a variation of longest common sub sequence problem.
- Sai August 11, 2013Along with finding the longest common sub sequence we have to keep track the frequency which is the count of no of string this particular longest common sub sequence is present or simply the most frequent remaining length after removing longest common string would win....!