Interview Question
Software Engineer / DevelopersCountry: United States
Interview Type: Phone Interview
DS.query has lower bound lglgT (like if DS is van emde boas at each character bucket) or lgT if you use balanced BST at each character bucket correct?
Please advise subbu
or balanced BST at each character bucket can be sorted array (which do binary search on) to improve cache hits
lasthope, this was your idea right?
but I like subbu do it without specifying the actual DS/algorithm exactly so it give clean look from above
nice one subbu
It is O(1), by just using a table, say Lookup[p,c].
This table, of size O(|T|) can be filled in O(|T|) time.
Basically a matrix of 256*|T| integers.
For example if string is ab
Then the matrix will look like (hope formatting goes through)
a | b | c | .... | z
------------------------
-1| 0 | 1 | -1 | .... | -1
0 | -1 | 1 | -1 | .... | -1
1 | -1 | -1 | -1 | ..... | -1
You can start by filling the row corresponding to |T| and work right to left.
-=, yes, nice clean implementation. Although I'm not sure how table access for query can be O(1).
Yeah I think the successor query (which is "DS.query" by subbu) has lower bound lglgT. Not sure if we can beat van embde boas (which match the very lower bound).
But the way lasthope thinking about it (which in bigO is lgT) is better for cache hits because he does binary search over static array instead of pointer based van emde boas (lglgT).
But I like subbu presenting it is as abstract DS.query, but maybe we rename it as DS.successor :)
Subbu how to do actual successor query in O(1) on your table of mostly -1 (sparse) ?
@lasthope.
Constructing the table is O(|T|) time and query is O(1).
For example just for a single character, say 'h', you can fill its column in O(|T|) time by scanning through the string right to left.
Do this for each character (256) and you have the table.
@subbu, I understand that the table construction is O(T). But, if you want to find the lowest non -1 entry of 'h' (which is greater than p) in the table, you have a problem. You may have to pass through many values less than equal to p or simply -1 before you meet your criteria. Hope I'm making sense.
In the table above column 'c' has all -1. Mistake there.
@lasthope/ -=. I hope you got why it is O(1).
@lasthope.
Here is the code for a single character, which should explain what I am trying to say better.
int *GetColumn (char * T, char c) {
int L = strlen(T);
int *column = new column[L];
int position = -1;
for (j = L-1 ; j>= 0; j--) {
column[j] = position;
if (T[j] == c) {
// found c, update the new next for the c that are to the left of current position.
position = j;
}
}
}
Yeah it would seem lasthope's binary search is the divide and conquery improvement of the table idea.
We can discuss "per character buckets" (we put some DS at each character bucket - this part is same for all solutions we are discussing).
At a character bucket if we hang a sparse vector of length T with mostly -1, then worst case we might have to travel paths linear in T to find query. It seems O(T).
lasthope suggest converting the very same -1 sparse vector into an array of indices (sorted) to binary search on O(lgT).
or we can hang a van emde boas off each character bucket to do successor queries in O(lglgT).
Since Tmax is only 10M , lgTmax is 23, lglgTmax is 4
So the internal constants being smaller for binary search probably wins here.
I read somewhere the successor problem (for reasonable requirements for preprocessing times and space) has a lower bound O(lglgT) optimal query.
@subbu, You are absolutely right. A table constructed this way will immediately point to where the next position is. Perfect.
subbu , lasthope
i am still blocked , help!
subbu can you post all code again in a new one because it seem you got the optimal solution so we can upvote that one and understand better?
@-=. I believe you are thinking of the case where the n integers come from a universe of size N, with N >> n, with Theta(N) space usage being prohibitively expensive.
That is not the case here. N = |T| here.
btw, the Omega(log log N) bound was just a conjecture that has been disproved.
@-=, you have to understand how subbu is updating his table. table['c'][pos] is giving you the value
min {i | i > p, T[i] = 'c' }
, which is exactly what you need to query.
Oh!
So extra preprocessing to O(max_char*T) to reduce to O(1) query.
Makes sense.
In Java this would be 64k*10M preprocessing worst case.
I understand all your amazing solutions. Unblocked thx lasthop subbu
Here is some quick C++, fully working (but not production quality), commentless code. (For explanation, there are a bunch of comments preceding).
#include <string.h>
#include <iostream>
#include <vector>
using std::cout;
class Successor {
public:
int Query(char c, int position) {
return _table[c-1][position+1];
}
Successor(const char *T) {
for (int c = 1; c <= 256; c++) {
_table[c-1] = GetRow(T,c);
}
}
~Successor() {
for (int i = 0; i < 256; i++) {
delete [] _table[i];
}
}
private:
int *GetRow(const char *T, char c) {
int L = strlen(T);
int *column = new int[L+1];
int position = -1;
for (int j = L-1; j >= 0; j--) {
column[j+1] = position;
if (T[j] == c) {
position = j;
}
}
column[0] = position;
return column;
}
int *_table[256];
};
class SubSequenceChecker {
public:
bool Check(const char *S) {
int position = -1;
while (*S != '\0') {
position = _DS.Query(*S, position);
if (position == -1) {
return false;
}
S++;
}
return true;
}
SubSequenceChecker(const char *T) : _DS(T) {
}
private:
Successor _DS;
};
int main() {
SubSequenceChecker s("aabcddef");
cout << s.Check("a") << "\n";
cout << s.Check("abcf") << "\n";
cout << s.Check("abcdef") << "\n";
cout << s.Check("fed") << "\n";
cout << s.Check("aaa") << "\n";
cout << s.Check("aafd") << "\n";
}
1. Create a sorted array of the positions where char X is present in T. Do it for all possible chars X (assuming 256).
2. For each char Y in S, do a binary search over the sorted array created for char Y in the first step. Note the position where it is found. Next time we look for Y again, we have look beyond this position.
3. If we do not find a viable position for Y in the sorted array, return false.
Runs in O( |S| log |T| ).
On a second thought, using two indexes, one pointing to the current char in S another one pointing to the last matched char plus one in T, it can be done in O( |S| + |T| ) time.
So we can do this:
An array or hash of all characters where each bucket is a van Embde Boa tree (keys are the indices where the character for that bucket shows up).
Then as required we can do successor to find next index.
This should make it to O(S lglg T) which is nearly O(S) even with T as 10M
Oh no, van emde boas is for making tree access cache oblivious. It is not required here.
In step 1 we "pre-process" T so that array[X] holds the positions (sorted if you go from left to right) where char X appears in T. O( T )
Then for each char S_i = 'd' in S (processing chars one by one from left to right), do a binary search in array['d']. Lets say we find it in location l of array['d']. O( log T )
The trick is to remember the location l and next time when we look for 'd' again we have to search array['d'][ l+1: END]. If found we update l. There should be an last found position similar to 'l' for each char processed from S so far.
If the process doesn't fail we return true at the end. O( S log T )
The position that is the lower bound is common to all characters. So Step 2 in the description lasthope's algorithm can lead to wrong answers. No?
oops, sorry. there should be only one last matched position variable l. we have to search starting from l+1. please see the implementation by subhu below.
@-=: No, it is not off-by-one. Step 2 talks about maintaining positions for each character, which actually gives wrong answers.
There should have been a delete/edit option :-P. One last match position is all you need. btw, I think the O(1) lookup is very clever, subbu :)
c'mon subbu , lasthope idea is good for large max-char also (the correct idea was hidden with some bugs that's all)
@-=. I am pretty sure lasthope had the correct solution in mind and it does indeed work for larger character sets. In fact, this is a good discussion to have with the interviewer! Even for smaller character set, it would be a good discussion. Classic tradeoff between time and space (256*T space + O(|S|) runtime vs |T| space + O(|S|log|T|) rutime).
But as written it is wrong, and it could actually be a common mistake made my many of the interview candidates, don't you think? No harm in calling it out.
Yes yes.
I made mistake because I had similar solution (using vanEmde) so when I read lasthope's I just skimmed it assumed he was doing similar (successor at each bucket with lgT).
All these discussions are correct, I think can overwhelm the interviewer no matter how interviewer complains. that is key to beating the interviewer, don't let him/her complain anything (always come back and say .. ok in that case if you don't care about preprocessing we can... ok in that case you complain about that so do it this way...)
Good job subbu I really like your original code without any DS.query implementation that way we can drop what we need for that part based on how interviewer complain
do following check for every Si....
input - T,
pattern - Si
public static boolean match(String input, String pattern){
char[] in = input.toCharArray();
char[] pat = pattern.toCharArray();
int i=0, j=0;
int count=0;
while(i<in.length && j<pat.length){
if(in[i]==pat[j]){
i++;j++;
count++;
if(count==pat.length)
return true;
}else{
i++;
}
}
return false;
}
Sorry the above code with HashMap is incomplete. Here is full code
We can use memo dp to improve the algorithm.
// use memorization to improve the algorithm from brute force to dp
HashMap<String, Boolean> memo = new HashMap<String, Boolean>();
Boolean result;
for(int i=i_min; i<=i_max; i++) {
if( memo.containsKey( s[i] ) {
result=memo.get( s[i] );
}
else {
result=T.subSequence(s[i]);
memo.put(s[i], result);
}
if( result) System.out.println(s[i] + "is a subsequence");
}
I will post visualization later
Great approach to have a character location map. Implemented the code with an IEqualityComparer parameter, just in case the interviewer wants to do this case insensitively.
using System;
using System.Collections;
using System.Collections.Generic;
namespace CareerCup
{
internal class Subsequence
{
private const int IndexNotFound = -1;
#region Public Methods
/// <summary>
/// Finds if the each of the <paramref name="patterns"/> string is a subsequence of the <paramref name="source"/> string.
/// </summary>
/// <param name="source">Source string.</param>
/// <param name="patterns">Pattern strings.</param>
/// <param name="comparer">Equality comparer.</param>
/// <remarks>
/// The <paramref name="source"/> string can be upto 10 million characters long.
/// </remarks>
internal static void IsSubsequence(string source, string[] patterns, IEqualityComparer<char> comparer = null)
{
// Let's pre-process the source string.
// Dictionary - This will store the count of each character in the source string.
// This will help as the first line of defense against verifying if the pattern can exist in the source string.
//
// Dictionary - This will store the location of each character in the source string.
// The concept is that if we will be able to do binary search to find the characters' existence and the order.
Dictionary<char, int> countMap = new Dictionary<char, int>(comparer);
Dictionary<char, List<int>> locationMap = new Dictionary<char, List<int>>(comparer);
for (int index = 0; index < source.Length; ++index)
{
char currentChar = source[index];
if (countMap.ContainsKey(currentChar))
{
countMap[currentChar] += 1;
locationMap[currentChar].Add(index);
}
else
{
countMap[currentChar] = 1;
List<int> locations = new List<int>();
locations.Add(index);
locationMap[currentChar] = locations;
}
}
// Let's process all the patterns.
foreach (string pattern in patterns)
{
bool result = IsPatternASubsequence(pattern, source.Length, new Dictionary<char, int>(countMap, comparer), locationMap);
Console.WriteLine("Pattern: {0} IsSubsequence: {1}", pattern, result);
}
}
#endregion // Public Methods
#region Private Methods
/// <summary>
/// Checks if the <paramref name="pattern"/> string is a subsequence of the source string.
/// </summary>
/// <param name="pattern">Pattern string.</param>
/// <param name="sourceLength">Source string length.</param>
/// <param name="countMap">Character count map of the source string.</param>
/// <param name="locationMap">Character location map of the source string.</param>
/// <returns>Flag indicating whether the <paramref name="pattern"/> string is a subsequence of the source string.</returns>
private static bool IsPatternASubsequence(
string pattern,
int sourceLength,
Dictionary<char, int> countMap,
Dictionary<char, List<int>> locationMap)
{
if (string.IsNullOrWhiteSpace(pattern)
|| pattern.Length > sourceLength
|| !DoesSourceContainPattern(pattern, countMap))
{
return false;
}
// If we have reached here, then the pattern characters are present in the source string.
// Now, we need to make sure that all the characters are in the right order.
int indexInSource = IndexNotFound;
for (int index = 0; index < pattern.Length; ++index)
{
char currentChar = pattern[index];
// Check if the number of characters left in the pattern can be found in the source.
// Will this check improve on our efficiency?
if (!GetNextAvailableIndex(ref indexInSource, locationMap[currentChar])
|| ((sourceLength - indexInSource) < (pattern.Length - index - 1)))
{
return false;
}
}
return true;
}
/// <summary>
/// Gets the next available index of the desired character by looking up the character location.
/// The search is performed using binary search.
/// </summary>
/// <param name="currentIndexInSource">Current index in source string that holds the previous charcter.</param>
/// <param name="locationMaps">Character location map.</param>
/// <returns>Flag indicating whether the desired character exists in the source string after the current index.</returns>
private static bool GetNextAvailableIndex(
ref int currentIndexInSource,
List<int> locationMaps)
{
int start = 0;
int end = locationMaps.Count - 1;
int nextSoFar = IndexNotFound;
while (end >= start)
{
int middle = start + ((end - start) >> 1);
if (locationMaps[middle] == currentIndexInSource)
{
if (middle < locationMaps.Count)
{
nextSoFar = locationMaps[middle + 1];
}
break;
}
else if (locationMaps[middle] > currentIndexInSource)
{
nextSoFar = locationMaps[middle];
end = middle - 1;
}
else
{
start = middle + 1;
}
}
if (nextSoFar == IndexNotFound)
{
return false;
}
currentIndexInSource = nextSoFar;
return true;
}
/// <summary>
/// Verifies that all the characters in the <paramref name="pattern"/> exists in the source string.
/// </summary>
/// <param name="pattern">Pattern string.</param>
/// <param name="countMap">Character count map.</param>
/// <returns>Flag indicating whether the pattern characters exist in the source string.</returns>
private static bool DoesSourceContainPattern(string pattern, Dictionary<char, int> countMap)
{
for (int index = 0; index < pattern.Length; ++index)
{
char currentChar = pattern[index];
int count;
if (!countMap.TryGetValue(currentChar, out count)
|| count == 0)
{
return false;
}
countMap[currentChar] -= 1;
}
return true;
}
#endregion // Private Methods
}
public class CharComparer : IEqualityComparer<char>
{
public bool Equals(char x, char y)
{
return char.ToLowerInvariant(x).Equals(char.ToLowerInvariant(y));
}
public int GetHashCode(char obj)
{
return char.ToLowerInvariant(obj).GetHashCode();
}
}
}
Represent each string as a vector with counts of characters and use that to determine "sub-sequencesness".
Eg: "abbc" will be the vector [1,2,1,0,...,0] with the first entry corresponding to number of a's, second to number of b's etc.
Given abc, which will be [1,1,1,0,...] you check if each corresponding entry of target is <= the entry at the source vector.
i am assuming strings are of ASCII character set. since ASCII has 256 charecters, declare an integer array of with 256. count each character presence. do the same by decrementing charecter occurance. if any position got -1, then its not the subset of the first string.
subsequence still has order involved
you are doing "subset" checking
you can use this as a first level check though
A greedy method should be OK, I think. The C++ code:
bool myCmp(string S, string T) {
int lenS = S.length();
int lenT = T.length();
for (int i = 0, j = 0; i < lenS, j < lenT; j++) {
if (S[i] == T[j]) i++;
}
return i == lenS ? true : false;
}
The space complexity is O(1); The time complexity is O(n), where n = length of T.
Please correct me, if it's wrong.
Algorithms on Strings, Trees and Sequences by Gusfield has Anonymous/subbu/= solutions all variations.
all same people win the upvote
-Guest DS
{fake ajeet, fake Varun, Anonymous that failed with first solution using hashing, =, fake subbu } all same person 100%
{lasthope} different person
Preprocess T so that we can answer queries of the following form:
- Given a character c and a position p, what is the smallest position j > p, such that c appears at position j in T (we allow p to be -1, assuming positions start at 0).
This can easily be done in O(n) preprocessing time (assuming 255 possible characters, linear time for each character) so that the queries can be answered in O(1) time.
Now for a given S we can use the above data structure (called DS below) as follows
Thus, for each string S, we only require O(|S|) time.
- Subbu. November 18, 2013