Given two aligned sequences `a

Google Interview Question for Software Engineer / Developers

4

of 14 votes

10
Answers
Given two aligned sequences `a` and `b`. Write a function "findCommon", that finds the longest substring of the longer sequence that align to the smaller sequence in such a way that the alignment length (matching length) can be maximized. Sequences initially were of different lengths but the smaller one is padded with hyphen ('-') after alignment to make it equal to the longer sequence. The length of longer sequence is known in advance (m, which is same for the smaller padded sequence). The output is always the subsequence of the longer string.

The total number of such operations to be done is in billions.

findCommon(a, b, m)

Example 1:

a = ------bixg--
b = xxxxxxbi-gzz
m = 12

output: big

Example 2:
a = xxxxxxbxigxx
b = ------b-ig--
m = 12

output = bxig

Example 3:
a = bigxxxxxxxxx
b = bi-x--------
m = 12

output = bigx
- mary.kindall November 19, 2013 in United States | Report Duplicate | Flag | PURGE
Google Software Engineer / Developer Algorithm

Email me when people comment.

An error occurred in subscribing you.

Country: United States
Interview Type: In-Person

Email me when people comment.

An error occurred in subscribing you.

Comment hidden because of low score. Click to expand.

of 0 vote

i am not clear about the output,

a = ------bixg--
b = xxxxxxbi-gzz
m = 12

ok so b has 'b' 'i' and 'g' characters common in order with a.

But why examples 2 and 3 have output bxig (x doesn't appear in b) and bigx (g doesnt appear in b)

- Anonymous November 19, 2013 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

alright the output is a sequence from longer string that match with subsequence in shorter string

- Anonymous November 19, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 0 vote

public void findCommon(char[] a, char[] b, int m) {
int startCount = -1;
int endCount =-1;
char[] c = null;
for (int i = 0; i < m; i++) {

if (a[i] == b[i]) {
if (startCount < 0)
startCount = i;

} else if (startCount >= 0) {

if ((a[i] == '-') && ((int)(b[i]) >=97 && (int)(b[i])<=122) ) {

if(!(i+1 < m && a[i+1] == b[i+1]))
{
endCount= i;
break;
}
} else if ((b[i] == '-') && ((int)(a[i]) >=97 && (int)(a[i])<=122) ) {

if(!(i+1 < m && a[i+1] == b[i+1]))
{
endCount= i;
break;
}
} else
startCount = -1;
}
}
System.out.println("start count = "+startCount +" end count = "+endCount);
if(startCount != -1)
{
if(startCount != 0)
c=a[0]== '-'?b:a;
else
c=a[endCount+1]== '-'?b:a;
if(endCount ==-1)
endCount=m-1;
int i= startCount;
while(i <=endCount-1 )
{
if(c[i] != '-')
System.out.print(" "+c[i]);
i++;
}
}
}

- hotcoder24 November 19, 2013 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

Whats the logic?

- confused_banda November 19, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 0 vote

Find starting and end index of Longest common subsequence of lcs(a,b) in a.
result=a.substring(start,end);

- Raven November 21, 2013 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

yes

string longest_common_subsequence(string a, string b){ // string a is longer than string b
	int m = a.length();
	int n = b.length();
	int** mat = new int*[m];
	for(int i = 0; i < m; i++){
		mat[i] = new int[n];
		for(int j = 0; j < n; j++) mat[i][j] = 0;
	}
	int** start_mat = new int*[m];
	for(int i = 0; i < m; i++){
		start_mat[i] = new int[n];
		for(int j = 0; j < n; j++) start_mat[i][j] = 0;
	}

	for(int i = 0; i < m; i++){
		if(a[i] == b[0]){
			mat[i][0] = 1;
			start_mat[i][0] = i;
		}
	}
	for(int j = 0; j < n; j++){
		if(a[0] == b[j]){
			mat[0][j] = 1;
			start_mat[0][j] = 0;
		}
	}
	for(int i = 1; i < m; i++){
		for(int j = 1; j < n; j++){
			if(a[i] == b[j]){
				mat[i][j] = mat[i-1][j-1] + 1;
				start_mat[i][j] = start_mat[i-1][j-1];
			}
			else{
				if(mat[i-1][j] > mat[i][j-1]){
					mat[i][j] = mat[i-1][j];
					start_mat[i][j] = start_mat[i-1][j];
				}else{
					mat[i][j] = mat[i][j-1];
					start_mat[i][j] = start_mat[i][j-1];
				}
			}
		}
	}
	int longest = mat[m-1][n-1];
	int start = -1, end = -1;
	int maxv = -1;
	for(int i = m-1; i >= 0; i--){
		for(int j = n-1; j >= 0; j--){
			if(mat[i][j] == longest && a[i] == b[j]){
				if(i - start_mat[i][j] > maxv){
					maxv = i-start_mat[i][j];
					start = start_mat[i][j]; end = i;
				}
			}
		}
	}
	int i = 0;
	for(i = 0; i < start; i++){
		if(a[i] == a[start]) break;
	}
	start = i;
	string res = a.substr(start, end-start+1);
	for(int i = 0; i < m; i++) delete[] mat[i];
	delete[] mat;
	for(int i = 0; i < m; i++) delete[] start_mat[i];
	delete[] start_mat;
	return res;
}

- Anonymous November 22, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

yeh, this would work. the last loop for finding "start" and "end" can be avoided by keeping a track in the main DP loop of the maximal substring seen so far (a complete alignment is detected when the smaller string is fully scanned, j == n).

- Anonymous January 25, 2014 | Flag

Comment hidden because of low score. Click to expand.

of 0 vote

not clear about the input:
1. shorter string can be padded at will or it has been padded fixedly in input?
2. why in example 1, longer string contains a "-" ? it's padded or a trival character?

- henryadam November 23, 2013 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

m = 12 is the length of the two strings.

- Ravio November 25, 2013 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

To cope with the 'billions' part you could make a Levenshtein Automaton of the pattern string, with as wide of edit distance as you're willing, convert that to an NFA, run the possible matches in parallel (not OS parallel), add logic (somehow) to keep count of the longest. It's possible that the branches in the NFA would make it exponential, but I'm sure that with good coding you could make it max out at something less (there should be only so many states). On its best day, this would be O(m = length of input string). Given that we're starting with O(n x m) we've got room to play with. The NFA 'threads' would keep track of their history - the longest you'd replay, with the right characters, '-' and so on.

- JeffD March 03, 2014 | Flag Reply

CareerCup

Google Interview Question for Software Engineer / Developers

Books

Videos

Resume Review

Mock Interviews