## Microsoft Interview Question

Software Engineer / DevelopersWell I think we can build array of suffixes and then sort them.

e.g. mayank

mayank| ayank |yank |ank |nk| k

sort them

ank| ayank| k| mayank| nk |yank|

now two adjacent which has longest prefix is our answer

in this example "a" from (ank and ayank).

This is easy and standard solution.

There are n suffixes.

Trivial sorting algorithm will work, but quite slow.

There're O(n) suffix array building algorithm and O(n) algorithm to calculate longest common prefix of two suffixes. But they're too complicated.

@ftfish you are right about complexity it is not O(n) but trivial sorting approach is very easy to code . See My code down the page. But I use standard utility for sorting.Hope Interviewer will not mind in using that.

Using trivial sorting algorithm yields an O(n^2 logn)algorithm which is too slow.

BTW, my algorithm can also be implemented in O(n^2 logn) if we don't use hash table. see the commented part of my code below.

How did you come with N^2logN ?

I think building suffixes O(NW) say W is average length of suffix.

Sorting Suffix also NWlogN if we are not doing anything fancier.

finding common prefix is NW

so NWlog(n).

Your hashing technique is better than this of course. But this seem okay to me. Correct me if I am wrong

the lengths of the suffixes are 1,2,...,n, so the sum is 1+2+..+n=(n+1)*n/2. Then your "average" = O(n).

Using any nlogn-comparision sorting algorithm yields an O(nlogn * n) time complexity, because each comparision (of two suffixes) takes O(n) time.

Note the commented part of my code is just a binary search + trivial bruteforce, which is easy to code.

@Mayank, can you post some good source of the implementation of the suffix array in C. A brief description of the approach will also be helpful.

Binary search for the length of the answer, then check validity with polynomial hash (so that hash[s[i..j]] can be easily calculated from hash[s[i-1..j-1]], which is also called Robin Karp hash).

O(n*logn)

Better (but more complicated) solutions exist. Try googling suffix tree/array. O(n) can be achieved.

Hi,

What do you mean by "Binary search for the length of the answer" ? We do not know the length of the answer so how can we feed this value into the input? I believe Robin Carp can be used when we have a Pattern which we want to match in the Text. But here we have no such input pattern.

We don't know exactly what the length is, but we do know it is between 0 and len(s).

The key insights which allow us to do binary search are:

1. if there's a substring with length k which appears more than once, then there's also a substring with length k-1 which also appears more than once;

2. if there's NO substring with length k which appears more than once, then there's also NO substring with length k+1 which also appears more than once;

for a given length of answer k, just insert all the hash values of all the substrings with length k into a hash table(totally O(len(s)) using polynomial hash) and check if there're two substrings with the same hash value.

so the time complexity is O(len*log(len))

This is the easiest way I know to solve this problem.

ftfish: Nice idea. So for this implementation we'll have to keep a a data structure:

hash[i][j] for 1 <= i <= n, min_hash_val <= j <= max_hash_val. Each entry of hash[i][j] stores the substring of length = i, and hash value = j.

And finally, we will have to do a binary search to find out largest 'i' for which there is 2 elements with same 'j'. However, we need to also store the substring. Is this idea correct? Thanks.

@Anonymous

I can't exactly understand your first part. But all we have to do is to keep a hash table in which all the hash values of all the substrings of length k are stored and clear that hash table before every phase.

You don't have to store the substrings themselves because it's too costly. Just use multiple "bases" in polynomial hash so that we're confident enough to assert two substrings are the same if they give the same hash value for all the bases. generally 2 or 3 bases are enough.

@ftfish: your "time complexity is O(len*log(len))" seems not correct, since when you do k=len/2, you will calc Hash for len/2 substrings with length len/2, the time cost for this step is O((len/2)^2), which is O(len^2).

@ftfish

I don't think explaining Rabin-Karp at the time of interview is a nice idea . I tried to do it once and ended up with confusing the interviewer .

Besides no matter how many bases you use ,a counterexample can always be found . So ,I think its better to explain other algorithms which ensure the correctness .

@Frodo Baggins

It's just a basic math formula. But of course, if even you don't understand the idea clearly, it may be confusing.

Your second part is not perfectly true. I assume you know that the Millerâ€“Rabin primality test, which is commonly used, is a probabilistic algorithm. There're counterexamples, but you shouldn't be able to find them directly/easily. And with careful choice of parameters, the probability that your computer hardware is done is larger than the probability that the algorithm gives wrong answer.

But OK, if you prefer suffix tree/array which generally take hundreds lines of code, it's your choice and google may help.

It seems like everyone is jumping to answers.. but no one asked for the right questions.. so it's either they heard this or make assumptions which is not good.

Regarding the string inputted into the function.. is it a very long sentence with spaces and therefore the substring is the word (tokenized with spaces)? or is it is a long string with random sequence with no spaces. If it is the latter how is the world are you going to find the substring. It is assumed that the substring is a word that you are NEVER given...thus you don't know what the substring which means you dont' know what you are looking for. Thus it seems like the function is just

findLongestSubstring( std::string const& veryLongWholeStr, char delim = ' ' );

Now how do you know what substring is.

Exactly my point!

I've said earlier too and will repeat it now.. people, please word your questions completely and clearly.. don't assume that readers KNOW the question already.

Sometimes I wonder if they just post random questions they heard 'somewhere'... scholars of triva!... save you if you join your "dream" companies and continue to write like this! very soon you will be back on careercup.com looking for a new job :P

P.S. please learn English if you have to!

This is longest repeated substring problem. If you've confusion about definition of "substring", then it's BIG problem! Why the hell you consider "word"/"token" as substring? Where you SAW that defn?

@gradStudent: I second u, it should be more elaborate specially when we've people in the room who may have confusion with defn of "substring".

No, YOU, anonymous, missed my point. It's obvious a LONGEST repeated substring problem. The problem is how is it inputted it? And you miss my delimiter point too.

The delimiter is used as a space so if you have an entire book or a long sentence inputted as a string, you will have a space in it. If you have a text file with commas in it (it's possible.. have you thought about that!?) is that treated as a delimiter or part of a string. You can ALSO have something like this :

"Thedeasdfaimiteiasdfs uasdfasdfa uasdfasdfaasds apace so if you have an entire uasdfasdfaasdsbook or a long senteasdfas uasdfasdfa"

So can you tell me is "uasdfasdfaasds" the longest string within the word or just the word? you have 4 areas where there is this "uasdfasdfaasds" and 2 of them are within a word and 2 are not. Or you can have a bunch of long SPACES and that can can be a longest substring WITHIN THE INPUTTED STRING.

If specs are not clear you have software problems. You NEVER ASSume anything..

speaking from commercial experience.

Actually correction...

"Thedeasdfaimiteiasdfs uasdfasdfa uasdfasdfaasds apace so if you have an entire uasdfasdfaasdsbook or a long senteasdfas uasdfasdfa uasdfasdfaasdsbook uasdfasdfaasdsbook"

"uasdfasdfa" can be a longest repeated substring

"uasdfasdfaasds" within "uasdfasdfaasdsBOOK" can be longest repeat substring

but if they say word is the substring, then "uasdfasdfaasdsbook" is the long substring..not "uasdfasdfa".. because sometimes interviewers MEANT word and if it's a word, the delimiter spaces comes into play.. AND if you have 3 100 spaces within the string, the 100 continous space is the longest repeated substring.

I agree with Vinod. Suffix Tree will offer the best possible solution. Suffix tree on input string can be build in linear time and space. Search for deepest internal node can be done in linear time too.

But i was wondering if there exists any Dynamic Programming Solution (though less efficient but easier to code compared to suffix tree) for this problem ?

Any Suggestions

Ok, then can you implement it in code now?..especially during 40 minutes in the interviews. People might know theory..but they might not know how to code.

read my algorithm above. It can be implemented in 20 minutes.

As I have said, it is the easiest (yet efficient) way I can do.

Right..but algorithm is good and is getting there..but it is still theory..

it sounds simple, but when actual coding, many things will be amissed.

Once the interviewer SEES the code, you already answer that you know the

algorithm AND you really know how to code.

What would you two like to say? You've found an easier but still efficient way to solve the problem?

Then I'm looking forward to your solution.

If not, then practice your coding skills.

I've implemented my algorithm many times, so I can do it in 20 minutes.

Actually one point of commenting here is to learn from everyone else, give suggestion and correct other people or as you say improve the algorithm. Well, there is a lot of algorithm suggestions, which we can learn from. But there are no coding sample which is the ultimate thing interviewers are looking for and also the way you program the thing out. No one is saying you don't know how to code. But even if you did implemented many times, there are others who can better then you and able to spot either better coding (not algorithm) style. There is no code review in college. There are code reviews in commercial world. Also little things in C++ like using ++i instead of i++ makes a huge difference in performance speed also. I have seen candidates like you who knows the algorithm, but asking them to code in syntax, a few either really struggled or a few can do it..BUT there are so many little things in terms of coding efficiency (not algorithm efficiency) and style that he lacked. So like I said.. coding is the ultimate. Rare to find someone who knows both algorithm and coding efficiency.

Actually one point of commenting here is to learn from everyone else, give suggestion and correct other people or as you say improve the algorithm. Well, there is a lot of algorithm suggestions, which we can learn from. But there are no coding sample which is the ultimate thing interviewers are looking for and also the way you program the thing out. No one is saying you don't know how to code. But even if you did implemented many times, there are others who can better then you and able to spot either better coding (not algorithm) style. There is no code review in college. There are code reviews in commercial world. Also little things in C++ like using ++i instead of i++ makes a huge difference in performance speed also. I have seen candidates like you who knows the algorithm, but asking them to code in syntax, a few either really struggled or a few can do it..BUT there are so many little things in terms of coding efficiency (not algorithm efficiency) and style that he lacked. So like I said.. coding is the ultimate. Rare to find someone who knows both algorithm and coding efficiency.

I still can't get what you exactly want to say.

If you just think I can't implement my algorithm efficiently, that's alright, I won't waste my time trying to prove you're wrong. But I assure you that I can implement most (if not all) of the algorithms I gave on this site efficient enough. But there's always someone better than I, it's true.

If you just want to see the code, just express that clearly so that some one may get you and may have time pasting it here.

If you have a better algorithm, or a normal algorithm with EXTREMELY fast implementation, which you consider as "ultimate", I'm glad to see your post and thanks in advance.

BTW, replying as Anonymous makes me (and many others I think) confused.

I second ftfish. Anonymous's comment is provocative in the sense that he has doubt in those who post smart logic, but don't post solution. Com'on dude, you've NOT hired people here to code the problems for you. If you are smart (i assume as you've perhaps "real world" experience), code yourself. If you are not, ask suggestion about coding efficiency (like, which data structure or STL would be best for a problem etc).

There are 1000+ algorithm problems. No candidate needs to write code for all those. Because coding is like maths, you need to grasp and practice as much as possible; but need not to solve all of those as the set of problems is infinite. It's one's responsibility to learn how to code efficiently. He can post his code here for comments. But, it should not be the practice that one (say ftfish for this problem) will post an algorithm, and also an efficient implementation for others. It is like spoon-feeding.

No offense please.

No Fighting Please :-)... Everyone out here are brilliant coders. You guys are doing great job keep it up.

@buried.shopno

If you say that at an interview.. I will say bye bye to you and good luck with your next job search.

Since if at an interview, he say he can code it up many time, it should be extremely easy for him to do so. It's not spoon feeding, because these questions should be treated like interview questions and therefore code it up. It's again not spoon feeding also because you might code it up differently and see the result might be the same as any one who posted the coding, and then we can all learn from either other's code to see the difference. That's also learning. Also it's not spoon feeding because if you code it incorrectly, you will know since ftfish here seems to know the answer correctly and well. (no offense to ftfish, just making a statement..from an objective point of view)

If it is that easy to him like he say.. it should be like writing a string reverse function... see how many code are out there for that on forums, because it's easy.

<pre lang="java" line="1" title="CodeMonkey80169" class="run-this">//I copied this from some where but not that tough to understand

import java.util.Arrays;

class LRS {

public static String lcp(String s, String t) {

int n = Math.min(s.length(), t.length());

for (int i = 0; i < n; i++) {

if (s.charAt(i) != t.charAt(i))

return s.substring(0, i);

}

return s.substring(0, n);

}

public static String lrs(String s) {

int N = s.length();

String[] suffixes = new String[N];

for (int i = 0; i < N; i++) {

suffixes[i] = s.substring(i, N);

}

Arrays.sort(suffixes);

String lrs = "";

for (int i = 0; i < N - 1; i++) {

String x = lcp(suffixes[i], suffixes[i+1]);

if (x.length() > lrs.length())

lrs = x;

}

return lrs;

}

public static void main(String[] args) {

String s = "MaYankMaYanksasas";

s = s.replaceAll("\\s+", " ");

System.out.println("'" + lrs(s) + "'");

}

}

</pre><pre title="CodeMonkey80169" input="yes">

</pre>

<pre lang="c#" line="1" title="CodeMonkey24360" class="run-this">using System;

public class Test

{

public static void Main()

{

Console.WriteLine(longestSubstr("this is avaya")); //should print 'is '

Console.WriteLine(longestSubstr("aaabbbaaab") ); //should print 'aaab'

Console.WriteLine(longestSubstr("aaaa")); // should print 'aaa'

Console.WriteLine(longestSubstr("noduplicate")); //should print ''

}

static string longestSubstr(string input)

{

int minIndex = 0, maxIndex = 0, globalminIndex = 0, globalmaxIndex = 0;

for (int i = 1; i < input.Length; i++)

{

minIndex = maxIndex = i;

for (int j = i; j < input.Length; j++)

{

char tmp1 = input[j - i];

char tmp2 = input[j];

if (input[j - i] == input[j])

{

maxIndex = j;

}

else

{

if (maxIndex - minIndex > globalmaxIndex - globalminIndex)

{

globalmaxIndex = maxIndex;

globalminIndex = minIndex;

}

minIndex = maxIndex = j;

}

}

if (maxIndex - minIndex > globalmaxIndex - globalminIndex)

{

globalmaxIndex = maxIndex;

globalminIndex = minIndex-1;

}

}

return input.Substring(globalminIndex + 1, globalmaxIndex - globalminIndex);

}

}

</pre><pre title="CodeMonkey24360" input="yes">

</pre>

Hi,

why do you need to assign to tmp1 when it's not used? Assignment is a bit expensive.

char tmp1 = input[j - i];

char tmp2 = input[j];

For the ones who really want to learn, and the ones who want to see the code but expect that I can't write it.

btw, there seems to be something wrong with the commenting system of this site. And my indentations are easily messed up even with the { { { brackets.

```
//============================================================================
// Name : LongestRepeatedSubstring.cpp
// Author : ftfish
// Version : 1.0
// Description :
// finds the longest substring of a given string which appears
// more than once.
//============================================================================
#include <cstdio>
#include <ctime>
#include <cstring>
#include <cstdlib>
const int maxl = 100000, maxc = 2;
const int maxh = 200000, basen = 3, base[basen] = { 113, 149, 173 };
struct node {
unsigned key[basen];
node *next;
};
void gen_rand_str(char *s, int len) {
srand(time(0));
for (int i = 0; i < len; ++i)
s[i] = 'a' + rand() % maxc;
s[len] = 0;
}
void insert(unsigned v[], node *h[], node *p) {
memcpy(p->key, v, sizeof(p->key));
p->next = h[v[0] % maxh];
h[v[0] % maxh] = p;
}
bool find(unsigned v[], node *h[]) {
for (node *p = h[v[0] % maxh]; p; p = p->next)
if (memcmp(p->key, v, sizeof(p->key)) == 0)
return 1;
return 0;
}
bool valid(char *s, int n, int len, int &pos) {
static node *h[maxh], pool[maxl];
memset(h, 0, sizeof(h));
int pp = 0;
unsigned v[basen], pow[basen];
for (int j = 0; j < basen; ++j)
pow[j] = 1, v[j] = 0;
for (int i = 0; i < len; ++i)
for (int j = 0; j < basen; ++j)
v[j] = v[j] * base[j] + s[i], pow[j] *= base[j];
insert(v, h, pool + (pp++));
for (int i = len; i < n; ++i) {
for (int j = 0; j < basen; ++j)
v[j] = v[j] * base[j] - pow[j] * s[i - len] + s[i];
if (find(v, h)) {
pos = i - len + 1;
return 1;
}
insert(v, h, pool + (pp++));
}
return 0;
}
//
//bool valid2(char *s, int n, int len, int &pos) {
// for (int i = 0; i < n - len; ++i)
// for (int j = i + 1; j < n - len; ++j)
// if (memcmp(s + i, s + j, len) == 0) {
// pos = i;
// return 1;
// }
// return 0;
//}
int main() {
int n = maxl;
static char s[maxl + 1];
gen_rand_str(s, n);
// puts(s);
int start_t = clock();
int lr = 1, rr = n - 1, mid, tpos, anslen = 0, anspos;
while (lr <= rr) {
mid = (lr + rr) >> 1;
if (valid(s, n, mid, tpos))
anslen = mid, anspos = tpos, lr = mid + 1;
else
rr = mid - 1;
}
int end_t = clock();
if (!anslen)
puts("No repeated substring !");
else {
printf("maxlen = %d\n", anslen);
for (int i = 0; i < anslen; ++i)
putchar(s[anspos + i]);
putchar('\n');
}
printf("Time used: %.2lf\n", double(end_t - start_t) / CLOCKS_PER_SEC);
//
// lr = 1, rr = n - 1, anslen = 0;
// while (lr <= rr) {
// mid = (lr + rr) >> 1;
// if (valid2(s, n, mid, tpos))
// anslen = mid, anspos = tpos, lr = mid + 1;
// else
// rr = mid - 1;
// }
// printf("maxlen by bruteforce: %d\n", anslen);
return 0;
}
```

/*This program uses suffix array to find out the longest duplicated string*/

//Uses suffix array to solve it

#include<stdio.h>

#include<stdlib.h>

#include<string.h>

#define MAXLEN 50000

int comlen(char *p, char *q) {

int i=0;

while(*p && (*p++ == *q++))

i++;

return i;

}

int pstrcmp(char **p, char **q) {

return(strcmp(*p, *q));

}

int main() {

char c[MAXLEN];

char *a[MAXLEN];

int n= 0, ch;

while( (ch = getchar()) != EOF) {

a[n] = &c[n];

c[n++] = ch;

}

c[n] = '\0';

qsort(a, n, sizeof(char *), pstrcmp);

int maxlen = -1;

int maxi = -1;

int thislen = 0, i=0;

for(i=0; i<n-1; i++) {

if( (thislen = comlen(a[i], a[i+1])) > maxlen){

maxlen = thislen;

maxi = i;

}

}

printf("%.*s\n", maxlen, a[maxi]);

return 0;

}

<pre lang="c" line="1" title="CodeMonkey72394" class="run-this">/*This program uses suffix array to find out the longest duplicated string*/

#include<stdio.h>

#include<stdlib.h>

#include<string.h>

#define MAXLEN 50000

int comlen(char *p, char *q) {

int i=0;

while(*p && (*p++ == *q++))

i++;

return i;

}

int pstrcmp(char **p, char **q) {

return(strcmp(*p, *q));

}

int main() {

char c[MAXLEN];

char *a[MAXLEN];

int n= 0, ch;

while( (ch = getchar()) != EOF) {

a[n] = &c[n];

c[n++] = ch;

}

c[n] = '\0';

qsort(a, n, sizeof(char *), pstrcmp);

int maxlen = -1;

int maxi = -1;

int thislen = 0, i=0;

for(i=0; i<n-1; i++) {

if( (thislen = comlen(a[i], a[i+1])) > maxlen){

maxlen = thislen;

maxi = i;

}

}

printf("%.*s\n", maxlen, a[maxi]);

return 0;

}

</pre><pre title="CodeMonkey72394" input="yes">

</pre>

->The idea is to generate the prefix function(KMP) for all the suffixes in the string

->For Each Suffix find the length of the max prefix using the value in prefix function

->compare all the max pref length for all the suffixes , calculate the max and find the substring from that

->Below is the working code for that

->Time complexity o(n2) and space complexity o(n) . Here prefix array is being generate every time we can optimize to use a single array

```
public class LongestSubString {
public static void main(String[] args) {
String s = "dabcdkjslkfabcdg";
int[][] array = new int[s.length()][2];
for (int i = 0; i < s.length(); i++) {
String nextSuffix = s.substring(i);
int[] prefix = new int[nextSuffix.length()];
array[i][0] = i;
array[i][1] = computePrefixFunction(nextSuffix, prefix);
}
int max = 0;
for (int i = 1; i < array.length; i++) {
if (array[i][1] > array[max][1]) {
max = i;
}
}
System.out.println("Max with Shift is " + max);
System.out.println("Longest possible Sub String " + s.substring(max, max + array[max][1]));
}
public static int computePrefixFunction(String str, int[] prefix) {
int q = 0;
prefix[0] = 0;
for (int i = 1; i < str.length(); i++) {
while (str.charAt(i) != str.charAt(q) && q > 0) {
q = prefix[q];
}
if (str.charAt(i) == str.charAt(q)) {
q = q + 1;
} else {
q = 0;
}
prefix[i] = q;
}
int max = 0;
for (int i = 0; i < prefix.length; i++) {
// System.out.print(prefix[i]);
if (max < prefix[i]) {
max = prefix[i];
}
}
return max;
}
}
```

Make a trie here

- DashDash July 21, 2010Now traverse every path and calculate its length

Time complexity : O(n^2)