Linkedin Interview Question
Software Engineer / DevelopersTeam: Data Scientist
Country: United States
Interview Type: Phone Interview
I think this solution is correct, pseudo code here:
min_len = infinity;
most_recent_pos[n] = -1;
foreach word in document
int sequenceStart = min (the most recent position of all words in pattern except the current word)
int curr_len = pos of curr word - sequenceStart
if curr_len < min_len min_len = curr_len
most_recent_pos[current word] = position of current word
string GetMinimumString ( string original, string word1, string word2, string word3 )
{
const char SpaceChar = ' ';
if(original == null)
return null;
List<string> words = original.Split(SpaceChar);
Hashtable<string, int> wordList = new Hashtable<string, int>();
wordList.Add(word1, -1);
wordList.Add(word2, -1);
wordList.Add(word3, -1);
int minDistance = Integer.MaxValue;
int minIndex = 0, maxIndex = 0;
for(int i=0; i< words.Length; i++)
{
string w = words[i];
if(w == word1 || w == word2 || w ==word3)
{
wordList[w] = i;
if ( (w == word1 & (wordList[word2] == -1 || wordList[word3] == -1))
|| (w == word2 & (wordList[word1] == -1 || wordList[word3] == -1))
|| (w == word3 & (wordList[word1] == -1 || wordList[word2] == -1)) )
continue;
int distance = GetMax(wordList) - GetMin(wordList);
if ( distance < minDistance )
{
minDistance = distance;
minIndex = GetMin(wordList);
maxIndex = i;
}
if(minDistance == 2)
break;
}
}
if(maxIndex == 0)
return null;
string s = "";
for(int i = minIndex; i<= maxIndex; i++)
s += words[i] + SpaceChar;
return s;
}
int GetMax(Hashtable<string, int> tab)
{
int max = 0;
foreach(Key k in tab.Keys())
if( tab[k] > max )
max = tab[k];
return max;
}
int GetMin(Hashtable<string, int> tab)
{
int min = 0;
foreach(Key k in tab.Keys())
if( tab[k] < min )
min = tab[k];
return min;
}
We need a Trie and a HashTable, and do one time scan using sliding window (head + tail pointers) technique
1) Advance tail pointer thru the document and match any word with the help of Trie. Add a count of 1 to HashTable keyed by word.
2) When we matched all n words (HashTable.Size == n), we know document from head to tail contains all words, but not necessarily the shortest.
3) Now advance head pointer and decrease the value in HashTable when we encounter a word. Stop when the value in HashTable is 0. Now we have a shortest match between head and tail at current (tail) location.
4) Advance tail pointer again and stop when any word is matched. Try to advance tail pointer the same way as far as we can. Update the global tracker if new match is shorter.
5) Repeat until tail hit the end of document.
There're optimizations we can do when we try to advance the head (because tail already visited these words).
We can solve this problem in only one iteration, O(n) Complexity
let there be a variable int result= -1
1) Create a hash table with keys (W1,W2...Wm) and initialize their value by -1
2) Start checking the document word by word if the word exist in hash map then check its value, if its value is -1 then put the index value , else check the hash value with index if index is greater than hash value then put index value in hash. then check if all the hash box have +ve values(means all words are present) then look for max and min of the hasp values if its diff is less than result or result == -1 then store it in result.
looking for max and min in the hash values will add overhead O(n * pattern.length)
i'm thinking how to keep track of the two index
looking for max and min in the hash values will add overhead O(n * pattern.length)
i'm thinking how to keep track of the two index
I think that the hash-table based solutions (suppose that they are not very trick) could work only if W1, W2, .. Wn are all different. If they are not something else should be considered or the hash items should contain a list of positions and should keep as many of these positions as many time Wk is repeated.
I made this attempt to solving this problem using regex, and testing on big.txt (from here norvig.com big.txt).
This runs just under 3sec for 1mil words. My main concern: Is this fast enough?
run with:
java test.ShortestString big.txt
package test;
import java.io.BufferedReader;
import java.io.FileReader;
import java.io.FileNotFoundException;
import java.lang.NullPointerException;
import java.lang.Exception;
import java.util.regex.Pattern;
import java.util.regex.Matcher;
import java.util.Scanner;
import java.util.List;
import java.util.ArrayList;
class ShortestString {
public static void main (String args[]) throws FileNotFoundException {
// You can also find out about how to make
String w1 = "find";
String w2 = "how";
String w3 = "make";
String minM = null; // min matched string
int minL = Integer.MAX_VALUE; // min matched string length
ArrayList<String> patterns = new ArrayList<String>();
patterns.add("(" + w1 + ".*?\\s.*?" + w2 + ".*?\\s.*?" + w3 + ")");
patterns.add("(" + w1 + ".*?\\s.*?" + w3 + ".*?\\s.*?" + w2 + ")");
patterns.add("(" + w2 + ".*?\\s.*?" + w1 + ".*?\\s.*?" + w3 + ")");
patterns.add("(" + w2 + ".*?\\s.*?" + w3 + ".*?\\s.*?" + w1 + ")");
patterns.add("(" + w3 + ".*?\\s.*?" + w1 + ".*?\\s.*?" + w2 + ")");
patterns.add("(" + w3 + ".*?\\s.*?" + w2 + ".*?\\s.*?" + w1 + ")");
// compile all patterns
ArrayList<Pattern> patternsC = new ArrayList<Pattern>();
for (String pattern : patterns) {
patternsC.add(Pattern.compile(pattern));
}
BufferedReader r = new BufferedReader(new FileReader(args[0]));
try {
String line = null;
while ((line = r.readLine()) != null) {
for (Pattern pattern : patternsC) {
Matcher m = pattern.matcher(line);
while(m.find()) {
String matched = m.group();
int l = matched.length();
if (matched != null) {
if (minL > l) {
minL = l;
minM = matched;
}
// System.out.println(l+ "\n" + matched);
}
}
// System.out.println("Pattern done: " + pattern);
// System.out.println("---------------------------------------------------");
}
}
} catch(NullPointerException e) {
e.printStackTrace();
} catch(Exception e) {
e.printStackTrace();
}
System.out.println("Min string: " + minM);
System.out.println("Min len: " + minL);
}
}
Can we do this?
minLength = infinity;
minLine = ""
For each line L_i:
{
indx1=substring(L_i,w1)
indx2=substring(L_i,w2)
indx3=substring(L_i,w3)
// indx1,indx2 and indx3 are the indexes where w1, w2 and w3 appear in the line L.
Length = max(indx1,indx2,indx3) - min(indx1,indx2,indx3)
if ( Length < minLength)
minLength = Length;
minLine = i;
}
Create a standard trie from the given text. Since we have well defined words we can get away with a trie instead of suffix tree. Find the words if they occur in the trie. If yes then we just have to sort the indexes to get our answer. For eg, W1 occurs at 1, 10 ,30. And W2 occurs at 5, 15, W3 at 35 40. Sort the indexes and pick 1, 5, 35.
O (n* n)
define pattern = "w1, w2, w3";
string smallestStr = null;
for word in Document
{
if word is w1
{
string str = FindPatternString(Document, currentPosiiton);
if(smallestStr is null or str.Length < smallestStr.Length)
{
smallestStr = str;
}
}
}
string FindPatternString(doc, pos)
{
string left = FindString(doc, 0, pos);
string right = FindString(doc. pos, doc.length -1);
if(left == null && right == null)
return null;
else if (left == null)
return right;
else if (right == null)
return left;
else
return left.length > right.length ? right : left;
}
string FindString(doc, from, to)
{
HashtableFromPattern<Word, Count> table = {<w1, 1>, <w2, 1>, <w3, 1>};
bool isFromSmaller = (to > from);
while(from != to) {
if(table.Count ==0){break;}
if(table.contains(doc[from])
{
table[from]--;
if(table[from] ==0)
table.remove(doc[from]);
}
if(isFromSmaller )
from++;
else
from--;
}
if(table.count ==0)
return doc.substring(from, to);
else
return null;
}
public void printShortestDis(String w1, String w2, String w3, String[] doc) {
int pos1 = -1;
int pos2 = -1;
int pos3 = -1;
int minLen = Integer.MAX_VALUE;
int minStart = -1;
int minEnd = -1;
for (int i = 0; i < doc.length; i++) {
if (doc[i].equals(w1)) {
pos1 = i;
if (pos2 != -1 && pos3 != -1) {
int tmpLen = pos1 - Math.min(pos2, pos3);
if (tmpLen < minLen) {
minLen = tmpLen;
minStart = Math.min(pos2, pos3);
minEnd = pos1;
}
}
}
else if (doc[i].equals(w2)) {
pos2 = i;
if (pos1 != -1 && pos3 != -1) {
int tmpLen = pos2 - Math.min(pos1, pos3);
if (tmpLen < minLen) {
minLen = tmpLen;
minStart = Math.min(pos1, pos3);
minEnd = pos2;
}
}
}
else if (doc[i].equals(w3)) {
pos3 = i;
if (pos1 != -1 && pos2 != -1) {
int tmpLen = pos3 - Math.min(pos1, pos2);
if (tmpLen < minLen) {
minLen = tmpLen;
minStart = Math.min(pos1, pos2);
minEnd = pos2;
}
}
}
}
if (minStart != -1 && minEnd != -1)
for (int i = minStart; i <= minEnd; i++)
System.out.print(doc[i] + " ");
}
Worst case is O(n^2).
Basic idea:
When we encounter a word in the sequence. We search to find others until we have a valid string. We repeat and keep track of the minimum string as we find more substrings containing the words.
public class ShortestSubStringExpression {
Hashtable<String, Boolean> words = new Hashtable<String, Boolean>();
public static void main(String [] args) {
ShortestSubStringExpression shortestSubStringExpression = new ShortestSubStringExpression();
shortestSubStringExpression.solution();
}
public void solution() {
words.put("Hello", true);
words.put("World", true);
words.put("Ciao", true);
ArrayList<String> document = new ArrayList<String>();
document.add("Hello");
document.add("was");
document.add("Hello");
document.add("maybe");
document.add("World");
document.add("Ciao");
document.add("this");
document.add("was");
document.add("Hello");
printShortestStringWithPattern(document);
}
public void printShortestStringWithPattern(ArrayList<String> document) {
int minStringLength = -1 ; // Use -1 as "infinite"
String minString = null ;
for(int pos = 0 ; pos < document.size() ; pos++) {
String currWord = document.get(pos);
if(words.get(currWord) == null) {
continue ;
} else {
minString = getMinimumStringSoFar(document, pos, minString);
if(minString != null)
minStringLength = minString.length();
}
}
System.out.println(String.format("Min String: %s. Length: %d" , minString, minStringLength));
}
private String getMinimumStringSoFar(ArrayList<String> document, int pos, String currentMinString) {
int wordsEncountered = 0 ;
StringBuffer newBuffer = new StringBuffer(2048);
HashMap<String,Boolean> foundSoFar = new HashMap<String, Boolean>();
for(int i = pos ; i < document.size() ; i++) {
newBuffer.append(document.get(i));
String currWord = document.get(i);
if(words.containsKey(currWord) && !foundSoFar.containsKey(currWord)) {
wordsEncountered ++;
foundSoFar.put(currWord, true); }
if(wordsEncountered == words.size()) {
break ;
}
}
if(wordsEncountered == words.size()){
String newString = newBuffer.toString();
if(currentMinString == null || currentMinString.length() > newString.length()) return newString;
return currentMinString;
} else {
return currentMinString;
}
}
import sys
def getMinString( str, search_L ):
min_dis = sys.maxint # assign max int
L = [] # used for result output
w_pos = {}
for i in range(len(str)):
if str[i] in search_L:
w_pos[str[i]] = i
if len(w_pos)==3:
v_list = w_pos.values()
if v_list[2]-v_list[0] < min_dis:
min_dis = v_list[2]-v_list[0]
L = v_list
w_pos = {}
return "".join( [str[i] for i in range(L[0],L[2]+1)] )
print getMinString( 'Hellowrold', ['l','r','d'] )
It can be done in O(N)
Keep indices track int last_W1, last_W2, last_W3 - where as we go through the doc from left to right, the last seen index of W1 is set to last_W1, last seen index of W2 is set to last_W2 so on.
Anytime W3 is seen, update last_W3 (similarly for last_W2, last_W1 cases). Then check
last_W3 minus min(last_W2,last_W1) and if it is lesser than global_min, update global_min and save the three indices last_W3, last_W2, last_W1.
Once you reach the end of the doc, the last_W1, last_W2 and last_W3 should give the needed substring from the doc. We can go to those locations and print the substring.
{{
static int ShortestStringWithWordSequences(string[] text, string[] wordSequences)
{
if (text == null || text.Count() == 0 || wordSequences == null || wordSequences.Count() == 0)
return 0;
var minLen = int.MaxValue;
var dict = new Dictionary<string, int>();
var index = 0;
for(int i=0; i<text.Length; i++)
{
if (wordSequences.Contains(text[i]))
{
if (!dict.ContainsKey(text[i]))
dict.Add(text[i], index);
else
dict[text[i]] = index;
if (wordSequences.Count() == dict.Count())
{
var startIndex = dict.Values.Min();
var len = index + text[i].Length - startIndex + 1;
if(minLen > len)
{
Console.WriteLine(len);
}
minLen = minLen > len ? len : minLen;
}
}
index += text[i].Length;
}
return minLen == int.MaxValue ? 0 : minLen;
}
}}
This can be done by modifying the maximum sum subsequence.
- kill January 08, 2012Loop through once keeping track of
Most recent position of w1 w2 and w3, minlengthsofar and min.