Amazon Interview Question
Software Engineer / DevelopersCountry: United States
Interview Type: In-Person
Traverse the log, put every record into a hashtable, userId is the key, the list of url is the value. For the example Jim updated, the hashtable looks like:
186: A, C, C, B, A
187: B, A, C, C
188: B, A, C
189: A, D, B, A, C
Compare every pair of list, it's the longest common subsequence problem. Use dynamic programming to solve it.
Time complexity: build hashtable: n + compare every pair of list: k^2 * (n/k) when k is the number of userId, n/k is the average length of the list. The total running time is nk.
What do you mean "compare every pair of list"?Won't this require an array [M][N] where M is the number of users and N is the URLS?By the way I initially thought suffix tree but interviewer complained for too much space
It does require an array[M][N]. However we don't need to create this array, since we already have it as the values in hashtable. Just traverse hashtable, and compare the values of every two keys.
But if I find the longest common sequence in a pair, that does not mean that it is the same with another pair.I mean comparing in pairs seems erroneous.Am I wrong?
I thought the question is to find the longest common subsequence between any two users, since the answer in example here is sequence A, C, C which only users 186 and 187 visited.
In the question, it is to find subsequence "from all users". And answer to the example is A, C. it's even better. The only change to the my solution is at step 2, compare the value of the first key in hashtable with the value of the second key. get the LCS and compare with the value of the next key, repeat this getting LCS, and comparing it with next key's value, until done with the hashtable traversal.
i.e.
186: A, C, C, B, A
187: B, A, C, C
188: B, A, C
189: A, D, B, A, C
LCS(ACCBA, BACC) = ACC
LCS( ACC, BAC) = AC
LCS( AC, ADBAC) = AC
resut: AC.
The total time complexity is n.
So to find the LCS for N strings you do it in pairs?When you say "get the LCS and compare with the value of the next key," compare the value of the next key with what?The previous LCS?Or the previous key?
@Hua, I agree your idea, it abvious a good choice to compare first two strings, and get their lcs, then use lcs to compare with next string. using dynamic programming, it really takes O(n) time complexity, but the space complexity is little more:O(n^2) .
@Hua:"However we don't need to create this array, since we already have it as the values in hashtable". But to find the LCS of 2 strings the DP algorithm needs extra space.I haven't read a version of the algorithm not needing extra space.So I am not sure why you say we don't need the extra array
@Hua:Also how is complexity O(N)?The algorithm for LCS is by definition O(mn) where m,n are the sizes of the substrings involved.
I agree. The time complexity is O(mn). The space complexity can be reduced by only keeping the values in last row we've computed, because we only need the last row to compute the values in current row. It's O(min(m, n))
@Hua, It does not look like it will work on all set of data. consider the following in given example.
186: A, C, C, B, A
187: B, A, C, C
188: F,G
189: A, D, B, A, C
In this case
LCS(ACCBA, BACC) = ACC
LCS( ACC, FG) = Nothing
LCS(nothing, ADBAC)
Kindly expain
@Hua:It is O(mn) for 2 strings.For k strings (url entries) it is O(kmn). Also I don't know the algorithm of LCS without extra space.If you could provide a reference or code what you are saying it would be really great!
To Sumi,
In the case you listed, there is no common subsequence visited by all the users. We can just return empty set once nothing is returned by LCS()
To Jim,
Yes, For k stirngs, it is O(kmn). Please take a look at wiki page.
en.wikipedia.org/wiki/Longest_common_subsequence_problem#Reduce_the_required_space
@Hua:The space reduction link you posted says that it is applicable only when we need to find the length of the LCS.I don't think it applies when we want to find the actual LCS.
Also do you think that suffix trees were such a bad idea?
If you take a look at Completed LCS Table in worked example section on wiki, you will see, to compute the values in current row, what's really needed is the last row.
I think suffix trees will also do the work, but it uses space alot.
There is no common URL visited by all the users. Should return empty set, shouldn't it?
Will it work for this input?
186 : A,B,C,X,Y
187 : A,B,C,X,Y
188 : M,N,X,Y
189 : D,E,X,Y
As per your algo:
LCS of (ABCXY, ABCXY) = ABC
LCS OF (ABC, MNXY) = Nothing
whereas you can see that the LCS of all sequences is XY
LCS of (ABCXY, ABCXY) IS NOT ABC.
LCS of (ABCXY, ABCXY) = ABCXY
LCS oF (ABCXY, MNXY) = XY
LCS of (XY, DEXY ) = XY
My algo is good for your case.
@Jim the suffix tree is helpful when there are large number of queries as it takes significant time in creation of suffix tree along with the space.... however in this case it will not be a good choice as the tree creation will take long and there are no repeated queries
@Hua in russell's example
186: x,a,b
187: a,b,x
188: x,d,e
189: d,e,x
The answer should be x but your algo will return NULL (as LCS between a,b and d,e will be NULL)
I think we can use the same algo but use it for more iteration to complete the solution .. so if we get NULL at end of first iteration we can only say that the LCS we started with doesnt have any common URL in all users....but we cant say for sure if there is no such solution...what we can do is ... take next LCS between first two strings(186 and 187 ignore the previous one)....and then do same thing for it...and we will have to do it till the LCS between the first two becomes NULL(by ignoring all previous LCS between them) ... only then we can say that there exists no solution between all users...lets assume we have p such LCS between first two strings then complexity increases to O(kmnp)
@vik:It seems this is a valid corner case (@Hua can verify for sure since he understands this very good).I think that if we sort the sequences prior to the algorithm then, Hua's algorithm would work since we would have after sort:
186: a,b,x
187: a,b,x
188: d,e,x
189: d,e,x
And we would get x as result. Runtime would not increase i.e. O(kmn + klogm)~ O(kmn).
Am I right?
@Jim: The goal is to find the longest consecutive URLs visited by users. If we sort the sequence first, the sequence property is broken and we couldn't get the expected answer.
@Vik: You solution looks good to me. I use you idea and maybe we can do something better. Here is the revised idea.
i.e.
186: x a b c y z
187: a b x y z d
LCS(xabcyz, abxyzd) = ab
LCS(x**cyz, **xyzd) = yz
LCS(x**c**, x****d) = x
LCS(***c**, *****d) = empty(iteration stops here)
So the idea is to iterate multiple times on the first two strings, till there no common sub-sequence found. After each iteration, we mark out the found sub-squence which we are not interested in the next iterations. So we mark them out. The reason of mark them up rather then remove them is we could find the wrong subsequence. i.e.
LCS( abxy, xaby ) = ab
LCS( xy, xy ) = xy <-- wrong.
LCS( **xy, x**y ) = x <-- correct.
LCS( ***y, ***y ) = y <-- correct.
Back to the idea, now we have all the common sub-sequences from the first two strings. The final answer(if there is one) must be in one of those common sub-sequences(at least be part of it). We can just do like this.
For the third string, let's name it t. find the LCS of t with each of common sub-sequences from first two strings, put this result into a set. After it's all done, we have the set which is the common sub-sequence for first three strings and the answer should be in the set or must be at least some part of a set element. Apply this logic to the rest of strings, and we will have a final set, and we choose the longest set element as our answer. If at any step, the set is empty, we can return empty set immediately because there is not such common sub-sequence visited by all the users. Does it make sense? Any suggestion is appreciated.
@Hua: Although LCS for 2 sequences can be solved using dynamic programming in O(m*n) time, the general case: LCS for k sequences, is a known NP-hard problem. That is to say, there is no known algorithm to solve this problem that runs in polynomial time. No matter how you arrange the order of doing LCS on two sequences, there will be some corner case where your algorithm will fail.
@Jim: The above said, I doubt whether your interviewer wants you to find the longest sub-array (consecutive URLs) or the longest sub-sequence, since there is no known efficient algorithm to solve the latter. You'll just have to enumerate all possible URL combinations and pick the longest one. It's another story if the interviewer wants you to find longest sub-array. For example, you can build suffix trees. I think it is called suffix array when you build multiple sequences into one tree.
@freshboy:I told him to use suffix tree but he said to much space find another way.I did not tell anything about DP version.But did something similar to what you side, i.e. use a map of user-id to sequences and then iterate over all sequences to find the longest one.At least that was the idea. I was rejected though
Divide the log lines by user (so you have a list of URLs for each user).
Find all common subsequence of user-1 and user-2 and sort them by decreasing order.
For each common subseuqnce found above, search for it in the list of each user. If it's not found then eliminate it. The first subseuquence that's present in the lists of all the users is the longest common subseuqnece.
This would take O(N)*<number of common subseuqneces between user1 and user2>
where N is the log length.
Approach :
Store as HashMap<String,HashSet<String>> where key is the url and set is the number of unique user who have visited this site.Maintain another HashSet<String> to get the count of unique user.All of the above work is O(n).
Then traverse through the HashMap and check if the size of the HashSet is equal to the count of the unique user id...if yes then this URL goes to the max subset(assuming the question asks for URL visited by all user)....This also takes O(n) time.
So total time complexity is O(n)....
Please comment in case I have understood something wrong.....
I am not sure I understand what you say.First of all the question is about URL sequences.How do you get it if in your HashMap you use as key a single URL? Also I am not sure what you store in the 2 HashSet<String>.In the first you say "the number of unique user who have visited this site".Is it a number then? And what is the use of the other HashSet?
my solution for this
package com.learning.puzzle;
import java.util.ArrayList;
import java.util.Collection;
import java.util.Comparator;
import java.util.HashMap;
import java.util.Map;
import java.util.TreeMap;
public class URLLogging {
public static void main(String...strings ){
/*
Build the problem in objects
*/
ArrayList<Browsing> arrys = new ArrayList<Browsing>();
arrys.add(new Browsing(186,"A"));
arrys.add(new Browsing(187,"B"));
arrys.add(new Browsing(186,"B"));
arrys.add(new Browsing(188,"A"));
arrys.add(new Browsing(189,"C"));
arrys.add(new Browsing(188,"B"));
// String the browsing URL
Map<Integer,String> map = new HashMap<Integer,String>();
for(Browsing browse : arrys){
if(!map.containsKey(browse.userid)){
map.put(browse.userid, browse.url);
}
else{
String value = map.get(browse.userid) +"-->"+browse.url;
map.put(browse.userid, value);
}
}
// Counting the Max Browsing URL
Collection<String> values = map.values();
Map<String,Integer> valueMap = new HashMap();
for(String url : values){
if(!valueMap.containsKey(url)){
valueMap.put(url, 1);
}
else{
int value = valueMap.get(url) +1;
valueMap.put(url, value);
}
}
ValueComparator bvc = new ValueComparator(valueMap);
TreeMap<String,Integer> sorted_map = new TreeMap<String,Integer>(bvc);
sorted_map.putAll(valueMap);
System.out.println("results: "+sorted_map);
}
}
class ValueComparator implements Comparator<String> {
Map<String, Integer> base;
public ValueComparator(Map<String, Integer> base) {
this.base = base;
}
// Note: this comparator imposes orderings that are inconsistent with equals.
public int compare(String a, String b) {
if (base.get(a) >= base.get(b)) {
return -1;
} else {
return 1;
} // returning 0 would merge keys
}
}
// user n url holding object
class Browsing {
int userid;
String url;
public Browsing(int userid, String url) {
super();
this.userid = userid;
this.url = url;
}
}
I have written the code in Java. But this would only find sequences specific to the URL given. For example, it would just store "abc" from the below file content in HashMap and would only use that to compare and increase the count. It would ignore whatever is present after abc.
/*
Input file would be something like this:
Jon abc.com
Jim def.com
Tim abc.com
Ann abc.com
Jim pqr.com
Tim abc.com
*/
public static void main(String[] args) throws IOException
{
HashMap<String, Integer> hm = new HashMap<String, Integer>();
BufferedReader br = new BufferedReader(new FileReader("File.txt"));
String str;
boolean flag = false;
String[] arrstr;
while((str=br.readLine()) != null)
{
flag = false;
arrstr = str.split(" ");
arrstr = arrstr[1].split("\\.");
if(hm.containsKey(arrstr[0]))
hm.put(arrstr[0], hm.get(arrstr[0])+1);
else
hm.put(arrstr[0], 1);
}
System.out.println(hm.values());
System.out.println(hm.entrySet());
}
Isn't finding longest common subsequence for more than 2 sequences NP hard?
- Vineet January 24, 2013