You have a web server's log th

Amazon Interview Question for Software Engineer / Developers

0

of 0 votes

37
Answers
You have a web server's log that records for each user the URL that he accessed.
Example format:
```
Time-stamp  User-id  URL
```
How to find the maximum common (sub)sequence of visited URLs from all users? Write code in java
Log entries are ordered based on timestamp.
Example log (omitting timestamp for clarity):
John URL1
John URL2
Jim URL2
Mary URL1
John URL4
etc
Update:
Example:
```
User-id    URL
186           A
187           B
186           C
188           B
186           C
187           A
186           B
188           A
188           C
189           A
189           D
187           C
189           B
186           A
187           C
189           A
189           C
```
So the max common URL sequence is A,C,C (URL A followed by URL C, followed by URL C)
- Jim January 20, 2013 in United States | Report Duplicate | Flag | PURGE
Amazon Software Engineer / Developer Java

Email me when people comment.

An error occurred in subscribing you.

Country: United States
Interview Type: In-Person

Email me when people comment.

An error occurred in subscribing you.

Comment hidden because of low score. Click to expand.

of 1 vote

Isn't finding longest common subsequence for more than 2 sequences NP hard?

- Vineet January 24, 2013 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

I think it is tractable if you have a specific number of Strings

- Jim January 24, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 0 vote

what do you mean by common sequence? can you give the expected result of your example?

- zyfo2 January 21, 2013 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

Updated post with an example.I hope it is clear now

- Jim January 21, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 4 vote

Traverse the log, put every record into a hashtable, userId is the key, the list of url is the value. For the example Jim updated, the hashtable looks like:
186: A, C, C, B, A
187: B, A, C, C
188: B, A, C
189: A, D, B, A, C

Compare every pair of list, it's the longest common subsequence problem. Use dynamic programming to solve it.
Time complexity: build hashtable: n + compare every pair of list: k^2 * (n/k) when k is the number of userId, n/k is the average length of the list. The total running time is nk.

- Hua January 21, 2013 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

What do you mean "compare every pair of list"?Won't this require an array [M][N] where M is the number of users and N is the URLS?By the way I initially thought suffix tree but interviewer complained for too much space

- Jim January 21, 2013 | Flag

Comment hidden because of low score. Click to expand.

-1

of 1 vote

It does require an array[M][N]. However we don't need to create this array, since we already have it as the values in hashtable. Just traverse hashtable, and compare the values of every two keys.

- Hua January 21, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

But if I find the longest common sequence in a pair, that does not mean that it is the same with another pair.I mean comparing in pairs seems erroneous.Am I wrong?

- Jim January 21, 2013 | Flag

Comment hidden because of low score. Click to expand.

-1

of 1 vote

I thought the question is to find the longest common subsequence between any two users, since the answer in example here is sequence A, C, C which only users 186 and 187 visited.

In the question, it is to find subsequence "from all users". And answer to the example is A, C. it's even better. The only change to the my solution is at step 2, compare the value of the first key in hashtable with the value of the second key. get the LCS and compare with the value of the next key, repeat this getting LCS, and comparing it with next key's value, until done with the hashtable traversal.
i.e.
186: A, C, C, B, A
187: B, A, C, C
188: B, A, C
189: A, D, B, A, C

LCS(ACCBA, BACC) = ACC
LCS( ACC, BAC) = AC
LCS( AC, ADBAC) = AC
resut: AC.

The total time complexity is n.

- Hua January 21, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

So to find the LCS for N strings you do it in pairs?When you say "get the LCS and compare with the value of the next key," compare the value of the next key with what?The previous LCS?Or the previous key?

- Jim January 21, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

Previous LCS.

- Hua January 21, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

@Hua, I agree your idea, it abvious a good choice to compare first two strings, and get their lcs, then use lcs to compare with next string. using dynamic programming, it really takes O(n) time complexity, but the space complexity is little more:O(n^2) .

- yingsun1228 January 22, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

@Hua:"However we don't need to create this array, since we already have it as the values in hashtable". But to find the LCS of 2 strings the DP algorithm needs extra space.I haven't read a version of the algorithm not needing extra space.So I am not sure why you say we don't need the extra array

- Jim January 22, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

@Hua:Also how is complexity O(N)?The algorithm for LCS is by definition O(mn) where m,n are the sizes of the substrings involved.

- Jim January 22, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

I agree. The time complexity is O(mn). The space complexity can be reduced by only keeping the values in last row we've computed, because we only need the last row to compute the values in current row. It's O(min(m, n))

- Hua January 22, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

@Hua, It does not look like it will work on all set of data. consider the following in given example.
186: A, C, C, B, A
187: B, A, C, C
188: F,G
189: A, D, B, A, C
In this case
LCS(ACCBA, BACC) = ACC
LCS( ACC, FG) = Nothing
LCS(nothing, ADBAC)

Kindly expain

- sumi January 22, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

@Hua:It is O(mn) for 2 strings.For k strings (url entries) it is O(kmn). Also I don't know the algorithm of LCS without extra space.If you could provide a reference or code what you are saying it would be really great!

- Jim January 22, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

To Sumi,
In the case you listed, there is no common subsequence visited by all the users. We can just return empty set once nothing is returned by LCS()

To Jim,
Yes, For k stirngs, it is O(kmn). Please take a look at wiki page.
en.wikipedia.org/wiki/Longest_common_subsequence_problem#Reduce_the_required_space

- Hua January 22, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

@Hua:The space reduction link you posted says that it is applicable only when we need to find the length of the LCS.I don't think it applies when we want to find the actual LCS.
Also do you think that suffix trees were such a bad idea?

- Jim January 22, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

If you take a look at Completed LCS Table in worked example section on wiki, you will see, to compute the values in current row, what's really needed is the last row.
I think suffix trees will also do the work, but it uses space alot.

- Hua January 22, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

@Hua, here is a counter-example:
186: x,a,b
187: a,b,x
188: x,d,e
189: d,e,x

- russell January 24, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

There is no common URL visited by all the users. Should return empty set, shouldn't it?

- Hua January 24, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

Will it work for this input?
186 : A,B,C,X,Y
187 : A,B,C,X,Y
188 : M,N,X,Y
189 : D,E,X,Y

As per your algo:
LCS of (ABCXY, ABCXY) = ABC
LCS OF (ABC, MNXY) = Nothing

whereas you can see that the LCS of all sequences is XY

- Vineet January 24, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

Good case. My algo fails on your case. I'll think of it. Thanks for your thinking.

- Hua January 24, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

LCS of (ABCXY, ABCXY) IS NOT ABC.

LCS of (ABCXY, ABCXY) = ABCXY
LCS oF (ABCXY, MNXY) = XY
LCS of (XY, DEXY ) = XY

My algo is good for your case.

- Hua January 24, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

@Jim the suffix tree is helpful when there are large number of queries as it takes significant time in creation of suffix tree along with the space.... however in this case it will not be a good choice as the tree creation will take long and there are no repeated queries

@Hua in russell's example
186: x,a,b
187: a,b,x
188: x,d,e
189: d,e,x
The answer should be x but your algo will return NULL (as LCS between a,b and d,e will be NULL)

I think we can use the same algo but use it for more iteration to complete the solution .. so if we get NULL at end of first iteration we can only say that the LCS we started with doesnt have any common URL in all users....but we cant say for sure if there is no such solution...what we can do is ... take next LCS between first two strings(186 and 187 ignore the previous one)....and then do same thing for it...and we will have to do it till the LCS between the first two becomes NULL(by ignoring all previous LCS between them) ... only then we can say that there exists no solution between all users...lets assume we have p such LCS between first two strings then complexity increases to O(kmnp)

- vik January 24, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

@vik:It seems this is a valid corner case (@Hua can verify for sure since he understands this very good).I think that if we sort the sequences prior to the algorithm then, Hua's algorithm would work since we would have after sort:

186: a,b,x
187: a,b,x
188: d,e,x
189: d,e,x

And we would get x as result. Runtime would not increase i.e. O(kmn + klogm)~ O(kmn).
Am I right?

- Jim January 24, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

@Jim: The goal is to find the longest consecutive URLs visited by users. If we sort the sequence first, the sequence property is broken and we couldn't get the expected answer.

@Vik: You solution looks good to me. I use you idea and maybe we can do something better. Here is the revised idea.
i.e.
186: x a b c y z
187: a b x y z d
LCS(xabcyz, abxyzd) = ab
LCS(x**cyz, **xyzd) = yz
LCS(x**c**, x****d) = x
LCS(***c**, *****d) = empty(iteration stops here)

So the idea is to iterate multiple times on the first two strings, till there no common sub-sequence found. After each iteration, we mark out the found sub-squence which we are not interested in the next iterations. So we mark them out. The reason of mark them up rather then remove them is we could find the wrong subsequence. i.e.
LCS( abxy, xaby ) = ab
LCS( xy, xy ) = xy <-- wrong.
LCS( **xy, x**y ) = x <-- correct.
LCS( ***y, ***y ) = y <-- correct.

Back to the idea, now we have all the common sub-sequences from the first two strings. The final answer(if there is one) must be in one of those common sub-sequences(at least be part of it). We can just do like this.
For the third string, let's name it t. find the LCS of t with each of common sub-sequences from first two strings, put this result into a set. After it's all done, we have the set which is the common sub-sequence for first three strings and the answer should be in the set or must be at least some part of a set element. Apply this logic to the rest of strings, and we will have a final set, and we choose the longest set element as our answer. If at any step, the set is empty, we can return empty set immediately because there is not such common sub-sequence visited by all the users. Does it make sense? Any suggestion is appreciated.

- Hua January 25, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

@Hua: Although LCS for 2 sequences can be solved using dynamic programming in O(m*n) time, the general case: LCS for k sequences, is a known NP-hard problem. That is to say, there is no known algorithm to solve this problem that runs in polynomial time. No matter how you arrange the order of doing LCS on two sequences, there will be some corner case where your algorithm will fail.

@Jim: The above said, I doubt whether your interviewer wants you to find the longest sub-array (consecutive URLs) or the longest sub-sequence, since there is no known efficient algorithm to solve the latter. You'll just have to enumerate all possible URL combinations and pick the longest one. It's another story if the interviewer wants you to find longest sub-array. For example, you can build suffix trees. I think it is called suffix array when you build multiple sequences into one tree.

- freshboy January 26, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

@freshboy:I told him to use suffix tree but he said to much space find another way.I did not tell anything about DP version.But did something similar to what you side, i.e. use a map of user-id to sequences and then iterate over all sequences to find the longest one.At least that was the idea. I was rejected though

- Jim January 27, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 0 vote

Divide the log lines by user (so you have a list of URLs for each user).
Find all common subsequence of user-1 and user-2 and sort them by decreasing order.
For each common subseuqnce found above, search for it in the list of each user. If it's not found then eliminate it. The first subseuquence that's present in the lists of all the users is the longest common subseuqnece.
This would take O(N)*<number of common subseuqneces between user1 and user2>
where N is the log length.

- gen-y-s January 21, 2013 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

I don't follow you here.You say to find all the common subsequences of User-1 and User-2.Ok. 1)Why sort them? 2)Why would the common sequences of specifically User-1 and User-2 are good indicators to search in the rest?

- Jim January 21, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 0 vote

Approach :
Store as HashMap<String,HashSet<String>> where key is the url and set is the number of unique user who have visited this site.Maintain another HashSet<String> to get the count of unique user.All of the above work is O(n).

Then traverse through the HashMap and check if the size of the HashSet is equal to the count of the unique user id...if yes then this URL goes to the max subset(assuming the question asks for URL visited by all user)....This also takes O(n) time.
So total time complexity is O(n)....

Please comment in case I have understood something wrong.....

- Anuj Agrawal February 03, 2013 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

I am not sure I understand what you say.First of all the question is about URL sequences.How do you get it if in your HashMap you use as key a single URL? Also I am not sure what you store in the 2 HashSet<String>.In the first you say "the number of unique user who have visited this site".Is it a number then? And what is the use of the other HashSet?

- Jim February 04, 2013 | Flag

Comment hidden because of low score. Click to expand.

-1

of 1 vote

my solution for this

package com.learning.puzzle;

import java.util.ArrayList;
import java.util.Collection;
import java.util.Comparator;
import java.util.HashMap;
import java.util.Map;
import java.util.TreeMap;

public class URLLogging {
	
	public static void main(String...strings ){
		/*
		  Build the problem in objects
		*/
		ArrayList<Browsing> arrys = new ArrayList<Browsing>();
		arrys.add(new Browsing(186,"A"));
		arrys.add(new Browsing(187,"B"));
		arrys.add(new Browsing(186,"B"));
		arrys.add(new Browsing(188,"A"));
		arrys.add(new Browsing(189,"C"));
		arrys.add(new Browsing(188,"B"));
		
		// String the browsing URL
		Map<Integer,String> map = new HashMap<Integer,String>();
		for(Browsing browse : arrys){
			if(!map.containsKey(browse.userid)){
			map.put(browse.userid, browse.url);
			}
			else{
				String value = map.get(browse.userid) +"-->"+browse.url;
				map.put(browse.userid, value);
			}
		}
		
		// Counting the Max Browsing URL
		
		Collection<String> values = map.values();
		Map<String,Integer> valueMap = new HashMap();
		for(String url : values){
			if(!valueMap.containsKey(url)){
				valueMap.put(url, 1);
				}
				else{
					int value = valueMap.get(url) +1;
					valueMap.put(url, value);
				}	
		}
		
		ValueComparator bvc =  new ValueComparator(valueMap);		
        TreeMap<String,Integer> sorted_map = new TreeMap<String,Integer>(bvc);
        sorted_map.putAll(valueMap);
        System.out.println("results: "+sorted_map);
		
	}
}

class ValueComparator implements Comparator<String> {
    Map<String, Integer> base;
    public ValueComparator(Map<String, Integer> base) {
        this.base = base;
    }
    // Note: this comparator imposes orderings that are inconsistent with equals.    
    public int compare(String a, String b) {
        if (base.get(a) >= base.get(b)) {
            return -1;
        } else {
            return 1;
        } // returning 0 would merge keys
    }
}
    
// user n url holding object
class Browsing {
	int userid;
	String url;
	public Browsing(int userid, String url) {
		super();
		this.userid = userid;
		this.url = url;
	}	
}

- http://javadecodedquestions.blogspot.sg/2012/03/java-interview-with-investment-bank-4.html February 16, 2013 | Flag Reply

Comment hidden because of low score. Click to expand.

-2

of 2 vote

I have written the code in Java. But this would only find sequences specific to the URL given. For example, it would just store "abc" from the below file content in HashMap and would only use that to compare and increase the count. It would ignore whatever is present after abc.

/*
Input file would be something like this:
Jon abc.com
Jim def.com
Tim abc.com
Ann abc.com
Jim pqr.com
Tim abc.com
*/
public static void main(String[] args) throws IOException
{
                 HashMap<String, Integer> hm = new HashMap<String, Integer>();
		
		BufferedReader br = new BufferedReader(new FileReader("File.txt"));
		String str;
		boolean flag = false;
		String[] arrstr;
		while((str=br.readLine()) != null)
		{
			flag = false;
			arrstr = str.split(" ");
			arrstr = arrstr[1].split("\\.");
			
			
			if(hm.containsKey(arrstr[0]))
				hm.put(arrstr[0], hm.get(arrstr[0])+1);
			else			
				hm.put(arrstr[0], 1);
			
		}
		
		System.out.println(hm.values());
		System.out.println(hm.entrySet());
}

- n9lbuzz January 20, 2013 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

But that is not a solution.The interviewer asked to find the common URLs accessed by all users in the log. Specifically the max common among all users

- Jim January 20, 2013 | Flag

CareerCup

Amazon Interview Question for Software Engineer / Developers

Books

Videos

Resume Review

Mock Interviews