Google Interview Question
Software Engineer / DevelopersCountry: India
Interview Type: In-Person
i thought tries have insertion time o(nlogn) for n elements.
are you sure its o(n), its a tree structure...
The trie insertion time is k, where k is the length of the string being inserted. So the total time is O(nk). Since the length of the URL's doesn't depend on n, though, and can be expected to be relatively constant, this is O(n).
Storage is going to be O(n) -- but a hash table is not as space efficient as it has to store the entire string in case of collision, but in all partial matches are collapsed together, no?
Incorrect soln will not be O(n) as you would be deleting nodes from link list...parsing till node will be extra time.
To answer the question of above user:
There is no extra time needed to "parse till the node to be deleted".
The following code is my implementation of the same idea as that of the original poster,
the linked list is combined with the trie,
once you have finished travelling to end of an url in the trie,
you would have the node in the linked list to be deleted.
/* The fields 'parent' and 'child' are for the trie.
* The fields 'prev' and 'next' are for the linked list.
* I presume that the character-set of URL is {'a' - 'z', '.', '/', ':'}.
* In real interview, need to discuss the assumption with your interviewer.
*/
class Node
{
public Node parent = null;
public Node[] child = new Node[29];
public Node prev = null;
public Node next = null;
public char data = ' ';
public int count = 0;
}
class G105
{
public static String firstUnique(String[] array)
{
Node root = new Node();
Node first = null;
Node last = null;
for (String s : array)
{
Node current = root;
/* Modify the trie */
for (char c : s.toCharArray())
{
int index = 0;
if (':' == c) {index = 28;}
else if ('/' == c) {index = 27;}
else if ('.' == c) {index = 26;}
else {index = c - 'a';}
if (null == current.child[index])
{
Node next = new Node();
next.parent = current;
next.data = c;
current.child[index] = next;
}
current = current.child[index];
}
/* Modify the linked list */
current.count++;
if (1 == current.count)
{
if (null == first) {first = last = current;}
else
{
current.prev = last;
last.next = current;
last = current;
}
}
else if (2 == current.count)
{
if (current == first)
{
first = current.next;
if (null != first) {first.prev = null;}
}
else
{
Node prev = current.prev;
prev.next = current.next;
if (null != current.next)
{
current.next.prev = prev;
}
}
}
}
StringBuffer sb = new StringBuffer();
Node current = first;
while (current != root)
{
sb.append(current.data);
current = current.parent;
}
return sb.reverse().toString();
}
public static void main(String[] args)
{
String[] urls = {
"abc.google.com",
"abc.facebook.com",
"abc.amazon.com",
"abc.yahoo.com",
"abc.facebook.com",
"abc.yahoo.com",
"abc.facebook.com",
"abc.google.com"
};
System.out.println(firstUnique(urls));
}
}
Even though delete operation for a single and the doubly linked list is O(1) operation once we have access to its current and previous pointers, finding the elements for deletion is still O(N). Therefore I dont think overall delete operation is O(1) operation when finding that element for removal is necessary
I believe you used hash(or trie or something similar) to store the counts of each URL.
To make it one traversal, you need only to use a doubly linked list to link all the unique ones (either link the items in the hash table or do it separately) so far as you traverse through.
Remove the URL from the list if its count goes over 1;
and of course connects its predecessor and successor (which is actually implied when I say "remove")
So after one traversal, the first one of your linked list is the desired one.
Extra spaces for two pointers for each URL are required for this solution.
suppose you are in the middle of traversing thru the list, u have a half prepared link list, now u get another url, so what u do?? u traverse thru the link list untill u find the match. But how u r going to do this "traverse thru" for each element on "overall O(n) complexity"??
The doubly linked list is in addition to the hash table.
To remove or to insert a node to this list is up to the count of the URL in the hash table.
If we have a hash table then why do we require a linked list. Can't we just traverse the array and keep on checking the count for each one??
What does unique mean? appearing exactly once?
Hash the url and use the hash to check for dupes, should be faster than plain old string matching.
I think this might work in linear time.
Assign a prime number for each character (letters, digits and special characters). Eg: 'a'--> 2, 'b'-->3, '1'-->17, '#'--> 29. Compute the hash code of the url based on the character's prime number and its position in the url.
After computing the hashcode, insert the hashcode in HashSet (in Java). Do this for every URL. Incase the hashcode of URL mathces the hashcode of some previous url inserted in HashSet, then its the first duplicate.
Sort the url list using quick sort in O(nlogn). Scan through the array to find the unique URL in O(n) time. Total time: O(nlogn).
PS: Here sorting and scnaning involves string comparision. Say if each string is length K, we can treat n*k = N as another constant.
Seems list quite a few here have missed a part of the question. The question is to find the first unique URL. Not any unique url or the first duplicate.
A solution would be
* go through the URLs, store in hash table, for key=url, a boolean value that represents if the URL has been seen once or more than once. That is first pass.
* go through the URLs again, and for each URL, check in the hash table, for how many times it has occured. If it has occured only once, that's the URL you want.
Now, there are some space constraints that I haven't considered. Millions of URLs. Let's say 10 million URLs. Each URL has let's say 50 characters on average. That's 500 million characters. That takes up a lot of space even though it is doable. However if I am using a hash table, I am not storing the entire URL, but just the hash code and a boolean. That would be let's say 4 bytes for hash code and a bit for boolean. ~40 MB for hash table. That's not bad at all.
One way to optimize the second pass would be .. if I could create another array that contains the URLs in the order in which they appear in the original list, however without any duplicates. I can construct this during the first pass. So during the second pass, I could go through this array, that way reducing the number of URLs I am processing. However this needs additional memory.
Now if I could make this more challenging, what if the number of URLs is not millions but billions. In that case the hash table cannot be stored in memory. It would be in the order of 40GB. How would we solve the problem then?
One solution I can think of is to use distributed computing. Have multiple server machines store the hash table pieces in their local memory. Since network communication could be faster than disk reads, this could work well. Otherwise the hash table could be stored on disk. This would be slow because you don't know where the hash code of next URL will be in the hash table memory.
How will you check for duplicates if you don't store the entire url? After all, if the hash is just 4 bytes, there is a non-zero chance of a collision.
Also, you have perhaps missed an important step: Canonicalize the URLS. There could be multiple versions of the same url, which might hash differently but are actually the same.
Also, your comment about hashtable and hard disk access is incorrect, if you use chained linear hashing. In fact the original intent behind inventing the hash table was to have a fast disk access. An array -> quick access, followed by a linked list access, which would be fast if the size of the linked list (# of collisions for that hash) is small.
And you could just store the least index of the url, (whether it was first, second, nth etc) in the hash table itself and just walk the hash table ignoring dupes. For this just a bit array would be sufficient as a supporting structure to figure out the first unique.
Also, if the urls are valid, about using IP addresses? With a trie (patricia) of IP address + a lookup structure (say hash/tree) for each IP adress, you could perhap cut down on space and make the lookups faster.
One way would be to maintain a Bloom filter for the URL's that you have seen before. They are great for negative membership queries.
the question is to find the 1st unique url. If u make the hash of all the url's and then traverse each hash to check the frequency you may be able to get all the unique url's but certainly not the one that comes first.
One way to come up with the 1st unique url is to go thourgh the list order in which the url's are presented to us and then look at the set of unique url's (found by Rajesh's method) to determine which is the 1st unique.
I like the bloom filter idea indeed. Now, since the list is very long but finite (ie. not a stream), we can iterate it backwards, keeping track of the last URL that we don't find in the bloom filter.
def first_unique_url(alist):
latest_unique = None
bloom = BloomFilter() #assuming some class here
for url in backwards(alist):
if not bloom.contains(url):
bloom.add(url)
latest_unique = url
return latest_unique
How about using linkedHashMap?
HashMap
md5(url) | reference to node in doubly linked list
while( lines ){
if( md5(url) exists )
delete node in in doubly linkedlist
else{
add to linkedlist queue end
set value to the node ( queue end).
}
}
return the first node of the queue.
I think this can be done in O(n)
Trie would be better. But instead of having count of urls we should have the index of the string in the list. Whenever we encounter the url again make the index = -1. So now the trie will give indexes in each node of trie as Null(never got set) or -1(set more than once) or +ve index value of the url in the list(set only once). now getting all the non null and +ve indexes find the min index you will get first unique URL.
uniq filename.txt > f2.txt
using unix commands? not sure about the time complexity tho :|
As fentoyal says, put them into a hash table. Add to each actual bucket entry an iterator into a doubly-linked list of those items that have been seen once. When you walk down the original list of urls, if you don't find it in the hash table, add it, and add its entry (hash table iterator, or raw url, either way) to the back of the linked list. If you do find the entry in the hash table, look at the iterator stored there. If it's null, you've seen it 2+ times already, do nothing. If the iterator there is non-null, use it to pull the item out of the linked list (O(1) time), and then set the iterator in the hash table to null.
When you've gone through the list of urls, pick the front-most (which is the first added, that was not subsequently removed) url from the linked list.
I dont think this will give time O(n).
Look at the case when an element was previously considered unique and now as we iterate through the sequence we find that it has a double. In this case we have to remove from hash which is fine as it can be done in O(1) but removing the item from the linked list takes O(n)
doesnt this push the overall running time to O(n^2)
No, no - you never remove the url from the hash table (otherwise if you see it again, you'd think it was the first time). There are 3 states - 1) it's a new url (add to hash table and to back of list, add that list iterator to the url's hashtable entry), 2) you've seen the url and the hash table still has a valid iterator to the list (you've seen it once - so we remove the url from the list, and remove the iterator from the url's entry in the hashtable) 3) you've seen the url, and its hash table entry no longer has the iterator -> you've seen the url 2+ times. It's not in the list, so it's in no danger of being thought unique - do nothing).
... to answer your objection more specifically, you're talking about case 2). We don't have to iterate through the doubly-linked list to find our candidate, because removing an element from a linked list is O(1), if you know exactly where it is. (I don't know how it is in Java, I'm sure it's the same, but in C++ you can have an iterator (acts like a pointer) to a list element. Being doubly-linked, removal is constant-time, and you don't mess up any other iterators.) And, again, you never take the url's entry out of the hash table.
I think space complexity here is way too high.
1. Keeping LONG list of URL as a DLL.
2. Maintaining a HashMap.
Slightly better solution is to use 2 hashmaps.
1. unqiue_url
2. all_url
Traverse through URLS
if URL is present in all_url
remove url from unique_url
else
add url to unique_url
add url to all_url
end
Items left in unique_url will be unique
> Items left in unique_url will be unique
... but how will you know which was first? And, my doubly-linked list contains only unique items, it should take less space than another hash table.
unqiue_url you can store url as key and value as the order of occurence 1, 2, 3, 4. or use a Map which keeps track of insertion order, but I guess that uses a DLL internally.
I did not see that you only store unique items in DLL.
Both solutions have same complexity, guess choice would be between what is easier to read and maintain
I guess you can first create a hashmap of all the url whose key contains
i) no of occurrences of the url
ii) corresponding linked list node containing the url.
So whenever a new url is met, it is determined whether the url is contained in the hashmap.
i) If it is contained and the no of occurences is 1, then we increase the number of occurences by 1 and delete the corresponding linked list node.
ii)If it is not contained in the hash map, then we create a new node and connect it to the tail of the linked list. Then we update the hash map accordingly.
iii) If it is contained in the hash map, and the number of occurrences is more than 1, it means that the corresponding node has already been deleted, and we do not need to do anything.
The head of the linked list is gives the answer.
This should work but you need to use a doubly linked list and you need to remember a pointer to the node in the doubly linked list in the hash map as well. Once an url is found to be duplicate you increment the counter, delete the node and also make the pointer in the hash map null. Rest should work as you mentioned.
<pre lang="" line="1" title="CodeMonkey48234" class="run-this">Dictionary<string, int> urlCount = new Dictionary<string, int>();
Queue<string> uniqueUrls = new Queue<string>();
foreach(string url in File.ReadLine())
{
if(!urlCount.Contains(url))
{
urlCount.Add(url, 1);
uniqueUrls.Enqueue(url);
}
else
{
//count > 1
urlCount[url]++;
if(uniqueUrls.Peek() == url)
{
uniqueUrls.Dequeue();
}
}
}
return uniqueUrls.Dequeue();</pre><pre title="CodeMonkey48234" input="yes">
</pre>
<pre lang="" line="1" title="CodeMonkey11335" class="run-this">package career.cup.google;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.HashSet;
import java.util.Iterator;
import java.util.LinkedHashSet;
import java.util.NoSuchElementException;
import java.util.Random;
/**
* @author Given a very long list of URLs, find the first URL which is
* unique ( occurred exactly once ). I gave a O(n) extra space and O(2n)
* time solution, but he was expecting O(n) time, one traversal.
*/
public class G94FirstUniqueURL {
/**
* @param url
* @returns 1st non duplicate URL, otherwise throws NoSuchElementFind exception...
*
* traverse the List check if the URL was find delete his
* correspondent index from the LinkedHashSet index keep that URL as duplicate....
* Will use a hash map to register the list traversal
* status, key = URL, Value = A list of indexes in the list,
* Will use A LinkedHashSet for the indexes which assume are not duplicates, and remove those which are...
* Used this data structure because will contain the first element which is unique at the end of the algorithm..
* If this hash set is empty throw No Such Element Find...
*/
public static String findFirstUniqueURL(ArrayList<String> url) {
HashMap<String, HashSet<Integer>> hm = new HashMap<String, HashSet<Integer>>();
LinkedHashSet<Integer> fIndex = new LinkedHashSet<Integer>();
Iterator<String> it = url.iterator();
int listIndex = -1;
while (it.hasNext()) {
String u = it.next();
listIndex++;
if (!hm.containsKey(u)) {
HashSet<Integer> idx = new HashSet<Integer>();
idx.add(listIndex);
hm.put(u, idx);
fIndex.add(listIndex);
} else {
hm.get(u).add(listIndex);
deleteDuplicateIndexes(hm.get(u), fIndex);
hm.put(u, new HashSet<Integer>());
}
}
if (fIndex.isEmpty())
throw new NoSuchElementException("No Unique URL");
return url.get(fIndex.iterator().next());
}
/**
* @param idx - duplicate indexes list
* @param in - indexes processed
*/
private static void deleteDuplicateIndexes(HashSet<Integer> idx,
LinkedHashSet<Integer> in) {
for (Integer i : idx) {
in.remove(i);
}
}
public static void main(String[] args) {
Random r = new Random();
ArrayList<String> url = new ArrayList<String>();
for (int i = 0; i < 50; i++) {
Integer v = r.nextInt(37);
url.add(v.toString());
System.out.printf("%d ", v);
}
System.out.println();
System.out.println(findFirstUniqueURL(url));
}
}
</pre><pre title="CodeMonkey11335" input="yes">
</pre>
you can solve this in O(n) using a combination of trie and linked list. The leaf node of a trie maintains a flag to record duplicate urls and pointer to a node in a link list. If you encounter a new url, add a node to the head of the linked list and set the pointer in the trie. Whenever you encounter a url that is already in the trie, if the flag is not set, then set the flag, delete the node from the linked list and set pointer to null. If the flag is already set, then ignore and read the next url.
After processing all the urls, the link list should only contain the unique urls and the node at the tail is the first unique url from the list. For n urls, inserting urls into the trie in O(n) and link list operations are all constant time. The node could just keep the index of the url in the list so that we don't have to store urls in the link list as well.
Wouldnt the algo below work ?
public static int findUnique(Url[] arr) {
HashMap<Url, Integer> map = new HashMap<Url, Integer>();
int minKey = Integer.MIN_VALUE;
for (int i = 0; i < arr.length; ++i) {
Url num = arr[i];
if (map.containsKey(num)) {
int count = map.get(num);
count++;
map.put(num, count);
if (num == minKey) {
minKey = Integer.MIN_VALUE;
}
} else {
if (minKey == Integer.MIN_VALUE) {
minKey = num;
}
map.put(num, 1);
}
}
return minKey;
}
import java.util.ArrayList;
import java.util.HashMap;
public class UniqueUrl {
HashMap<String, Boolean> urlKeyMap;
ArrayList<String> urlList;
public static void main(String[] args) {
UniqueUrl uniqueUrl = new UniqueUrl();
String[] stringArr = {"aab","abc","abc","abd","abd"};
String nonRepeatedUrl = uniqueUrl.getUniqueUrl(stringArr);
if(nonRepeatedUrl == null){
System.out.println("No such string");
}else{
System.out.println(nonRepeatedUrl);
}
}
private String getUniqueUrl(String[] arr) {
urlKeyMap = new HashMap<String, Boolean>();
urlList = new ArrayList<String>();
for(String str : arr){
if(urlKeyMap.containsKey(str)){
if(urlList.contains(str)){
urlList.remove(str);
}
}else{
urlKeyMap.put(str,true);
urlList.add(str);
}
}
if(urlList.size()==0){
return null;
}
return urlList.get(0);
}
}
import java.util.ArrayList;
import java.util.HashMap;
public class UniqueUrl {
HashMap<String, Boolean> urlKeyMap;
ArrayList<String> urlList;
public static void main(String[] args) {
UniqueUrl uniqueUrl = new UniqueUrl();
String[] stringArr = {"aab","abc","abc","abd","abd"};
String nonRepeatedUrl = uniqueUrl.getUniqueUrl(stringArr);
if(nonRepeatedUrl == null){
System.out.println("No such string");
}else{
System.out.println(nonRepeatedUrl);
}
}
private String getUniqueUrl(String[] arr) {
urlKeyMap = new HashMap<String, Boolean>();
urlList = new ArrayList<String>();
for(String str : arr){
if(urlKeyMap.containsKey(str)){
if(urlList.contains(str)){
urlList.remove(str);
}
}else{
urlKeyMap.put(str,true);
urlList.add(str);
}
}
if(urlList.size()==0){
return null;
}
return urlList.get(0);
}
}
public static String findFirstUnique(String [] list) {
Set<String> seen = new HashSet<String>();
Set<String> notSeen = new HashSet<String>();
for (String s : list) {
notSeen.add(s);
if (!seen.contains(s)) {
seen.add(s);
} else {
notSeen.remove(s);
}
}
return notSeen.isEmpty() ? null : notSeen.iterator().next();
}
I think you are on to something. This can all be done with Linux/Unix commands.
if the master list of urls are in a text file (master.txt) then
sort master.txt | uniq -c > occurances.txt
will give you a file of the occurances for each url.
Obviously, the url(s) with associated occurance of one is what you want.
If for some reason the urls are scattered in different files you can use grep
to pipe them into one file with something like
grep -r "http:*" *.* > master.txt
I don't know what the run time of unix sort is but why would anyone write
a C++ program when Unix/Linux can tell you with no effort.
ok. so create two has tables.
One will store dup urls. Other will store unique urls.
1. Read url. create hash.
2. Try to insert hash into dup url hash. If success remove it. else go to step 1. Ignore insertion failures in dup url hash table.
3. Insert into unique url hash. If hash exists error will be generated. Remove that key from uniqu url hash and insert it into dup url hash. Ignore insert failures for dup url hash table. go to step 1
3. The unique url hash table contains all the urls that are unique.
I think all of you missed that the guy wanted a real string algo... pretty hard core:
#include <iostream>
#include <map>
#include <string>
#include <vector>
using namespace std;
class node{
public:
map<char, node*> children;
char c;
int index;
};
int find_first_unique(vector<string> & urls)
{
int result_index = -1;
node * root = new node;
root->c = '*';
bool repeat[urls.size()];
if(urls.size() <= 0)
{
return -1;
}
for(int i = 0; i < urls.size(); i++)
{
string & url = urls[i];
node * n = root;
for(int j = 0; j < url.length(); j ++)
{
node * child;
if((child = n->children[url[j]]) == NULL)
{
child = new node;
child->c = url[j];
n->children[child->c] = child;
child->index = i;
}
n = child;
}
if(n->index != i)
{
repeat[i]=true;
repeat[n->index]=true;
}
else{
repeat[n->index] = false;
}
}
for(int i =0; i < urls.size(); i ++)
{
if(repeat[i] == false)
{
return i;
}
}
return -2;
}
int main(int argc, char ** a)
{
vector<string> v;
v.push_back("AAAA");
v.push_back("AAAA");
v.push_back("AAAAC");
v.push_back("BBBB");
cout <<find_first_unique(v)<< endl;
return 0;
}
This does the job in cleaner, shorter, faster and least memory possible
static String findUniqueUrl(List<String> urls) {
/** preserves insertion order, so you can eventually tell the first unique url.
it provides constant-time performance for the basic operations (add, contains and remove),
assuming the the hash function disperses elements properly among the buckets
**/
Set<String> unique = new LinkedHashSet<>();
for (String url : urls) {
if (unique.contains(url)) {
/** removes a url if duplicated,
only requires space O(unique urls) **/
unique.remove(url);
} else {
/** add url the first time its found **/
unique.add(url);
}
}
/** get the first found unique url **/
Iterator<String> it = unique.iterator();
String firstunique = null;
while (it.hasNext()) {
firstunique = it.next();
break;
}
output(unique);
return firstunique;
}
Can try
1) adding all values in hash table O(n)
2)Perform linear search on hash table O(n)
Use min heap with url as data of heap and url_frequency as the heap comparator.
Top of the heap would give u an unique url if its frequency is exactly 1.
O(nlogn) time complexity, O(n) space complexity.
i think following code will work.
package psan;
import java.io.*;
import java.util.*;
public class UniqueUrlSearch
{
public static void main(String...s) throws Exception
{
BufferedReader b=new BufferedReader(new InputStreamReader(System.in));
LinkedList<String> ans=new LinkedList<String>();
System.out.println("Enter the list");
String list[]=b.readLine().split(" ");
int len=list.length,itr=-1;
int old=1,current=1,v=0;
String tmp="",yy="";
HashMap<String,String> hs=new HashMap<String,String>();
for(int i=0;i< len;i++)
{
if(!hs.containsKey(list[i]))
{
try{++itr;
yy=yy+itr;
tmp=tmp+yy+" "+"1";
System.out.println("in if tmp="+tmp);
hs.put(list[i],tmp);
ans.add(list[i]);
}catch(Exception e){e.printStackTrace();}
}
else
{
try{
System.out.println("in else:");
String ss=hs.get(list[i]);
String x[]=ss.split(" ");
//for(int j=0;j<x.length;j++) System.out.print(" "+x[j]);
int key=new Integer(x[1]);
int index=new Integer(x[0]);
key+=old;
tmp="";
tmp=tmp+index+" "+key;
hs.put(list[i],tmp);
//System.out.println("index="+index+" "+"key="+key);
if((index-v) >= 0)
ans.remove(index-v);
//System.out.println("size of linked list="+ans.size());
++v;
}catch(Exception e1){e1.printStackTrace();}
}
yy="";
tmp="";
}
System.out.println("ans="+ans);
}
}
package psan;
import java.io.*;
import java.util.*;
public class UniqueUrlSearch
{
public static void main(String...s) throws Exception
{
BufferedReader b=new BufferedReader(new InputStreamReader(System.in));
LinkedList<String> ans=new LinkedList<String>();
System.out.println("Enter the list");
String list[]=b.readLine().split(" ");
int len=list.length,itr=-1;
int old=1,current=1,v=0;
String tmp="",yy="";
HashMap<String,String> hs=new HashMap<String,String>();
for(int i=0;i< len;i++)
{
if(!hs.containsKey(list[i]))
{
try{++itr;
yy=yy+itr;
tmp=tmp+yy+" "+"1";
System.out.println("in if tmp="+tmp);
hs.put(list[i],tmp);
ans.add(list[i]);
}catch(Exception e){e.printStackTrace();}
}
else
{
try{
System.out.println("in else:");
String ss=hs.get(list[i]);
String x[]=ss.split(" ");
//for(int j=0;j<x.length;j++) System.out.print(" "+x[j]);
int key=new Integer(x[1]);
int index=new Integer(x[0]);
key+=old;
tmp="";
tmp=tmp+index+" "+key;
hs.put(list[i],tmp);
//System.out.println("index="+index+" "+"key="+key);
if((index-v) >= 0)
ans.remove(index-v);
//System.out.println("size of linked list="+ans.size());
++v;
}catch(Exception e1){e1.printStackTrace();}
}
yy="";
tmp="";
}
System.out.println("ans="+ans);
}
}
I was asked this question by an interviewer. I felt that no algorithm can be faster than O(n^2). But, I saw the methods using the hash table. Though I get the idea to use the hash table, the question simply shifts from finding the first unique URL to finding the optimal hashing function. There can still be collisions and that can end up increasing the search in the unique list from O(1).
I was asked this question by an interviewer. I felt that no algorithm can be faster than O(n^2). But, I saw the methods using the hash table. Though I get the idea to use the hash table, the question simply shifts from finding the first unique URL to finding the optimal hashing function. There can still be collisions and that can end up increasing the search in the unique list from O(1).
you can solve this in O(n) using a combination of trie and linked list. The leaf node of a trie maintains a flag to record duplicate urls and pointer to a node in a link list. If you encounter a new url, add a node to the head of the linked list and set the pointer in the trie. Whenever you encounter a url that is already in the trie, if the flag is not set, then set the flag, delete the node from the linked list and set pointer to null. If the flag is already set, then ignore and read the next url.
- hello world December 14, 2011After processing all the urls, the link list should only contain the unique urls and the node at the tail is the first unique url from the list. For n urls, inserting urls into the trie in O(n) and link list operations are all constant time. The node could just keep the index of the url in the list so that we don't have to store urls in the link list as well.