Amazon Interview Question
Software Development ManagersCountry: United States
Interview Type: Phone Interview
There are many ways to solve this problem. Following ways can be used for this.
1) If change in position of numbers is allowed, then Sort the list of numbers and remove the duplicates. Sorting will require at least n*log(n) time if quick sort or merge sort is used.
2) Iterate through list n^2 times removing the duplicate elements by choosing a seed and then comparing that seed against each number. But this is not effient method.
3) Using hashing, we can map the numbers to some less number of bucket and then compare the numbers in each bucket and arrive at a list of unique number.
4) Take bit map of number. Because we know that it is 32 bit integer, we require 2^32 bits to represent the bit map image and initialize the image to all 0. On first occurrance of number, set corresponding bit of bit map to 1. If number appears again then remove it from list. This will be O(1), but requires constant memory space i.e. 2^32 bits = 2^29 bytes = 512 Megabytes.
IMO option 4 is the best and the correct solution to this problem. allocating 512 MB of memory for such a large set of input should not be a problem. However if the interviewer wants you to reduce the amount of memory required then also the same logic will apply but with a little change and slightly more but still constant processing time.
What we will do is divide the processing into N batches depending upon the memory restriction imposed by the interviewer. Consider N to be 2 for this example. So we will divide it into 2 passes. In the first pass of 5 billion numbers we will only consider numbers from 0 to 2^32/N = 2^31 and accordingly allocate 2^31/2^3 = 2^28 = 256 MB of memory. In the second pass we will consider elements in the integer range of 2^31 to 2^32 and again allocate 256 MB and remove duplicates according to technique mentioned in point 4.
Hence by applying the same algorithm with different ranges of number I have reduced my memory footprint by Half. We can further reduce it by incorporating more passes into it.
On the approach 4, the bit-field space is one area that may do well with another improvement. Instead of keeping one bit for each number which can be set on the first occurance, we could reduce the memory requirement by using hash and a bloom-filter. Typically a bloomfilter of size 8 bit per hash input can return with 2% false positive rate in O(1) time. For input size of 2^32 we can choose a good hash function of size 8*32 bits to give a relatively collision-resistant hash and then use a bloomfilter of 8*(8*32) bits to find out if the hash of the input pattern occurred before or the pattern is new.
If it is new, we add the hash to bloomfilter and if it occured before we remove the input pattern.
.
There were some mistakes in the calculation above. The hash space could be 8 or 10 bit size rendering 2^32 inputs to 2^8 or 2^10 unique hash output. Now allowing 8 bits or 10 bits [as recommended in Wiki for 1% false positive rate] the bloomfilter size would be 2^10*10 bits i.e. 1032*10 bits or 10Kb or 1.2 KB which is lot smaller than keeping a bit for every 32 bit number.
Allocate two byte arrays of (5 billion / 8) size - one for positive integers, second for negative integers.
Go through array. If the next item >= 0 then check whether bit number [item % 8] of the First Byte Array[item /8] is set. If it it set then we found the duplicate element, print it and remove from array. Set the bit if it is unset. continue.
Similar for item<0
Why cant we use 2 bit arrays each of size 2^31 without any of this % business? The space requirement for that will be 0.25 + 0.25 = 0.5 GB with no extra calculation required (which will save atleast 5 billion %8 operations)
It looks like you are considering all the elements are in the range of 5 billion, which is not mentioned in the problem description.
This can be solved easily with two Hastables, the first Hashtable is where I will keep all numbers that appear at least once in the input, the second Hashtable is where I will keep all the numbers that appear more than once in the input (ie. a duplicate). Simply iterate over the input in one pass, testing the first Hashtable and then the second Hashtable to see if it already contains the number, and insert to the Hashtable and continue if it is not there already.
But this solution has a worst-case space complexity of more than 2 GB (2 GB for the numbers between -2^31 to +2^31 plus additional overhead for maintaining the hashtable). The first solution posted in this article and the solutions posted in the article above come close to solving this in less space complexity. I am not sure if there are other better solutions.
Also, correct me if I am wrong, but I guess in the original solution, he/she meant to say "Allocate two byte arrays of (2 GB / 8) size". We only need to access the index pointed by arrayOfFiveGB[i] element to check if there is a duplicate.
One solution that seems reasonably efficient:
pass1: allocate an array of 2^32 elements each is two bit initialized to 00,
we iterate the 5-billion array and mark the 2^32 array as followed:
00: not seen
01: seen
10: duplicate
Then,we can output the duplicate number every time we see a switch from "seen" to "duplicate".
pass2: this is pass is only used for eliminating duplicate
create two pointers each pointing to the 0th element of the array
iterate the first pointer through the 5-billion array and use the number to index the 2^32 array to see whether we have a seen element or duplicate element,if we have a seen element,we copy it to what the second pointer points,and forward the second pointer by 1.
after the first pointer is at the tail,we know we are done and what the second pointer points is the tail of the new array.
public static void removeDups2(int[] origArr,int[] newArr )
{
//32 bit signed integer means max +ve value are 2^31-1 and 2^31-1 -ve integers.
//We will use the index to refer the actual integer
//2^31 is approx 2.2 Gig
BitSet positive=new BitSet((int)Math.pow(2, 31));
BitSet negative=new BitSet((int)Math.pow(2, 31)-1);
boolean isDuplicate=false;
for(int i=0,j=0;i<origArr.length;++i)
{
isDuplicate=false;
if(origArr[i]>=0)
{
//if the bit is set means its a duplicate else set the bit
if(!positive.get(origArr[i]))
positive.set(origArr[i]);
else
isDuplicate=true;
}
else if(origArr[i]<0)
{
//if the bit is set means its a duplicate else set the bit
if(!negative.get(origArr[i]*-1))//multiply by -1 to make it a postive index
negative.set(origArr[i]*-1);
else
isDuplicate=true;
}
if(isDuplicate)
{
out.println(origArr[i]+" is duplicate.. remove it");
origArr[i]=0;// this is necessary only if we are not using the 2nd array for output
}
else
newArr[j++]=origArr[i];
}
}
Create an array DUP of 2^32 size. initialized with -1 (or something)
Pass through the given array, and use the value of the array as index into the DUP array and if its -1 set it to 1. and if its alreay 1, then its already seen so print it....One pass solution is obtained
Improvemnt....Since it is a signed 32 bit numbers in the 5 b array...we can use 2^16 array size would be enough to keep track...0 if not seen any of the +/- number... last bit set....+ve number already seen second last bit set if -ve number seen.
Now if the current number is +/- and is already seen then its duplicate.....now the size of array is reduced by a lot
Well ...since only 2 bits are being used in every element of 2^16 array....remaining bits can be used further reducing the memory by half. Hence for can be further reduced by using all bits of each element arrray...so memory is requirement would become very lesss and negligible...and working with bits are faster..improves.!
[C#]
Use a HashSet() to keep track of 32bit ints that have been seen.
Also need to overwrite duplicate entries, with non-duplicates...
static void RemoveAndPrintDuplicates( BigArray<int>[] input)
{
HashSet<int> set = new HashSet<int>();
Int64 destIndex = 0;
for (Int64 srcIndex = 0; srcIndex < input.Length; srcIndex++)
{
if (set.Contains(input[srcIndex]))
{
Console.WriteLine("Duplicate: {0}", input[srcIndex]);
continue;
}
set.Add(input[srcIndex]);
// If srcIndex and destIndex are different,
// copy the src # to the dest.
if (srcIndex != destIndex)
{
input[destIndex] = input[srcIndex];
}
destIndex++;
}
}
we can do this in two steps:
1. Sort the array using ur favorite sorting algorithm.
2. Now we need to find duplicates and remove them
pseudo code:
removeDuplicates( A[] , size)
{
A=sort(A,size);
last_element= infinite;
index=0;
for each element in A:
if A[i] = last_element // =>duplicate
// do nothing
else
A[index] = A[i];
index++;
last_element=A[i];
return A[];
}
the size of array A is now equal to index.
How are you guys getting the space requirements to under 2 GB or 512 MB in some cases?
The array size itself is 20 GB, isnt it? Please correct me if im wrong:
32 bits = 4 bytes.
5 billion 32 bit numbers = 5 billion * 4 bytes = 20 billion bytes
20 billion bytes = 2*10^10 bytes = 2*10^7 KB = 2*10^4 MB = 2*10^1 GB = 20 GB(roughly)
We can think of this as equivalent to: There is a 20 GB file full of unsigned integers, how to produce a file that has no duplicates?
1. Break this large file into 40 500 MB files
2. Sort the integer in each file, modify whatever sorting you are doing to eliminate duplicates and print them.
3. Merge these 400 files, (like merging 400 arrays), while merging, eliminate and print duplicates.
I guess the solution above works. Also, the article below talks about various solutions related to this topic.
- Anonymous May 31, 2012aperiodic.net/phil/archives/Geekery/find-duplicate-elements.html