How would you eliminate and pr

Amazon Interview Question for Software Development Managers

0

of 0 votes

31
Answers
How would you eliminate and print duplicate elements of an array of 5 billion 32-bit signed integers
- InfiniteLoop May 31, 2012 in United States | Report Duplicate | Flag | PURGE
Amazon Software Development Manager Algorithm

Email me when people comment.

An error occurred in subscribing you.

Country: United States
Interview Type: Phone Interview

Email me when people comment.

An error occurred in subscribing you.

Comment hidden because of low score. Click to expand.

of 1 vote

I guess the solution above works. Also, the article below talks about various solutions related to this topic.

aperiodic.net/phil/archives/Geekery/find-duplicate-elements.html

- Anonymous May 31, 2012 | Flag Reply

Comment hidden because of low score. Click to expand.

of 1 vote

There are many ways to solve this problem. Following ways can be used for this.

1) If change in position of numbers is allowed, then Sort the list of numbers and remove the duplicates. Sorting will require at least n*log(n) time if quick sort or merge sort is used.
2) Iterate through list n^2 times removing the duplicate elements by choosing a seed and then comparing that seed against each number. But this is not effient method.
3) Using hashing, we can map the numbers to some less number of bucket and then compare the numbers in each bucket and arrive at a list of unique number.
4) Take bit map of number. Because we know that it is 32 bit integer, we require 2^32 bits to represent the bit map image and initialize the image to all 0. On first occurrance of number, set corresponding bit of bit map to 1. If number appears again then remove it from list. This will be O(1), but requires constant memory space i.e. 2^32 bits = 2^29 bytes = 512 Megabytes.

- Software Architect June 02, 2012 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

IMO option 4 is the best and the correct solution to this problem. allocating 512 MB of memory for such a large set of input should not be a problem. However if the interviewer wants you to reduce the amount of memory required then also the same logic will apply but with a little change and slightly more but still constant processing time.

What we will do is divide the processing into N batches depending upon the memory restriction imposed by the interviewer. Consider N to be 2 for this example. So we will divide it into 2 passes. In the first pass of 5 billion numbers we will only consider numbers from 0 to 2^32/N = 2^31 and accordingly allocate 2^31/2^3 = 2^28 = 256 MB of memory. In the second pass we will consider elements in the integer range of 2^31 to 2^32 and again allocate 256 MB and remove duplicates according to technique mentioned in point 4.

Hence by applying the same algorithm with different ranges of number I have reduced my memory footprint by Half. We can further reduce it by incorporating more passes into it.

- tarutrehan June 04, 2012 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

yes! That can be done if there is any restriction on memory usage.

- Software Architect June 05, 2012 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

On the approach 4, the bit-field space is one area that may do well with another improvement. Instead of keeping one bit for each number which can be set on the first occurance, we could reduce the memory requirement by using hash and a bloom-filter. Typically a bloomfilter of size 8 bit per hash input can return with 2% false positive rate in O(1) time. For input size of 2^32 we can choose a good hash function of size 8*32 bits to give a relatively collision-resistant hash and then use a bloomfilter of 8*(8*32) bits to find out if the hash of the input pattern occurred before or the pattern is new.
If it is new, we add the hash to bloomfilter and if it occured before we remove the input pattern.
.

- passerby June 18, 2012 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

There were some mistakes in the calculation above. The hash space could be 8 or 10 bit size rendering 2^32 inputs to 2^8 or 2^10 unique hash output. Now allowing 8 bits or 10 bits [as recommended in Wiki for 1% false positive rate] the bloomfilter size would be 2^10*10 bits i.e. 1032*10 bits or 10Kb or 1.2 KB which is lot smaller than keeping a bit for every 32 bit number.

- passerby June 19, 2012 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

nice explanation !!!!!!

- Arun Kumar Gupta September 09, 2012 | Flag

Comment hidden because of low score. Click to expand.

of 0 vote

Allocate two byte arrays of (5 billion / 8) size - one for positive integers, second for negative integers.
Go through array. If the next item >= 0 then check whether bit number [item % 8] of the First Byte Array[item /8] is set. If it it set then we found the duplicate element, print it and remove from array. Set the bit if it is unset. continue.
Similar for item<0

- Anonymous May 31, 2012 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

dude, why do you need 2 byte arrays ? you can treat integers as unsigned if you want

- Anonymous June 01, 2012 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

Why cant we use 2 bit arrays each of size 2^31 without any of this % business? The space requirement for that will be 0.25 + 0.25 = 0.5 GB with no extra calculation required (which will save atleast 5 billion %8 operations)

- Renjith June 03, 2012 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

It looks like you are considering all the elements are in the range of 5 billion, which is not mentioned in the problem description.

- hash June 06, 2012 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

32-bit limits everything to 2^32 or less.

- Anonymous June 14, 2012 | Flag

Comment hidden because of low score. Click to expand.

of 2 vote

This can be solved easily with two Hastables, the first Hashtable is where I will keep all numbers that appear at least once in the input, the second Hashtable is where I will keep all the numbers that appear more than once in the input (ie. a duplicate). Simply iterate over the input in one pass, testing the first Hashtable and then the second Hashtable to see if it already contains the number, and insert to the Hashtable and continue if it is not there already.

- eliminate the technobabble if possible May 31, 2012 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

But this solution has a worst-case space complexity of more than 2 GB (2 GB for the numbers between -2^31 to +2^31 plus additional overhead for maintaining the hashtable). The first solution posted in this article and the solutions posted in the article above come close to solving this in less space complexity. I am not sure if there are other better solutions.

Also, correct me if I am wrong, but I guess in the original solution, he/she meant to say "Allocate two byte arrays of (2 GB / 8) size". We only need to access the index pointed by arrayOfFiveGB[i] element to check if there is a duplicate.

- Interviewee May 31, 2012 | Flag

Comment hidden because of low score. Click to expand.

of 0 vote

May I ask if a B-tree would work (not BST) ? I understand that it is frequently used to store large data sets.

- InfiniteLoop May 31, 2012 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

One solution that seems reasonably efficient:

pass1: allocate an array of 2^32 elements each is two bit initialized to 00,
we iterate the 5-billion array and mark the 2^32 array as followed:
00: not seen
01: seen
10: duplicate
Then,we can output the duplicate number every time we see a switch from "seen" to "duplicate".

pass2: this is pass is only used for eliminating duplicate
create two pointers each pointing to the 0th element of the array
iterate the first pointer through the 5-billion array and use the number to index the 2^32 array to see whether we have a seen element or duplicate element,if we have a seen element,we copy it to what the second pointer points,and forward the second pointer by 1.
after the first pointer is at the tail,we know we are done and what the second pointer points is the tail of the new array.

- Anonymous May 31, 2012 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

Case I

you don't need 2 bits. 1 bit is enough.
0 - unseen
1 - seen
Whenever a number enters, if the bit is 1 then you can print it as duplicate.

- Hari August 11, 2012 | Flag

Comment hidden because of low score. Click to expand.

of 0 vote

Follow the link .....(C++ )
bestalgo4programmers. blogspot.in/2012/04/recursively-reverse-n-nodes-of-link.html
(remove spaces in above link)

- Tushar Gupta May 31, 2012 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

Sorry .. it's the link for previous question ...

- Tushar Gupta May 31, 2012 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

public static void removeDups2(int[] origArr,int[] newArr )
	{
		//32 bit signed integer means max +ve value are 2^31-1 and 2^31-1 -ve integers.
		//We will use the index to refer the actual integer
		//2^31 is approx 2.2 Gig
		BitSet positive=new BitSet((int)Math.pow(2, 31));
		BitSet negative=new BitSet((int)Math.pow(2, 31)-1);
		boolean isDuplicate=false;
		for(int i=0,j=0;i<origArr.length;++i)
		{
			isDuplicate=false;
			if(origArr[i]>=0)
			{
				//if the bit is set means its a duplicate else set the bit
				if(!positive.get(origArr[i]))
					positive.set(origArr[i]);
				else 
					isDuplicate=true;
				
			}
			else if(origArr[i]<0)
			{
				//if the bit is set means its a duplicate else set the bit
				if(!negative.get(origArr[i]*-1))//multiply by -1 to make it a postive index
				negative.set(origArr[i]*-1);
				else
					isDuplicate=true;
			}
			if(isDuplicate)
			{
				out.println(origArr[i]+" is duplicate.. remove it");
				origArr[i]=0;//  this is necessary only if we are not using the 2nd array for output
			}
			else
				newArr[j++]=origArr[i];

		}
	}

- Anonymous May 31, 2012 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

Create an array DUP of 2^32 size. initialized with -1 (or something)
Pass through the given array, and use the value of the array as index into the DUP array and if its -1 set it to 1. and if its alreay 1, then its already seen so print it....One pass solution is obtained

- Lokesh June 01, 2012 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

Improvemnt....Since it is a signed 32 bit numbers in the 5 b array...we can use 2^16 array size would be enough to keep track...0 if not seen any of the +/- number... last bit set....+ve number already seen second last bit set if -ve number seen.

Now if the current number is +/- and is already seen then its duplicate.....now the size of array is reduced by a lot

- Lokesh June 01, 2012 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

Well ...since only 2 bits are being used in every element of 2^16 array....remaining bits can be used further reducing the memory by half. Hence for can be further reduced by using all bits of each element arrray...so memory is requirement would become very lesss and negligible...and working with bits are faster..improves.!

- Lokesh June 01, 2012 | Flag

Comment hidden because of low score. Click to expand.

of 0 vote

I wonder "eliminate duplicate" means " eliminate all the other cases of duplicates but retain one" or "eliminate all the cases of a duplicated number",

if it is the former one,one pass is feasible.

- liao24hoho June 01, 2012 | Flag Reply

Comment hidden because of low score. Click to expand.

of 2 vote

[C#]
Use a HashSet() to keep track of 32bit ints that have been seen.

Also need to overwrite duplicate entries, with non-duplicates...

static void RemoveAndPrintDuplicates( BigArray<int>[] input)
{
    HashSet<int> set = new HashSet<int>();
    Int64 destIndex = 0;

    for (Int64 srcIndex = 0; srcIndex < input.Length; srcIndex++)
    {
        if (set.Contains(input[srcIndex]))
        {
            Console.WriteLine("Duplicate: {0}", input[srcIndex]);
            continue;
        }
        set.Add(input[srcIndex]);

        // If srcIndex and destIndex are different,
        // copy the src # to the dest.
        if (srcIndex != destIndex)
        {
            input[destIndex] = input[srcIndex];
        }
        destIndex++;
    }
}

- MessyHack June 01, 2012 | Flag Reply

Comment hidden because of low score. Click to expand.

of 2 vote

we can do this in two steps:

1. Sort the array using ur favorite sorting algorithm.
2. Now we need to find duplicates and remove them

pseudo code:

removeDuplicates( A[] , size)
{
   A=sort(A,size);
   last_element= infinite;
   index=0;
   for each element in A:
     if  A[i] = last_element             // =>duplicate
          // do nothing
     else 
           A[index] = A[i];
            index++;
            last_element=A[i];

       return A[];    
}

the size of array A is now equal to index.

- sarthak June 01, 2012 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

How are you guys getting the space requirements to under 2 GB or 512 MB in some cases?

The array size itself is 20 GB, isnt it? Please correct me if im wrong:
32 bits = 4 bytes.
5 billion 32 bit numbers = 5 billion * 4 bytes = 20 billion bytes

20 billion bytes = 2*10^10 bytes = 2*10^7 KB = 2*10^4 MB = 2*10^1 GB = 20 GB(roughly)

- confused!! June 03, 2012 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

Because, bit map will be used. Which will be of fixed length. In this case each bit will represent presence or absence of the number. The memory usage of 512 MB is extra memory that is required for the processing apart from the array of numbers.

- Software Architect June 05, 2012 | Flag

Comment hidden because of low score. Click to expand.

of 0 vote

linux./unix solution

$uniq -u <filename>

- varun July 17, 2012 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

linux./unix solution

$uniq -u <filename>

- varun July 17, 2012 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

We can think of this as equivalent to: There is a 20 GB file full of unsigned integers, how to produce a file that has no duplicates?
1. Break this large file into 40 500 MB files
2. Sort the integer in each file, modify whatever sorting you are doing to eliminate duplicates and print them.
3. Merge these 400 files, (like merging 400 arrays), while merging, eliminate and print duplicates.

- xdoer October 03, 2012 | Flag Reply

CareerCup

Amazon Interview Question for Software Development Managers

Books

Videos

Resume Review

Mock Interviews