## Facebook Interview Question

Software Engineer / Developers**Country:**United States

**Interview Type:**In-Person

that should work no regardless the size of the integers. 1 million number = 2^30 numbers, each number representing to 32 = 2^5 numbers, it needs at most 2^30/2^5 = 2^25 integers, in 64-bit system is 2^25* 8 byte = 256M. Or I am talking about a different method?

@nanny: I doubt that you and DarkKnight are talking about the same thing. Do elaborate on your idea though.

Use bit map, each 32-integer could represent the existences of 32 numbers, we need 2^25 integers to present from 1 to 1 billion , which may need up to 256M memory. In 64 bit, each integer cost more space but could represent more numbers (64) so it doesn't matter the system is using.

@nanny: I see what you're saying. You're saying that by pigeonhole principle, we will conclude that if there's a billion numbers, at least one number between 1 and 1 billion + 1, inclusive, is unused. We will then make a bitmap with 1 billion + 1 bits and mark off occurrences of only numbers from 1 to 1 billion + 1. We will be able to do this with 128 MB. If the inputs were 64-bit integers, this would take no more memory, because we still only care about integers in the 1 to 1 billion + 1 range.

That's a decent solution. It's a different solution from what DarkKnight was referring to, though.

@eugene.yarovoi

The question is asking for any integer not in file. We can set the range to 1 to 1 billion. Any number out of the range, we just ignore it.

For example, if we are given 64 numbers, and we want to find an integer not there, we only need to check the existences of 1 to 64. We only need 2 integers to check the existences of 1 to 64: num_1 store the existence of 1 to 32 and num_2 stores the existences of 33 to 64. (One integer is actually works as 32 bool values).

That's my only understanding of DarkKnight's algorithm. Maybe you have a different interpreting?

@eugene.yarovoi

And that's independent to the 32-bit and 64-bits since if the bits in one integer doubled, it takes double memory but can present double amount of numbers as well. So doesn't matter which system is using

@Nanny: As you were writing your response, I realized what you were talking about and edited my response above before I saw your reply. See my changes.

Your algo makes sense. DarkKnight was almost certainly proposing a different solution, as evidenced by:

"There are 2^32 possible values ( about 4 Billion )

Allocate a single bit for each value ( This would required 512 MB of memory )"

He is proposing allocating a bit for every possible value of a number and doesn't incorporate the pigeonhole principle.

@eugene.yarovoi

Yes, I just read your response. I don't quite understand "he is proposing allocating a bit for every possible value of a number". He need some way to keep tack of the orders of numbers so no sorting is need. Bit map is the only way I can think of. Could you please explain more on this, take the example: given 64 numbers and find one not existing.

If the numbers are 32-bit numbers, since there are 2^32 possible 32-bit numbers, he would allocate 2^32 bits, regardless of the input size.

@eugene.yarovoi

How did you use 1-bit to store the existence of 2^32 instead of using a bit map?

Could you show me an example? I think DarkKnight and I was talking about the same method, "set the ith bit to be true". You have to use some way to keep the orders as I said, or maybe you have some other fancy way?

As far as I've read, binary search and quick sort both don't exploit the locality of reference hence if used on file sizes which can't fit in main memory, they cause excessive paging. Please correct me if I'm wrong.

@Algos: No one said anything about quicksort, first of all. External sort is usually done via a mergesort phase followed by an N-way merge algorithm (which is either done by more mergesorts or by something that looks like a modified heapsort).

Second of all, there's no reason why quicksort couldn't exploit locality of reference, as it does operate on largely contiguous data. It arranges the array from the front and the back at the same time, but that's only 2 locations and writes are contiguous in both those places, so with a proper buffering strategy, things would be fine.

Third, you're right that binary search is not always a great algorithm to use on files, but we only need to use it once, so the cost of it relative to the cost of sorting is minimal.

Read each number in the file. Keep track of the largest number you encounter. When you are at the end. Add 1 to it.

That would probably be my first thought too. To give some analysis: the person who posted this answer probably (and apologies if I'm mistaken) meant that we should choose an integer uniformly at random from the space of all possible integers we're considering (let's say this question is about 32-bit ints). Then, because there's > 4*10^9 possibilities (in the case of 32-bit ints), if we scan our array of size 10^9 for the number we just chose, the chances we'll find it is < 1/4. If we don't find it, the algo terminates and returns the number. If we find it, we just run the whole algorithm again. So, we have that E[running time] = O(n) + P(we need to run again)*E[running time]. This implies E[running time] = O(n) / (1-P). Since we know P < 1/4, we know (1-P) > 3/4. This means E[running time] < 4/3 O(n), which means E[running time] is still O(n).

I think starting at MIN_INT and methodically working up would work better than a randomly chosen number. Since the list of numbers is random, whatever number you choose has the exact same odds of being in the list as a number chosen randomly and also doesn't add the risk of randomly choosing the same target number again. It still seems like too much of a brute force method though.

The only thing my tired brain can come up with right now is doing a quicksort so that finding it becomes trivial. But again, it feels like there's a simpler solution I haven't put my finger on yet.

It is not clear from the question, If the integers are between 1 to 1B only. We can solve it by divide and conquer. We can divide all the integers based on their 1st bit i.e. make two buckets, 1st bucket with all integers whose first bit is 0 and second bucket with integers whose first bit is 1. then count the integers in both buckets, The bucket which has less integers pick that and again divide it into 2 buckets with 2nd bit and so on, do this until you end with a bucket no element and keep track of the bit for the each chosen bucket. You can form the missing integer.

It's like the radix sort, and it will work, but there is a memory issue, to store a billion 4B ints will take about 4GB. If memory isn't an issue better solution would be counting sort(O(n)). To solve this in O(n) without additional memory some sort of dynamic search pattern should be used.

I think using quicksort approach would definitely hit it. Suppose we are given a billion 64-bit integers we try to partition them around 2^63 and then we choose the part that has less (or equal) numbers. Suppose that left part has less numbers, then by comparing it to 2^62 we partition it (in place) again. until we reach a point that after partitioning there's one part that doesn't contain any number in it. Then you could choose any number in that range for an answer. That works for 32-bit integers, too.

Scan the file first and store the values in a BitMap. Using 1 bit per integer, we'd need 125MB.

Next scan the bitmap for a 0 bit.

Two we need to know to answer this question:

1. Are the numbers in the file unique?

2. What is the range of the numbers in the file?

If the numbers in the file are unique then we are dealing with a billion numbers in the range [n to n+1billion+1]. The plus one is for the missing number. If the numbers are not unique then the numbers will range from [n to n+c+1] where c is unknown, and there will be repetitions in the file (unless the file is 1billion instances of the same number).

If assume we are dealing with a file containing 1billion unique numbers in the range [0 to 1billion+1] then an way to determine the missing number would be to:

1. calculate the sum of [0 to 1billion+1] (sum of arithmetic series = n(a1+an)/2)

2. then subtract each number contained in the file.

The sum is big (approx. 1x10^18) This could be represented in 60bits, so a long will be fine.

```
final long billion = 1000000000;
long sum = billion * (billion+1)/2;
while (file.hasNext()) {
sum -= file.next();
}
return sum;
```

Here's my Java solution for limited memory. Very minimal error checking.

```
int getIntNotInFile(String filename, int memoryInBits) {
if (memoryInBits<=0)
throw new IllegalArgumentException("We need some memory");
int low = Integer.MIN_VALUE, high = Integer.MIN_VALUE + memoryInBits;
while (low<=Integer.MAX_VALUE) {
BitSet bs = new BitSet(memoryInBits);
Scanner sc = new Scanner(new File(filename));
while (sc.hasNextInt()) {
int num = sc.nextInt();
bs.set(num - low);
}
int clearBit = bs.nextClearBit(0);
if (clearBit<memoryInBits)
return low + clearBit;
low = high + 1;
high = low + memoryInBits;
}
// We've reached the end of the file and we've seen every integer
throw new IllegalArgumentException("File does not have a missing integer");
}
```

Use a bloom filter - en.wikipedia.org/wiki/Bloom_filter

A Bloom filter with 1% error and an optimal value of k, requires only about 9.6 bits per element. So the algorithm would be as follows:

1. Add all the numbers to the Bloom filter

2. Look up numbers from 1 to 1B in the Bloom filter. Return the first one that is not found.

Since bloom filter addition and look up are constant time, this is an O(n) solution

We don't win anything with the bloom filter. We only want to store "presence"; whether a number was present in those billion numbers or not. We can do this by using a BitSet, for instance, which requires 1 bit per element. Again, we don't win anything with a bloom filter if it requires ~10 bits per element. Not to mention the error rate that comes with it.

```
int input[] = { 0, 1, 2, 3, 4, 5, 6, 7, 12, 22, 23, 32, 54, 66, 88, 102, 12312, 1000000, 1012739, Integer.MAX_VALUE };
long sum = 0;
int max = 0;
int res = 0;
int i;
// Sum all the ints into a long variable, in this example ints come from int input[]
for (i = 0; i < input.length; i++)
{
sum += input[i];
if (max < input[i])
{
max = input[i];
}
}
// simple check
if (max != Integer.MAX_VALUE)
{
res = max + 1;
}
else
{
// if i cannot be subtracted from the sum it is the result, otherwise subtract
for (i = Integer.MAX_VALUE; i >= 1 ; i--)
{
if (sum >= i)
{
while (sum >= i)
{
sum -= i;
}
}
else
{
res = i;
break;
}
}
}
System.out.println("Here is your new int: " + String.valueOf(res));
```

This works only for positive ints, to handle negative, another sum variable and for circle must be added.

Unbelievable the amount of bullshit we can read here. The best solution is, as someone mentioned, cantor's diagonalisation. You read the nth digit of the nth word, you add 1 to it modulo 10 (or even modulo 2 actually) (if the number doesn't have a nth digit, you just use 0+1 = 1) and this makes the nth digit of your new number. This new number is different from all other numbers from at least one digit (by construction). The complexity of this algorithm is O(number of lines). No other algorithm I have seen on this page runs as fast as this one.

Considering an unsigned 32 bit integer

- DarkKnight October 07, 2012There are 2^32 possible values ( about 4 Billion )

Allocate a single bit for each value ( This would required 512 MB of memory )

Read the file, for each integer I, set ith bit to 1

Once reading file is complete

find the first bit which is not 1

print the number corresponding to this bit.