Microsoft Interview Question
Country: United States
Bloom filter will work only for the false case, not the true case.
One other way to optimize for space could be to keep two levels of hashsets. One regular one, one for contiguous ranges. If in your regular hashset you find that you have hashed numbers in, say, a contiguous block of 10 (like 0-9 inclusive), you can remove them from regular hashshet and then add 1 to your range hashset. Now, as you are checking a new number n, check to see if n/10 exists in the range hashset. If not, check in the regular hashshet.
This can incredibly reduce your space usage.
One more thing, this seems more like an open ended question to see how many questions you can ask about details and come up with different solutions for it rather than one with one correct answer.
The fact that smaller numbers are more likely to appear before large numbers is a hint that the data will be clustered in continues ranges.
Here is a Python code that would keep track of seen numbers by storing contiguous ranges in a binary tree:
class MyRange:
def __init__(self, l: int, r: int) -> None:
self.l = l
self.r = r
class IntegerSetNode:
def __init__(self, i: int):
self.range = MyRange(i, i+1)
self.left = None
self.right = None
class IntegerSet:
def __init__(self):
self.root = None
def seen(self, i: int) -> bool:
if self.root is None:
self.root = IntegerSetNode(i)
return False
else:
node = self.root
while True:
if node.range.l > i:
if node.range.l == i + 1:
node.range.l = i
return False
elif node.left is None:
node.left = IntegerSetNode(i)
return False
else:
node = node.left
elif node.range.r <= i:
if node.range.r == i:
node.range.r = i + 1
return False
elif node.right is None:
node.right = IntegerSetNode(i)
return False
else:
node = node.right
else:
return True
def test():
in_set = IntegerSet()
for i in range(10):
if in_set.seen(i):
print(f"should not have seen {i}")
for i in range(10):
if not in_set.seen(i):
print(f"should have have seen {i}")
# Press the green button in the gutter to run the script.
if __name__ == '__main__':
test()
A question for the interviewer would be - are we allowed to use a data store. In that case, you can have a sorted linked list to store the largest 50 or so integers and store the rest in the data store. Searching in a constant size list should take constant time - of course the constant itself is proportional to the size of the list. The moment you encounter a larger number, you extend the list and then pop out the smallest number from the tail and put it in a data store. If you occasionally encounter a number smaller than the smallest in the list, just search in data store.
#include <iostream>
#include <unordered_set>
#include <vector>
class StreamChecker {
public:
StreamChecker() { }
bool hasSeen(int number) {
// Check if the number is in the set of seen numbers
if (seen.find(number) != seen.end()) {
return true; // Number has already been seen
} else {
seen.insert(number); // Add the number to the set of seen numbers
return false; // This is the first time seeing the number
}
}
private:
std::unordered_set<int> seen; // Set to store seen numbers
};
int main() {
StreamChecker checker;
std::vector<int> stream = {1, 2, 3, 5, 6, 11, 7, 3, 4, 1, 2, 5, 9, 22, 12};
for (int num : stream) {
if (checker.hasSeen(num)) {
std::cout << num << " has been seen before." << std::endl;
} else {
std::cout << num << " is seen for the first time." << std::endl;
}
}
return 0;
}
The question starts with a probability topic and restriction on space both point to a possible solution using Bloom Filter.
- Saurabh March 18, 2021Now Bloom Filter is a probabilistic data structure that uses a relatively small fixed-size array to record hashes for the elements that it has seen before. It may give you a false positive with some probability P but never a false negative. (So it may say that the element exists in a set when the element may not really exist)
Usually, a bloom filter has an "add" and "contains" operation which both are very fast (time complexity O(1)). But there is an implementation/extension of the bloom filter called, the Counting Bloom filter with a "remove" operation at the cost of incurring an additional space overhead for counting.
I would assume, the first part of the question where smaller numbers have more probability of appearing first in the stream to suggest that once you start getting larger numbers, you can remove hashes generated using smaller numbers to further reduce the false positive scenario.