## Microsoft Interview Question

Senior Software Development Engineers**Country:**United States

**Interview Type:**Phone Interview

Very imprecise formulation. Given the problem is :

1. False positives are not allowed : --> Do not say found, when it is not actually there.

2. False negatives are allowed --> Can say not found, when it is there.

These two axioms trivialises the problem. Just store the portion of the list as set what is possible in the memory.

Say "yes" when it exists in the set, otherwise say "No".

Was it a trick question?

how much memory do we have? how many words?

this quesion wants optimal memory usage, since we accept false negatives, we should minimize the probability of false negatives by maximizing a hash value and picking a "good" hash function. Hashtables are not optimal for such a problem due to the overhead not required in this problem (e.g. for a given hash-value we just need a bit, conflicts don't matter, no bucket lists or open addressing required, filling all possible values is just a huge overhead - not in terms of big-O but practically). Use a bit-array and index it with a hash value. Scale the size of the array to the memory available to minimize probability of false negatives.

Preparation step:

1) create an array of max size n with int's (n ints with b bits)

2) fill the table with 0xffffffff (depending on int size)

3) for each word compute a hash: 0<= hash < n*b and unmaks the bit hash%b of array[hash/b]

Query:

compute hash for word

check if bit hash%b of array[has/b] is set. If set word is for sure not in there

If not set word is probably in there (or an other word colliding with the same hash-value is in there...)

this is what I came up with, but seems it wasn't good enough:

```
class WordList
{
const int mask = (1<<6) - 1;
ulong[] ba;
public WordList(string[] l) {
ba = new ulong[1<<21];
foreach(string w in l) {
int h = Hash(w);
ba[h>>6] |= 1ul<<(h & mask);
}
for(int i = 0; i < ba.Length; ++i) ba[i] = ~ba[i];
}
public bool Contains(string w) {
int h = Hash(w);
return (ba[h>>6] & 1ul<<(h & mask)) == 0;
}
int Hash(string s) {...}
}
```

@ kanukadze

it looks ok to me. I don't get your constants 1 << 21 (roughly 16 MB memory, can't you use more?) The hash function has no upper limit, so it creates you an int 2^32 or 2^64 (don't kow c# that well anymore), anyway best case is 32 - 6 bits: 2^26 is > 2^21 so you will sooner or later access elements that aren't there... a remainder % would help although it messes up the hash distribution a bit, better is a hash function that takes the range into account and perfectly distributes in this range...

what are actually this ";6" ";21" close to the shift operators?

the idea was certainly right, I find it elegant to invert the array and work with OR before inverting the whole thing. What interview was it? was it just one or several?

@ChrisZ

thanks for walking me through this.

you're right 1<<21 does look strange. there wasn't any explicit cap on memory except fitting into a single box, so prob. just miscalculated. "<<;6" must be a website's formatting issue, no ';' there.

# of words was in billions, and there was a hint of using a hash function with lots of collisions which I interpreted as a way to limit the range => bitarray size.

it was of the two technical questions on a phone screen for senior dev with MS big data team

@ChrisZ

The problem definition is very blurry from my perspective.

If it's allowed to tune hash function until you're are sure your function has no collisions on this set. Well... It might work.

But what if you have only one iteration of reading?

I agree that trie and arch algs is not good solution. But we have strong constraint "no false positive", so we have to save some of the words without loosing of information.

At least I can create an example of words, which have a lot of hash collisions for popular hashing algorithms. I mean if nature of words is random you have low probability of collisions. But if you face artificially designed set of words it would be problem.

My conclusion for now: not enough information about this problem and about nature of words in particular. What probability of a mistake is acceptable.

PS by "merging approaches" I mean if alphabet is wide but words have a lot of same parts you can switch to another alphabet, like arch algs do. Then create trie over new alphabet.

But I never did this on practice. Such tricks usually used in math proofs.

PSS if you create a trie u not always need to use ptrs for all edges. You can make quasi-trie. And sub trie will be indexed inside little blocks. Very like to CPU functional blocks inside CPU.

I hope u get the idea, couse it's hard to describe with my english knowledge :)

@Sergey

I am not sure I understood the quasi-trie-approach.

But you are right, if you can't inverse the problem statement to say "it's certainly not in there"

you may need to pack as many words into memory as possible. I researched a bit, Minimal Acyclic Finite State Automaton (DAWG) is a modification of the trie that could do it.

Anyway upvoted your post ;-)

This problem is not about hash. Why? If you use hash you can't guarantee "no false positive". It means if you say "YES" it should be 100% "YES". If you store words with losing information you can't say "yes, i'm sure".

- Sergey October 19, 2016PLEASE stop use hash EVERYWHERE.

So, we need such data structure which say "Yes, I'm sure" and "Probably no".

1st approach

Use Trie (aka digital tree/radix tree/etc). Try to append as much words as you can. Trie is good choice because of good time complexity for search and better then plane storing memory efficiency if words intersecting each other.

2nd approach

Use archiving algorithms without losing of data. Try to append as much as you can.

Pros: you will be close to best memory efficiency.

Cons: veeeeery sloooow.

Both approaches can be merged depending on nature of data.