Facebook Interview Question
Software Engineer / DevelopersCountry: United States
Interview Type: In-Person
I like this solution. Assuming good hashing scheme, this can do 4 billion strings in 512MB memory (ideal case). This means ~64 billion strings in 8GB. If N becomes two large, a distributed hash table can be used to span the data across multiple servers. Another alternative could be to store the data on disk and maintain indexes for faster seeks and updates.
Hi, I didn't get it. "this can do 4 billion strings in 512MB memory (ideal case)" can you explain this much more in detail?
If one string maps to one hash value (kept as one bit), we need 4 billion bits to represent 4 billion strings. Compressing 8 bits in one byte, means 4 billion bits can be compressed to 0.5 billion bytes (i.e. 512 MB). This assumes that hashing scheme is ideal i.e. one string has exactly one hash in 0 to 2^32-1 range.
Use MapReduce if all the strings or the hashes cannot fit in main memory. You will need a couple of MR jobs. In the first map step, simply emit (word, 1) and in the reduce step emit (word, size(intermediate_list)). In the second map step, only emit (word, 1) if the value=1. The second reducer is simply the identity function.
A Bloom Filter might be a good way of checking how many unique elements there are in one pass through the array. As the question implies, it would only be an *estimate*, as the bloom filter is a probabilistic data structure and there might be false positives.
I like the bloom filter, but this doesn't really fully solve the problem. For one, if we iterate over N strings, and String S appears more than once, then the first occurrence will be evaluated as "unique", and the second instance will not. We need a way to store the resultset outside of the "uniqueness detector" (bloom filter). If an array of N size is too large to fit into memory, then our resultset should be stored in a way that doesn't impact the memory constraint (ideally with the uniqueness detecting construct).
We can think of maintaining a trie. So if two strings differ in only one character, we are not necessarily storing all characters of the strings separately as we would end up doing if use sets.
We can store the trie for words starting with each of the 26 characters separately. And only maintain few of them in memory (cache) as per the momory constraints allow. Else, we should dump them to disk (cache miss etc).
And obviously increment the count when we encounter a string which is not present in the trie.
For N large enough to fit into memory, maintain a bucket based on sum of ascii values of alphabets in a string. Compare the strings in the common bucket. This optimizes the solutino to fit into memory. If there are large number of strings in common bucket, some other values can be used to maintain bukets, like first few characters of string so as to minimize max bucket size
1) If we can keep the hashes of entire array in memory then simple hashing technique would be fast enough for this problem. But to avoid the collision, we might need sufficiently large hash function ( 160 -256 bits ), which may again result in memory issue.
2) We can use the distributed hash table ( is possible ) to solve the memory problem.
3) We can use the Bloom filters -
a) We may use 10 bits bloom filter to keep false positive rate under 1% ( source : wiki)
b) While inserting any element into bloom filter, we can check if it is "definitely not in set". We can write these strings to "resultset" file.
c) If it "can be in set", keep track if it - Either writing to file or keep in memory using hashes (depending upon memory)
d) After we have consumed available memory space or all elements, we have definitely unique elements in "resultset". We repeat this step until, we consume all elements in source file.
e) Then, we can discard/keep elements "can be in set" using simple comparison or comparing hashed in group.
We need to keep track of strings we already encountered, so we need a "set". If we can do this in-memory, then we use our language's native set implementation, if it has one (e.g. Python has sets), or, if not, we use a hash where the strings are the keys and the values are some set value like True or 1.
The basic algorithm is this:
visited = set()
cnt = 0
for each string s:
if s not in visited:
cnt += 1
visited.add(s)
If memory is a constraint, use an RDBMS to implement your set of strings. Create a table with a single column called string, and index on that column. Use SELECT to determine if a string has been encountered, and use INSERT to add a string to the table.
If it's only required to estimate the number with limited memory, I think it's possible to use a simplified hash table to do this.
- Ghost October 29, 2012We can define a boolean array S[](If the memory is limited we can compress every 8 boolean values into one byte)
For each iteration we calculate the hash function H of current string, and check if the S[H] is true which means that we have inserted the same string before current iteration, so we do not count the string.
If the hash function have no collision, the algorithm can always get the accurate result.
To minimize the collision, we can use a group of hash functions like a bloom filter ....(each time we check S1[H1] S2[H2] .. SN[HN], if all those variable have been set, we assert that the string is duplicated)