Estimate the # of unique strin

Facebook Interview Question for Software Engineer / Developers

0

of 0 votes

13
Answers
Estimate the # of unique strings with limited memory
Given a large array of strings S = [s1, s2, ... sN], determine Uniq(S) = how many unique strings there are in S.
(b) How large can N be to solve on one machine using only memory?
(c*) What if N is too large to fully fit in memory?
- Steve October 28, 2012 in United States | Report Duplicate | Flag | PURGE
Facebook Software Engineer / Developer Application / UI Design

Email me when people comment.

An error occurred in subscribing you.

Country: United States
Interview Type: In-Person

Email me when people comment.

An error occurred in subscribing you.

Comment hidden because of low score. Click to expand.

of 2 vote

If it's only required to estimate the number with limited memory, I think it's possible to use a simplified hash table to do this.
We can define a boolean array S[](If the memory is limited we can compress every 8 boolean values into one byte)
For each iteration we calculate the hash function H of current string, and check if the S[H] is true which means that we have inserted the same string before current iteration, so we do not count the string.
If the hash function have no collision, the algorithm can always get the accurate result.
To minimize the collision, we can use a group of hash functions like a bloom filter ....(each time we check S1[H1] S2[H2] .. SN[HN], if all those variable have been set, we assert that the string is duplicated)

- Ghost October 29, 2012 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

I like this solution. Assuming good hashing scheme, this can do 4 billion strings in 512MB memory (ideal case). This means ~64 billion strings in 8GB. If N becomes two large, a distributed hash table can be used to span the data across multiple servers. Another alternative could be to store the data on disk and maintain indexes for faster seeks and updates.

- Nikhil November 01, 2012 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

Hi, I didn't get it. "this can do 4 billion strings in 512MB memory (ideal case)" can you explain this much more in detail?

- amyue.xing November 15, 2012 | Flag

Comment hidden because of low score. Click to expand.

of 1 vote

If one string maps to one hash value (kept as one bit), we need 4 billion bits to represent 4 billion strings. Compressing 8 bits in one byte, means 4 billion bits can be compressed to 0.5 billion bytes (i.e. 512 MB). This assumes that hashing scheme is ideal i.e. one string has exactly one hash in 0 to 2^32-1 range.

- Nikhil November 16, 2012 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

Use MapReduce if all the strings or the hashes cannot fit in main memory. You will need a couple of MR jobs. In the first map step, simply emit (word, 1) and in the reduce step emit (word, size(intermediate_list)). In the second map step, only emit (word, 1) if the value=1. The second reducer is simply the identity function.

- dr.house December 21, 2012 | Flag

Comment hidden because of low score. Click to expand.

of 1 vote

A Bloom Filter might be a good way of checking how many unique elements there are in one pass through the array. As the question implies, it would only be an *estimate*, as the bloom filter is a probabilistic data structure and there might be false positives.

- Anonymous October 29, 2012 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

I like the bloom filter, but this doesn't really fully solve the problem. For one, if we iterate over N strings, and String S appears more than once, then the first occurrence will be evaluated as "unique", and the second instance will not. We need a way to store the resultset outside of the "uniqueness detector" (bloom filter). If an array of N size is too large to fit into memory, then our resultset should be stored in a way that doesn't impact the memory constraint (ideally with the uniqueness detecting construct).

- Anonymous November 06, 2012 | Flag

Comment hidden because of low score. Click to expand.

of 4 vote

We can think of maintaining a trie. So if two strings differ in only one character, we are not necessarily storing all characters of the strings separately as we would end up doing if use sets.

We can store the trie for words starting with each of the 26 characters separately. And only maintain few of them in memory (cache) as per the momory constraints allow. Else, we should dump them to disk (cache miss etc).

- Sam October 29, 2012 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

And obviously increment the count when we encounter a string which is not present in the trie.

- Sam October 29, 2012 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

A patricia trie maybe? to further reduce space

- w00ster November 01, 2012 | Flag

Comment hidden because of low score. Click to expand.

of 0 vote

For N large enough to fit into memory, maintain a bucket based on sum of ascii values of alphabets in a string. Compare the strings in the common bucket. This optimizes the solutino to fit into memory. If there are large number of strings in common bucket, some other values can be used to maintain bukets, like first few characters of string so as to minimize max bucket size

- Anon October 31, 2012 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

1) If we can keep the hashes of entire array in memory then simple hashing technique would be fast enough for this problem. But to avoid the collision, we might need sufficiently large hash function ( 160 -256 bits ), which may again result in memory issue.
2) We can use the distributed hash table ( is possible ) to solve the memory problem.
3) We can use the Bloom filters -
a) We may use 10 bits bloom filter to keep false positive rate under 1% ( source : wiki)
b) While inserting any element into bloom filter, we can check if it is "definitely not in set". We can write these strings to "resultset" file.
c) If it "can be in set", keep track if it - Either writing to file or keep in memory using hashes (depending upon memory)
d) After we have consumed available memory space or all elements, we have definitely unique elements in "resultset". We repeat this step until, we consume all elements in source file.
e) Then, we can discard/keep elements "can be in set" using simple comparison or comparing hashed in group.

- ~amit April 18, 2015 | Flag Reply

Comment hidden because of low score. Click to expand.

-1

of 1 vote

We need to keep track of strings we already encountered, so we need a "set". If we can do this in-memory, then we use our language's native set implementation, if it has one (e.g. Python has sets), or, if not, we use a hash where the strings are the keys and the values are some set value like True or 1.

The basic algorithm is this:

visited = set()
cnt = 0
for each string s:
  if s not in visited:
    cnt += 1
    visited.add(s)

If memory is a constraint, use an RDBMS to implement your set of strings. Create a table with a single column called string, and index on that column. Use SELECT to determine if a string has been encountered, and use INSERT to add a string to the table.

- showell30@yahoo.com October 28, 2012 | Flag Reply

CareerCup

Facebook Interview Question for Software Engineer / Developers

Books

Videos

Resume Review

Mock Interviews