## Facebook Interview Question for Software Engineer / Developers

Country: United States
Interview Type: In-Person

Comment hidden because of low score. Click to expand.
2
of 2 vote

If it's only required to estimate the number with limited memory, I think it's possible to use a simplified hash table to do this.
We can define a boolean array S[](If the memory is limited we can compress every 8 boolean values into one byte)
For each iteration we calculate the hash function H of current string, and check if the S[H] is true which means that we have inserted the same string before current iteration, so we do not count the string.
If the hash function have no collision, the algorithm can always get the accurate result.
To minimize the collision, we can use a group of hash functions like a bloom filter ....(each time we check S1[H1] S2[H2] .. SN[HN], if all those variable have been set, we assert that the string is duplicated)

Comment hidden because of low score. Click to expand.
0

I like this solution. Assuming good hashing scheme, this can do 4 billion strings in 512MB memory (ideal case). This means ~64 billion strings in 8GB. If N becomes two large, a distributed hash table can be used to span the data across multiple servers. Another alternative could be to store the data on disk and maintain indexes for faster seeks and updates.

Comment hidden because of low score. Click to expand.
0

Hi, I didn't get it. "this can do 4 billion strings in 512MB memory (ideal case)" can you explain this much more in detail?

Comment hidden because of low score. Click to expand.
1
of 1 vote

If one string maps to one hash value (kept as one bit), we need 4 billion bits to represent 4 billion strings. Compressing 8 bits in one byte, means 4 billion bits can be compressed to 0.5 billion bytes (i.e. 512 MB). This assumes that hashing scheme is ideal i.e. one string has exactly one hash in 0 to 2^32-1 range.

Comment hidden because of low score. Click to expand.
0

Use MapReduce if all the strings or the hashes cannot fit in main memory. You will need a couple of MR jobs. In the first map step, simply emit (word, 1) and in the reduce step emit (word, size(intermediate_list)). In the second map step, only emit (word, 1) if the value=1. The second reducer is simply the identity function.

Comment hidden because of low score. Click to expand.
1
of 1 vote

A Bloom Filter might be a good way of checking how many unique elements there are in one pass through the array. As the question implies, it would only be an *estimate*, as the bloom filter is a probabilistic data structure and there might be false positives.

Comment hidden because of low score. Click to expand.
0

I like the bloom filter, but this doesn't really fully solve the problem. For one, if we iterate over N strings, and String S appears more than once, then the first occurrence will be evaluated as "unique", and the second instance will not. We need a way to store the resultset outside of the "uniqueness detector" (bloom filter). If an array of N size is too large to fit into memory, then our resultset should be stored in a way that doesn't impact the memory constraint (ideally with the uniqueness detecting construct).

Comment hidden because of low score. Click to expand.
0
of 4 vote

We can think of maintaining a trie. So if two strings differ in only one character, we are not necessarily storing all characters of the strings separately as we would end up doing if use sets.

We can store the trie for words starting with each of the 26 characters separately. And only maintain few of them in memory (cache) as per the momory constraints allow. Else, we should dump them to disk (cache miss etc).

Comment hidden because of low score. Click to expand.
0

And obviously increment the count when we encounter a string which is not present in the trie.

Comment hidden because of low score. Click to expand.
0

A patricia trie maybe? to further reduce space

Comment hidden because of low score. Click to expand.
0
of 0 vote

For N large enough to fit into memory, maintain a bucket based on sum of ascii values of alphabets in a string. Compare the strings in the common bucket. This optimizes the solutino to fit into memory. If there are large number of strings in common bucket, some other values can be used to maintain bukets, like first few characters of string so as to minimize max bucket size

Comment hidden because of low score. Click to expand.
0
of 0 vote

1) If we can keep the hashes of entire array in memory then simple hashing technique would be fast enough for this problem. But to avoid the collision, we might need sufficiently large hash function ( 160 -256 bits ), which may again result in memory issue.
2) We can use the distributed hash table ( is possible ) to solve the memory problem.
3) We can use the Bloom filters -
a) We may use 10 bits bloom filter to keep false positive rate under 1% ( source : wiki)
b) While inserting any element into bloom filter, we can check if it is "definitely not in set". We can write these strings to "resultset" file.
c) If it "can be in set", keep track if it - Either writing to file or keep in memory using hashes (depending upon memory)
d) After we have consumed available memory space or all elements, we have definitely unique elements in "resultset". We repeat this step until, we consume all elements in source file.
e) Then, we can discard/keep elements "can be in set" using simple comparison or comparing hashed in group.

Comment hidden because of low score. Click to expand.
-1
of 1 vote

We need to keep track of strings we already encountered, so we need a "set". If we can do this in-memory, then we use our language's native set implementation, if it has one (e.g. Python has sets), or, if not, we use a hash where the strings are the keys and the values are some set value like True or 1.

The basic algorithm is this:

``````visited = set()
cnt = 0
for each string s:
if s not in visited:
cnt += 1

If memory is a constraint, use an RDBMS to implement your set of strings. Create a table with a single column called string, and index on that column. Use SELECT to determine if a string has been encountered, and use INSERT to add a string to the table.

Name:

Writing Code? Surround your code with {{{ and }}} to preserve whitespace.

### Books

is a comprehensive book on getting a job at a top tech company, while focuses on dev interviews and does this for PMs.

### Videos

CareerCup's interview videos give you a real-life look at technical interviews. In these unscripted videos, watch how other candidates handle tough questions and how the interviewer thinks about their performance.