Google Interview Question
Developer Program EngineersCountry: United States
Interview Type: Phone Interview
1) Serialize function: you just go through an array and form 1 big string inserting some delimiter string between items of the array. The delimiter string must be exotic in order not to repeat some part of array's items, e.g. "~`*&#]^". The key problem here is to choose the delimiter string.
(Another solution for creating the delimiter string is the following: you scan an array and get the longest possible length of all array's items. Then in loop you create a delimiter string of random characters, which will be longer by 1 then the longest string from array. In this case the total complexity will be O(n), 'cause of 2 loops only). By the way, here goes the hint from the interviewer, if your delimiter string occurs to be quite long as well as with large initial array, you can face string overflow problems, but interviewer knows about that and hints you not to take this into consideration.
That's all.
2) Deserialize function: you take this big string and in loop restore the array, you just take the sequences of chars in between the delimiter strings.O(n).
(The implementation is quite straightforward. If you want, I can provide it.)
That's basically exactly what i said, but the interviewer I think was looking for a better solution
let's see.
In your solution you do "Check(string[i])". - this will cause O(n^2) complexity. And in case of fault (i.e. string contains your delimiter), you will have to start ab ovo: choose another delimiter and start again. And so on. This is not good, not to say bad.
There are no such problems in my solution: my delimiter is by 1 longer than any string of the array => so automatically I do not do any checks, I just go through array, and incrementally create the output. This is O(n) against your O(n^2).
And I use the hint of the interviewer - does not matter how long the delimiter is (as well as the array), I do not worry about the string overflow.
So, I offer O(n) solution. Is it possible to find better solution?
No, maybe i wrote down my explanation wrong. the deilimiter would be a character that was pre determined, a rare exotic character, and hard coded in (ex. !).
1. Loop through array of strings one by one.
2. Checkstring() would take that string, merge sort it, then check each index of the string until it find either the escape, or an unicode character value GREATER than the escape, and delimit it.
3. The the entire string would be returned, and escaped (ex Like: string1!string2)
"strings can contain any unicode character" - so, there are no exotic characters, because the task says ANY character is possible to be in string. It is the same as: ALL the characters are there among the strings of the array (quite possible variant). So, rare exotic hard coded character will not work.
Frankly, I don't understand your Checkstring() function, looks like, after it you will not be able to restore the array in Deserialize function.
Ok, the checkstring was meant to see if the exotic character that I used to delimit was there in the string before we serialized it. But I see what you are saying, if all characters are possible, how can i use one to delimit, something I would have figured out, if i had a unicode chart, and more time than 9 minutes to think.
All characters are possible in Json also though, but json manages to be serialized. So I was using a similar method.
@zr.roman your solution is perfect. However I wouldn't say the implementation is straight forward. Below is what I coded in Java.
* @author shivam.maharshi
*/
public class ArraySerializerDeserializer {
public static String serialize(String[] a) {
StringBuilder output = new StringBuilder();
int maxLenght = 0;
for (String s : a)
if (s.length() > maxLenght)
maxLenght = s.length();
maxLenght++;
output.append(maxLenght).append(":");
String delimiter = generateRandString(maxLenght);
for (String s : a)
output.append(delimiter).append(s.length()).append(":").append(s);
System.out.println(output.toString());
return output.toString();
}
public static String[] deserialize(String s, int size) {
String[] output = new String[size];
StringBuilder sb = new StringBuilder();
StringBuilder num = new StringBuilder();
int i = 0;
while (s.charAt(i) != ':') {
num.append(s.charAt(i));
i++;
}
i++;
int maxWordSize = Integer.valueOf(num.toString());
num = new StringBuilder();
boolean parsingNum = false;
boolean parsingDelimiter = true;
int charCount = 0;
int nextWordLenght = 0;
int wordCount = 0;
while (i < s.length()) {
if (parsingDelimiter) {
while (charCount < maxWordSize) {
i++;
charCount++;
}
parsingDelimiter = false;
parsingNum = true;
charCount = 0;
} else if (parsingNum) {
while (s.charAt(i) != ':') {
num.append(s.charAt(i));
i++;
}
parsingNum = false;
nextWordLenght = Integer.valueOf(num.toString());
num = new StringBuilder(); // Emptying.
i++;
} else {
while (nextWordLenght > 0) {
sb.append(s.charAt(i));
i++;
nextWordLenght--;
}
parsingDelimiter = true;
output[wordCount] = sb.toString();
wordCount++;
sb = new StringBuilder(); // Emptying.
}
}
return output;
}
private static String generateRandString(int size) {
StringBuilder sb = new StringBuilder();
for (int i = 0; i < size; i++) {
sb.append((char) (65 + (26 * Math.random())));
}
return sb.toString();
}
public static void main(String[] args) {
String[] a = { "this", "is", "very", "nice", "I", "like" };
String s = serialize(a);
String[] output = deserialize(s, a.length);
for (String out : output)
System.out.print(out + " ");
}
}
Why can't we just add length add the beginning of each string. Then all these we don't have to find the delimiter.
Suppose, first string is of 10 characters (fellowship), second is of 5 characters(sandy), third is of 7(brownie) characters. So, we will serialize the objects like this 10-fellowship5-sandy7-brownie. Here '-' character will serve as delimiter as the we need to distinguish between characters and length numbers.
While deserializing, we know that the length of the first string (characters before '-'), then read that many numbers of characters to retrieve the string. Repeat this process until we read the entire serialized string.
Why not chose a delimiter like dash (-) and escape any other dash in the string by adding another dash to it. this way you know that you've reached a delimiter if you've reached a single dash.
after you finished extracting strings into array, you can replace '--' with '-'.
this seems like an O(n) solution to me. unless I am missing something....
I only had 9 minutes, when he asked me this question. I got as far as whats below, then when I had only 2 minutes left, he asked me to explain the rest of my thinking, which I did(see comments as to what I said).
string serialize(string input[], int size)
{
string Output;
for (int i = 0; i < size, i++)
{
Check(string[i]); //Check string for escape character, use a character high on the unicode chart, sort each string indexes by unicode value using merge sort then check first character of each string, if it is greater than the character you chose, then you know the string does not contain your escape character
//Add escape character to each string to separate values for deserialization
Output +=string[i];
}
return Output;
}
string deserialize(string input[], int size)
{
//didnt get to explain this function
}
I didnt have enough time to optimize or even think of a solution that would be better than O(n2). I know there are better solutions, but nothing came to mind during the interview.
deserialize function has wrong signature, it should be:
string[] deserialize(string input, int size)
It makes sense to me to keep use the length of each string as part of the serialization. E.g. "test" becomes "4:test".
#!/usr/bin/env python
# -*- coding: utf-8 -*-
NUMERIC = xrange(ord('0'), ord('9') + 1)
def serialize(list_of_strings):
serialized_string = ''
for blob in list_of_strings:
n = len(blob)
serialized_string += str(n) + ":" + blob
return serialized_string
def get_next_length(a):
end_index = 0
for i in xrange(0, len(a)):
if ord(a[i]) not in NUMERIC:
end_index = i
break
if end_index == 0:
return None
return int(a[0:end_index])
def int_width(n):
width = 0
while n:
n //= 10
width += 1
return width
def deserialize(a):
next_length = get_next_length(a)
stop_idx = 0
result = []
while next_length:
start_idx = stop_idx + int_width(next_length) + 1 # extra colon character
stop_idx = start_idx + next_length
next_length = get_next_length(a[stop_idx:])
result.append(a[start_idx:stop_idx])
return result
if __name__ == '__main__':
some_unicode_strings = []
some_unicode_strings.append('générales okie dokie')
some_unicode_strings.append('555é okie dokie";')
serialized = serialize(some_unicode_strings)
print serialized
deserialized = deserialize(serialized)
for blob in deserialized:
print blob
zr.roman , I see your point ( unfortunately my russian is not very good to answer). Why should we use delimiter to split string array? Coudn't we use byte length followed by string encoding for the dictionary? Then if some string repeat in the array we could encode string index in dictionary, not the whole string. Just ponderiong over it.
if you forget about delimiters for strings and try to think about something else, you probably will find that it is possible to use length:
arr = ["aaa", "b", "ccccc"]
serialized string: 3,1,5-aaabccccc
to deserialize just find the first "-", take this prefix, split the prefix by "," and read strings based on the length
If you forget about delimiters and try to think about something else, you will probably find that you can use lengths.
arr = ["aaa", "b", "ccccc"]
serialized string: 3,1,5-aaabccccc
to deserialize just find the first "-", take this prefix, split the prefix by "," and based on lengths read the strings.
If you forget about delimiters and try to think about something else, you will probably find that you can use lengths.
arr = ["aaa", "b", "ccccc"]
serialized string: 3,1,5-aaabccccc
to deserialize just find the first "-", take this prefix, split the prefix by "," and based on lengths read the strings.
C# simple implementation that runs in O (N) time
public static string Serialize (IEnumerable<string> strings)
{
StringBuilder sb = new StringBuilder();
foreach (var str in strings)
{
sb.Append(str.Length); // how many characters
sb.Append("|");
sb.Append(str);
}
return sb.ToString();
}
public static IEnumerable<string> Deserialize (string s)
{
if (String.IsNullOrEmpty(s))
yield break;
var readingData = false;
int fwd =0;
for (var i = 0; i < s.Length; i++)
{
if (readingData == false )
{
fwd = ReadNumber(ref i, s);
readingData = true;
continue;
}
else
{
yield return ReadString(ref i, s, fwd);
readingData = false;
}
}
}
private static int ReadNumber (ref int index, string s)
{
StringBuilder sb = new StringBuilder(10); // maximum 10 digits can have an int32 in C#
while (s[index] != '|')
sb.Append(s[index++]);
return int.Parse(sb.ToString());
}
private static string ReadString (ref int index, string s, int howMany)
{
StringBuilder sb = new StringBuilder(howMany);
var cnt = 0;
while (cnt < howMany)
{
sb.Append(s[index++]);
cnt++;
}
index--;
return sb.ToString();
}
Inorder to serialize these kind of strings we need to place the string length along with delimiter. let's take our delimiter as ";<num>;" So, String "Hello" will be serialized as ";5;Hello"
Since, we know our delimiter string format, we can always deserialize it with ease.
I dont agree with that, please explain. If this was the optimal way, why isnt this used widely? Json is more widely used than this format, how is this more efficient space wise than using a exotic character as a separator and delimiting the character?
You are adding 2 bytes to every single string. Please explain this method
#!/usr/bin/env python
# -*- coding: utf-8 -*-
def serialize(arr):
s1 = ''
for i in xrange(len(arr)):
if not isinstance(arr[i],unicode):
text = arr[i].decode('utf-8')
else:
text = arr[i]
s1 += text
arr[i]=len(text)
return s1,arr
def deserialize(str,arr):
start=0
l=0
for i in xrange(len(arr)):
l += arr[i]
text = str[start:l]
if isinstance(text,unicode):
text = str[start:l].encode('utf-8')
arr[i] = text
start=l
return arr
#Test:
a=['serialize','deserialize','YćY','ℙƴ☂ℌøἤ','2016']
print a
first = serialize(a)
print first[0],first[1]
second = deserialize(first[0],first[1])
print second
if you forget about delimiters for strings and try to think about something else, you probably will find that it is possible to use length:
- Anonymous December 09, 2015arr = ["aaa", "b", "ccccc"]
serialized string: 3,1,5-aaabccccc
to deserialize just find the first "-", take this prefix, split the prefix by "," and read strings based on the length