Gold-Tier Interview Question
Software Engineer / DevelopersSuppose you read the following two previous bytes (I'm only listing the first bit of each byte):
... ? 1 0 [current byte].
This could yield several scenarios (I'll parenthesis the whole characters and read back one more byte)
... (1 1) (0) [current byte].
... 1) (1 0) [current byte].
... (0) (1 0) [current byte].
... 0) (1 0) [current byte].
So, Sam and Srik... it appears reading two previous bytes back is not necessarily sufficient.
Is four bytes even sufficient?
What if you had:
... 1 1 1 0 [current byte]
There are at least these two scenarios:
... (1 1) (1 0) [current byte]
... 1)(1 1) (0) [current byte]
I'm only describing the tricky case (...1 1 1 0). When the end byte is 1 it's obvious we're dealing with a 2 byte character.
We need too look back until we hit a 0, that's the end of a character for sure. Then you count the number of 1's in between the two zeros.
0 1 1 ...1 1 1 0 - if the number of ones is even it's single byte, else double byte.
PS: I asked this question in an interview the other day :)
If previous two are two zero the previous character is a single byte character. Otherwise keep looking until you reach a sequence 01. For example if we have two pointers prev, curr and the sequence 11001111 you stop when prev reaches in position 4 and curr reaches in position 5. The reason we stop at 01 is that this seq. gives us the information that the 1 means that the character is double byte character hence we can backtrack from there to the position provided as the input and ascertain each character size on the way.
It needs tracking for the first byte that starts with 0 and is not the immediate previous byte.
.[0]..[1][current]... or ..[0]..[0][current]
so, the byte after the first [0] shown above will tell us that the next byte will be a new char. Then, we can tracking down. Or, it will reach the beginning char.
two to three previous bytes inspection
pattern 00 -- previous common character starts with immediate left
pattern 11 -- previous uncommon character starts with two bytes left
pattern 01 -- previous uncommon character starts with one byte left
patttern 10 -- needs to look the third byte to the left and we derive 110
or 010 which is answered above
No, it can’t be found unless you restart from the beginning.
Since khanji uses variable bit length to encode its characters, it is a case of Prefix coding, which is designed to get the benefit of less space compared to fixed bit length encoding like ASCII. In such encoding the bit patterns of characters are so designed that any character can be identified from its unique prefix bits (certain bits from left if you write bits of a character in little Endian style) but the characters can’t be identified from its prefix bits. This is why, when a set of such characters are encoded in binary form, each characters can be identified by reading from beginning (left to right) but moving in reverse direction will fail to identify the characters.
One such example is the Network IP address class scheme. Class A has prefix 0, class B prefix 10, Class C 110 and Class D with prefix 1110. Looking at any IP address’ prefix you can tell its class but not from its suffix. In this particular example, all IP addresses are fixed length of 32 bits so a suffix of 32 bit (eventually all the bits) can do the task, but in case of variable IP address lengths it would have not been possible to identify their class from their suffix.
Errata of previous post:
..In such encoding the bit patterns of characters are so designed that any character can be identified from its unique prefix bits (certain bits from left if you write bits of a character in little Endian style) but the characters can’t be identified from its suffix bits...
Ok, The whole point here is to reach an index that one is sure to conclude is the start of a character. How?? Here is a solution: (The dots and the bits in brackets represent the MSB of that index)
-Case (0)(0)......(Given Index).... Here definitely, the second zero is the start of a character (ASCII) where as nothing can be said of the first zero
-Case (0)(1)......(Given Index)... Here, the (1) is definitely the start of a character and nothing can be said about 0
-Case (1)(0)....(Given index).... Here, 10 can be a character or 0 can be a character , so no use
-Case (1)(1)....(Given index)..... Here 11 can be a character or second 1 can be start of a character etc., so ambiguous.
From above 4 cases, the patterns that are useful are 00 and 01. Now, the solution is simple:
1) Start
2) cur=given index
3) q=MSB of cur-1 index
4) p=MSB of q-1 index
5) if(p is 0) {index associated with q is the start of a character}
6)else{cur=q}
7) repeat from 3 until if(p is 0) is satisfied
Now you have index associated with q as the start of character. Depending on whether q is a 0 or 1 increment until you reach the given index each time storing the previous index and thats the answer.
So, initially, u are going back until you reach the pattern 00 or 01.
Some observations:
1.if the previous byte begins with 1, it can only be a part of two bytes, so the previous character begins two bytes ahead
2.if the previous byte begins with 0, it can be two situations:
1) it is the beginning of the previous character
2) it is part of the previous character
3. so keep looking forward to the previous previous bytes:
if it begins with 0, then the previous byte is the beginning
if it begins with 1, then keep looking forward.
4. for the previous previous previous previous byte:
if it begins with 1, then the previous previous byte is the beginning
if it begins with 0, then the previous byte is the begining
So we shall look into the previous four bytes.From the beginning,mkeep looking forward until we meet one byte that begins with 1
Hi all,
- sudhanshu22 June 09, 2008I am pretty new to this forum.
My solution to the above problem would be:
We already know that current pointer points to beginning of a character.
Our target is know where does the previous character begin. The previous character can begin at 1 byte left or 2 bytes left. We don't bother about other characters.
two/three/four previous bytes inspection
pattern 00 -- previous common character starts with immediate left
pattern 11 -- previous uncommon character starts with two bytes left
pattern 01 -- This is not an allowed pattern. Note that we will have found a character when we have moved 2 bytes to the left. here 1 means it is not a1 byte character and 0 means it is not a 2 byte character.
patttern 10 -- This is ambiguous and we need to explore furthur.
needs to look the third byte to the left and we derive 110
or 010 .
If it is 010,then it is (0)(1 0), i.e. the previous character is a 2 byte character, note that if we took
(0 1)(0), we would encounter a (0 1) pattern which is not an allowed pattern.
if it is 110, then it can be (..1) (1 0) or it can be
(1 1)(0).So we need to explore furthur one byte.
So we can have 1110 or 0110.
In case of 0110, we can have (..0)(1 1)(0) again because (0 1) is not an allowed pattern.
In case of 1110, we have ambiguity, so explore furthur.say we get 01110, so the pattern is (1 1)(1 0).( see that (1 1)(0) is not allowed because we will again have 0 1 , which is not an allowed pattern.
lesson: we get ambiguity for 11111...0 series, and once we get 01111...0, we stop.
So once we get a zero on left, we stop. Now for finding the answer, we can look at our lesson, we can count the no of 1's , if we get an even no of 1's,(like 0110), we can say that last character begins at the 1st byte from the current position.
if we get an odd no of 1's(like 01110), we cna say that last character begins at 2 bytes from the current character.
Hope the problem is solved.