Variable-length integers encoding is an algorithm to compress fixed-length integers into variable-length integers to save space when you want to store or transmit numbers.
In this blog post, we are looking at a varint implementation for compressing int
, a datatype that in Java has a
fixed length of 32 bits (4 bytes).
The same algorithms can also be applied to the long
(64 bit) data type; however, in this blog post, I focus solely on the int
data type.
Introduction ¶
When you look at the binary representation of int
300, you see that the first two bytes don't contain any information and only the last two bytes
are necessary to represent the number 300.
00000000 00000000 00000001 00101100
Even worse for the number 100, we only need one byte, and the other 3 bytes don't contain any information.
00000000 00000000 00000000 01100100
If you know in advance that a number always fits into 1 or 2 bytes, the following approach might not be the best solution. You could save much
more bandwidth by choosing a smaller data type and use a fixed-length encoding. In Java, you switch to byte
or short
if a
value always fits into one or two bytes.
However, what if you don't know the number range in advance, and it could be a number between -2,147,483,648 to 2,147,483,647 (max range of int
).
With varints, we can store a 300 in 2 bytes, a 100 in one byte, and 1,000,000 in 3 bytes. This can save much space if your application mostly needs to handle small numbers and only sporadically very big numbers.
In the following sections, we see how we can convert int
into a variable-length int
and vice versa with the continuation bits encoding.
In the first part, we focus on positive numbers, and in the second part, we look at a solution for both negative and positive numbers.
Positive only ¶
In this section, we focus on positive int
and how we can convert them to and from variable-length integers.
Encoder: fixed length int -> variable length int ¶
The idea behind this variable-length encoding is to split the number into groups of 7 bits. Then take every group where at least 1 bit is set and always the least significant (rightmost) group of 7 bits.
Add one bit to each 7 bits group as the leftmost bit (most significant bit, MSB) to form a byte. Set this bit to 1 except in the least significant byte (rightmost byte) set it to 0. This extra bit is an indicator for the decoder whether there are more bytes following.
In the following algorithm, we are going to reverse the order of the bytes. You could also preserve the order. It does not matter as long as the encoder and decoder agree on the order. This is also the same encoding that Protocol Buffers uses.
Let's see how this works for the number 300.
300 is represented with the following 4 bytes
00000000 00000000 00000001 00101100
The first step is to split these 32 bits into groups of 7.
0000 0000000 0000000 0000010 0101100
Take the last group, 0101100
, and put it first into the result. Add the second group (from the right), because it contains bits that are set.
Ignore the other 3 groups because none of the bits are set.
We end up with the following 14 bits
0101100 0000010
Starting from the left, add 1 as the most significant bit to the groups and 0 to the last group. The final result looks like this. 300 using 4 bytes in fixed-length encoding can be stored as 2 bytes in variable-length encoding.
10101100 00000010
Next, we are looking at a method that implements this encoding in Java.
public static byte[] encodeUInt32(int inputValue) {
int value = inputValue;
byte[] buffer = new byte[5];
int position = 0;
The method starts with some variable declarations and reserves a buffer of 5 bytes. This is also one of the downsides of this variable-length encoding. Because we are going to split the numbers into groups of 7 bits and add one extra bit, a large number will be encoded in 5 bytes.
Next, the method starts a loop.
while (true) {
// ~0x7F = 0xffffff80
if ((value & 0b11111111111111111111111110000000) == 0) {
buffer[position++] = (byte) value;
break;
}
The first check is a bitwise AND where the code looks if there are no bits set between positions 8 and 32. If that is the case, add the last 7 bits to the result and stop encoding. Because this is the last group, the MSB has to be 0, which happens here implicitly because with the if check, we know that the bit at position 8 is zero (if it were 1, then the check would return false). The (byte) cast takes the least significant byte from the 4 bytes.
Here the if check if we encode int
300.
00000000 00000000 00000001 00101100 (int 300)
& 11111111 11111111 11111111 10000000
= 00000000 00000000 00000001 00000000 (int 256)
The result of the bitwise AND is 256, and the if condition evaluates to false. The encoder now knows that this is not the last group, so he adds the least significant 7 bits to the result and sets the MSB to 1
buffer[position++] = (byte) ((value & 0b1111111) | 0b10000000);
The bitwise AND with 1111111 unsets all bits that are set between positions 8 and 32.
00000000 00000000 00000001 00101100 (int 300)
& 00000000 00000000 00000000 01111111
= 00000000 00000000 00000000 00101100
The bitwise OR with 10000000 then sets the bit at position 8.
00000000 00000000 00000000 00101100
| 00000000 00000000 00000000 10000000
= 00000000 00000000 00000000 10101100
Moreover, the cast to byte
extracts the least significant byte
10101100
This is the first byte of our result.
The algorithm then unsigned right shift the variable 7 bits to the right. This removes the last 7 bits from the input because we already processed them.
value >>>= 7;
00000000 00000000 00000001 00101100 (int 300)
>>> 7 00000000 00000000 00000000 00000010
The method continues with the loop and runs the if check again.
This time value
is 2 (00000000 00000000 00000000 00000010
).
Because none of the bits are set between positions 8 and 32, the bitwise AND returns the number 0, and the condition evaluates to true.
00000000 00000000 00000000 00000010 (int 2)
& 11111111 11111111 11111111 10000000
= 00000000 00000000 00000000 00000000 (int 0)
The algorithm extracts the least significant byte (with the byte
cast) and adds it to the result.
This is the last group of the result, with the MSB set to 0.
10101100 00000010
In the temporary buffer
byte array of length 5, only 2 bytes are relevant, so the method copies these
two bytes into a new byte array of length 2 and returns it to the caller.
byte[] dest = new byte[position];
System.arraycopy(buffer, 0, dest, 0, position);
return dest;
Decoder: variable length int -> fixed length int ¶
The job of the decoder is to loop over the bytes, take the least significant 7 bits of each byte and add them to the result. Check the most significant bit (MSB), if it is set continue the loop.
Notice that the decoder needs to know beforehand that this is a variable length encoded number and that the order of the bytes is reversed. He can't recognize this just from the byte array he receives. Protocol Buffers does this by sending additional type information in the encoded byte stream.
Java implementation of the decoder. The decoder gets a byte array as a parameter and loops over each byte.
public static int decodeUInt32(byte[] input) {
int result = 0;
int shift = 0;
for (int ix = 0; ix < input.length; ix++) {
byte b = input[ix];
result |= (b & 0b1111111) << shift;
shift += 7;
if ((b & 0b10000000) == 0) {
return result;
}
}
return result;
}
Let's see this in action with our encoded number 300: input = [10101100, 00000010]
.
The first byte is 10101100
, the method extracts the 7 least significant bits with a bitwise AND with 1111111.
The MSB was added by the encoder, so the decoder has to remove it.
10101100
& 01111111
= 00101100
The decoder then bit shifts the result to the left. In the first iteration, shift
is 0, so nothing happens.
The result of this AND and shift operation is added to the result
with a bitwise OR.
Because result
is an int
, Java expands the operators to 4 bytes.
00000000 00000000 00000000 00000000 (int 0, initial value of result)
| 00000000 00000000 00000000 00101100
= 00000000 00000000 00000000 00101100
The algorithm now increments shift
by 7 and checks with a bitwise AND if the MSB is set.
10101100
& 10000000
= 10000000 (int 128)
The comparison with 0 evaluates to false, and the decoder continues the for iteration with the next byte from the input
array.
The next byte is 00000010
. Again, the decoder extracts the 7 least significant bits, and in this iteration shifts the result 7 bits to the left.
00000010
& 01111111
= 00000010
00000010
<< 7 00000001 00000000
And applies this result to the result
variable with a bitwise OR.
00000000 00000000 00000000 00101100 (result from 1st iteration)
| 00000000 00000000 00000001 00000000 (AND and shift from 2nd iteration)
= 00000000 00000000 00000001 00101100 (int 300)
Next, the decoder checks if the MSB is set
00000010
& 10000000
= 00000000 (int 0)
The bitwise AND returns 0, so the comparison with 0 evaluates to true, and the decoder returns the result.
Downside ¶
As mentioned above, encoding large numbers like 2,000,000,000 ends up using 5 bytes. So if you know in advance that most int
numbers in your application will be greater than 268,435,455, stick with the fixed-length encoding of 4 bytes. 268,435,455 is the last number that can be represented with 4 bytes with this encoding.
2,000,000,000 ---variable length encode---> 10000000 10101000 11010110 10111001 00000111
Negative numbers have the same problem. A negative int
, like -3, in Java, is internally encoded like this
11111111 11111111 11111111 11111101 = int -3
Encoding and decoding with the algorithm above works but every negative number will be encoded into 5 bytes, and if all your numbers are negative would increase the data size instead of decreasing it.
// int -3 variable-length encoded
11111101 11111111 11111111 11111111 00001111
In the next section, we will see a solution to this problem.
Negative and Positive ¶
A simple solution, if all your numbers are negative, is to convert them to positive numbers and then run them through the encoder.
In Java you can use the java.lang.Math.abs()
method for this purpose
java.lang.Math.abs(-3) == 3
Your application then needs to know that these numbers, although represented as positive numbers, are negative and you need to convert them to negative before you display them to a user or further process them.
But what if the input dataset contains a mix of negative and positive numbers. For this use case, we can use a smart solution
called ZigZag encoding. With this encoding, every negative number will be converted to a positive number.
But this would lead to confusion if an algorithm would simply convert -3 to 3. You could no longer differentiate between a positive 3
and a negative 3.
ZigZag encoding has a smart solution for this and encodes the positive numbers too. Every positive number becomes an even number, and every negative number becomes an odd number. The encoder doubles every positive number, and for every negative number, it takes the absolute value, doubles it, and subtracts 1.
-3 -> 5
-2 -> 3
-1 -> 1
0 -> 0
1 -> 2
2 -> 4
3 -> 6
// positive -> 2 * positive
// negative -> 2 * abs(negative) - 1
The decoder then only has to check if the number is even, divide it by 2, if odd add 1, divide it by 2 and subtract it from 0.
even encoded number / 2 -> positive decoded number
0 - (odd encoded number + 1) / 2 -> negative decoded number
Instead of doing this with arithmetic operations, we can do the ZigZag encoding and decoding with faster bitwise operations.
ZigZag Encoder ¶
The code for the ZigZag encoder looks like this.
(value << 1) ^ (value >> 31)
Let's look at two examples: -3 and 3
value = 3
00000000 00000000 00000000 00000011 (int 3)
<< 1 00000000 00000000 00000000 00000110 (int 6)
>> 31 00000000 00000000 00000000 00000000
The signed left shift operation by 1 is equivalent to a multiplication by 2. The signed right shift operation by 31 always results in 0 if the number is positive. The encoder then bitwise XOR them together.
00000000 00000000 00000000 00000110 (int 6, result of << 1)
^ 00000000 00000000 00000000 00000000 (result of >> 31)
= 00000000 00000000 00000000 00000110 (int 6)
So for the number 3, we end up with the encoded number 6.
value = -3
11111111 11111111 11111111 11111101 (int -3)
<< 1 11111111 11111111 11111111 11111010 (int -6)
>> 31 11111111 11111111 11111111 11111111
The left shift operation by 1 results in a multiplication by 2. The right shift results in a number where all 32 bit are set. And when the encoder XOR them together, it results in a positive number.
11111111 11111111 11111111 11111010 (int -6, result of << 1)
^ 11111111 11111111 11111111 11111111 (result of >> 31)
= 00000000 00000000 00000000 00000101 (int 5)
After the ZigZag encoding, the number is fed into the variable-length encoding algorithm described above.
ZigZag Decoder ¶
The decoder first runs the byte array through the variable length decoder we discussed above.
Then the ZigZag Decoder needs to decode the returned int
value with the following algorithm.
(result >>> 1) ^ -(result & 1)
Again we look at the two numbers -3 and 3. The variable-length decoder returns 5 for -3 and 6 for the number 3.
encoded 5, decoded -3
The unsigned right shift is equivalent to a division by 2 and subtraction by 1, if it's an odd number.
00000000 00000000 00000000 00000101 (int 5)
>>> 1 00000000 00000000 00000000 00000010 (int 2)
In an odd number, the least significant bit is always set and with a right shift by 1 the bit gets removed.
The bitwise AND with 1 is a simple check if a number is odd or even. This operation returns 1 if it's an odd number and 0 if it's even.
00000000 00000000 00000000 00000101
$ 00000000 00000000 00000000 00000001
= 00000000 00000000 00000000 00000001
Then the unary minus operator negates the result (1 -> -1
).
And finally the decoder XOR the two values together
00000000 00000000 00000000 00000010 (int 2)
^ 11111111 11111111 11111111 11111111 (int -1)
= 11111111 11111111 11111111 11111101 (int -3)
And we arrive at the decoded value of -3
encoded 6, decoded 3
The unsigned right bit shift is a division by 2
00000000 00000000 00000000 00000110 (int 6)
>>> 1 00000000 00000000 00000000 00000011 (int 3)
AND with 1 results in 0 because 6 is an even number
00000000 00000000 00000000 00000110 (int 6)
$ 00000000 00000000 00000000 00000001 (int 1)
= 00000000 00000000 00000000 00000000 (int 0)
Lastly, XOR the two values together, and we arrive at the decoded number 3.
00000000 00000000 00000000 00000011 (int 3)
^ 00000000 00000000 00000000 00000000 (int 0)
= 00000000 00000000 00000000 00000011 (int 3)
This concludes this deep dive into bits and bytes. As mentioned at the beginning, variable-length encoding might not always be the best solution for compressing numbers, and other methods can save you much more bandwidth.
Imagine you have two int
a and b, and the application only stores the values 0 to 15 in these two variables.
The variable-length encoding presented here would sure help you by encoding both variables into one byte each. But you always have to run the variables through an encoder and decoder.
It might be better to switch to a smaller data type, for this use case byte
, store them in fixed-length encoding, and your application no longer needs to encode and decode them.
With this example, you could even go one step further. Because the values 0 to 15 can be represented with only 4 bits, an application could merge them into one byte.
byte a = 8;
byte b = 3;
// encode
byte encoded = (byte)(a << 4 | b);
// decode
byte decodedA = (byte)(encoded >> 4 & 0b1111);
byte decodedB = (byte)(encoded & 0b1111);
For more information about variable-length encoding, see the Wikipedia article about this topic:
https://en.wikipedia.org/wiki/Variable-length_quantity
See also the Java implementation of this algorithm in the Protocol Buffers library:
CodedInputStream.java
CodedOutputStream.java
The source code of this blog post is hosted on GitHub:
https://github.com/ralscha/blog2019/blob/master/variable-length-int/VariableLengthInt.java