Home | Send Feedback

Variable length integers

Published: June 18, 2019  •  java

Variable length integers encoding, is an algorithm to compress fixed length integers into variable length integers to save space when you want to store or transmit numbers.

In this blog post, we are looking at a varint implementation for compressing int, a datatype that in Java has a fixed length of 32 bits (4 bytes).

The same algorithms can also be applied to the long (64 bit) data type, however, in this blog post, I focus solely on the int datatype.

Introduction

When you look at the binary representation of int 300, you see that the first two bytes don't contain any information and only the last two bytes are necessary to represent the number 300.

00000000 00000000 00000001 00101100

Even worse for the number 100, we only need one byte and the other 3 bytes don't contain any information.

00000000 00000000 00000000 01100100

If you know in advance that a number always fits into 1 or 2 bytes, the following approach might not be the best solution. You could save much more bandwidth by choosing a smaller datatype and use a fixed length encoding. In Java, you switch to byte or short if a value always fits into one or two bytes.

However, what if you don't know the number range in advance and it could be a number between -2,147,483,648 to 2,147,483,647 (max range of int).

With varints, we can store a 300 in 2 bytes, a 100 in one byte and 1,000,000 in 3 bytes. This can save a lot of space if your application mostly needs to handle small numbers and only sporadically very big numbers.

In the following sections, we see how we can convert int into a variable length int and vice versa with the continuation bits encoding. In the first part, we focus on positive numbers, and in the second part, we look at a solution for both, negative and positive numbers.

Positive only

In this section, we focus on positive int and how we can convert them to and from variable length integers.


Encoder: fixed length int -> variable length int

The idea behind this variable length encoding is to split the number into groups of 7 bits. Then take every group where at least 1 bit is set and always the least significant (right most) group of 7 bits.

Add one bit to each 7 bits group as the leftmost bit (most significant bit, MSB) to form a byte. Set this bit to 1 except in the least significant byte (rightmost byte) set it to 0. This additional bit is an indicator for the decoder whether there are more bytes following.

In the following algorithm, we are going to reverse the order of the bytes. You could also preserve the order it does not matter as long as the encoder and decoder agree on the order. This is also the same encoding that Protocol Buffers uses.

Let's see how this works for the number 300.

300 is represented with the following 4 bytes

00000000 00000000 00000001 00101100

The first step is to split these 32 bits into groups of 7.

0000 0000000 0000000 0000010 0101100

Take the last group, 0101100 and put it first into the result. Add the second group (from the right), because it contains bits that are set. Ignore the other 3 groups because none of the bits are set.

We end up with the following 14 bits

0101100 0000010

Starting from the left add 1 as most significant bit to the groups and 0 to the last group. The final result looks like this. 300 using 4 bytes in fixed length encoding can be stored as 2 bytes in variable length encoding.

10101100 00000010

Next, we are looking at a method that implements this encoding in Java.

  public static byte[] encodeUInt32(int inputValue) {
    int value = inputValue;
    byte[] buffer = new byte[5];
    int position = 0;

VariableLengthInt.java

The method starts with some variable declarations and reserves a buffer of 5 bytes. This is also one of the downsides of this variable length encoding. Because we are going to split the numbers into groups of 7 bits and add one additional bit, large number will be encoded in 5 bytes.

Next, the method starts a loop.

    while (true) {
      // ~0x7F = 0xffffff80
      if ((value & 0b11111111111111111111111110000000) == 0) {
        buffer[position++] = (byte) value;
        break;
      }

VariableLengthInt.java

The first check is a bitwise AND where the code looks if there are no bits set between positions 8 and 32. If that is the case, add the last 7 bits to the result and stop encoding. Because this is the last group, the MSB has to be 0 which happens here implicitly because with the if check we know that the bit at position 8 is zero (if it were 1 then the check would return false). The (byte) cast takes the least significant byte from the 4 bytes.

Here the if check if we encode int 300.

   00000000 00000000 00000001 00101100  (int 300)
 & 11111111 11111111 11111111 10000000
 = 00000000 00000000 00000001 00000000  (int 256)

The result of the bitwise AND is 256 and the if condition evaluates to false. The encoder now knows that this is not the last group, so he adds the least significant 7 bits to the result and sets the MSB to 1

      buffer[position++] = (byte) ((value & 0b1111111) | 0b10000000);

VariableLengthInt.java

The bitwise AND with 1111111 unsets all bits that are set between positions 8 and 32.

   00000000 00000000 00000001 00101100 (int 300)
 & 00000000 00000000 00000000 01111111
 = 00000000 00000000 00000000 00101100

The bitwise OR with 10000000 then sets the bit at position 8.

   00000000 00000000 00000000 00101100
 | 00000000 00000000 00000000 10000000
 = 00000000 00000000 00000000 10101100

Moreover, the cast to byte extracts the least significant byte

10101100

This is the first byte of our result.

The algorithm then unsigned right shift the variable 7 bits to the right. This removes the last 7 bits from the input because we already processed them.

      value >>>= 7;

VariableLengthInt.java

        00000000 00000000 00000001 00101100 (int 300)
 >>> 7  00000000 00000000 00000000 00000010

The method continues with the loop and runs the if check again. This time value is 2 (00000000 00000000 00000000 00000010).

Because none of the bits are set between positions 8 and 32 the bitwise AND returns the number 0 and the condition evaluates to true.

   00000000 00000000 00000000 00000010 (int 2)
 & 11111111 11111111 11111111 10000000
 = 00000000 00000000 00000000 00000000 (int 0)

The algorithm extracts the least significant byte (with the byte cast) and adds it to the result. This is the last group of the result, with the MSB set to 0.

10101100 00000010

In the temporary buffer byte array of length 5 only 2 bytes are relevant, so the method copies these two bytes into a new byte array of length 2 and returns it to the caller.

    byte[] dest = new byte[position];
    System.arraycopy(buffer, 0, dest, 0, position);
    return dest;

VariableLengthInt.java


Decoder: variable length int -> fixed length int

The job of the decoder is to loop over the bytes, take the least significant 7 bits of each byte and add them to the result. Check the most significant bit (MSB), if it is set continue the loop.

Notice that the decoder needs to know beforehand that this is a variable length encoded number and that the order of the bytes is reversed. He can't recognize this just from the byte array he receives. Protocol Buffers does this by sending additional type information in the encoded byte stream.

Java implementation of the decoder. The decoder gets a byte array as parameter and loops over each byte.

  public static int decodeUInt32(byte[] input) {
    int result = 0;
    int shift = 0;
    for (int ix = 0; ix < input.length; ix++) {
      byte b = input[ix];
      result |= (b & 0b1111111) << shift;
      shift += 7;
      if ((b & 0b10000000) == 0) {
        return result;
      }
    }
    return result;
  }

VariableLengthInt.java

Let's see this in action with our encoded number 300: input = [10101100, 00000010].

The first byte is 10101100, the method extracts the 7 least significant bits with a bitwise AND with 1111111. The MSB was added by the encoder, so the decoder has to remove it.

   10101100
 & 01111111
 = 00101100

The decoder then bit shifts the result to the left. In the first iteration, shift is 0, so nothing happens. The result of this AND and shift operation is added to the result with a bitwise OR.

Because result is an int, Java expands the operators to 4 bytes.

   00000000 00000000 00000000 00000000 (int 0, initial value of result)
 | 00000000 00000000 00000000 00101100
 = 00000000 00000000 00000000 00101100

The algorithm now increments shift by 7 and checks with a bitwise AND if the MSB is set.

   10101100
 & 10000000
 = 10000000 (int 128)

The comparison with 0 evaluates to false and the decoder continues the for iteration with the next byte from the input array.

The next byte is 00000010. Again, the decoder extracts the 7 least significant bits, and in this iteration shifts the result 7 bits to the left.

   00000010
 & 01111111
 = 00000010
               00000010
 << 7 00000001 00000000

And applies this result to the result variable with a bitwise OR.

   00000000 00000000 00000000 00101100 (result from 1st iteration)
 | 00000000 00000000 00000001 00000000 (AND and shift from 2nd iteration)
 = 00000000 00000000 00000001 00101100 (int 300)

Next, the decoder checks if the MSB is set

   00000010
 & 10000000
 = 00000000 (int 0)

The bitwise AND returns 0, so the comparison with 0 evaluates to true, and the decoder returns the result.


Downside

As mentioned above, encoding large numbers like 2,000,000,000 ends up using 5 bytes. So if you know in advance that most int numbers in your application will be greater than 268,435,455, stick with the fixed length encoding of 4 bytes. 268,435,455 is the last number that can be represented with 4 bytes with this encoding.

2,000,000,000  ---variable length encode--->  10000000 10101000 11010110 10111001 00000111

Negative numbers have the same problem. A negative int, like -3, in Java, is internally encoded like this

11111111 11111111 11111111 11111101 = int -3

Encoding and decoding with the algorithm above works but every negative number will be encoded into 5 bytes and if all your numbers are negative would increase the data size instead of decreasing it.

// int -3  variable length encoded
11111101 11111111 11111111 11111111 00001111

In the next section, we will see a solution for this problem.

Negative and Positive

A simple solution, if all your numbers are negative, is to convert them to positive numbers and then run them through the encoder. In Java you can use the java.lang.Math.abs() method for this purpose

java.lang.Math.abs(-3) == 3

Your application then needs to know that these numbers, although represented as positive numbers, are negative and you need to convert them to negative before you display them to a user or further process them.


But what if the input dataset contains a mix of negative and positive numbers. For this use case, we can use a smart solution called ZigZag encoding. With this encoding, every negative number will be converted to a positive number.
But this would lead to confusion if an algorithm would simply convert -3 to 3. You could no longer differentiate between a positive 3 and a negative 3.

ZigZag encoding has a smart solution for this and encodes the positive numbers too. Every positive number becomes an even number, and every negative number becomes an odd number. The encoder doubles every positive number, and for every negative number, it takes the absolute value, doubles it and subtracts 1.

-3  -> 5 
-2  -> 3
-1  -> 1
 0  -> 0
 1  -> 2
 2  -> 4
 3  -> 6

 // positive -> 2 * positive
 // negative -> 2 * abs(negative) - 1

The decoder then only has to check if the number is even, divide it by 2, if odd add 1, divide it by 2 and subtract it from 0.

even encoded number / 2 -> positive decoded number
0 - (odd encoded number + 1) / 2 -> negative decoded number

Instead of doing this with arithmetic operations, we can do the ZigZag encoding and decoding with faster bitwise operations.


ZigZag Encoder

The code for the ZigZag encoder looks like this.

(value << 1) ^ (value >> 31)

Let's look at two examples: -3 and 3

value = 3

       00000000 00000000 00000000 00000011 (int 3)
<< 1   00000000 00000000 00000000 00000110 (int 6)
>> 31  00000000 00000000 00000000 00000000

The signed left shift operation by 1 is equivalent to a multiplication by 2. The signed right shift operation by 31 always result in 0 if the number is positive. The encoder then bitwise XOR them together.

   00000000 00000000 00000000 00000110 (int 6, result of << 1)
 ^ 00000000 00000000 00000000 00000000 (result of >> 31)
 = 00000000 00000000 00000000 00000110 (int 6)

So for the number 3 we end up with the encoded number 6.


value = -3

       11111111 11111111 11111111 11111101 (int -3)
<< 1   11111111 11111111 11111111 11111010 (int -6)
>> 31  11111111 11111111 11111111 11111111 

The left shift operation by 1 results in a multiplication by 2. The right shift results in a number where all 32 bit are set. And when the encoder XOR them together, it results in a positive number.

   11111111 11111111 11111111 11111010 (int -6, result of << 1)
 ^ 11111111 11111111 11111111 11111111 (result of >> 31)
 = 00000000 00000000 00000000 00000101 (int 5)

After the ZigZag encoding, the number is fed into the variable length encoding algorithm described above.


ZigZag Decoder

The Decoder first runs the byte array through the variable length decoder we discussed above. Then the ZigZag Decoder needs to decode the returned int value with the following algorithm.

(result >>> 1) ^ -(result & 1)

Again we look at the two numbers -3 and 3. The variable length decoder returns 5 for -3 and 6 for the number 3.

encoded 5, decoded -3

The unsigned right shift is equivalent to a division by 2 and subtraction by 1, if it's an odd number.

       00000000 00000000 00000000 00000101 (int 5)
>>> 1  00000000 00000000 00000000 00000010 (int 2)

In an odd number, the least significant bit is always set and with a right shift by 1 the bit gets removed.

The bitwise AND with 1 is a simple check if a number is odd or even. This operation returns 1 if it's an odd number and 0 if it's even.

   00000000 00000000 00000000 00000101
 $ 00000000 00000000 00000000 00000001
 = 00000000 00000000 00000000 00000001

Then the unary minus operator negates the result (1 -> -1). And finally the decoder XOR the two values together

   00000000 00000000 00000000 00000010 (int 2)
 ^ 11111111 11111111 11111111 11111111 (int -1)
 = 11111111 11111111 11111111 11111101 (int -3)

And we arrive at the decoded value of -3


encoded 6, decoded 3

The unsigned right bit shift is a division by 2

       00000000 00000000 00000000 00000110 (int 6)
>>> 1  00000000 00000000 00000000 00000011 (int 3)

AND with 1 results in 0 because 6 is an even number

   00000000 00000000 00000000 00000110 (int 6)
 $ 00000000 00000000 00000000 00000001 (int 1)
 = 00000000 00000000 00000000 00000000 (int 0)

Lastly XOR the two values together and we arrive at the decoded number 3.

   00000000 00000000 00000000 00000011 (int 3)
 ^ 00000000 00000000 00000000 00000000 (int 0)
 = 00000000 00000000 00000000 00000011 (int 3)

This concludes this deep dive into bits and bytes. As mentioned at the beginning, variable length encoding might not always be the best solution for compressing numbers, and other methods can save you much more bandwidth.

Imagine you have two int a and b and the application only stores the values 0 to 15 in these two variables.

The variable length encoding presented here would sure help you by encoding both variables into one byte each. But you always have to run the variables through an encoder and decoder. It might be better to switch to a smaller data type, for this use case byte, store them in fixed length encoding and your application no longer needs to encode and decode them.

With this example, you could even go one step further. Because the values 0 to 15 can be represented with only 4 bits, an application could merge them into one byte

byte a = 8;
byte b = 3;
      
// encode
byte encoded = (byte)(a << 4 | b);

// decode
byte decodedA = (byte)(encoded >> 4 & 0b1111); 
byte decodedB = (byte)(encoded & 0b1111);

For more information about variable length encoding, see the Wikipedia article about this topic:
https://en.wikipedia.org/wiki/Variable-length_quantity

See also the Java implementation of this algorithm in the Protocol Buffers library:
CodedInputStream.java
CodedOutputStream.java

The source code of this blog post is hosted on GitHub:
https://github.com/ralscha/blog2019/blob/master/variable-length-int/VariableLengthInt.java