Back in the days when I was playing with SSE instructions, I was trying to optimize every workload that I could think of. One of these was to convert thousands of IPv4 strings to 32-bit numbers for further processing. This article shows one way to optimize such a thing, and how the SSE instructions set can be used to get the better of your $1000 Intel CPU :)

All the code used in this article can be downloaded here: https://github.com/quarkslab/ip_conv_sse.

Notes:

- we assume the reader has some knowledge of how SIMD instructions work. Refer to this article (http://sci.tuomastonteri.fi/programming/sse) for a nice introduction.
- characters between simple quotes represent the ASCII values of these characters (like in C). For instance, '0' == 0x30.
- we loosely define SSE instructions by their C intrinsics name, for the sake of clarity.

## The workload

It is as simple as this:

- take an IPv4 string like
`37.187.47.70` - convert it to a little-endian 32-bit unsigned number, that is
`0x25bb2f46`.

We only support this format, and not stuff like `"0x25BB2F46"` (thing that
inet_aton supports for instance).

## Benchmarking

The first thing to do when optimizing such algorithm is to have a reference benchmark.
We could have taken eglibc's `inet_aton`, but as said previously, it does a lot of
checks and efforts to support different formats, and thus it's not a good starting point.

We took `libleeloo` (http://github.com/quarkslab/libleeloo) implementation
and Python 3 `ipaddress` module
(http://code.google.com/p/ipaddress-py/source/browse/ipaddress.py). The
benchmarks consists in the conversion of a one million list of IPv4 addresses
stored in memory. Our goal is to get the best number of converted strings per
second. This measure makes sense as the algorithm here is of linear complexity.

Here are the results running on a Core i7-3520M for 1 million strings:

- Python 3
`ipaddress`: ~ 200,000 conversions/s - eglibc's
`inet_aton`(for the sake of curiosity): ~ 12,400,000 conversions/s `libleeloo`: ~ 20,800,000 conversions/s

The Python 3 module is written in plain Python, thus the big difference. So we compare
ourself to the fastest of these methods, that is `libleeloo`.

## The original algorithm

To convert an address to a 32-bit number, we need to convert each string (between dots) to real numbers and compose these numbers to have the final result.

Converting a string to a number basically involves:

- subtracting the ASCII value of '0' to each figure character
- multiply it by 1, 10 or 100 (according to its position) and add it to the resulting number

In our case, for the first part of the URL, this gives ('3'-'0')*10 + ('7'-'0')*1.

One classical algorithm would be:

- split the strings according to the dots
- check that it gave four non empty parts
- convert each part to an integer (using subtractions and multiplications)
- check that each integer fits in a byte
- compute the 32-bit resulting number with
`shift`and`or`operations

## What are the most computation-intensive parts?

One can assume that the subtraction and multiplication operations used to convert strings to numbers are the most computation-intensive ones. There are also a lot of branchings to look for the dots that can impact performances.

We did some profiling of `libleeloo` code using `gprof`. According to the
results, the `atoi3` function is the most computation-intensive, that is
the conversion of strings into integers.

So we try to use SSE instructions to perform these transformations.

## The long road to optimization

The main idea is that, at most, one IPv4 string can fit into a SSE register. Indeed, 255.255.255.255 is 16 characters wide, thus 128 bit, which is the size of a SSE register!

So, we assume that we are able to have a SSE register filled like this:

SSE register = {'3, '7', '.', '1', '8', '7', '.', '4, '7', '.', '7', '0', 0, 0, 0, 0}

So, let's see what SSE instructions might be of use for us. In order to do this, Intel provides us with a real nice web application you can access here: https://software.intel.com/sites/landingpage/IntrinsicsGuide/. Note that this used to be a Java application (that you can't seem to be able to download anymore, but you can still get a copy here: http://files.geekou.info/intel_intrinsics.tar.bz2), and even an older PDF that stopped with SSE4.2.

Back to the web application, we select all flavors of SSE. Then, we need to find:

- a subtraction operation: in the "Arithmetic" category, we can find
`_mm_sub_epi8`, which subtract two SSE registers by pair of 8-bit signed integers. Sounds like our candidate for the subtraction of '0'. - a bitwise-or operation: _mm_or_si128 in the "Logical" category seems fine.
- something to find out where the dots are. The "Compare" category seems a
good choice, and
`_mm_cmpeq_epi8`our candidate. It will generate a "mask" vector where bytes will be equal to 0xFF where dots are, and null otherwise. That is,`res = _mm_cmpeq_epi8(a, b)`is semantically the same as:

for i from 0 to 16: res[i] = (a[i] == b[i]) ? OxFF:0x00

- a multiply operation. This is a bit trickier, because a multiplication of two 8-bit integers might end up on a 16-bit integer.

### Multiplications

As the different multiplication instructions available will determine how our algorithm works, let's detail a bit the available choices.

The SSE instructions set provides several ways to do multiplication.
Basically, the different kinds of instructions are, for `res = a * b` (with `res`,
`a` and `b` 128-bit SSE registers):

- multiply 16-bit numbers to 32-bit (
`_mm_mullo_epi16`):

res[0:32[ = a[0:16[*b[0:16[ res[32:64[ = a[16:32[*b[16:32[ res[64:96[ = a[32:48[*b[32:48[ res[96:128[ = a[48:64[*b[48:64[

- same with 32-bit numbers to 64-bit (
`_mm_mullo_epi32`): - same with the upper parts of the
`a`and`b`registers (`_mm_mulhi_epi16`/`_mm_mulhi_epi32`): - some mix of multiplications and additions, like this (
`_mm_maddubs_epi16`):

res[0:16[ = a[0:8[ * b[0:8[ + a[8:16[ * b[8:16[ res[16:32[ = a[16:24[ * b[16:24[ + a[24:32[ * b[24:32[ res[32:48[ = a[32:40[ * b[32:40[ + a[40:48[ * b[40:48[ res[48:64[ = a[48:56[ * b[48:56[ + a[56:64[ * b[56:64[ etc.

Let's imagine we manage to get this vector (out of '37.187.47.70') (X means any 8-bit value):

SSE ip_int = {3, 7, X, 1, 8, 7, X, 4, 7, X, 7, 0, X, X, X, X}

What we would like to do is multiply it by this one:

SSE mul = {10, 1, 0, 100, 10, 1, 0, 10, 1, 0, 10, 1, 0, 0, 0, 0}

and get this:

SSE res = {3*10, 7*1, X*0, 1*100, 8*10, 7*1, X*0, 4*10, 7*1, X*0, 7*10, 0*1, X*0, X*0, X*0, X*0}

and then do some kind of horizontal sum to obtain this:

SSE ip = { 3*10 + 7*1 + X*0, 1*100 + 8*10 + 7*1 + X*0, 4*10 + 7*1 + X*0, 7*10 + 0*1 + X*0 }

But, SSE does not provide a way to multiply SSE registers by 8-bit packed numbers. The minimum is, as shown above, 16-bit.

Fortunately for us, the _mm_maddubs_epi16 is the one that will help us. The issue though is that it will only work on 8 blocks of 2*8-bit numbers each, that is if our vectors are organised like this:

SSE ip_int = {0, 3 , 7, X, 1 , 8 , 7, X, 0, 4 , 7, X, 0, 7 , 0, X} (1) SSE mul = {100, 10, 1, 0, 100, 10, 1, 0, 100, 10, 1, 0, 100, 10, 1, 0} (2)

With `res = _mm_maddubs_epi16(ip_int, mul)`, we have:

SSE ip_res = {0*100 + 3*10, 7*1 + X*0, 1*100 + 8*10, 7*1 + X*0, 0*100 + 4*10, 7*1 + X*0, 0*100 + 7*10, 0*1 + X*0}

Each number here is 16-bit wide. Thus, we need one last horizontal 16-bit addition and shifting, that is one instruction that does:

res[0:32[ = (a[00:16[ + a[16:32[) << 24, res[32:64[ = (a[32:48[ + a[48:64[) << 16, res[64:96[ = (a[64:80[ + a[80:96[) << 8, res[96:128[ = (a[96:112[ + a[112:128[),

There is no instruction that does *exactly* that, and `_mm_madd_epi16` is the
closest to what we are looking for:

SSE res = {a[00:16[*b[00:16[ + a[16:32[*b[16:32[, a[32:48[*b[32:48[ + a[48:64[*b[48:64[, a[64:80[*b[64:80[ + a[80:96[*b[80:96[, a[96:112[*b[96:112[ + a[112:128[*b[112:128[}

Thus, by setting the `b` 8*16-bit integers vector as:

b = {2**8, 2**8, 1, 1, 2**8, 2**8, 1, 1} (3)

we end up with, for `res = _mm_madd_epi16(ip_res, b)`:

res[0:32[ = (a[00:16[ + a[16:32[) << 8, res[32:64[ = (a[32:48[ + a[48:64[), res[64:96[ = (a[64:80[ + a[80:96[) << 8, res[96:128[ = (a[96:112[ + a[112:128[),

The final computation will just need one shift of 16 bit and four `or` operations.

We need to keep this in mind as this is the key to the vectorized algorithm, because we still have to manage to get (1) and (2).

### The vectorized algorithm

So, starting with our IP string in a SSE register and the above instructions, it seems natural to try something like this:

- find out where the dots are in the IPv4 string
- subtract with '0'
- manage to have (1) and (2) from this
- use
_mm_maddubs_epi16and_mm_madd_epi16to get four 32-bit numbers representing the four parts of the IP- construct the final number

Let's describe this a little bit step by step.

### First steps

So, first thing first, the subtraction is not that hard to perform in SSE. We just need to create a vector where '0' bytes are at the right position.

So, we search for the dots and ending zeros, using the `compare` and `or` instructions:

__m128i mask_dots = (sse_str == '.' | sse_str == 0) = _mm_or_si128( _mm_cmpeq_epi8(sse_str, _mm_set1_epi8('.')), _mm_cmpeq_epi8(sse_str, _mm_setzero_si128()));

In our case, this gives:

SSE ip = {'3' , '7' , '.' , '1', '8', '7', '.', '4', '7', '.', '7', '0', 0, 0, 0, 0} SSE mask_dots = {0x00, 0x00, 0xFF, 0x00, 0x00, 0x00, 0xFF, 0x00, 0x00, 0xFF, 0x00, 0x00, 0xFF, 0xFF, 0xFF, 0xFF};

Then, we use the `and not` instructions to create the '0' register:

SSE mask_zeros = ~mask_dots & {'0', '0', ..., '0'} = _mm_andnot_si128(mask_dots, _mm_set1_epi8('0')); = {'0', '0', 0x00, '0', '0', '0', 0x00, '0', '0', 0x00, '0', '0', 0x00, 0x00, 0x00, 0x00}

And, finally subtract this to our original SSE "string" register:

SSE ip_int = sse_ip - mask_zeros = _mm_sub_epi8(sse_ip, mask_zeros);

### The shifting dance

Alright, so now we need to create (1) and (2). (2) is fairly easy, as this is a constant value:

SSE mul_ten = _mm_set_epi8(0,1,10,100,0,1,10,100,0,1,10,100,0,1,10,100); (one will notice that _mm_set_XX takes their arguments the "little-endian" way)

(1) requires more work. The goal is to make sure that each number of the original string is
aligned on 4 bytes. In our case, we need to transform `ip_int` from:

SSE ip_int = {3, 7, X, 1, 8 , 7, X, 4 , 7, X, 7, 0, X, X, X, X}

to:

SSE ip_int = {0, 3, 7, X, 1, 8, 7, X, 0, 4, 7, X, 0, 7, 0, X} <--------> <--------> <--------> <--------> 1st block 2nd 3rd 4th

We need to play with the SSE shift instructions to add the zeros needed to
build the four blocks. The instruction `_mm_shuffle_epi8` permutes the
different bytes of a register according to another one. A shift of one byte to
the right will be expressed like this:

right_shift(reg) = _mm_shuffle_epi8(reg, _mm_set_epi8(14,13,12,11,10,9,8,7,6,5,4,3,2,1,0,0x80));

(0x80 indicates that the byte at this index will be null)

For instance:

right_shift(ip_int) = {0, 3, 7, X, 1, 8 , 7, X, 4 , 7, X, 7, 0, X, X, X} <--------> 1st block

As we can see, this gives us our first block. We can "move" this block in a fresh blank register, and go on:

SSE mask_32bit = {0xFFFFFFFF, 0 (32-bit wide), 0 (32-bit wide), 0 (32-bit wide)} SSE ip_final = ip_int & mask_32bit = _mm_and_si128(ip_int, mask_32bit) ip_int = ~mask_32bit & ip_int = _mm_andnot_si128(mask_32bit, ip_int)

Here, we got ourselves with:

ip_final = {0, 3, 7, X, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0} ip_int = {0, 0, 0, 0, 1, 8, 7, X, 4, 7, X, 7, 0, X, X, X}

For the following block, we do not need any shifting, so simply "move" it the same way:

SSE mask_32bit = {0 (32-bit wide), 0xFFFFFFFF, 0 (32-bit wide), 0 (32-bit wide)} ip_final = ip_final | (ip_int & mask_32bit) = _mm_or_si128(ip_final, _mm_and_si128(ip_int, mask_32bit)) ip_int = ~mask_32bit & ip_int = _mm_andnot_si128(mask_32bit, ip_int)

resulting in:

ip_final = {0, 3, 7, X, 1, 8, 7, X, 0, 0, 0, 0, 0, 0, 0, 0} ip_int = {0, 0, 0, 0, 0, 0, 0, 0, 4, 7, X, 7, 0, X, X, X}

The third block needs one shift to the right, so let's do it:

SSE mask_32bit = {0 (32-bit wide), 0 (32-bit wide), 0xFFFFFFFF, 0 (32-bit wide)} SSE ip_final = right_shift(ip_int, mask_32bit) ip_int = ~mask_32bit & ip_int = _mm_andnot_si128(mask_32bit, ip_int)

resulting in:

ip_final = {0, 3, 7, X, 1, 8, 7, X, 0, 4, 7, X, 0, 0, 0, 0} ip_int = {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 7, 0, X, X}

And the same for the last block:

ip_final = {0, 3, 7, X, 1, 8, 7, X, 0, 4, 7, X, 0, 7, 0, X}

This is the vector we wanted in (1)!

What's left to compute are the shift "counts" that we used. As you now have understood, it depends on the number of figures of each block. We need to find a way to compute this one thanks to SSE instructions and what we've got so far, that is a SSE "mask" register that indicates the positions of dots and null bytes. In our example:

SSE mask_dots = {0x00, 0x00, 0xFF, 0x00, 0x00, 0x00, 0xFF, 0x00, 0x00, 0xFF, 0x00, 0x00, 0xFF, 0xFF, 0xFF, 0xFF};

What we can do is start from this vector to compute the shifting count:

- first, shift this vector one byte to the left:

SSE mask_shifts_count = left_shift(mask_dots) = {0x00, 0xFF, 0x00, 0x00, 0x00, 0xFF, 0x00, 0x00, 0xFF, 0x00, 0x00, 0xFF, 0xFF, 0xFF, 0xFF, 0x00}

- create a vector of 2 according to this mask:

SSE shifts_count = mask_shifts_count & {2, 2, ..., 2} = _mm_and_si128(mask_shifts_counts, {2, 2, ..., 2}) = {0, 2, 0, 0, 0, 2, 0, 0, 2, 0, 0, 2, 2, 2, 2, 0}

- Now, do the same and make sure we do not erase a shift count already set:

mask_shifts_count = ~mask_shifts_count & left_shift(mask_shifts_count) = {0xFF, 0x00, 0x00, 0x00, 0xFF, 0x00, 0x00, 0xFF, 0x00, 0x00, 0xFF, 0xFF, 0xFF, 0xFF, 0x00, 0x00} shifts_count = _mm_or_si128(shifts_count, _mm_and_si128(mask_shifts_count, {1, 1, .., 1})) = {1, 2, 0, 0, 1, 2, 0, 1, 2, 0, 1, 2, 2, 2, 2, 0} <--------> <--------> <--------> <-------->

So now, we shift our `shifts_count` vector the same way that we will shift the `ip_int` vector, that is:

- take the first 8-bit integer of
`shifts_count`(here 1). It means that the first block needs a one-byte shift to the right. We'll do the same with`shifts_count`and end up with:

shifts_count = right_shift(shifts_count) = {0, 1, 2, 0, 0, 1, 2, 0, 1, 2, 0, 1, 2, 2, 2, 2} <--------> <--------> <--------> <-------->

- then, we take the 4-th 8-bit number of
`shifts_count`, and do the same. In our example, this is the shift count of our second block (0), so`shifts_count`does not change. - go on with the 8-th byte, so a shift count of 1, and get:

shifts_count = right_shift(shifts_count) = {0, 0, 1, 2, 0, 0, 1, 2, 0, 1, 2, 0, 1, 2, 2, 2} <--------> <--------> <--------> <-------->

- and, finally, for the last block, the 12-th byte gives us the last shift count, that is 1!

Using the last "shift count" and the same operations as above, we finally manage to create (1):

ip_final = {0, 3, 7, X, 1, 8, 7, X, 0, 4, 7, X, 0, 7, 0, X}

### The final steps

Our dance allowed us to get (1) and (2), so now let's apply what we've seen previously. That is:

- set a vector with
`{100, 10, 1, 0, 100, 10, 1, 0, 100, 10, 1, 0, 100, 10, 1, 0}` - use the
`_mm_maddubs_epi16`instructions with (1) and the above vector - as a remainder, we get:

SSE res = {0*100 + 3*10, 7*1 + X*0, 1*100 + 8*10, 7*1 + X*0, 0*100 + 4*10, 7*1 + X*0, 0*100 + 7*10, 0*1 + X*0}

- use
`_mm_madd_epi16`with (3) and get (each number is 32-bit wide):

SSE res = {(0*100 + 3*10 + 7*1 + X*0) << 8, (1*100 + 8*10 + 7*1 + X*0), (0*100 + 4*10 + 7*1 + X*0) << 8, (0*100 + 7*10 + 0*1 + X*0)}

- extract the four parts of
`res`and compute:

ip = ((res[0]|res[1])<<16) | ((res[2]|res[3]))

And here we are!

## The silence of the lambs

We didn't talk about two issues here:

- we do not check the validity of the strings given. These could be done by
checking that the number of 0xFF bytes in the
`mask_dots`variable is okay , and that two 0xFF bytes aren't consecutive (using an horizontal sum for instance) - some padding is needed for strings of arbitrary length

All of these are left as an exercise to the reader :)

## Performances

That's the main point of all of this, that is being faster that the original implementation.

Here are the benchmarks, still on a Core i7-3520M:

`libleeloo`: ~ 20,800,000 conversions/s- SSE version: ~ 82,600,000 conversions/s

We got a performance gain of about x4! All this work was not for nothing :)

## Conclusion

At first sight, the original algorithm does not seem easy to vectorize with modern so-called "multimedia instructions". This article showed that, with some tricks, a factor of 4 can still be gained.

## Acknowledgment

Thanks to the Quarkslab team for their reviews!

## Comments