[Tech blog] An introduction to SIMD vectorization and NEON

1. What is vectorization?

Vectorization is a type of parallel computing where a single process performs multiple tasks at the same time. This is done by packing a set of elements together into an array and operating on them all at once. It’s especially useful for multimedia purposes such as gaming, image processing, video encoding/decoding and much more.  The process looks something like Figure A.

Fig A: SIMD Sample

Fig A: SIMD Sample

So how do we do this in actual code? And how does it compare with a scalar, one at a time approach? Let’s take a look. I’m going to be doing two implementations of the same addition function, one scalar and one with vectorization using ARM’s NEON intrinsics and gcc 4.7.2-2 (on Yellowdog Linux for ARM*).




The scalar function is very simple, it’s just a for loop adding two 16 member arrays.

void add_scalar(uint8_t * A, uint8_t * B, uint8_t * C){
    for(int i=0; i<16; i++){
         C[i] = A[i] + B[i];

The NEON function however looks a lot more complicated.

void add_neon(uint8_t * A, uint8_t * B, uint8_t *C){
        //Setup a couple vectors to hold our data
        uint8x16_t vectorA, vectorB vectorC;

        //Load our data into the vector's register
        vectorA = vld1q_u8(A);
        vectorB = vld1q_u8(B);

        //Add A and B together
        vectorC = vaddq_u8(vectorA, vectorB);

Those strange looking functions are NEON’s Intrinsics, they form an intermediate layer between assembly and C. They’re a bit confusing, but ARM’s infocenter goes into some detail about them and GCC has a great reference available here. So what do they do? Well, “uint8x16_t” is a vector type containing 16 8bit uints in an array and “ald1q_u8” loads 8bit uints into a vector. Finally, “vaddq_u8” adds two vectors made of 8bit uints together all at once, and returns the result. If you test this out you’ll notice that the neon function isn’t really any faster. This is because those two load functions take up a lot of time, and we’re doing so little work that the scalar solution catches up. If we could avoid those (by structuring our program to use vectors in the first place) we’d see a greater improvement.

Now lets take a look at another case where neon can really shine, matrix multiplication. Specifically 4×4 matrix multiplication, a common case in computer graphics.

// Our test matrices
        uint16_t matrixA[4][4] = {1,  2,  3,  4, \
                                5,  6,  7,  8, \
                                9,  10, 11, 12,\
                                13, 14, 15, 16 };

        uint16_t matrixB[4][4] = {16, 15, 14, 13,\
                                12, 11, 10, 9, \
                                8,  7,  6,  5, \
                                4,  3,  2, 1 };

        uint16_t matrixC[4][4];

Multiplying these together with a scalar function is fairly straightforward, we calculate the dotproduct of each value in matrixA (by rows) with each column in matrixB. We can do this in a somewhat efficient manner using for loops:

        for(i=0; i<4; i++){ //For each row in A
                for(j=0; j<4; j++){ //And each column in B
                        for(k=0; k<4; k++){ //for each item in that column
                                dotproduct = dotproduct + A[i][k]*B[k][j];
                                //use a running total to calculate the dp.
                        C[i][j] = dotproduct; //fill in C with our results.

Now using NEON…

        //Load matrixB into four vectors
        uint16x4_t vectorB1, vectorB2, vectorB3, vectorB4;

        vectorB1 = vld1_u16 (B[0]);
        vectorB2 = vld1_u16 (B[1]);
        vectorB3 = vld1_u16 (B[2]);
        vectorB4 = vld1_u16 (B[3]);

        //Temporary vectors to use with calculating the dotproduct
        uint16x4_t vectorT1, vectorT2, vectorT3, vectorT4;

        // For each row in A...
        for (i=0; i<4; i++){
                //Multiply the rows in B by each value in A's row
                vectorT1 = vmul_n_u16(vectorB1, A[i][0]);
                vectorT2 = vmul_n_u16(vectorB2, A[i][1]);
                vectorT3 = vmul_n_u16(vectorB3, A[i][2]);
                vectorT4 = vmul_n_u16(vectorB4, A[i][3]);

                //Add them together
                vectorT1 = vadd_u16(vectorT1, vectorT2);
                vectorT1 = vadd_u16(vectorT1, vectorT3);
                vectorT1 = vadd_u16(vectorT1, vectorT4);

                //Output the dotproduct
                vst1_u16 (C[i], vectorT1);

That looks much more complicated, and in some ways it is. It’s also about three times as fast (including loads) on my test machine. Instead of stepping through each item in matrixA I’m stepping through each row, and calculating the dot product for four of them at a time. If we break down the matrix multiplication and look at the dotproduct calculations for the first row, you can hopefully see why this works:

C[0][0] = (1 * 16) + (2 * 12) + (3 * 8) + (4 * 4) (which is 80)
C[0][1] = (1 * 15) + (2 * 11) + (3 * 7) + (4 * 3) (70)
C[0][2] = (1 * 14) + (2 * 10) + (3 * 6) + (4 * 2) (60)
C[0][3] = (1 * 13) + (2 * 9) + (3 * 5) + (4 * 1) (50)

We’re multiplying each row of matrixB with a single value from matrixA at a time, something neon can do easily using “vmul_n_X”. We hold this data in temp vectors, add those vectors together with “vadd_X” (accumulating the result in vectorT1) then unload our new row into matrixC using “vstl_X”. A very different approach than the scalar solution but with the same results. My test program is attached to the blog post if you’d like to give it a try yourself.

2. Autovectorization

Interested in avoiding all this mess and still getting at least some of the benefits? Luckily there’s something called Auto-Vectorization, an automatic way to convert a scalar program into a Vectorized one. Research into auto-vectorization is still ongoing (and probably will be for quite some time), but there are several implementations available already. Gcc is by far the most popular and gcc 4.7(+), which includes support for autovectorization is already included in Yellowdog Linux 7.

Enabling Autovectorization in gcc is quite simple, but there are several tricks and hints you may need to give the compiler to get an optimal result. In gcc 4.7(+), auto-vectorization can be enabled by adding -03 or -ftree-vectorize to the command line (or CFLAGS). If you’re planning to use neon you’ll need to enable it with -mfpu=neon, although there are some issues. Gcc’s auto vectorization will ignore floats with neon unless you enable -funsafe-math-optimizations. Unfortunately using neon instructions with floats can lead to a loss of precision as neon does not adhere to IEEE 754.

Autovectorization can under the right circumstances significantly speed up a program, but it’s imperfect. Luckily there are ways to structure your code and hints you can give the compiler that will make sure it behaves properly. In future posts I will be covering these tricks and tips as well as going into detail about NEON assembly and how using it directly instead of intrinsics can give your app an even greater speed boost.

* Yellowdog Linux (YDL) is a Linux distribution developed by Fixstars Solutions. It is based on RHEL/Centos and used in Fixstars’ products. See http://www.ydl.net/ydl7/ for more details. A version of Yellowdog Linux optimized for ARM servers is currently in development.

Comments are closed.