## Auto Vectorization and AVX512

Compilers are pretty cool these days. They can generate vector code automatically if you give them the chance. Here is a simple example:

struct float32x16
{
float v[16];
};

float32x16 compute(float32x16 a, float32x16 b)
{
float32x16 result;
for (int i = 0; i < 16; ++i)
result.v[i] = a.v[i] < b.v[i] ? a.v[i] : b.v[i] + a.v[i];
return result;
}

// compiled for avx512
vmovups zmm2, ZMMWORD PTR [rsi]
mov     rax, rdi
vmovups zmm0, ZMMWORD PTR [rdx]
vcmpltps        k1, zmm2, zmm0
vmovaps zmm1{k1}, zmm2
vmovups ZMMWORD PTR [rax], zmm1
ret


Indeed; the compiler was able to process sixteen floats per each instruction. Let's see what happens if we explicitly write the code in vector form:

__m512 compute(__m512 a, __m512 b)
{
}

// compiled for avx512
vcmpps    k1, zmm0, zmm1, 2
vblendmps zmm0{k1}, zmm2, zmm0
ret


Much nicer; runs complete in CPU registers and uses the new avx512 kmask. The difference was not in compiler being better - it was the same, with same compiler options. The calling convention for passing arguments to functions in registers seems to be nicer way to go overall.

Of course, the argument is that in such a trivial example the compiler cannot keep things in registers and when you compile a more complicated function using these simpler functions the inlining will make a lot of this overhead disappear. The only places where the compiler absolutely must read and write into memory is the inputs and the eventual outputs from the transformations we are doing to the data with our code. I have indeed observed this to a great degree but the register calling convention still yields overall better results.

So; if we want to express our intent with explicit vector register types we must make it as convenient as possible. The x86 (and others like ARM) intrinsic syntax is very convoluted and difficult to read and write. The minimum service to yourselves is to overload the most common operations so that the code will look more like this:

float32x16 compute(float32x16 a, float32x16 b)
{
return select(a > b, a + b, a);
}


To trained eye this should be much easier to follow. The math library I been writing does work like this already but the existing arrangement generates a bit-pattern of all 0's or 1's into the result depending on the outcome of the compare operation. Then this value, a sort of a mask, is used to blend between two input values.

Every CPU-SIMD architecture worked like this, more or less, before the AVX512 came along. It dropped this approach and uses what Intel calls kmask registers, which are in practise up to 64 bit mask where each bit indicates one vector lane. Then the mask_move or blend operations use these kmask's to blend between different inputs. The difference to old arrangement is that the old way could use bitwise AND, NAND, and OR operations to do the bitwise blending. The new way uses a single bit per lane as control to select between different lanes. This means the kmasks are easy to convert between ALU and AVX512 vector engine and allows other kinds of cool tricks. Most of the AVX512 instructions are masked by these kmasks which in theory makes it easier for the compiler to implement branches with predication instead of actually doing control flow.

Long story short, I am currently in process of converting the masking system to be more AVX512 friendly. The downside is that need a lot of more code since now the masks resulting from compare operations are discrete types. They used to be same type the compared vectors were. We need to be able to convert between different types of masks and do bitwise operations between masks:

mask = (a < b) && (a < c);


Above used to be a simple bitwise-and operation (&) but now we can implement logical-and (&&) between the masks. This means more code needs to be written so that it will be perfect sigh.

Initially did implement the AVX512 compare operations so that always converted the kmask into bitmask and did bitwise blend between different vectors. The problem with THAT approach was that it had to be emulated again since AVX512 does not have built-in operations to do that. Roughly speaking the number of compiled instructions doubled.

A compromise is not something we want here but we must make one. OK, performance comes first, so, with AVX512 we do things with kmasks and with SSE for example we use bitmasks. And now it should be more clear why the way we have to deal with the masks is that the masks need to have a type attached to them. If you do compare of float32's you cannot use the resulting mask for selecting uint8's because the mask would be too wide. So.. you must "convert" the mask to appropriate type if you want to select between different types. Not all conversions are possible, for example, 512 bits wide vector of uint8's means there are 64 uint8's. The number doesn't quite match, for example, 128 bit vector of float64's, see the problem? 2 lanes vs 64 lanes.. if you do 2-lane compare only 2 first lanes would be valid even after the conversion. Things like this just cannot be avoided we are trimming the icing on the cake here to be honest. I just don't like ugly corners like that. :(

## Short Vector Math Libraries suck!

Who actually NEEDS a short vector math library? Isn't AoS very inefficient and generally uncool? Doesn't the GPU actually do all of the heavy calculations these days? Yes, yes, and yes!

The Ultimate Reason for SVML is convenience. If you going to do setup to your very cool program that does render a very nice picture in less than 16,600 microseconds you still need to write the code to do all of this setting things up for the GPU. If you are a genius you just memory map some file and by some black magic the data just gets where it needs to be while breaking the laws of known physics, that's great, you are awesome!

Meanwhile the really really bad programmers (that's us!) have to get the job done. It is not a crime if writing the code is a lot of fun and the code looks really nice as a side-effect. This is where SVML steps into the picture. First of all, you can concentrate on the cool parts of the project and let the library handle the boring stuff. Then as extra bonus get faster code even if it does not mean anything.

Now that we have completely ripped the SVML a new one we can concentrate on the Good Parts. Not all SVML are created equal for starters. Some have awkward syntax, some simply lack some nice feature you are in love with or just generates really bad code. I very much like the one that I am using (of course, should go without saying) since it has one really nice feature: one can recursively declare vector types.

using PackedVector3 = Vector<Vector<float, 8>, 3>;


Hold on, that looks suspicious.. what's going on in here? First you have to realize that the Vector<float, 8> is specialization for 256 bit wide vector registers found in modern CPU's. It is a float32x8. Armed with this knowledge we can refactor above into:

using PackedVector3 = Vector<float32x8, 3>;


PackedVector3 acts just like float32x3 aka. "vec3" with a wild difference: it processes 8 floating point values simultaneously. Now you can write code just like you would write scalar code:

// let's give this type more user-friendly name
using vec3 = Vector<float32x8, 3>;

vec3 dot(vec3 a, vec3 b)
{
// dot product
return a.x * b.x + a.y * b.y + a.z * b.z;
}


We just wrote code to compute eight dot products simultaneously and it wasn't even hard! The best part is that we don't have to do any shuffling and bit fiddling like we have to do when we compute dot product horizontally within a SINGLE vector register (it would look something like this)

float32x4 dot3(float32x4 a, float32x4 b)
{
float32x4 s = _mm_mul_ps(a, b);
_mm_add_ps(shuffle<1, 1, 1, 1>(s), shuffle<2, 2, 2, 2>(s)));
}


Not so nice. Notice also that we are more or less forced to replicate the return value into all components so that the result can be used "just like it were a scalar" - on Intel SIMD there is no instruction (before avx512) to multiply a vector by scalar, or add scalar into a vector and so on. On ARM NEON such activities are very common place and now the same happiness and joy is coming to the Intel CPU's but in very expensive package so we'll see where that road leads (AMD Ryzen isn't included in this mega-party of happiness, just be informed).

Enough about dot product. Two things work in our favour when we use this packed vector approach. First is data locality. Don't make the vector too wide; it is best to restrict the width to vector register width supported by the target architecture. Some even make the vector as wide as the whole array, this is called SoA, or Structure-of-Arrays for short. It is very bad for CPU cache. Don't be that guy. SSE would be 128 bits, AVX would be 256 bits and AVX512 would be 512 bits, duh, the name is kind of a spoiler. Adding more width to the packed vector would be counter productive. Who really cares about the second reason after this convincing arguments?

I have the benchmarks and numbers to prove it, of course. The results are at end of the source file. The PackedVector ruL3z. Never go full-SoA. Just kidding; do whatever you want if it works better.

## SIMD Scalar Accessor

You find yourself one day writing code for a C++ vector math library. Of course, it is written with templates so that you can parametrize the scalar type and vector dimensions. You will have something like this:

template <typename ScalarType, int Size>
struct Vector
{
ScalarType data[Size];
};

using float32x4 = Vector<float, 4>;


It can't get easier than that, right? Then the requirements start piling up and the code begins to bloat. This is perfectly normal and nothing to be alarmed about. One of the first things that typically happen at this stage is that the template is specialized for specific parameters. 2-, 3- and 4-dimensional vectors are very common with well established names for the vector components so something like this is scribbled to the text editor:

template <>
struct Vector<float, 4>
{
union
{
float data[4];
struct { float x, y, z, w; };
};
};


Looks fairly innocent, right? Wrong! There is aliasing between the data members so the compiler has no choice but to treat these as discrete values. The CPU on the other hand has to assume that each object even if with identical storage location are actually unique and any read or write from these locations must be coherent.

The ugly alternative is to write methods to access the vector components like so:

template <>
struct Vector<float, 4>
{
float data[4];
float getX() const { return data[0]; }
void setX(float x) { data[0] = x; }
};


That will certainly do the trick and compile into very nice code but it is very inconvenient to write code using these facilities. The best solution that I have so far found to solve this design problem is to use what I coined the "Scalar Accessor"

    template <typename ScalarType, typename VectorType, int Index>
struct ScalarAccessor
{
VectorType m;

operator ScalarType () const
{
return simd::get_component<Index>(m);
}

ScalarAccessor& operator = (ScalarType s)
{
m = simd::set_component<Index>(m, s);
return *this;
}
};


It is a type which stores a SIMD vector and has operators to convert between vector and scalar values. The Index template parameter chooses the vector component which is being accessed. The low-level SIMD code encapsulates the implementation details. Now this template type is used to build the accessor into the vector class.

template <>
struct Vector<float, 4>
{
union
{
simd::float32x4 xyzw;
ScalarAccessor<float, simd::float32x4, 0> x;
ScalarAccessor<float, simd::float32x4, 1> y;
ScalarAccessor<float, simd::float32x4, 2> z;
ScalarAccessor<float, simd::float32x4, 3> w;
};
};


We have union just like at the beginning but there is a subtle difference: it is union of identical types. Now, access to the object x, y, z or w is accessing the same object as xyzw with one difference: xyzw will return a vector and x and it's bros will return a component of the vector. This is VERY IMPORTANT: the code runs completely on CPU registers and does NOT write into temporary memory location to access the vector.

How well does this work in practise?

float32x4 test(float32x4 a, float32x4 b)
{
return a * b.x;
}


Let's compile!

    // clang 4.0
shufps  xmm1, xmm1, 0           # xmm1 = xmm1[0,0,0,0]
mulps   xmm0, xmm1

// gcc 7.1
shufps  xmm1, xmm1, 0
mulps   xmm0, xmm1


Looks about what you expect the compiler to do anyway, what's the big deal? Look what the compiler does with the union of float values and vector register:

        shufps  xmm2, xmm2, 0
movq    QWORD PTR [rsp-24], xmm0
movq    QWORD PTR [rsp-16], xmm1
mulps   xmm2, XMMWORD PTR [rsp-24]
movaps  XMMWORD PTR [rsp-24], xmm2
mov     rax, QWORD PTR [rsp-16]
movq    xmm0, QWORD PTR [rsp-24]
mov     QWORD PTR [rsp-24], rax
movq    xmm1, QWORD PTR [rsp-24]


This is what this whole thing is all about. The same technique works just fine for scalar vector class implementation, I just happened to shortcut directly into the SIMD vector class implementation as that is where I personally gain more in return for my coding investment. One way to look at this is that we created a proxy type to provide the overloads we needed and let the C++ type system to do the rest.

Please do check out the math library I been working on at https://github.com/t0rakka/mango

It is a little bit more than that but that is another story.

Jukka Liimatta

Helsinki, Finland.

Programmer.