WARNING! If you are a Linux user and have a Core i9 based computer DO NOT ENABLE THE INTEL MICROCODE DRIVER!!!

I have older Linux box with Core i7 3770 processor which has served me loyally for years. I recently upgraded and installed a fresh Linux Mint 18.2 and the first screen that greets after a new installation is the Driver Manager. It was offering binary drivers from NVIDIA Corporation and "Unknown" (labeled "intel-microcode"). It sounded like a good idea to install the microcode.

The old computer did compile one specific piece of code in 22 seconds using "make -j10". The new computer with nearly 3x the CPU capacity took 15-18 seconds which is not improvement at all. The CPU utilization was jumping between 20 and 30% which was pathetic. The old machine used to go ALL OUT at 100% until the job was done. The statistics looked like this:

real 0m22.231s user 2m45.231s

If you divide the user/real we get roughly 7.5, which makes sense since our hardware has maximum concurrency of 8.

The new machine was giving concurrency ratio which was very little below 4! Long story short, after disabling the intel-microcode goodness the results look like this:

real 0m6.122s user 1m9.624s

Concurrency is 11.3 - the hardware should be able to do 20 but definitely an improvement. The CPU utilization still doesn't go all out at 100%. We should get real 0m3s with user of 2m0s. The HT is enabled and I can confirm 100% CPU load with custom workload from my threadpool code. Something funky going on with the GNU make?

On the bright side, now the machine is over 3x the speed of the old one and the number of cores shouldn't even give more than 2.5x so it is exceeding expectations but simultaneously seeing underwhelming CPU usage. I/O limited? :(


September 27, 2017. 4.5 seconds!


February 14, 2018. 6.4 seconds :(

After upgrade to Meltdown/Spectre patched kernel the effect is same as taking away 3 out of the 10 cores. Oh well, let's just suck it and continue into the bright future.

Software Rasterizer

It is a completely different to think that the code you are writing is not generating bad code and it not actually generating bad code. You want to be sure so you have to give the code real data to transform. I am not one of those people with great imaginations so I decided that I will write a software triangle rasterizer.

Having written a few in my younger days my options were more limited than if I had not; the world wasn't my oyster - I was forced to designing one around the things that I 'knew' so I started my design around the number one bottleneck in modern CPUs - memory access. The other thing I wanted to do is to leverage SIMD instructions as that is the task I was set out to do in the first place.

So. I would be computing multiple pixels, or fragments if you will, simultaneously. I did choose to use 2x2 quad as my primitive for 4-wide SIMD vectors, this makes sense as GPU hardware works like this for various reasons. This is all nice when we scale to wider SIMD vectors. 8 wide can do two side-by-side quads simultaneously and 16 wide can do a whole 4x4 block at once. We want to avoid doing super-wide spans like 16x1 pixels because that would be wasted for most of the triangles.

4x4 is also nice for 32 bit buffers: 16 pixels times 4 bytes each is 64 bytes, which happens to be a L1 cache line size on many contemporary CPU architectures. If we align our buffers and thus the 4x4 blocks to 64 bytes our memory access just got reasonably efficient.

Now we are taking a good advantage of the vector units in our CPU. The next problem to solve is how to use the multiple CPU cores. The obvious solution is called 'binning'; the framebuffer is split into number of tiles which are processed individually. 128x128 is a good size as it is not too small and still fits into the L2 cache of most CPUs leaving some cache for textures and other input.

When the vertices are transformed the resulting coordinates can be used for binning the resulting triangles. Binning can be done in either clip or screen coordinates. The screen coordinate binning should not require explanation so writing out a few words about clip coordinate binning. In clip coordinates the ratio of x/w and y/w determine the bin, or bins the triangle belongs to.

The last step is called 'resolve', where each bin is resolved by discrete CPU thread. This has a nice effect on CPU cache as a lot of triangles processed in the same CPU thread end up written in the same area of memory. One CPU core thus accesses the same L2 cache and does not need to share it between other threads which reduces on-chip processing overhead significantly.

Enough theory! Screenshot! screen.jpg

As can be observed, the number of features isn't so great at this time. There is depth buffering going on, some perspective correct gradients and stuff like that. There is texture mapping as well (not shown) and it is quite trivial to add more gradients and use them in different creative ways. The inner loops are still hand-written but if ever get serious about this they should be compiled from higher level shading language. I will never get serious about this, though, as there is no place for software rendering these days. I wrote this for fun and to test the math library, okay?

One neat feature I must add is early-z. The 'heart' of the rasterizer can easily classify the 4x4 (or any other size) of blocks; fully inside triangle, trivially rejected as outside of triangle and crossing the triangle edge. When a block is fully inside the minZ can be stored in coarse depth buffer and then any block that is about to be rasterized can be tested against the coarse depth to reject blocks that are not visible. That will be fun feature but I need more complicated test scenes for this, seriously.

Other optimizations: when block is fully inside the triangle there is no need to compute the colorMask, which is used to mask out writes to the color buffer. The code does not write pixels out one-by-one, we process 16 pixels simultaneously so we write them all out simultaneously - remember - the cost is same for 1 or 16 pixels because they reside in same L1 cache line. That is the smallest unit the CPU can write into memory across the memory bus anyway.

Performance? 500K triangles on 1920x1080 buffer render 60 fps easily on i7 4770 CPU. 3840x2160 can render 200K triangles at 60 fps, too, on the same CPU. I don't have comprehensive charts or anything like that at this time since coding is still on-going but the results are promising.

The effect of resolution is actually smaller than anticipated. The number of triangles is more limiting factor as the transformation and binning code is still not optimized at all (those operations run in multiple threads but that's it). The triangle setup code is still scalar; we could at least setup 4, 8 or 16 triangles simultaneously. Also, on these resolutions and 500K-1M triangles the triangles are so small, only 2-4 pixels in size and the block we are processing is 4x4. But reducing the block size doesn't give any benefits (tested) so we are going with those dimensions.

I compiled Linux 64 bit demo for SSE4.1. It can be found here.

Update: on i9 7900x the performance is nearly doubled when using AVX-512 over AVX2 when the fragments are expensive enough. With only depth test and gouraud shading the memory bandwidth is limiting factor on performance; the AVX-512 is only 25% faster. This means the AVX-512 leaves more headroom for more expensive shaders. I read the early reports about Skylake-X CPU's thermal throttling when AVX-512 is in heavy use but I did not encounter this effect with my setup; I have AIO liquid cooling with enough cooling on the case and the all CPU cores run at 100% utilization consistently w/o throttling. Amazing CPU for the price even if it is a bit steep but for once you get what you pay for. Intel did not pay me to advertise their products but they could wink wink

Auto Vectorization and AVX512

Compilers are pretty cool these days. They can generate vector code automatically if you give them the chance. Here is a simple example:

struct float32x16
    float v[16];

float32x16 compute(float32x16 a, float32x16 b)
    float32x16 result;
    for (int i = 0; i < 16; ++i)
        result.v[i] = a.v[i] < b.v[i] ? a.v[i] : b.v[i] + a.v[i];
    return result;

    // compiled for avx512
    vmovups zmm2, ZMMWORD PTR [rsi]
    mov     rax, rdi
    vmovups zmm0, ZMMWORD PTR [rdx]
    vaddps  zmm1, zmm2, zmm0
    vcmpltps        k1, zmm2, zmm0
    vmovaps zmm1{k1}, zmm2
    vmovups ZMMWORD PTR [rax], zmm1

Indeed; the compiler was able to process sixteen floats per each instruction. Let's see what happens if we explicitly write the code in vector form:

__m512 compute(__m512 a, __m512 b)
    return _mm512_mask_blend_ps(_mm512_cmp_ps_mask(a, b, 2), _mm512_add_ps(a, b), a);

    // compiled for avx512
    vcmpps    k1, zmm0, zmm1, 2 
    vaddps    zmm2, zmm0, zmm1
    vblendmps zmm0{k1}, zmm2, zmm0

Much nicer; runs complete in CPU registers and uses the new avx512 kmask. The difference was not in compiler being better - it was the same, with same compiler options. The calling convention for passing arguments to functions in registers seems to be nicer way to go overall.

Of course, the argument is that in such a trivial example the compiler cannot keep things in registers and when you compile a more complicated function using these simpler functions the inlining will make a lot of this overhead disappear. The only places where the compiler absolutely must read and write into memory is the inputs and the eventual outputs from the transformations we are doing to the data with our code. I have indeed observed this to a great degree but the register calling convention still yields overall better results.

So; if we want to express our intent with explicit vector register types we must make it as convenient as possible. The x86 (and others like ARM) intrinsic syntax is very convoluted and difficult to read and write. The minimum service to yourselves is to overload the most common operations so that the code will look more like this:

float32x16 compute(float32x16 a, float32x16 b)
    return select(a > b, a + b, a);

To trained eye this should be much easier to follow. The math library I been writing does work like this already but the existing arrangement generates a bit-pattern of all 0's or 1's into the result depending on the outcome of the compare operation. Then this value, a sort of a mask, is used to blend between two input values.

Every CPU-SIMD architecture worked like this, more or less, before the AVX512 came along. It dropped this approach and uses what Intel calls kmask registers, which are in practise up to 64 bit mask where each bit indicates one vector lane. Then the mask_move or blend operations use these kmask's to blend between different inputs. The difference to old arrangement is that the old way could use bitwise AND, NAND, and OR operations to do the bitwise blending. The new way uses a single bit per lane as control to select between different lanes. This means the kmasks are easy to convert between ALU and AVX512 vector engine and allows other kinds of cool tricks. Most of the AVX512 instructions are masked by these kmasks which in theory makes it easier for the compiler to implement branches with predication instead of actually doing control flow.

Long story short, I am currently in process of converting the masking system to be more AVX512 friendly. The downside is that need a lot of more code since now the masks resulting from compare operations are discrete types. They used to be same type the compared vectors were. We need to be able to convert between different types of masks and do bitwise operations between masks:

mask = (a < b) && (a < c);

Above used to be a simple bitwise-and operation (&) but now we can implement logical-and (&&) between the masks. This means more code needs to be written so that it will be perfect sigh.

The conversion between different types of masks will be a great pain as we have so many different kinds of masks. We cannot just have mask8, mask16, mask32 and mask64 like the AVX512 has, because for AVX512 the meaning of the mask contents is different; one bit per lane. For NEON, AVX2, SSE and so on the actual width of the lane must be encoded in the mask. For float32x4 the mask will be 32 bits wide for each lane, instead of one bit. We could, of course, keep the same system for our non-AVX512 masks that one bit corresponds to a lane but then we would have to litter the code with mask conversions everywhere and the generated code would be really bad.

Initially did implement the AVX512 compare operations so that always converted the kmask into bitmask and did bitwise blend between different vectors. The problem with THAT approach was that it had to be emulated again since AVX512 does not have built-in operations to do that. Roughly speaking the number of compiled instructions doubled.

A compromise is not something we want here but we must make one. OK, performance comes first, so, with AVX512 we do things with kmasks and with SSE for example we use bitmasks. And now it should be more clear why the way we have to deal with the masks is that the masks need to have a type attached to them. If you do compare of float32's you cannot use the resulting mask for selecting uint8's because the mask would be too wide. So.. you must "convert" the mask to appropriate type if you want to select between different types. Not all conversions are possible, for example, 512 bits wide vector of uint8's means there are 64 uint8's. The number doesn't quite match, for example, 128 bit vector of float64's, see the problem? 2 lanes vs 64 lanes.. if you do 2-lane compare only 2 first lanes would be valid even after the conversion. Things like this just cannot be avoided we are trimming the icing on the cake here to be honest. I just don't like ugly corners like that. :(

GitHub – t0rakka

Jukka Liimatta

Helsinki, Finland.