## holysh*t

I got the microcode issue sorted out and tuned the settings in UEFI. Time to run the parallel jpeg decoding test!

void test_jpeg(const std::string& folder)
{

Path path(folder);
const size_t count = path.size();

std::atomic<size_t> image_bytes { 0 };

for (size_t i = 0; i < count; ++i)
{
const auto& node = path[i];
if (!node.isDirectory())
{
std::string filename = node.name;
q.enqueue([&path, filename, i, count, &image_bytes] {
printf("filename: %s (%zu / %zu) begin.\n", filename.c_str(), i + 1, count);
File file(path, filename);
Bitmap bitmap(file, filename);
image_bytes += bitmap.width * bitmap.height * 4;
printf("filename: %s (%zu / %zu) done.\n", filename.c_str(), i + 1, count);
});
}
}

q.wait();
printf("image: %zu MB\n", image_bytes / (1024 * 1024));
}


10,616 jpeg files (5.7 GB) loaded and decoded in 12.6 seconds! The 5.7 GB decodes into 88 GB after decompression (compression ratio of 15:1).

real	0m12.619s
user	3m29.716s
sys	0m23.980s


Holy SH*T; this machine is FAST. Even better news is that my work queue code scales perfectly to 20 concurrent threads w/o slowing down. :)

## intel-microcode

WARNING! If you are a Linux user and have a Core i9 based computer DO NOT ENABLE THE INTEL MICROCODE DRIVER!!!

I have older Linux box with Core i7 3770 processor which has served me loyally for years. I recently upgraded and installed a fresh Linux Mint 18.2 and the first screen that greets after a new installation is the Driver Manager. It was offering binary drivers from NVIDIA Corporation and "Unknown" (labeled "intel-microcode"). It sounded like a good idea to install the microcode.

The old computer did compile one specific piece of code in 22 seconds using "make -j10". The new computer with nearly 3x the CPU capacity took 15-18 seconds which is not improvement at all. The CPU utilization was jumping between 20 and 30% which was pathetic. The old machine used to go ALL OUT at 100% until the job was done. The statistics looked like this:

real 0m22.231s user 2m45.231s

If you divide the user/real we get roughly 7.5, which makes sense since our hardware has maximum concurrency of 8.

The new machine was giving concurrency ratio which was very little below 4! Long story short, after disabling the intel-microcode goodness the results look like this:

real 0m6.122s user 1m9.624s

Concurrency is 11.3 - the hardware should be able to do 20 but definitely an improvement. The CPU utilization still doesn't go all out at 100%. We should get real 0m3s with user of 2m0s. The HT is enabled and I can confirm 100% CPU load with custom workload from my threadpool code. Something funky going on with the GNU make?

On the bright side, now the machine is over 3x the speed of the old one and the number of cores shouldn't even give more than 2.5x so it is exceeding expectations but simultaneously seeing underwhelming CPU usage. I/O limited? :(

[update]

September 27, 2017. 4.5 seconds!

## Software Rasterizer

It is a completely different to think that the code you are writing is not generating bad code and it not actually generating bad code. You want to be sure so you have to give the code real data to transform. I am not one of those people with great imaginations so I decided that I will write a software triangle rasterizer.

Having written a few in my younger days my options were more limited than if I had not; the world wasn't my oyster - I was forced to designing one around the things that I 'knew' so I started my design around the number one bottleneck in modern CPUs - memory access. The other thing I wanted to do is to leverage SIMD instructions as that is the task I was set out to do in the first place.

So. I would be computing multiple pixels, or fragments if you will, simultaneously. I did choose to use 2x2 quad as my primitive for 4-wide SIMD vectors, this makes sense as GPU hardware works like this for various reasons. This is all nice when we scale to wider SIMD vectors. 8 wide can do two side-by-side quads simultaneously and 16 wide can do a whole 4x4 block at once. We want to avoid doing super-wide spans like 16x1 pixels because that would be wasted for most of the triangles.

4x4 is also nice for 32 bit buffers: 16 pixels times 4 bytes each is 64 bytes, which happens to be a L1 cache line size on many contemporary CPU architectures. If we align our buffers and thus the 4x4 blocks to 64 bytes our memory access just got reasonably efficient.

Now we are taking a good advantage of the vector units in our CPU. The next problem to solve is how to use the multiple CPU cores. The obvious solution is called 'binning'; the framebuffer is split into number of tiles which are processed individually. 128x128 is a good size as it is not too small and still fits into the L2 cache of most CPUs leaving some cache for textures and other input.

When the vertices are transformed the resulting coordinates can be used for binning the resulting triangles. Binning can be done in either clip or screen coordinates. The screen coordinate binning should not require explanation so writing out a few words about clip coordinate binning. In clip coordinates the ratio of x/w and y/w determine the bin, or bins the triangle belongs to.

The last step is called 'resolve', where each bin is resolved by discrete CPU thread. This has a nice effect on CPU cache as a lot of triangles processed in the same CPU thread end up written in the same area of memory. One CPU core thus accesses the same L2 cache and does not need to share it between other threads which reduces on-chip processing overhead significantly.

Enough theory! Screenshot!

As can be observed, the number of features isn't so great at this time. There is depth buffering going on, some perspective correct gradients and stuff like that. There is texture mapping as well (not shown) and it is quite trivial to add more gradients and use them in different creative ways. The inner loops are still hand-written but if ever get serious about this they should be compiled from higher level shading language. I will never get serious about this, though, as there is no place for software rendering these days. I wrote this for fun and to test the math library, okay?

One neat feature I must add is early-z. The 'heart' of the rasterizer can easily classify the 4x4 (or any other size) of blocks; fully inside triangle, trivially rejected as outside of triangle and crossing the triangle edge. When a block is fully inside the minZ can be stored in coarse depth buffer and then any block that is about to be rasterized can be tested against the coarse depth to reject blocks that are not visible. That will be fun feature but I need more complicated test scenes for this, seriously.

Other optimizations: when block is fully inside the triangle there is no need to compute the colorMask, which is used to mask out writes to the color buffer. The code does not write pixels out one-by-one, we process 16 pixels simultaneously so we write them all out simultaneously - remember - the cost is same for 1 or 16 pixels because they reside in same L1 cache line. That is the smallest unit the CPU can write into memory across the memory bus anyway.

Performance? 500K triangles on 1920x1080 buffer render 60 fps easily on i7 4770 CPU. 3840x2160 can render 200K triangles at 60 fps, too, on the same CPU. I don't have comprehensive charts or anything like that at this time since coding is still on-going but the results are promising.

The effect of resolution is actually smaller than anticipated. The number of triangles is more limiting factor as the transformation and binning code is still not optimized at all (those operations run in multiple threads but that's it). The triangle setup code is still scalar; we could at least setup 4, 8 or 16 triangles simultaneously. Also, on these resolutions and 500K-1M triangles the triangles are so small, only 2-4 pixels in size and the block we are processing is 4x4. But reducing the block size doesn't give any benefits (tested) so we are going with those dimensions.

I compiled Linux 64 bit demo for SSE4.1. It can be found here.

Update: on i9 7900x the performance is nearly doubled when using AVX-512 over AVX2 when the fragments are expensive enough. With only depth test and gouraud shading the memory bandwidth is limiting factor on performance; the AVX-512 is only 25% faster. This means the AVX-512 leaves more headroom for more expensive shaders. I read the early reports about Skylake-X CPU's thermal throttling when AVX-512 is in heavy use but I did not encounter this effect with my setup; I have AIO liquid cooling with enough cooling on the case and the all CPU cores run at 100% utilization consistently w/o throttling. Amazing CPU for the price even if it is a bit steep but for once you get what you pay for. Intel did not pay me to advertise their products but they could wink wink

Jukka Liimatta

Helsinki, Finland.

Programmer.