It is a completely different to think that the code you are writing is not generating bad code and it not actually generating bad code. You want to be sure so you have to give the code real data to transform. I am not one of those people with great imaginations so I decided that I will write a software triangle rasterizer.
Having written a few in my younger days my options were more limited than if I had not; the world wasn't my oyster - I was forced to designing one around the things that I 'knew' so I started my design around the number one bottleneck in modern CPUs - memory access. The other thing I wanted to do is to leverage SIMD instructions as that is the task I was set out to do in the first place.
So. I would be computing multiple pixels, or fragments if you will, simultaneously. I did choose to use 2x2 quad as my primitive for 4-wide SIMD vectors, this makes sense as GPU hardware works like this for various reasons. This is all nice when we scale to wider SIMD vectors. 8 wide can do two side-by-side quads simultaneously and 16 wide can do a whole 4x4 block at once. We want to avoid doing super-wide spans like 16x1 pixels because that would be wasted for most of the triangles.
4x4 is also nice for 32 bit buffers: 16 pixels times 4 bytes each is 64 bytes, which happens to be a L1 cache line size on many contemporary CPU architectures. If we align our buffers and thus the 4x4 blocks to 64 bytes our memory access just got reasonably efficient.
Now we are taking a good advantage of the vector units in our CPU. The next problem to solve is how to use the multiple CPU cores. The obvious solution is called 'binning'; the framebuffer is split into number of tiles which are processed individually. 128x128 is a good size as it is not too small and still fits into the L2 cache of most CPUs leaving some cache for textures and other input.
When the vertices are transformed the resulting coordinates can be used for binning the resulting triangles. Binning can be done in either clip or screen coordinates. The screen coordinate binning should not require explanation so writing out a few words about clip coordinate binning. In clip coordinates the ratio of x/w and y/w determine the bin, or bins the triangle belongs to.
The last step is called 'resolve', where each bin is resolved by discrete CPU thread. This has a nice effect on CPU cache as a lot of triangles processed in the same CPU thread end up written in the same area of memory. One CPU core thus accesses the same L2 cache and does not need to share it between other threads which reduces on-chip processing overhead significantly.
Enough theory! Screenshot!
As can be observed, the number of features isn't so great at this time. There is depth buffering going on, some perspective correct gradients and stuff like that. There is texture mapping as well (not shown) and it is quite trivial to add more gradients and use them in different creative ways. The inner loops are still hand-written but if ever get serious about this they should be compiled from higher level shading language. I will never get serious about this, though, as there is no place for software rendering these days. I wrote this for fun and to test the math library, okay?
One neat feature I must add is early-z. The 'heart' of the rasterizer can easily classify the 4x4 (or any other size) of blocks; fully inside triangle, trivially rejected as outside of triangle and crossing the triangle edge. When a block is fully inside the minZ can be stored in coarse depth buffer and then any block that is about to be rasterized can be tested against the coarse depth to reject blocks that are not visible. That will be fun feature but I need more complicated test scenes for this, seriously.
Other optimizations: when block is fully inside the triangle there is no need to compute the colorMask, which is used to mask out writes to the color buffer. The code does not write pixels out one-by-one, we process 16 pixels simultaneously so we write them all out simultaneously - remember - the cost is same for 1 or 16 pixels because they reside in same L1 cache line. That is the smallest unit the CPU can write into memory across the memory bus anyway.
Performance? 500K triangles on 1920x1080 buffer render 60 fps easily on i7 4770 CPU. 3840x2160 can render 200K triangles at 60 fps, too, on the same CPU. I don't have comprehensive charts or anything like that at this time since coding is still on-going but the results are promising.
The effect of resolution is actually smaller than anticipated. The number of triangles is more limiting factor as the transformation and binning code is still not optimized at all (those operations run in multiple threads but that's it). The triangle setup code is still scalar; we could at least setup 4, 8 or 16 triangles simultaneously. Also, on these resolutions and 500K-1M triangles the triangles are so small, only 2-4 pixels in size and the block we are processing is 4x4. But reducing the block size doesn't give any benefits (tested) so we are going with those dimensions.
I compiled Linux 64 bit demo for SSE4.1. It can be found here.
Update: on i9 7900x the performance is nearly doubled when using AVX-512 over AVX2 when the fragments are expensive enough. With only depth test and gouraud shading the memory bandwidth is limiting factor on performance; the AVX-512 is only 25% faster. This means the AVX-512 leaves more headroom for more expensive shaders. I read the early reports about Skylake-X CPU's thermal throttling when AVX-512 is in heavy use but I did not encounter this effect with my setup; I have AIO liquid cooling with enough cooling on the case and the all CPU cores run at 100% utilization consistently w/o throttling. Amazing CPU for the price even if it is a bit steep but for once you get what you pay for. Intel did not pay me to advertise their products but they could wink wink