EDITEDIT: Replaced the inline gists with inline code. The links to the gists are kept around in case someone wants them.

EDIT: If the inline gists don't show up, use these links instead:

I started implementing a tile-based software rasterizer the other day. Well, more like borrowing some old sample code from the user Nick on the DevMaster.net forums. If you're around to read this, thanks Nick!

So what is a tile-based rasterizer? It's what the GPUs use nowadays, as opposed to the oldschool method called scanline rasterization. With scanline rasterization you walk down the edges of your polygon with the bresenham algorithm, and draw one scanline for every y coordinate.

In a tile-based rasterizer on the other hand, you use the standard line/plane equation to classify a pixel as either inside or outside the polygon. But even better; instead of doing that every pixel, you can test a whole NxN box or tile. Tiles that are outside the polygon can be trivially discarded, tiles that are completely inside can be completely rendered with a fast unrolled loop, and intersecting tiles must be traversed and tested per-pixel inside the tile.

I'll write more about tile-based rasterizers later, but you now know the basic idea. So when implementing a tile-based rasterizer, you might meet a small obstacle if you're used to scanline-based rasterizers. How do you calculate the interpolants (per-vertex data) in a perspective correct manner?

In scanline-based rasterizers it's trivial. Normal scalars aren't linear in screenspace with respect to z, and neither is w, but 1/w is. So given a per-vertex scalar s, you interpolate s/w and 1/w along the edges of your polygon using Bresenham. Then at each pixel, you compute w' = 1/(1/w) , and finally s' = s*w'.

The problem with tile-based rasterizers is that you need something else than Bresenham to compute these interpolants with, for every pixel. We don't have access to the edges anymore. One way to do this is to compute Barycentric coordinates, given that we only render triangle primitives (which you should anyway). With barycentric coordinates for every pixel, you can just use those weights to blend the values from each vertex. Simple!

```
float wReci =b1*v1.w + b2*v2.w + b3*v3.w; //v1/v2/v3's w is already flipped to 1/w
float sReci = b1*v1.s + b2*v2.s + b3*v3.s; //v1/v2/v3's s is already flipped to s/w
float finalW = 1.0f / wReci;
float finalS = sReci * finalW;
/* use s ..
..
*/
```

The only problem with barycentric coordinates, is that they are slow to compute. From the Wikipedia article on the subject, you'll find the formula:

Fortunately, this can be optimized a lot. As you can see, the denominator is the same for b1 and b2. And only a few of the final sums and factors depends on the x and y from the pixel coordinate. This means that most of the sums and factors can be pre-computed for each triangle, instead of being computed per pixel. Let's see what happens if we expand the equations:

```
/* naive equation/algorithm */
/*
float b1 = (y2*x - y2*x3 - y3*x + y3*x3) + (x3*y - x3*y3 - x2*y + x2*y3);
float b2 = (y3*x - y3*x3 - y1*x + y1*x3) + (x1*y - x1*y3 - x3*y + x3*y3);
float b3 = 1 - b1 - b2;
*/
//Since the only thing that changes per pixel is 'x' and 'y', we can rewrite it like so:
//Setup (once per triangle) :
float denomInv = 1 / ((y2 - y3)(x1 - x3) + (x3 - x2)(y1 - y3));
float x = startX;
float y = startY;
float b1Term1 = y2*x;
float b1Term2 = y3*x;
float b1Term3 = x3*y;
float b1Term4 = x2*y;
float b1TermConst1 = (-y2*x3) + y3*x3;
float b1TermConst2 = (-x3*y3) + x2*y3;
float b1TermConst = b1TermConst1 + b1TermConst2;
float b2Term1 = y3*x;
float b2Term2 = y1*x;
float b2Term3 = x1*y;
float b2Term4 = x3*y;
float b2TermConst1 = (-y3*x3) + y1*x3;
float b2TermConst2 = (-x1*y3) + x3*y3;
float b2TermConst = b2TermConst1 + b2TermConst2;
//compute s0/w0, s1/w1, s2/w2 as usual..
//compute 1/w0, 1/w1, 1/w2 as usual..
//Loop (for every pixel) :
for(y; y<yEnd; ++y){
for(; x<xEnd; ++x){
float b1 = (b1TermConst + b1Term1 + b1Term3 - b1Term2 - b1Term4) * denomInv;
float b2 = (b2TermConst + b2Term1 + b2Term3 - b2Term2 - b2Term4) * denomInv;
float b3 = 1 - b1 - b2;
float sOverW = s0w0*b1 + s1w1*b2 + s2w2*b3;
float OneOverW = w0*b1 + w1*b2 + w2*b3;
float wFinal = 1 / OneOverW;
float sFinal = sOverW * wFinal;
b1Term1 += y2
b1Term2 -= y3
b2Term1 += y3
b2Term2 -= y1
}
b1Term3 += x3
b1Term4 -= x2
b2Term3 += x1
b2Term4 -= x3
}
```

There! Much better than the naive algorithm. So, in the triangle setup we got 3 divisions, 16 multiplications, 10 subtractions/additions as the constant cost, and 3 divisions per vertex scalar. Per pixel, we have to compute 1 division, 5 multiplications and 16 sub/additions as the constant cost, and 3 multiplications and 2 additions per vertex scalar. In addition there are 4 sub/additions per column when we change the y coordinate. The division per-pixel is impossible to get rid of, since it is used to compute the reciprocal of the interpolated w value. A possible optimization is to interpolate w instead of 1/w over the tile. It's not perspective correct, but given a 4x4, 8x8 or 16x16 tile size, the distortion is probably not noticeable. Id did a similar hack in their Quake-renderer.

That's it for barycentric coordinates. Finally, there is a possibly more efficient method which uses a 3x3 transformation matrix instead, as mentioned in this paper «Triangle Scan Conversion using 2D Homogeneous Coordinates» by Marc Olano and Trey Greer

a, b and c are coefficients which satisfy this equation:

X and Y are pixel / screenspace coordinates. The matrix above is made from the vertices in clip space *before* the projection, i.e division by w. In OpenGL, the coefficient matrix would be made out of the x, y and w components right after the vertices are transformed by the modelview-projection matrix. s1, s2, and s3 defines a scalar s, one value from each vertex in the triangle respectively. So for each scalar, you gather it from all three vertices and make a 3D vector out of it. Then, this vector is multiplied with the matrix above, and that gives you the coefficients to later compute s/w per pixel.

We still need to interpolate 1/w. To get 1/w, just transform the vector [1 1 1] with the matrix. Also some values (for example z used for depth testing) needs to avoid perspective-correction, i.e need a normal linear interpolation. To do that, scale each scalar in the vector with w.

That's what is required in the triangle setup. We then end up with a rasterizer inner-loop roughly like this:

```
//setup
x = startX;
y = startY;
float sX = s1*x; //s1 is 'a' for s
float sY = s2*y; //s2 is 'b' for s
float wX = w1*X; //w1 is 'a' for w
float wY = w2*Y; //w2 is 'b' for w
//optimized loop:
for(; y < yEnd; ++y){
for(; x < xEnd; ++x){
float sOverW = sX + sY + s3; //s3 is 'c' for s
float finalW = 1.0f / (wX + wY + w3); //w3 is 'c' for w
float finalS = sOverW * finalW;
sX += s1;
wX += w1;
}
sY += s2;
wY += w2;
}
```

Whoah! Only one multiplication per scalar! The rest is a bunch of cheap additions! Now, computing the inverse of the 3x3 matrix isn't exactly cheap, but that's done once for every triangle. I guess being lazy is a good idea. First see if the triangle will be rendered, and then afterwards compute the coefficient matrix. I haven't tested performance yet, but I'm fairly sure using the matrix-scheme favours big over small triangles, due to moving all the work into the triangle setup.

And since I'm so nice, here's some code to compute the 3x3 matrix. Remember to transform x, y and w into the correct coordinate system!

```
Matrix3f ComputeCoeffMatrix(const Vector4f& v1, const Vector4f& v2, const Vector4f& v3)
{
Matrix3f m(Vector3f(v1.x, v2.x, v3.x),
Vector3f(v1.y, v2.y, v3.y),
Vector3f(v1.w, v2.w, v3.w));
const float eps = 0.01f;
float det =
m[0]*(m[4]*m[8] - m[5]*m[7]) +
m[1]*(m[5]*m[6] - m[3]*m[8]) +
m[2]*(m[3]*m[7] - m[4]*m[6]);
ASSERT(std::abs(det) > eps);
// Matrix cofactors
float c00 = +(m[4]*m[8] - m[5]*m[7]); //4,5,7,8
float c10 = -(m[3]*m[8] - m[5]*m[6]); //3,5,6,8
float c20 = +(m[3]*m[7] - m[4]*m[6]); //3,4,6,7
float c01 = -(m[1]*m[8] - m[2]*m[7]); //1,2,7,8
float c11 = +(m[0]*m[8] - m[2]*m[6]); //0,2,6,8
float c21 = -(m[0]*m[7] - m[1]*m[6]); //0,1,6,7
float c02 = +(m[1]*m[5] - m[2]*m[4]); //1,2,4,5
float c12 = -(m[0]*m[5] - m[2]*m[3]); //0,2,3,5
float c22 = +(m[0]*m[4] - m[1]*m[3]); //0,1,3,4
//Make an adjoint matrix (reuse m)
m[0] = c00; m[1] = c01; m[2] = c02;
m[3] = c10; m[4] = c11; m[5] = c12;
m[6] = c20; m[7] = c21; m[8] = c22;
det = 1.0f/det;
m[0] *= det; m[1] *= det; m[2] *= det;
m[3] *= det; m[4] *= det; m[5] *= det;
m[6] *= det; m[7] *= det; m[8] *= det;
return m;
}
```

That's it for now. Next post will probably be about guard band clipping :-)