Half of Linear Algebra Course in 10 Python One-liners

A little disclaimer first. This post is not about linear algebra in Python. If you want exactly that, have a look at python.linalg, it is worth it. This post is about how awesome Python is in its core and how much you can achieve with it writing very little code.

We’ll start slow. Python has list comprehensions. It’s a way you can make one list into another without explicitly specifying loops and indexes. So if you want a function that multiplies a vector by scalar, you can do it as simple as this:

def scale(A, x): return [ai*x for ai in A]

You can do several lists at once with zip. It takes a couple of lists and makes a single list of tuples out of them. Each tuple then contains exactly one element from every list. So adding vectors is rather plain too.

def add(A, B): return [ai+bi for (ai, bi) in zip(A, B)]

Python also has some list handling functions from the box. Like sum. Let’s use it to write a function for a dot product.

def dot(A, B): return sum([ai*bi for (ai, bi) in zip(A, B)])

List comprehension is great, but you still might want to operate on indexes now and then. Python has a range function that generates a sequence of indexes for you. There is also conditional expressions in Python, so with these two features you can compose an identity matrix as a nested list.

def ident(n): return [[1.0 if i==j else 0.0 for j in range(n)] for i in range(n)]

Remember zip? We used it to add two vectors, but it can operate on an arbitrary numbers of arrays. Matrix, as a nested list, is an arbitrary list of matrix rows. If we manage to zip it, we will get a list of columns. That’s basically a matrix transposition. Python lets you turn a list into an argument list for a function with a special syntax. You have to write your list with an asterisk in front of it.

def transpose(A): return [list(aj) for aj in zip(*A)]

Of course you can use your own functions to build new ones. So far all of our functions were referential transparent meaning that they always provide the same output for the same input. Like mathematical functions or fine tuned mechanical machines. This is a great trait that enables composition without adding much complexity into the code. We can reuse dot function from before in a context of matrix by vector multiplication.

def mul(A, X): return [dot(ai, X) for ai in A]

However it can get messy. At some point the code consists more of user defined words than basic syntax. By adding new functions you create your own language, that’s why naming is important.

We already defined 3 words for vector operations, let’s combine them to make a project function. It projects a point from space onto a plane, or hyperplane, or a line defined by its normal vector and a distance to the center of coordinates.

def project(A, Pn, Pd): return add(A, scale(Pn, (Pd-dot(Pn, A)) / dot(Pn, Pn)))

Python also provides a syntax for multiplying lists by repetitive concatenation. It’s when you write [1, 2, 3] * 3 and get [1, 2, 3, 1, 2, 3, 1, 2, 3]. We can use this to define a distance function from a dot product as a dot product of vector and self is basically a square of its length.

def distance(A, B): return pow(dot( *(add(A, scale(B, -1.0)), )*2 ), 0.5)

Now things can get a little tricky. I want to show you how to solve a linear equation system with a one-liner. Well, it is pretty straightforward with Cramers rule, but you would face serious computational issues using it without Chio condensation, and that might require pivoting, and that is no matter of a one line. I have another idea.

Remember, you can define a plane with a normal vector and a scalar? Well, that pretty much states a linear equation. You can picture a system of linear equations as intersecting planes. When two of them are parallel, there is no intersection and no solution. When two of them are identical, you have a whole line as a solution, so no particular point. But when they intersect, there is the one and only point and you want it.

And this point obviously lies on every single plane at once. Now consider this: a point in space can not be closer to a point on plane, than its own projection on that plane. Because point in space, its projection, and a point on a plane make a right triangle, and hypotenuse can be no shorter than a leg. So if we start from any point in space and will repeatedly project it on every plane from a system, we will eventually come close to the solution. That’s if there is one of course.

That’s how all iterative algorithms are made. You make sure every operation brings you closer to the goal, that’s called convergence, and do these operations until you are done.

But we want this to be one-liner, so no loops are appropriate here. We will use recursion. On every step we will check if we are good with the current solution, and if not — rotate our system by one equation and go on.

As for matrix rotation, Python grants a very convenient syntax for list ranges we will use. You can not only get an element from a list by its index in Python, but also specify a range of elements you want to get with a colon. So all elements of A apart from the first will be simply A[1:] and a list of a single first element of an A — [A[0]]. Now you just concatenate them like A[1:] + [A[0]] and you got yourself a rotation.

def solve(A, B, Xi): return Xi if distance(mul(A, Xi), B) < 0.0001 else solve(A[1:] + [A[0]], B[1:] + [B[0]], project(Xi, A[0], B[0]))

Now let’s finish with computing an inverse matrix. Usually you’d have to find an adjoined matrix of cofactors and divide it to the determinant, but that’s just too much typing. We can find an inverse matrix with one-liner as well.

Consider this. When we multiply a matrix by vector, we have a list of dot products. Now if a vector is an orth, like [1, 0, …], by multiplying matrix by it we will get a list of its first column elements. That the first column. We can get any column by selecting a corresponding orth.

We can get a vector, we would of have by multiplying inverse matrix by something, by solving a linear equation system made of the original matrix, which is itself inverse to the inverse, and that something.

So we will simply solve that system for every orth and then transpose the list of resulting solution to have an inverse matrix.

def invert(A): return transpose([solve(A, ort, [0.0]*len(A)) for ort in ident(len(A))])

That’s it. Only 10 small functions, but so much computation.

Of course Python code is not all that terse. You can pretty much write Fortran-style or Java-style in Python, and by you I mean everyone else. But as for a mainstream language Python sure has a lot of fun to offer.

One More C++ Trick. Compile-time Sorting of a Run-time Array

Compile time computations are pretty easy in modern C++. We have constexpr for expressions we want to evaluate statically, and we now have variadic templates so we can make compile-time parameter sequences of arbitrary length like integer_sequence. But constant expressions can not modify their input data and sequences are not real data structures. Yet you might want to do something like this statically:

    auto a = std::array<int, 8> {6,5,4,7,3,5,1,2};
    for(auto ai : a)
        std::cout << ai << " ";

Well, you have your data in compile-time, why can't you make it sorted in compile-time? Of course you can, but you'd have to resort to trickery.

Actually, compiler strives to make as much work as possible statically, you just have to help it a bit. To make computation easy for compiler to do statically you should make its control flow independent of run-time data. Basically this means it should not contain any conditional branching and looping unless the arguments for a condition are template parameters. Consequently this also means that you can not optimize your control-flow based on data properties. You have a single control-flow for all data, so every case is the worst case.

For us this means that bubble sort is as good as quick sort or insertion sort. Well, let's go with it then. Naive implementation of bubble sort looks like this:

template <size_t N>
void bubble_sort(std::array<int, N>& a){
    for(size_t i = 0; i < N-1; i++){
        for(size_t j = 0; j < N-i-1; j++){
             if(a[j] > a[j+1]) std::swap(a[j], a[j+1]);

It has conditional loops and conditional swap. We'd have to get rid of these. First let's rewrite loops with static recursive functions. Now the facade function will look like this:

template <size_t N>
void static_sort(std::array<int, N>& a){
    outer_loop<0, N>(a);

It simply runs the outer_loop statically passing it the starting counter value and exit condition. Note than in order to deduce N as array size, it should be declared in a template part exactly as size_t. The function would still compile and work fine with int, but you'd have to pass array size to it implicitly.

Now the outer loop:

template <size_t I, size_t N>
static inline void outer_loop(std::array<int, N>& a){
    if(I < N - 1){
        inner_loop<0, I, N>(a);
        outer_loop<I + (I<N-1), N>(a);

Here the condition relies only on template variables, so this is fine, but note that recursive call. It kind of duplicates the condition. I already wrote about that trick in a previous post. The compiler always compiles all the branches regardless of their liveliness. If you call outer_loop as outer_loop<I+1, N>(a), the compiler will not stop creating instances of outer_loop when I hits N-1, it will go further and further until it reaches its maximum recursive depth and then it will stop with an error.

With this hack it will not compile N-1 instance of outer loop, it will instead see that it already has N-2 and that is what we call for in our dead branch, so it'd stop right there.

Ok, now let's get into inner loop.

template <size_t J, size_t I, size_t N>
static inline void inner_loop(std::array<int, N>& a){
    if(J < N-I-1){
        int d = std::abs(a[J]-a[J+1]);
        int s = a[J] + a[J+1];
        a[J] = (s-d) / 2;
        a[J+1] = (s+d) / 2;
        inner_loop<J + (J<N-I-1), I, N>(a);

We can not afford conditional swap, so we'd have to resort to arithmetic magic here. You see, sum of two numbers complimented with their difference is always two times the greater of numbers. And the very sum subtracted their difference is two times the lower. Of course when the difference is 0, the numbers are equal and the sum is just twice the number.

But please note that this trick works reliably only on the lesser quarter of the int range due to the possible overflows.

However it does make computation control-flow data-agnostic, so this really does work statically. The whole thing compiles into this:

        .file   "main.cpp"
        .p2align 4,,15
        .def    ___tcf_0;       .scl    3;      .type   32;     .endef
        movl    $__ZStL8__ioinit, %ecx
        jmp     __ZNSt8ios_base4InitD1Ev
        .def    ___main;        .scl    2;      .type   32;     .endef
        .section .rdata,"dr"
        .ascii " \0"
        .section        .text.startup,"x"
        .p2align 4,,15
        .globl  _main
        .def    _main;  .scl    2;      .type   32;     .endef
        leal    4(%esp), %ecx
        .cfi_def_cfa 1, 0
        andl    $-16, %esp
        pushl   -4(%ecx)
        pushl   %ebp
        .cfi_escape 0x10,0x5,0x2,0x75,0
        movl    %esp, %ebp
        pushl   %ebx
        pushl   %ecx
        .cfi_escape 0xf,0x3,0x75,0x78,0x6
        .cfi_escape 0x10,0x3,0x2,0x75,0x7c
        subl    $48, %esp
        call    ___main
        leal    -40(%ebp), %ebx
        movl    $7, -12(%ebp)
        movl    $6, -16(%ebp)
        movl    $5, -20(%ebp)
        movl    $5, -24(%ebp)
        movl    $4, -28(%ebp)
        movl    $1, -40(%ebp)
        movl    $2, -36(%ebp)
        movl    $3, -32(%ebp)
        .p2align 4,,7
        movl    (%ebx), %eax
        movl    $__ZSt4cout, %ecx
        addl    $4, %ebx
        movl    %eax, (%esp)
        call    __ZNSolsEi
        subl    $4, %esp
        movl    $1, 8(%esp)
        movl    $LC0, 4(%esp)
        movl    %eax, (%esp)
        call    __ZSt16__ostream_insertIcSt11char_traitsIcEERSt13basic_ostreamIT_T0_ES6_PKS3_i
        leal    -8(%ebp), %eax
        cmpl    %ebx, %eax
        jne     L4
        leal    -8(%ebp), %esp
        xorl    %eax, %eax
        popl    %ecx
        .cfi_restore 1
        .cfi_def_cfa 1, 0
        popl    %ebx
        .cfi_restore 3
        popl    %ebp
        .cfi_restore 5
        leal    -4(%ecx), %esp
        .cfi_def_cfa 4, 4
        .p2align 4,,15
        .def    __GLOBAL__sub_I_main;   .scl    3;      .type   32;     .endef
        subl    $28, %esp
        .cfi_def_cfa_offset 32
        movl    $__ZStL8__ioinit, %ecx
        call    __ZNSt8ios_base4InitC1Ev
        movl    $___tcf_0, (%esp)
        call    _atexit
        addl    $28, %esp
        .cfi_def_cfa_offset 4
        .section        .ctors,"w"
        .align 4
        .long   __GLOBAL__sub_I_main
.lcomm __ZStL8__ioinit,1,1
        .ident  "GCC: (GNU) 4.8.1"
        .def    __ZNSt8ios_base4InitD1Ev;       .scl    2;      .type   32;     .endef
        .def    __ZNSolsEi;     .scl    2;      .type   32;     .endef
        .def    __ZSt16__ostream_insertIcSt11char_traitsIcEERSt13basic_ostreamIT_T0_ES6_PKS3_i; .scl    2;      .type   32;     .endef
        .def    __ZNSt8ios_base4InitC1Ev;       .scl    2;      .type   32;     .endef
        .def    _atexit;        .scl    2;      .type   32;     .endef

Well, I posted the whole disassembly just to show that I have nothing in my sleeve. The most interesting point is the difference between the way numbers are printed out with and without static sorting. Here:

***** _sort.s
        leal    -40(%ebp), %ebx
        movl    $7, -12(%ebp)
        movl    $6, -16(%ebp)
        movl    $5, -20(%ebp)
        movl    $5, -24(%ebp)
        movl    $4, -28(%ebp)
        movl    $1, -40(%ebp)
        movl    $2, -36(%ebp)
        movl    $3, -32(%ebp)
        .p2align 4,,7
***** _NO_SORT.S
        leal    -40(%ebp), %ebx
        movl    $6, -40(%ebp)
        movl    $5, -36(%ebp)
        movl    $4, -32(%ebp)
        movl    $7, -28(%ebp)
        movl    $3, -24(%ebp)
        movl    $5, -20(%ebp)
        movl    $1, -16(%ebp)
        movl    $2, -12(%ebp)
        .p2align 4,,7

You see, all that static_sort does it rearranges the order in which variables are put on the stack! It doesn't do any sorting or rearrangement, just this.

Amazing, isn't it. But is it really applicable? Well, not like this of course. But I do use semi-static implementation of Cramers rule (even without Chio's condensation) to calculate splines. I define segments on [0, 1], so my matrix is always known and then its minors can and should be computed statically. Of course, I could of take a piece of paper and precompute them myself. This would of reduce the whole computation to a couple of multiplications and divisions, but compiler does this for me, so why bother.

C++ Metaprogramming Trick That Actually Has an Application

Let's start from the classics. This is a static factorial function that would never compile:

template<int A> long f(){
    if(A <= 1){
        return 1;
        return A * f<A-1>();

On my setup I get a "template instantiation depth exceeds maximum of 900" error although I only instantiate it as f<4>. This happens because run-time and compile-time evaluation differ: the compiler always compiles all the code regardless on run-time branching. So it tries to compile f<4> which requires f<3> to be compiled, which requires f<2>, which in turn requires f<1> which doesn't actually need, but still requires f<0>, and then f<-1>, f<-2>, f<-3>...

This could of been solved with the static if, but it isn't yet available in the language. Well, there are some ways to emulate it, but my trick is much easier. I just do this:

template<int A> long f(){
    if(A <= 1){
        return 1;
        return A * f<A - (A>1)>();

This way the dead branch, the one that doesn't run in run-time, compiles wrongly, but effectively, to simple recursion. f<1> now requires f<1>.

Ok, that's pretty cool, but I promised you an application. Something that you can't do with a mere constexpr or a lookup table. Here, consider this:

	bool solve(const std::array<std::array<double, 2>, 2>& a,
		const std::array<double, 2>& b,
		std::array<double, 2>& x){

		double d = a[0][1] * a[1][0] - a[0][0] * a[1][1];
		if (std::abs(d) < SMALL_ENOUGH) return false;

		x[0] = a[0][1] * b[1] - a[1][1] * b[0] / d;
		x[1] = a[1][0] * b[0] - a[0][0] * b[1] / d;
		return true;

This is a simple linear equation solver for a system of 2. Of course there are tons of ways to solve a linear equation system, each with its own flaws, but when you have to deal with a lot of small systems, you probably want an exact solution such as this.

It gets worse as a number of equations goes up:

	bool solve(const std::array<std::array<double, 3>, 3>& a,
		const std::array<double, 3>& b,
		std::array<double, 3>& x){

		double d =
			  a[0][0] * (a[1][2] * a[2][1] - a[1][1] * a[2][2]) +
			+ a[0][1] * (a[1][0] * a[2][2] - a[1][2] * a[2][0])
			+ a[0][2] * (a[1][1] * a[2][0] - a[1][0] * a[2][1]);
		if (std::abs(d) < SMALL_ENOUGH) return false;

		x[0] =
			 (a[0][1] * (a[2][2] * b[1] - a[1][2] * b[2])
			+ a[0][2] * (a[1][1] * b[2] - a[2][1] * b[1])
			+ (a[1][2] * a[2][1] - a[1][1] * a[2][2]) * b[0]) / d;

		x[1] =
			-(a[0][0] * (a[2][2] * b[1] - a[1][2] * b[2])
			+ a[0][2] * (a[1][0] * b[2] - a[2][0] * b[1])
			+ (a[1][2] * a[2][0] - a[1][0] * a[2][2]) * b[0]) / d;

		x[2] =
			 (a[0][0] * (a[2][1] * b[1] - a[1][1] * b[2])
			+ a[0][1] * (a[1][0] * b[2] - a[2][0] * b[1])
			+ (a[1][1] * a[2][0] - a[1][0] * a[2][1]) * b[0]) / d;
		return true;

Well, you can reduce the system to the system of 2 just as you get the first x, but new matrix allocation would eat up all the performance gain. I've came up with generalization that doesn't need reallocation and works theoretically with any number of equations, but it's all small recursive functions that doesn't get optimized well.

	template <int N>
	bool solve(const std::array<std::array<double, N>, N>& a,
		const std::array<double, N>& b,
		std::array<double, N>& x){

		// you have to explicilty set lambda type to use it recursively
		std::function<double(int, int, int)> fa = [&](int i, int j, int n) -> double {
			if(n == N) return a[i][j];
			return fa(i, j, n+1) * fa(n, n, n+1)
				- fa(i, n, n+1) * fa(n, j, n+1);

		std::function<double(int, int)> fb = [&](int i, int n) -> double{
			if(n == N) return b[i];
			return fa(n, n, n+1) * fb(i, n+1)
				- fa(i, n, n+1) * fb(n, n+1);

		std::function<double(int)> fd = [&](int i) -> double{
			double d = fb(i, i+1);
			for(int j = 0; j < i; j++){
				d -= fa(i, j, i+1) * x[j];
			return d;

		auto fx = [&](int i, double d) -> double{
			return d / fa(i, i, i+1);

		for(int i = 0; i < N; i++){
			double d = fd(i);
			if(std::abs(d) < SMALL_ENOUGH) return false;
			x[i] = fx(i, d);
		return true;

I made these functions as lambdas nested in solve to avoid namespace littering, but they wouldn't inline anyways, so it's ok. However, with the new trick I actually can make them compile inline! I just have to pass indexes as compile time variables, and use runtime variables to pass the context that nested functions would of share.

    template <int I, int J, int K, int N>
    inline static double aij(const std::array<std::array<double, N>, N>& a){
        if(K == N) return a[I][J];
        return aij<I, J, K+(K<N), N>(a) * aij<K, K, K+(K<N), N>(a) - aij<I, K, K+(K<N), N>(a) * aij<K, J, K+(K<N), N>(a);

    template <int I, int K, int N>
    inline static double bi(const std::array<std::array<double, N>, N>& a, const std::array<double, N>& b){
        //std::array<int, N> unused = {0}; // the only way to trace compile time evaluation sadly
        if(K == N) return b[I];
        return aij<K, K, K+(K<N), N>(a) * bi<I, K+(K<N), N>(a, b) - aij<I, K, K+(K<N), N>(a) * bi<K, K+(K<N), N>(a, b);

    template <int J, int I, int N>
    inline static void d_for(double& d, const std::array<std::array<double, N>, N>& a, std::array<double, N>& x){
        if(J < I){
            d -= aij<I, J, I+(J<I), N>(a) * x[J];
            d_for<J+(J<I), I, N>(d, a, x);

    template <int I, int N>
    inline static double di(const std::array<std::array<double, N>, N>& a, const std::array<double, N>& b, std::array<double, N>& x){
        double d = bi<I, I+1, N>(a, b);
        d_for<0, I, N>(d, a, x);
        return d;

    template <int I, int N>
    inline static bool x_for(const std::array<std::array<double, N>, N>& a, const std::array<double, N>& b, std::array<double, N>& x){
        if(I < N){
            double d = di<I, N>(a, b, x);
            if(std::abs(d) < SMALL_ENOUGH) return false;
            x[I] = d / aij<I, I, I+1, N>(a);
            return x_for<I+(I<N), N>(a, b, x);
        return true;

    template <int N>
    bool solve_s(const std::array<std::array<double, N>, N>& a, const std::array<double, N>& b, std::array<double, N>& x){
        return x_for<0, N>(a, b, x);

And it works surprisingly well. I mean there is a surprise. It obviously compiles to zero-overhead code, otherwise I wouldn't be bragging about it. It also evaluates as a huge constant expression, when input data is constant. But what is really awesome about it - it can do both!

For instance when linear equations are homogeneous, our b vector would be all zeros. Compiler makes a solver that exploits this. Or when we calculate polynomials for spline segments, we usually have a matrix defined beforehand and only b changing. Compiler exploits this even better. It just precalculates all the minors in compile time, so the solution it provides is algebraically sound reduction of the general case.

It does my algebra for me! How cool is that!

So, to sum things up, for linear equations solver metaprogramming gives us both zero overhead solution in general case and semi-constant custom solution for special cases. And it all started with the little trick.

GitHub – akalenuk

Oleksandr Kaleniuk


Ukrainian software engineer