I recently added AES-NI acceleration for ECB and CBC modes. There is something strange going on with the performance; I was expecting something in the range of 400-500 MB/s throughput but instead got 7.2 GB/s which made me very skeptical that the code actually works.

After exhaustive testing: it does work. The encrypted data is correct and decrypts just fine with other decoders. After some fiddling it turns out that when I compile for SSE2-4 with "-maes" enabled the performance is roughly 600 MB/s but compiling for AVX mode the performance jumps to 7240 MB/s.

Is this some new Intel optimization in SkylakeX that activates? I have no clue what I just did. OpenSSL for example gives 440 MB/s with AES-NI which is in line with my expectations.

After looking at the Intel documentation it looks like the AES instructions are 1 clock cycle throughput with latency around 10 clock cycles or so. The AES decryption has one trait that means out-of-order execution is not feasible in traditional sense: each iteration has strict dependency to the previous one. The way I have written the code is 100% unrolled for each block, which means 10 rounds with 128 bit key takes only 10 clock cycles + the latency from the last iteration, which is amortised nearly completely by loading the blocks and loop overhead.

Unrolling the AES loop is critical for squeezing the performance out of it. Maybe unrolling more would reduce the effect of the overhead even more? To be continued...