I got the microcode issue sorted out and tuned the settings in UEFI. Time to run the parallel jpeg decoding test!

void test_jpeg(const std::string& folder)
{
    ConcurrentQueue q("jpeg reader testloop");
    //SerialQueue q("jpeg reader testloop");

    Path path(folder);
    const size_t count = path.size();

    std::atomic<size_t> image_bytes { 0 };

    for (size_t i = 0; i < count; ++i)
    {
        const auto& node = path[i];
        if (!node.isDirectory())
        {
            std::string filename = node.name;
            q.enqueue([&path, filename, i, count, &image_bytes] {
                printf("filename: %s (%zu / %zu) begin.\n", filename.c_str(), i + 1, count);
                File file(path, filename);
                Bitmap bitmap(file, filename);
                image_bytes += bitmap.width * bitmap.height * 4;
                printf("filename: %s (%zu / %zu) done.\n", filename.c_str(), i + 1, count);
            });
        }
    }

    q.wait();
    printf("image: %zu MB\n", image_bytes / (1024 * 1024));
}

10,616 jpeg files (5.7 GB) loaded and decoded in 12.6 seconds! The 5.7 GB decodes into 88 GB after decompression (compression ratio of 15:1).

real	0m12.619s
user	3m29.716s
sys	0m23.980s

Holy SH*T; this machine is FAST. Even better news is that my work queue code scales perfectly to 20 concurrent threads w/o slowing down. :)