I already have PR #1289 which needs some more work before being considered for inclusion in the project. In said PR, cat uses splice() instead of read() on Linux, drastically increasing throughput. In the most recent kernel (5.7), io_uring will finally gain splice support. I want to try comparing a io_uring + splice() implementation of cat with a normal splice()-only implementation.