What is your question?
I am developing a GEMM operator for the Blackwell GPU where the B matrix uses MXFP4 and the A matrix uses MXFP8. However, there are no relevant examples for either CUTE C++ or CUTE DSL. What's more, CUTE DSL even requires the data types of the A and B matrices input to MMA to be identical, but in reality, the Blackwell architecture supports mismatched data types. Could anyone provide relevant examples?
What is your question?
I am developing a GEMM operator for the Blackwell GPU where the B matrix uses MXFP4 and the A matrix uses MXFP8. However, there are no relevant examples for either CUTE C++ or CUTE DSL. What's more, CUTE DSL even requires the data types of the A and B matrices input to MMA to be identical, but in reality, the Blackwell architecture supports mismatched data types. Could anyone provide relevant examples?