Currently only sparse by sparse products are parallel in the smmp module. Converting the current sparse by dense products using ndarray::parallel should be straight forward. Here is an implementation for par_csr_mulacc_dense_colmaj that gives a significant speedup on my machine:
pub fn par_csr_mulacc_dense_colmaj<'a, N, A, B, I, Iptr>(
lhs: CsMatViewI<A, I, Iptr>,
rhs: ArrayView<B, Ix2>,
mut out: ArrayViewMut<'a, N, Ix2>,
) where
A: Send + Sync,
B: Send + Sync,
N: 'a + crate::MulAcc<A, B> + Send + Sync,
I: 'a + SpIndex,
Iptr: 'a + SpIndex,
{
assert_eq!(lhs.cols(), rhs.shape()[0], "Dimension mismatch");
assert_eq!(lhs.rows(), out.shape()[0], "Dimension mismatch");
assert_eq!(rhs.shape()[1], out.shape()[1], "Dimension mismatch");
assert!(lhs.is_csr(), "Storage mismatch");
let axis1 = Axis(1);
ndarray::Zip::from(out.axis_iter_mut(axis1))
.and(rhs.axis_iter(axis1))
.par_for_each(|mut ocol, rcol| {
for (orow, lrow) in lhs.outer_iterator().enumerate() {
let oval = &mut ocol[[orow]];
for (rrow, lval) in lrow.iter() {
let rval = &rcol[[rrow]];
oval.mul_acc(lval, rval);
}
}
});
}
The only changes here are the parallel iterator, adding the rayon feature for ndarray, and adding the Sync and Send trait bounds to the data types inside the matrices. My concern is that adding Send + Sync will result in these trait requirements to be unnecessarily added in many places.
Looking at the impl Mul for CsMatBase and CsMatBase I see that Sync + Send is required no matter if multi_thread is enabled or not. Is it okay to propagate these trait requirements all the way up to many of the trait impls for CsMatBase and then use the conditional feature compilation on the lowest level functions found in the prod module? Conditionally compiling at all the higher level implementations sounds like it would get nasty very quickly, especially as more parallel features get added.
Currently only sparse by sparse products are parallel in the
smmpmodule. Converting the current sparse by dense products usingndarray::parallelshould be straight forward. Here is an implementation forpar_csr_mulacc_dense_colmajthat gives a significant speedup on my machine:The only changes here are the parallel iterator, adding the
rayonfeature forndarray, and adding theSyncandSendtrait bounds to the data types inside the matrices. My concern is that addingSend + Syncwill result in these trait requirements to be unnecessarily added in many places.Looking at the
impl MulforCsMatBaseandCsMatBaseI see thatSync + Sendis required no matter ifmulti_threadis enabled or not. Is it okay to propagate these trait requirements all the way up to many of the traitimpls forCsMatBaseand then use the conditional feature compilation on the lowest level functions found in theprodmodule? Conditionally compiling at all the higher level implementations sounds like it would get nasty very quickly, especially as more parallel features get added.