While I was working on #78, the test suite caused me a lot of trouble. There were two main problems:
-
Running the tests takes a really long time. I ended up writing some scripts to run the tests on my institution's cluster, but even then ≈50 of the tests (individual test functions, not test files) run up against a 4h time limit. The whole suite takes >34h to run, not counting any tests that were killed for exceeding the time limit. There are also a handful of tests that require >32 GB of RAM!
-
There are ≈30 tests that are either broken or flaky. This makes it a bit scary to develop new features, because you're not sure which test failures you're actually responsible for. And of course, even if you realize that you're not responsible for a particular failure, you have no way of knowing if you broke something relating to these tests. I do think this relates back to the first problem, because if it were practical to run the whole test suite for each commit, there would surely be fewer broken tests.
Migrating to pytest wouldn't solve the underlying issue, but it has a few features that I think would help a lot. The biggest is the ability to "mark" tests. There are two marks that I'd want to use extensively. The first is xfail, which indicates that a test is currently broken and expected to fail. The failures are still reported, so you know they're happening, but they don't affect whether or not the whole suite is considered to pass or fail. This means that you can require new contributions to not break any currently working tests, while not having to fix all the broken/flaky tests up front. The second mark is slow, which would indicate that a test takes more than a few seconds to complete. With this information, GitHub could be configured to automatically run all the "not slow" tests for each PR, which would help protect against breakages. The slow tests could either be gradually made fast enough to include in the automated test suite, or just be run manually when there's some reason to.
Pytest also supports running tests in parallel across multiple processes, via the pytest-xdist plugin. This alone wouldn't be enough to make the tests run in a reasonable amount of time, but it would help. Finally, pytest makes it easier to share code between tests (because it doesn't require every test to belong to a class) and has features like parametrization that can cut down on boilerplate. These last two things wouldn't make the tests run any faster, but would make them easier to write.
Is this something you'd be in favor of? I'm not asking for you to do anything yourself, I'm more asking for permission to do something like this the next time I want to make a non-trivial change to the code. (With no guarantees that I'll ever actually get around to it.)
In case it's of any interest, here's the test report I generated for the most recent commit (94380ef):
RUNTIME: 1 day, 10:31:15
TOTAL: 864
PASS: 771
FAIL: 39
KILLED: 54
FAILURES:
AssertionError: ((24, 24, 10, 10, 10), 9)
nn.test_convolution.TestConvolution.test_padding_modes_r3conv
AssertionError: False is not true...
group.test_combine_subgroups.TestSubgroups.test_o3
group.test_subgroups.TestSubgroups.test_restrict_so3_o2
AssertionError: Induction from general representations is not supported yet
nn.test_pooling.TestPooling.test_induced_norm_pooling
AssertionError: The error found during equivariance check with element...
nn.test_r3upsampling.TestUpsampling.test_cyclic_even_nearest
nn.test_r2upsampling.TestUpsampling.test_cyclic_even_nearest
nn.test_r3upsampling.TestUpsampling.test_o3_nearest
nn.test_r2upsampling.TestUpsampling.test_cyclic_odd_nearest
nn.test_r3upsampling.TestUpsampling.test_dihedral_even_nearest
nn.test_r3upsampling.TestUpsampling.test_ico_nearest
nn.test_r3upsampling.TestUpsampling.test_cone_nearest
nn.test_r2upsampling.TestUpsampling.test_dihedral_odd_bilinear
nn.test_r3upsampling.TestUpsampling.test_dihedral_nearest
nn.test_tensorproduct.TestTensor.test_with_spatial_dimns
nn.test_r3upsampling.TestUpsampling.test_so2_nearest
nn.test_r3upsampling.TestUpsampling.test_so3_trilinear
nn.test_r2upsampling.TestUpsampling.test_cyclic_odd_bilinear
nn.test_r3upsampling.TestUpsampling.test_dihedral_odd_nearest
nn.test_r3upsampling.TestUpsampling.test_inv_nearest
nn.test_r3upsampling.TestUpsampling.test_so3_nearest
nn.test_r3upsampling.TestUpsampling.test_cube_nearest
nn.test_r2upsampling.TestUpsampling.test_so2_nearest
nn.test_r2upsampling.TestUpsampling.test_dihedral_even_nearest
nn.test_r2upsampling.TestUpsampling.test_dihedral_odd_nearest
nn.test_r2upsampling.TestUpsampling.test_o2_nearest
nn.test_convolution.TestConvolution.test_ico_sparse
NotImplementedError
group.test_subgroups.TestSubgroups.test_restrict_ico_dn
group.test_subgroups.TestSubgroups.test_restrict_ico_tetra
group.test_combine_subgroups.TestSubgroups.test_ico
group.test_subgroups.TestSubgroups.test_restrict_octa_dn
group.test_subgroups.TestSubgroups.test_restrict_octa_tetra
RuntimeError: Found no NVIDIA driver on your system...
nn.test_amp.TestMixedPrecision.test_r2conv_uniform
nn.test_amp.TestMixedPrecision.test_r3pointconv_non_uniform
nn.test_amp.TestMixedPrecision.test_r2conv_non_uniform
nn.test_amp.TestMixedPrecision.test_gnorm
nn.test_amp.TestMixedPrecision.test_r3conv_non_uniform
nn.test_amp.TestMixedPrecision.test_fourier
nn.test_amp.TestMixedPrecision.test_fieldnorm
RuntimeError: could not construct a memory descriptor using a format tag
nn.test_export.TestExport.test_R2ConvTransposed
The upsampling equivariance tests are the ones that are flaky. The mixed precision tests probably aren't really failures, I just wasn't running on nodes with GPUs.
To end on a more positive note, I should say that I really appreciate how extensive the test suite is. There are so many academic projects out there with not enough tests, but this is the only one I know of with too many tests! It's a good problem to have, I think.
While I was working on #78, the test suite caused me a lot of trouble. There were two main problems:
Running the tests takes a really long time. I ended up writing some scripts to run the tests on my institution's cluster, but even then ≈50 of the tests (individual test functions, not test files) run up against a 4h time limit. The whole suite takes >34h to run, not counting any tests that were killed for exceeding the time limit. There are also a handful of tests that require >32 GB of RAM!
There are ≈30 tests that are either broken or flaky. This makes it a bit scary to develop new features, because you're not sure which test failures you're actually responsible for. And of course, even if you realize that you're not responsible for a particular failure, you have no way of knowing if you broke something relating to these tests. I do think this relates back to the first problem, because if it were practical to run the whole test suite for each commit, there would surely be fewer broken tests.
Migrating to pytest wouldn't solve the underlying issue, but it has a few features that I think would help a lot. The biggest is the ability to "mark" tests. There are two marks that I'd want to use extensively. The first is
xfail, which indicates that a test is currently broken and expected to fail. The failures are still reported, so you know they're happening, but they don't affect whether or not the whole suite is considered to pass or fail. This means that you can require new contributions to not break any currently working tests, while not having to fix all the broken/flaky tests up front. The second mark isslow, which would indicate that a test takes more than a few seconds to complete. With this information, GitHub could be configured to automatically run all the "not slow" tests for each PR, which would help protect against breakages. The slow tests could either be gradually made fast enough to include in the automated test suite, or just be run manually when there's some reason to.Pytest also supports running tests in parallel across multiple processes, via the
pytest-xdistplugin. This alone wouldn't be enough to make the tests run in a reasonable amount of time, but it would help. Finally, pytest makes it easier to share code between tests (because it doesn't require every test to belong to a class) and has features like parametrization that can cut down on boilerplate. These last two things wouldn't make the tests run any faster, but would make them easier to write.Is this something you'd be in favor of? I'm not asking for you to do anything yourself, I'm more asking for permission to do something like this the next time I want to make a non-trivial change to the code. (With no guarantees that I'll ever actually get around to it.)
In case it's of any interest, here's the test report I generated for the most recent commit (
94380ef):The upsampling equivariance tests are the ones that are flaky. The mixed precision tests probably aren't really failures, I just wasn't running on nodes with GPUs.
To end on a more positive note, I should say that I really appreciate how extensive the test suite is. There are so many academic projects out there with not enough tests, but this is the only one I know of with too many tests! It's a good problem to have, I think.