Benchmarks: Micro benchmark - collect per-snapshot per-GPU flops/temp in gpu burn#735
Benchmarks: Micro benchmark - collect per-snapshot per-GPU flops/temp in gpu burn#735
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #735 +/- ##
==========================================
- Coverage 86.47% 85.88% -0.59%
==========================================
Files 102 102
Lines 7541 8204 +663
==========================================
+ Hits 6521 7046 +525
- Misses 1020 1158 +138
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
| per_gpu_flops[i] = [] | ||
| if i not in per_gpu_temps: | ||
| per_gpu_temps[i] = [] | ||
| if i < len(gflops) and gflops[i] > 0: |
There was a problem hiding this comment.
Why do we need to check i < len(gflops)? I think it's not your goal.
There was a problem hiding this comment.
here is to parse one line result, which looks like
50.0% proc'd: 2261 (7150 Gflop/s) - 0 (0 Gflop/s) - 0 (0 Gflop/s) - 0 (0 Gflop/s) - 0 (0 Gflop/s) - 0 (0 Gflop/s) - 0 (0 Gflop/s) - 0 (0 Gflop/s) errors: 0 - 0 - 0 - 0 - 0 - 0 - 0 - 0 temps: 59 C - 56 C - 55 C - 57 C - 56 C - 37 C - 38 C - 39 C, the check(i < len(gflops)) is in case that one line missed value for any gpu so len(gflops) < num_gpus
or do you think if num_gpus > len(gflops), I should skip this line and set error return code?
| per_gpu_flops[i].append(gflops[i]) | ||
| else: | ||
| self._result.add_result(f'gpu_{snap_idx}_gflops:{i}', 0.0) | ||
| if i < len(temps): |
There was a problem hiding this comment.
It's the same question as previous one.
| @@ -57,4 +57,7 @@ def test_gpu_burn(self, results): | |||
| assert (benchmark.result['time'][0] == time) | |||
| for device in range(8): | |||
There was a problem hiding this comment.
Should we add some correctness check?
There was a problem hiding this comment.
updated correctness check according to current static test data file
Description
gpu burn: collect per-snapshot per-GPU flops/temp and add summary metrics
Major Revision
Per-GPU average flops: gpu_avg_gflops:<gpu_index>
Per-GPU flops variability metric: gpu_var_gflops:<gpu_index> (simple max-min based metric)
Per-GPU max temperature: gpu_max_temp:<gpu_index>