Skip to content

refactor(trace): compute the CPU Tensor trace in the container layer#862

Draft
IvanaGyro wants to merge 3 commits into
claude/stride-viewfrom
claude/trace-cpu
Draft

refactor(trace): compute the CPU Tensor trace in the container layer#862
IvanaGyro wants to merge 3 commits into
claude/stride-viewfrom
claude/trace-cpu

Conversation

@IvanaGyro
Copy link
Copy Markdown
Collaborator

Part of #834.

Stacked on #849 → stride_view → trace-sig-refactor. This PR's diff is only the CPU implementation. Review/merge the prerequisites first; this PR retargets up the chain as each one merges.

Problem

The low-level Tensor trace built an identity UniTensor and called Contract, so the container-level linalg_internal layer depended on the higher-level UniTensor/Contract APIs — an inverted layering. Compute the trace directly over storage instead, and tighten the internal Trace API at the same time.

Change

  • Simplified dispatch API: each Trace_internal_* (CPU) and cuTrace_internal_* (GPU) now takes only (Tn, ax1, ax2) and returns the result Tensor. The per-dtype dispatchers each call a single templated TraceImpl<T> in an unnamed namespace that derives Ndiag / Nelem / accu / remain_rank_id / reduced shape / the 2D-vs-Nd flag inline from Tn.shape() and Tn.strides() — no separate TraceParams struct or dispatcher header. Trace.cpp is now just validation + a dtype-keyed dispatch call.
  • Public Tensor::strides() getter (snake_case, mirrors shape() / NumPy's .strides): for each logical axis, the storage stride derived from _shape and _invmapper. The CPU trace calls it; the upcoming GPU PR will share the same getter.
  • Stride-aware diagonal sum: PairwiseSum over the stride_view adaptor (from the prerequisite PR) with diagonal stride = strides[ax1] + strides[ax2]. The previous contiguous() copy is dropped; floating-point traces get the same O(log N · ε) error bound as linalg::Sum. Ndiag == 0 / Nelem == 0 guards prevent unsigned underflow in (Ndiag - 1) * diag_stride + 1.
  • GPU dispatcher signatures in cuTrace_internal.hpp are kept consistent with the CPU side (the actual GPU implementation lands in the follow-up PR).

Tests

  • tests/Tensor_strides_test.cpp — walks every multi-index across rank-2..5 permutations and dtypes, asserting flatten(idx · strides()) == offset at() actually reads (so if strides() ever drifts from the layout at() indexes through, every case fails).
  • tests/linalg_test/Trace_test.cpp — strided in-place Trace vs linalg::Trace(t.contiguous(), ...) reference across ranks 2–5, complex+integer dtypes, the rank-2 path (_trace_2d / cuTrace_2d_kernel), output rank invariants (rank-N → rank-(N-2)), and swapped axis order.

Benchmark

benchmarks/linalg/Trace_bm.cpp pits the in-place strided Trace against the "collect contiguously and reduce" baselines:

  • 3D: matvec — tr(A)[m] = ⟨vec(I_n), vec(A[:, m, :])⟩ as a {middle, n·n} @ {n·n, 1} BLAS GEMM against vec(I_n).
  • 2D: vecdot — tr(A) = ⟨vec(I_n), vec(A)⟩ as a BLAS dot product on the n·n-element flattenings.
  • 2D: reshape trick — save A[n-1, n-1], view the rest as {n-1, n+1}, permute + contiguous so the first column becomes a contiguous row, reduce it, add the saved corner. The {n-1, n+1} view reuses the source storage in place (Storage::resize is a no-realloc metadata shrink), so the permute → contiguous gather is the only copy; the size is restored (and the dropped corner written back) afterwards. CPU reduces the contiguous row over the raw buffer (no linalg::Sum/Accessor allocations); GPU wraps the row as a {n-1} tensor and uses linalg::Sum since device memory can't be read host-side.

Every variant runs once before timing and is checked against linalg::Trace; a wrong baseline std::aborts instead of producing fast-but-meaningless numbers.

Numbers (openblas-cpu, single run, µs; smaller is better). Strided wins at every tested size, including the largest — trace touches O(n) diagonal elements while the GEMM/dot baselines touch all O(n²), so BLAS throughput never closes the gap:

3D — {n, middle, n} traced over (0, 2):

n / middle Strided Double Matvec Double (GEMM) Strided ComplexDouble Matvec ComplexDouble
64 / 64 4.78 833 4.88 1045
256 / 64 96.7 28278 152 44281
1024 / 16 233 134696 239 227744
2048 / 16 660 520290 780 914131
4096 / 8 763 1065279 761 1840246

2D — {n, n} traced over (0, 1):

n Strided Double Vecdot Double (dot) Reshape Double Strided ComplexDouble Vecdot ComplexDouble Reshape ComplexDouble
64 0.410 2.17 12.0 0.392 3.00 13.9
256 0.512 7.17 176 0.513 16.0 282
1024 1.17 119 5529 1.21 204 7324
4096 36.9 1756 242078 42.8 5703 352744
8192 92.8 10966 1041902 110 29162 1494712

Test plan

  • openblas-cpu: full suite 963/963 passed.
  • mkl-cpu: full suite 963/963 passed.
  • Each benchmark variant verified against linalg::Trace before timing (the benchmark abort()s if any baseline disagrees) across all sizes above.
  • pre-commit (clang-format-14 + EOF + whitespace) clean.

Draft — opened for review.


Generated by Claude Code

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the linalg::Trace implementation to perform direct, stride-aware diagonal sums instead of falling back to expensive tensor contractions. It introduces a Tensor::strides() method to retrieve the storage strides of a tensor, adds comprehensive unit tests for both strides and trace operations, and includes benchmarks to evaluate performance. A review comment suggests optimizing the multi-index and base offset calculation in the CPU trace hot loop by replacing division and modulo operations with an odometer-like incremental index update.

Comment on lines +54 to +66
std::vector<cytnx_uint64> accu(out_shape.size(), 1);
for (int i = static_cast<int>(out_shape.size()) - 1; i > 0; --i)
accu[i - 1] = accu[i] * static_cast<cytnx_uint64>(out_shape[i]);

T *out_data = out.storage().data<T>();
for (cytnx_uint64 i = 0; i < Nelem; ++i) {
cytnx_uint64 tmp = i, base = 0;
for (cytnx_uint64 x = 0; x < out_shape.size(); ++x) {
base += (tmp / accu[x]) * strides[remain_rank_id[x]];
tmp %= accu[x];
}
out_data[i] = PairwiseSum(std::span<const T>(data + base, extent) | stride(diag_stride));
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The current implementation uses division (/) and modulo (%) operations inside the hot loop to compute the multi-index and base offset for each element. Since division and modulo are computationally expensive, we can optimize this by using an odometer-like incremental index update. This completely avoids division and modulo operations in the loop, and also eliminates the need to precompute the accu vector.

Suggested change
std::vector<cytnx_uint64> accu(out_shape.size(), 1);
for (int i = static_cast<int>(out_shape.size()) - 1; i > 0; --i)
accu[i - 1] = accu[i] * static_cast<cytnx_uint64>(out_shape[i]);
T *out_data = out.storage().data<T>();
for (cytnx_uint64 i = 0; i < Nelem; ++i) {
cytnx_uint64 tmp = i, base = 0;
for (cytnx_uint64 x = 0; x < out_shape.size(); ++x) {
base += (tmp / accu[x]) * strides[remain_rank_id[x]];
tmp %= accu[x];
}
out_data[i] = PairwiseSum(std::span<const T>(data + base, extent) | stride(diag_stride));
}
std::vector<cytnx_uint64> index(out_shape.size(), 0);
cytnx_uint64 base = 0;
T *out_data = out.storage().data<T>();
for (cytnx_uint64 i = 0; i < Nelem; ++i) {
out_data[i] = PairwiseSum(std::span<const T>(data + base, extent) | stride(diag_stride));
for (int x = static_cast<int>(out_shape.size()) - 1; x >= 0; --x) {
if (++index[x] < static_cast<cytnx_uint64>(out_shape[x])) {
base += strides[remain_rank_id[x]];
break;
}
index[x] = 0;
base -= (out_shape[x] - 1) * strides[remain_rank_id[x]];
}
}

->Args({256, 64}) \
->Args({1024, 16}) \
->Args({2048, 16}) \
->Args({4096, 8}) \
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there should be a case that middle is large and n is normal. In this case Matmul may win.

Comment on lines +19 to +20
const cytnx_uint64 ax1 = std::min(a1, a2);
const cytnx_uint64 ax2 = std::max(a1, a2);
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two lines are inherited from the old code. They are not needed now.

Tensor out = Tensor({is_2d ? cytnx_uint64{1} : Nelem}, Tn.dtype(), Tn.device());
out.storage().set_zeros();
if (Ndiag == 0 || Nelem == 0) {
if (!is_2d) out.reshape_(out_shape);
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should throw error here instead.

if (is_2d) {
out.storage().at<T>(0) =
PairwiseSum(std::span<const T>(data, extent) | stride(diag_stride));
return out;
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can compose the output Tensor here by Tensor::from_stroage

Comment on lines +54 to +65
std::vector<cytnx_uint64> accu(out_shape.size(), 1);
for (int i = static_cast<int>(out_shape.size()) - 1; i > 0; --i)
accu[i - 1] = accu[i] * static_cast<cytnx_uint64>(out_shape[i]);

T *out_data = out.storage().data<T>();
for (cytnx_uint64 i = 0; i < Nelem; ++i) {
cytnx_uint64 tmp = i, base = 0;
for (cytnx_uint64 x = 0; x < out_shape.size(); ++x) {
base += (tmp / accu[x]) * strides[remain_rank_id[x]];
tmp %= accu[x];
}
out_data[i] = PairwiseSum(std::span<const T>(data + base, extent) | stride(diag_stride));
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not interfere by the old implementation you should write the most efficient method. Consider this

offset += stride[last_dim];
++index[last_dim];

if (index[last_dim] == shape[last_dim]) {
    index[last_dim] = 0;
    offset -= shape[last_dim] * stride[last_dim];
    offset += stride[previous_dim];
    ++index[previous_dim];
}

In this way, we don't have to calculate the new strides (accu). And it should be must faster. As the 2d case, we can construct the tensor from storage.

Copy link
Copy Markdown
Collaborator Author

@IvanaGyro IvanaGyro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Review written by Claude on behalf of @IvanaGyro.)

Nice rewrite — moving the trace into a single stride-aware TraceImpl<T> over the container layer removes the UniTensor::eye + Contract round-trip and the old 2D path's contiguity assumption (rawdata[i*Ldim + i]). I worked through the offset math and it is sound: for any output multi-index, base + (Ndiag-1)*diag_stride lands on a valid element, so the std::span(data+base, extent) end pointer is at most one-past storage — no OOB. Tensor::strides() is the right primitive and Tensor_strides_test validates it against at() across permutations/dtypes. CPU coverage (permuted ranks 2–5, Double/ComplexDouble/Int32, 2D path, output-rank, swapped axes) is solid and all of it passed on both openblas-cuda and mkl-cuda. Two things inline: one stacked-build concern and one test gap. Also a minor note: the Reshape_2D benchmark mutates the shared storage of a const input via resize() and relies on shrink-not-reallocating to restore it — it's verified once against Trace, but it's a fragile trick to leave in a committed benchmark; a comment-backed assertion that capacity was preserved would harden it.

const std::vector<cytnx_uint64> &,
const std::vector<cytnx_uint64> &,
const std::vector<cytnx_int64> &, cytnx_uint64, cytnx_uint64);
typedef Tensor (*Tracefunc_oii)(const Tensor &, cytnx_uint64, cytnx_uint64);
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Review written by Claude on behalf of @IvanaGyro.)

This PR flips Tracefunc_oii to the new Tensor(const Tensor&, cytnx_uint64, cytnx_uint64) shape and updates cuTrace_internal.hpp to match, but the GPU definition in cuTrace_internal.cu is still the old void cuTrace_internal_*(bool, Tensor&, ...) (that rewrite lands in #850). So a USE_CUDA=ON build of this branch in isolation won't compile: the .cu definitions no longer match their declarations, and cuTrace_ii[...] = cuTrace_internal_* assigns an incompatible pointer type into the Tracefunc_oii table.

The CPU build is fine and the full stack (through #850) builds + passes — I verified that. But if these PRs are CI'd/merged independently rather than squash-as-a-unit, #862 is a broken-GPU intermediate commit. Worth either landing #850's .cu change together with this typedef flip, or calling out the merge-as-a-unit intent so a CUDA bisect doesn't trip here.

EXPECT_TRUE(cytnx::TestTools::AreNearlyEqTensor(out_p, out_c, 1e-12));
}

TEST(LinalgTraceTest, OutputRankIsInputMinusTwo) {
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Review written by Claude on behalf of @IvanaGyro.)

TraceImpl has a dedicated early-out for Ndiag == 0 || Nelem == 0, but no test drives it, so the zero-extent path (and the shape of its result) is unverified. A regression there — wrong output shape, or a stride computation that divides/indexes into empty storage — would slip through.

Suggest a case with a zero-size traced axis (e.g. {0, 0} -> expect a 1-element result, value 0) and one with a zero-size remaining axis (e.g. {3, 0, 3} traced over 0,2 -> expect shape {0}), asserting both shape and that no read occurs. Cheap, and it locks the Ndiag==0 || Nelem==0 branch.

@IvanaGyro IvanaGyro force-pushed the claude/trace-sig-refactor branch from dd601ed to 92ae99d Compare May 31, 2026 08:37
@pcchen pcchen added this to the v1.1.0 milestone Jun 1, 2026
@IvanaGyro IvanaGyro force-pushed the claude/trace-sig-refactor branch from 92ae99d to f894e02 Compare June 2, 2026 02:11
IvanaGyro and others added 3 commits June 2, 2026 02:53
The CPU Tensor trace built an identity UniTensor and called Contract, so the
container-level linalg_internal layer depended on the higher-level UniTensor
and Contract APIs -- an inverted layering. Compute the trace directly over
storage instead, and tighten the internal Trace API at the same time.

* Simplify the dispatch API: every Trace_internal_* / cuTrace_internal_*
  dispatcher (and the Tracefunc_oii typedef) now takes only (Tn, ax1, ax2) and
  returns the result Tensor. A small TraceParams helper
  (backend/linalg_internal_cpu/trace_dispatch.hpp) derives Ndiag, Nelem, accu,
  remain_rank_id, the reduced shape, and the is_2d flag once per call from
  shape and axes, so the dispatchers and the _trace_2d / _trace_nd kernels no
  longer plumb those through individually. Trace.cpp loses its derivation
  block and is now just validation + a dtype-keyed dispatch call.
* Add a public Tensor::strides() getter (declared in include/Tensor.hpp,
  defined in src/Tensor.cpp): for each logical axis, the storage offset
  increment per +1 step, derived from the shape and inverse mapper -- the same
  layout Tensor::at already indexes through. This is what the dispatchers use
  to sum the diagonal in place.
* Replace the body of _trace_2d / _trace_nd with a stride-aware reduction
  via PairwiseSum over a new range adaptor stride_view (snake_case, with a
  `r | stride(k)` pipe form; iterator stores the underlying base iterator so it
  stays valid independent of the view's lifetime). The diagonal stride is
  strides[ax1] + strides[ax2], so the previous contiguous() copy is dropped,
  and floating-point traces get the same O(log N * eps) error bound as
  linalg::Sum. Guard Ndiag == 0 / Nelem == 0 (unsigned (Ndiag - 1) would
  otherwise underflow into a huge span extent). Drop the UniTensor and
  Generator includes from the CPU translation unit.
* tests/linalg_test/stride_view_test.cpp exercises the iterator concept model,
  the pipe form, the iterator-outlives-view case, the non-dividing
  size-vs-stride branch, the zero-stride throw, and complex strided summation.
  Tests are reached through src/ added to the test binary's include path, so
  internal headers can be unit-tested directly.
* tests/Tensor_strides_test.cpp checks strides() against Tensor::at across
  ranks 2-5 permutations and across dtypes.
* tests/linalg_test/Trace_test.cpp covers the strided in-place path against
  the contiguous-clone reference across ranks 2-5, complex and integer dtypes,
  the rank-2 / rank-N output shape invariants, and swapped axis order.
* benchmarks/linalg/Trace_bm.cpp pits the in-place strided Trace against the
  "collect contiguously and reduce" alternatives Ivana asked for: the 3D
  contiguous baseline is a BLAS matrix-vector multiplication
  (tr(A[d,m,d]) over d as {middle, n*n} @ {n*n, 1} against vec(I_n)), and the
  2D contiguous baseline is the reshape trick (save A[n-1, n-1], view the rest
  as {n-1, n+1}, permute+contiguous, Sum the resulting first row, add the
  saved entry). On openblas-cpu debug, the strided path beats both baselines
  by orders of magnitude at the tested shapes.
* The discovery timeout for gtest_discover_tests is bumped to 120 s so the
  test binary's ASAN+MKL listing fits on slower runners.

Co-authored-by: Claude <noreply@anthropic.com>
The ND CPU trace decoded each flat output index into an input base offset by
dividing and taking the modulo against precomputed row-major accumulators on
every element. Division and modulo dominate that hot loop, and the accumulator
vector only existed to support them.

Carry the base offset on an odometer instead: precompute the input stride for
each surviving axis once, then for each output element bump the last axis index
(carrying into earlier axes on wrap) and adjust the base offset by the affected
axes' strides. Advancing is amortized O(1) per element with no division, modulo,
or per-element flat-index decode, and the row-major accumulator vector is gone.

Also drop the now-unused backend/lapack_wrapper.hpp and utils/utils.hpp
includes, left over from the previous Contract-based implementation.

Co-authored-by: Claude <noreply@anthropic.com>
TraceImpl<T> allocated the result as a Tensor and then reached through
out.storage() to fill it, mixing the container and storage layers. Fill a flat
result Storage directly and compose the Tensor with Tensor::from_storage once
the diagonal sums are written, so the storage is populated at the storage layer
and only promoted to a Tensor (and reshaped to the reduced rank) at the end.

Behavior is unchanged: the 2D trace yields a single-element result, the ND trace
one element per remaining-rank multi-index, and the Ndiag == 0 / Nelem == 0
guard still returns a zeroed result of the reduced shape.

Co-authored-by: Claude <noreply@anthropic.com>
@IvanaGyro IvanaGyro changed the base branch from claude/trace-sig-refactor to claude/stride-view June 2, 2026 03:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants