Skip to content

device, tsasm/arm: add ChaCha20-Poly1305 assembly for 32-bit GOARCH=arm#58

Open
bradfitz wants to merge 1 commit into
tailscalefrom
bradfitz/arm32_asm
Open

device, tsasm/arm: add ChaCha20-Poly1305 assembly for 32-bit GOARCH=arm#58
bradfitz wants to merge 1 commit into
tailscalefrom
bradfitz/arm32_asm

Conversation

@bradfitz
Copy link
Copy Markdown
Member

@bradfitz bradfitz commented May 7, 2026

golang.org/x/crypto ships no chacha20poly1305 assembly on 32-bit ARM,
so the WireGuard data-path AEAD falls back to a slow pure-Go path.
The previous AF_ALG approach (commit 30595e7) borrowed the kernel's
NEON crypto via a socket; that helped on NEON hardware but cost a
syscall per chunk and was a net loss on ARMv6.

This adds a real in-process assembly path, regenerated from the
upstream CryptoGAMS Perl scripts (chacha-armv4.pl, poly1305-armv4.pl
from Andy Polyakov's CryptoGAMS distribution, dual-licensed
BSD-3-Clause / OpenSSL -- see the full text in the .s file headers).
The .pl files are NOT vendored: tsasm/arm/regen/regen.sh fetches them
on demand from a Tailscale-hosted mirror at a pinned commit SHA, with
SHA256 verification on each file. Updating upstream is a SHA bump in
regen.sh plus re-running it.

A new plan9-xlate.pl post-processor turns the GAS-syntax output into
Plan 9 ARM assembly the Go assembler accepts; a companion
neon_encode.pl encodes every NEON instruction CryptoGAMS uses in
pure Perl, since Go's ARM assembler rejects every NEON mnemonic.
No cross-assembler is needed at regen time: a missing encoder is
a fatal error rather than a fall-through to arm-as.
TestNEONEncoderAgainstGAS in tsasm/arm/regen/ cross-checks every
unique NEON line in the upstream .S against arm-linux-gnueabihf-as
and fails on any mismatch or skipped line (gated identically to
TestRegenReproducible: pass --run-regen-tests or set CI=true).

NEON dispatch happens at runtime via golang.org/x/sys/cpu's
HasNEON, so a single binary covers both NEON and non-NEON ARM.

Performance, 1420-byte payload (typical WireGuard packet), median of
three runs:

Hardware                  arm32-go  AF_ALG  arm32-asm  arm64-native
Pi 1 ARMv6 (no NEON)      6.1       4.7     14.7       n/a
Pi 2 ARMv7+NEON (A7)      10.8      --      30.7       n/a
Pi 3 ARMv7+NEON (A53)     22        --      84.6       112.6
Pi 4 ARMv7+NEON (A72)     50        73      132.3      159.4

All MB/s, AEAD Seal at 1420 bytes. 'arm32-asm' is what this CL
adds; 'arm64-native' is x/crypto/chacha20poly1305 built as
GOARCH=arm64 on the same Pi, included for comparison only.
Pi 1/Pi 2 are 32-bit only (ARMv6 / Cortex-A7).

The asm path beats AF_ALG by ~1.8x on Pi 4 (no kernel transition per
chunk) and is ~2.4x faster than pure-Go on Pi 1 where there's no
NEON path at all (CryptoGAMS's optimized scalar inner loop).

Setting TS_WG_ASM=0 in the environment forces the pure-Go
x/crypto implementation, as an escape hatch for hardware
regressions or asm bugs.

See tsasm/arm/README.md for design notes where the non-trivial
parts of plan9-xlate.pl are documented, such as Plan 9's frame
layout (the SP shift for the auto-saved LR), R10=g shadowing,
NEON encoding strategy, the data-label folding trick for
sigma/one/rot8, and the Go-side length trimming that avoids
cross-function branches in chacha NEON.

Updates tailscale/tailscale#7053

@bradfitz bradfitz requested a review from raggi May 7, 2026 04:47
Copy link
Copy Markdown
Member

@raggi raggi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should consider just using gas to encode everything, and then we don't need to maintain an encoder and validation stack. We could submit the gas output and have regen and CI checking that it hasn't/doesn't need to change.

We should add test coverage to assert around r10 usage in the final encoding explicitly, especially if we still have the gas fallback.

Comment thread tsasm/arm/regen/plan9-xlate.pl Outdated
Comment thread tsasm/arm/regen/neon_encode_test.pl Outdated
Comment thread tsasm/arm/regen/neon_encode_test.pl Outdated
Comment thread tsasm/arm/poly1305/poly1305_test.go Outdated
@bradfitz bradfitz force-pushed the bradfitz/arm32_asm branch 3 times, most recently from 48e8964 to 632adb4 Compare May 13, 2026 18:45
golang.org/x/crypto ships no chacha20poly1305 assembly on 32-bit ARM,
so the WireGuard data-path AEAD falls back to a slow pure-Go path.
The previous AF_ALG approach (commit 30595e7) borrowed the kernel's
NEON crypto via a socket; that helped on NEON hardware but cost a
syscall per chunk and was a net loss on ARMv6.

This adds a real in-process assembly path, regenerated from the
upstream CryptoGAMS Perl scripts (chacha-armv4.pl, poly1305-armv4.pl
from Andy Polyakov's CryptoGAMS distribution, dual-licensed
BSD-3-Clause / OpenSSL -- see the full text in the .s file headers).
The .pl files are NOT vendored: tsasm/arm/regen/regen.sh fetches them
on demand from a Tailscale-hosted mirror at a pinned commit SHA, with
SHA256 verification on each file. Updating upstream is a SHA bump in
regen.sh plus re-running it.

A new plan9-xlate.pl post-processor turns the GAS-syntax output into
Plan 9 ARM assembly the Go assembler accepts; a companion
neon_encode.pl encodes every NEON instruction CryptoGAMS uses in
pure Perl, since Go's ARM assembler rejects every NEON mnemonic.
No cross-assembler is needed at regen time: a missing encoder is
a fatal error rather than a fall-through to arm-as.
TestNEONEncoderAgainstGAS in tsasm/arm/regen/ cross-checks every
unique NEON line in the upstream .S against arm-linux-gnueabihf-as
and fails on any mismatch or skipped line (gated identically to
TestRegenReproducible: pass --run-regen-tests or set CI=true).

NEON dispatch happens at runtime via golang.org/x/sys/cpu's
HasNEON, so a single binary covers both NEON and non-NEON ARM.

Performance, 1420-byte payload (typical WireGuard packet), median of
three runs:

    Hardware                  arm32-go  AF_ALG  arm32-asm  arm64-native
    Pi 1 ARMv6 (no NEON)      6.1       4.7     14.7       n/a
    Pi 2 ARMv7+NEON (A7)      10.8      --      30.7       n/a
    Pi 3 ARMv7+NEON (A53)     22        --      84.6       112.6
    Pi 4 ARMv7+NEON (A72)     50        73      132.3      159.4

    All MB/s, AEAD Seal at 1420 bytes. 'arm32-asm' is what this CL
    adds; 'arm64-native' is x/crypto/chacha20poly1305 built as
    GOARCH=arm64 on the same Pi, included for comparison only.
    Pi 1/Pi 2 are 32-bit only (ARMv6 / Cortex-A7).

The asm path beats AF_ALG by ~1.8x on Pi 4 (no kernel transition per
chunk) and is ~2.4x faster than pure-Go on Pi 1 where there's no
NEON path at all (CryptoGAMS's optimized scalar inner loop).

Setting TS_WG_ASM=0 in the environment forces the pure-Go
x/crypto implementation, as an escape hatch for hardware
regressions or asm bugs.

See tsasm/arm/README.md for design notes where the non-trivial
parts of plan9-xlate.pl are documented, such as Plan 9's frame
layout (the SP shift for the auto-saved LR), R10=g shadowing,
NEON encoding strategy, the data-label folding trick for
sigma/one/rot8, and the Go-side length trimming that avoids
cross-function branches in chacha NEON.

Updates tailscale/tailscale#7053

Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
@bradfitz bradfitz force-pushed the bradfitz/arm32_asm branch from 632adb4 to 8c69ed2 Compare May 21, 2026 17:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants