device, tsasm/arm: add ChaCha20-Poly1305 assembly for 32-bit GOARCH=arm#58
Open
bradfitz wants to merge 1 commit into
Open
device, tsasm/arm: add ChaCha20-Poly1305 assembly for 32-bit GOARCH=arm#58bradfitz wants to merge 1 commit into
bradfitz wants to merge 1 commit into
Conversation
raggi
reviewed
May 12, 2026
Member
raggi
left a comment
There was a problem hiding this comment.
I think we should consider just using gas to encode everything, and then we don't need to maintain an encoder and validation stack. We could submit the gas output and have regen and CI checking that it hasn't/doesn't need to change.
We should add test coverage to assert around r10 usage in the final encoding explicitly, especially if we still have the gas fallback.
48e8964 to
632adb4
Compare
golang.org/x/crypto ships no chacha20poly1305 assembly on 32-bit ARM, so the WireGuard data-path AEAD falls back to a slow pure-Go path. The previous AF_ALG approach (commit 30595e7) borrowed the kernel's NEON crypto via a socket; that helped on NEON hardware but cost a syscall per chunk and was a net loss on ARMv6. This adds a real in-process assembly path, regenerated from the upstream CryptoGAMS Perl scripts (chacha-armv4.pl, poly1305-armv4.pl from Andy Polyakov's CryptoGAMS distribution, dual-licensed BSD-3-Clause / OpenSSL -- see the full text in the .s file headers). The .pl files are NOT vendored: tsasm/arm/regen/regen.sh fetches them on demand from a Tailscale-hosted mirror at a pinned commit SHA, with SHA256 verification on each file. Updating upstream is a SHA bump in regen.sh plus re-running it. A new plan9-xlate.pl post-processor turns the GAS-syntax output into Plan 9 ARM assembly the Go assembler accepts; a companion neon_encode.pl encodes every NEON instruction CryptoGAMS uses in pure Perl, since Go's ARM assembler rejects every NEON mnemonic. No cross-assembler is needed at regen time: a missing encoder is a fatal error rather than a fall-through to arm-as. TestNEONEncoderAgainstGAS in tsasm/arm/regen/ cross-checks every unique NEON line in the upstream .S against arm-linux-gnueabihf-as and fails on any mismatch or skipped line (gated identically to TestRegenReproducible: pass --run-regen-tests or set CI=true). NEON dispatch happens at runtime via golang.org/x/sys/cpu's HasNEON, so a single binary covers both NEON and non-NEON ARM. Performance, 1420-byte payload (typical WireGuard packet), median of three runs: Hardware arm32-go AF_ALG arm32-asm arm64-native Pi 1 ARMv6 (no NEON) 6.1 4.7 14.7 n/a Pi 2 ARMv7+NEON (A7) 10.8 -- 30.7 n/a Pi 3 ARMv7+NEON (A53) 22 -- 84.6 112.6 Pi 4 ARMv7+NEON (A72) 50 73 132.3 159.4 All MB/s, AEAD Seal at 1420 bytes. 'arm32-asm' is what this CL adds; 'arm64-native' is x/crypto/chacha20poly1305 built as GOARCH=arm64 on the same Pi, included for comparison only. Pi 1/Pi 2 are 32-bit only (ARMv6 / Cortex-A7). The asm path beats AF_ALG by ~1.8x on Pi 4 (no kernel transition per chunk) and is ~2.4x faster than pure-Go on Pi 1 where there's no NEON path at all (CryptoGAMS's optimized scalar inner loop). Setting TS_WG_ASM=0 in the environment forces the pure-Go x/crypto implementation, as an escape hatch for hardware regressions or asm bugs. See tsasm/arm/README.md for design notes where the non-trivial parts of plan9-xlate.pl are documented, such as Plan 9's frame layout (the SP shift for the auto-saved LR), R10=g shadowing, NEON encoding strategy, the data-label folding trick for sigma/one/rot8, and the Go-side length trimming that avoids cross-function branches in chacha NEON. Updates tailscale/tailscale#7053 Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
632adb4 to
8c69ed2
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
golang.org/x/crypto ships no chacha20poly1305 assembly on 32-bit ARM,
so the WireGuard data-path AEAD falls back to a slow pure-Go path.
The previous AF_ALG approach (commit 30595e7) borrowed the kernel's
NEON crypto via a socket; that helped on NEON hardware but cost a
syscall per chunk and was a net loss on ARMv6.
This adds a real in-process assembly path, regenerated from the
upstream CryptoGAMS Perl scripts (chacha-armv4.pl, poly1305-armv4.pl
from Andy Polyakov's CryptoGAMS distribution, dual-licensed
BSD-3-Clause / OpenSSL -- see the full text in the .s file headers).
The .pl files are NOT vendored: tsasm/arm/regen/regen.sh fetches them
on demand from a Tailscale-hosted mirror at a pinned commit SHA, with
SHA256 verification on each file. Updating upstream is a SHA bump in
regen.sh plus re-running it.
A new plan9-xlate.pl post-processor turns the GAS-syntax output into
Plan 9 ARM assembly the Go assembler accepts; a companion
neon_encode.pl encodes every NEON instruction CryptoGAMS uses in
pure Perl, since Go's ARM assembler rejects every NEON mnemonic.
No cross-assembler is needed at regen time: a missing encoder is
a fatal error rather than a fall-through to arm-as.
TestNEONEncoderAgainstGAS in tsasm/arm/regen/ cross-checks every
unique NEON line in the upstream .S against arm-linux-gnueabihf-as
and fails on any mismatch or skipped line (gated identically to
TestRegenReproducible: pass --run-regen-tests or set CI=true).
NEON dispatch happens at runtime via golang.org/x/sys/cpu's
HasNEON, so a single binary covers both NEON and non-NEON ARM.
Performance, 1420-byte payload (typical WireGuard packet), median of
three runs:
The asm path beats AF_ALG by ~1.8x on Pi 4 (no kernel transition per
chunk) and is ~2.4x faster than pure-Go on Pi 1 where there's no
NEON path at all (CryptoGAMS's optimized scalar inner loop).
Setting TS_WG_ASM=0 in the environment forces the pure-Go
x/crypto implementation, as an escape hatch for hardware
regressions or asm bugs.
See tsasm/arm/README.md for design notes where the non-trivial
parts of plan9-xlate.pl are documented, such as Plan 9's frame
layout (the SP shift for the auto-saved LR), R10=g shadowing,
NEON encoding strategy, the data-label folding trick for
sigma/one/rot8, and the Go-side length trimming that avoids
cross-function branches in chacha NEON.
Updates tailscale/tailscale#7053