|
| 1 | +# FFI Fast-Call Internals |
| 2 | + |
| 3 | +This document is for contributors who maintain or extend the FFI |
| 4 | +fast-call path (the V8 Fast API Calls implementation in `node:ffi`). |
| 5 | +For end-user behavior, see [doc/api/ffi.md](../api/ffi.md). |
| 6 | + |
| 7 | +## Overview |
| 8 | + |
| 9 | +For each registered FFI function whose signature is fast-call eligible |
| 10 | +(`src/ffi/types.cc:IsFastCallEligible`), Node generates a tiny native |
| 11 | +trampoline that strips the `Local<Object>` receiver V8 fast calls |
| 12 | +require and tail-calls the user's target function. The trampoline |
| 13 | +address is handed to `v8::CFunction`. A JS wrapper |
| 14 | +(`lib/internal/ffi-fastcall.js`) validates args, routes object-typed |
| 15 | +pointer args to a libffi slow path, and checks a per-library "alive" |
| 16 | +sentinel before each call. |
| 17 | + |
| 18 | +The libffi path remains for callbacks (`ffi_prep_closure_loc`), |
| 19 | +ineligible signatures (signatures containing the FFI `function` type), |
| 20 | +and unsupported platforms. |
| 21 | + |
| 22 | +## Eligibility (`src/ffi/types.cc:IsFastCallEligible`) |
| 23 | + |
| 24 | +A signature is fast-call eligible iff all of: |
| 25 | + |
| 26 | +1. The platform is supported (see Platform Support below). |
| 27 | +2. Return type is one of: void, i8/u8/i16/u16, i32/u32, i64/u64, |
| 28 | + f32/f64, pointer. |
| 29 | +3. Every arg type is in that set. |
| 30 | +4. No arg or return is the FFI `function` type. |
| 31 | +5. Per-ABI argument caps: |
| 32 | + - AArch64: ≤ 7 GP, ≤ 8 FP |
| 33 | + - x86_64 SysV: ≤ 6 GP, ≤ 8 FP |
| 34 | + - x86_64 Win64: GP + FP combined ≤ 3 (positional register slots — 4 minus the receiver) |
| 35 | + - AArch32 hardfp: ≤ 3 GP, ≤ 8 FP; i64/u64 args and return type rejected |
| 36 | + |
| 37 | +`IsFastCallEligible(fn, &reason)` returns false with a static reason |
| 38 | +string on miss. |
| 39 | + |
| 40 | +## Platform support |
| 41 | + |
| 42 | +| ABI | Emitter file | Status | |
| 43 | +|---|---|---| |
| 44 | +| AArch64 (Linux/macOS/FreeBSD/Windows) | `stub_emitter_aarch64.cc` | Implemented, runtime-verified | |
| 45 | +| x86_64 SysV (Linux/macOS/FreeBSD) | `stub_emitter_x64_sysv.cc` | Implemented, CI-verified | |
| 46 | +| x86_64 Win64 | `stub_emitter_x64_win.cc` | Implemented, CI-verified | |
| 47 | +| AArch32 hardfp (Linux/FreeBSD) | `stub_emitter_arm.cc` | Implemented, CI-verified | |
| 48 | + |
| 49 | +On platforms without an emitter, all registrations fall back to libffi. |
| 50 | + |
| 51 | +Adding a new ABI: implement `EmitForwarder` for the new platform in a |
| 52 | +new `stub_emitter_<abi>.cc`, gate it via `node.gyp` conditions on |
| 53 | +`target_arch` and `OS`, and add the `(os, arch)` pair to |
| 54 | +`fastcall_supported` in `configure.py`. |
| 55 | + |
| 56 | +## Stub generation (`src/ffi/fastcall/stub_emitter_*.cc`) |
| 57 | + |
| 58 | +Each stub does, at most, three things: |
| 59 | + |
| 60 | +1. Shift GP regs down by one slot (drop the receiver). |
| 61 | +2. (Win64 only) shift FP regs down by one slot — Win64's FP/GP register |
| 62 | + slots are positional, so stripping a GP arg also reindexes FP slots. |
| 63 | +3. Tail-call the target via an indirect jump. |
| 64 | + |
| 65 | +For SysV ≥ 6 GP args, the stub uses a call+ret pattern with stack |
| 66 | +rewrite (because the 7th GP slot lives on the stack). Other ABIs cap |
| 67 | +below their stack overflow point in v1 to keep emitters simple. |
| 68 | + |
| 69 | +## JIT memory (`src/ffi/fastcall/jit_memory.cc`) |
| 70 | + |
| 71 | +A process-global singleton on top of platform `mmap`/`VirtualAlloc`. |
| 72 | +Allocates 64-byte slot-aligned chunks within page-aligned allocations. |
| 73 | +After writing the stub, the page is transitioned to RX via `mprotect` / |
| 74 | +`VirtualProtect`; once a page goes RX, no further allocation happens |
| 75 | +in it (the bump cursor is locked). |
| 76 | + |
| 77 | +The original spec called for `v8::PageAllocator`, but neither |
| 78 | +`Isolate::GetArrayBufferAllocator()->GetPageAllocator()` nor |
| 79 | +`Platform::GetPageAllocator()` returns a usable allocator in Node's |
| 80 | +embedded configuration — both default to `nullptr`. The implementation |
| 81 | +uses direct system calls (with `MAP_JIT` on Apple Silicon) instead. |
| 82 | + |
| 83 | +`Free` decrements the live-byte counter but does not return memory. |
| 84 | +Pages stay alive for the process lifetime. |
| 85 | + |
| 86 | +Concurrent emit from multiple isolates is safe via |
| 87 | +`JitMemory::EmitStub(code, size)`, which holds the singleton mutex across |
| 88 | +allocate + memcpy + RX-transition. The lower-level `Allocate` / |
| 89 | +`MakeExecutable` / `Free` methods remain public for the self-test only |
| 90 | +(which writes platform-specific instruction bytes after Allocate but |
| 91 | +before MakeExecutable, and needs that explicit step ordering). |
| 92 | + |
| 93 | +## Self-test |
| 94 | + |
| 95 | +`JitMemory::SelfTest` allocates a tiny stub, writes a `ret`-style |
| 96 | +native sequence, transitions to RX, and calls it. Cached in a |
| 97 | +process-wide atomic via `std::call_once`. Run once per process at |
| 98 | +first FFI registration. On failure, every subsequent registration |
| 99 | +falls back to libffi-only and a process warning is emitted via |
| 100 | +`ProcessEmitWarning`. |
| 101 | + |
| 102 | +This catches: |
| 103 | +- macOS `MAP_JIT` entitlement missing (e.g., signed binary without |
| 104 | + `com.apple.security.cs.allow-jit`). |
| 105 | +- Hardened-runtime restrictions. |
| 106 | +- SELinux execmem denial. |
| 107 | + |
| 108 | +## JS wrapper (`lib/internal/ffi-fastcall.js`) |
| 109 | + |
| 110 | +For each fast-call-eligible inner v8::Function returned from C++, |
| 111 | +`buildWrapper` creates a JS wrapper that: |
| 112 | + |
| 113 | +1. Reads the per-library "alive" `Uint8Array` and throws |
| 114 | + `ERR_FFI_LIBRARY_CLOSED` if `[0] !== 0`. |
| 115 | +2. Per-arg validation, mirroring `ToFFIArgument` in |
| 116 | + `src/ffi/types.cc:ToFFIArgument`. Same `ERR_INVALID_ARG_VALUE` |
| 117 | + codes, same messages, same range bounds. |
| 118 | +3. Pointer args: |
| 119 | + - BigInt or null/undefined: pass through as primitive. |
| 120 | + - String / Buffer / ArrayBuffer / ArrayBufferView: `ReflectApply` |
| 121 | + the `kFastcallInvokeSlow` libffi-backed v8::Function with the |
| 122 | + original args. |
| 123 | +4. Calls the inner v8::Function with positional primitives. V8's fast |
| 124 | + call engages when TurboFan inlines the wrapper. |
| 125 | + |
| 126 | +The wrapper body is **arity-specialized**: arities 0..6 are unrolled into |
| 127 | +distinct closures with named parameters (`function(a0, a1, ...)`), so V8 |
| 128 | +inlines them and the per-arg type info / pointer flag are read from |
| 129 | +closure locals instead of arrays. Arities 7+ use a rest-args fallback. This |
| 130 | +matters: an earlier draft used a single generic `function(...args)` plus |
| 131 | +`ReflectApply`, which dropped FFI throughput by 30–50% vs. the libffi+SB |
| 132 | +path. The arity specialization gets the throughput back to 5–13× the |
| 133 | +libffi+SB baseline (see commit `81d908e48da` for the fix and benchmarks). |
| 134 | + |
| 135 | +The wrapper is patched onto `DynamicLibrary.prototype.getFunction`, |
| 136 | +`getFunctions`, and the `functions` accessor. |
| 137 | + |
| 138 | +## Internal symbols |
| 139 | + |
| 140 | +The JS wrapper looks for these per-isolate Symbols on the inner |
| 141 | +`v8::Function`. They are defined in `src/env_properties.h` and |
| 142 | +attached by `DynamicLibrary::CreateFunction` for fast-call-eligible |
| 143 | +signatures only: |
| 144 | + |
| 145 | +| Symbol | Value | Purpose | |
| 146 | +|---|---|---| |
| 147 | +| `kFastcallAlive` | `Uint8Array(1)` shared with `DynamicLibrary` | close sentinel | |
| 148 | +| `kFastcallInvokeSlow` | `v8::Function` over `InvokeFunction` | object-arg fallback | |
| 149 | +| `kFastcallParams` | `string[]` of parameter type names | wrapper introspection | |
| 150 | +| `kFastcallResult` | result type name string | wrapper introspection | |
| 151 | + |
| 152 | +## Lifecycle |
| 153 | + |
| 154 | +**Registration:** `CreateFunction` in `src/node_ffi.cc` builds a |
| 155 | +`fastcall::CFunctionInfoBundle` (which owns the heap-allocated |
| 156 | +`v8::CFunctionInfo` + `v8::CTypeInfo[]`), allocates and emits the stub via |
| 157 | +`JitMemory::EmitStub`, then constructs the inner `v8::Function` via a |
| 158 | +`FunctionTemplate` with the `CFunction` attached. Per-function fast-call |
| 159 | +state is stored on `FFIFunctionInfo::fast` (a `unique_ptr<FastCallState>`, |
| 160 | +null when fast-call is unavailable for that signature). |
| 161 | + |
| 162 | +**Per-call:** wrapper validates → calls inner. V8 picks fast or slow |
| 163 | +callback. Slow = `InvokeFunction` (libffi); fast = our generated stub → |
| 164 | +target. |
| 165 | + |
| 166 | +**`lib.close()`:** flips the alive sentinel (`alive[0] = 1`). The wrapper |
| 167 | +throws `ERR_FFI_LIBRARY_CLOSED` on subsequent calls. Slow-path |
| 168 | +`InvokeFunction` independently checks `fn->closed` for the same effect on |
| 169 | +ineligible signatures. Stubs are NOT freed at close. |
| 170 | + |
| 171 | +**Weak callback (function GC'd):** `CleanupFunctionInfo` resets |
| 172 | +`info->fast`, whose `~FastCallState` destructor calls `JitMemory::Free` |
| 173 | +on the stub. |
| 174 | + |
| 175 | +## Testing |
| 176 | + |
| 177 | +- `test/cctest/test_ffi_fastcall_*.cc`: unit tests for emitters, JIT |
| 178 | + memory, eligibility, CFunctionInfo builder. |
| 179 | +- `test/ffi/test-ffi-*.js`: JS-level integration tests covering |
| 180 | + types, arity, callbacks, permissions, etc. (existing FFI suite — |
| 181 | + reused as the integration baseline). |
| 182 | + |
| 183 | +When debugging unexpected fast-call behavior, log the eligibility miss |
| 184 | +reason via the second arg to `IsFastCallEligible`. Set the |
| 185 | +`--without-ffi-fastcall` configure flag to A/B test against the |
| 186 | +libffi-only path. |
0 commit comments