Environment
|
|
| Julia |
1.12.6 / 1.13-rc1 |
| HTTP.jl |
2.0.0 |
| Reseau |
1.1.3 (165e3ac) — also reproduced on 1.1.2 (3b6e3aa) |
| Platform |
Scaleway Serverless Jobs (gVisor sandbox) |
uname -a |
Linux e7f94219-061e-42fc-a1fd-94e8979590e0 4.4.0 #1 SMP Sun Jan 10 15:06:54 PST 2016 x86_64 GNU/Linux |
Description
I'm getting a signal 11 (SIGSEGV) crash in Reseau's background epoll I/O poller thread, in _backend_poll_once!.
Claude suggests that @gcsafe_ccall epoll_wait in _backend_poll_once! allows Julia's GC to run during the syscall, and gVisor's userspace epoll implementation exposes a race that doesn't exist on a real Linux kernel. This would explain why I can reproduce it consistently on Scaleway Serverless Jobs (a serverless container platform that uses gVisor as its sandbox) but not at all on standard Docker locally.
Stack trace
MWE reproduction :
┌ Info: environment
│ julia = "1.12.6"
│ os = :Linux
└ arch = :x86_64
┌ Info: packages
│ HTTP = v"2.0.0"
└ Reseau = v"1.1.3"
┌ Info: kernel
└ uname = "Linux 105e9ccc-a395-48a5-9f20-c2824a20e916 4.4.0 #1 SMP Sun Jan 10 15:06:54 PST 2016 x86_64 GNU/Linux"
[1] signal 11 (1): Segmentation fault
in expression starting at /app/mwe.jl:11
== at ./promotion.jl:637 [inlined]
== at ./promotion.jl:487 [inlined]
_backend_poll_once! at /root/.julia/packages/Reseau/chanm/src/iopoll/epoll.jl:235
_poller_thread_main! at /root/.julia/packages/Reseau/chanm/src/iopoll/runtime.jl:384
_poller_thread_entry at /root/.julia/packages/Reseau/chanm/src/iopoll/runtime.jl:401
jlcapi__poller_thread_entry_18496 at /root/.julia/compiled/v1.12/Reseau/vKBJ8_aNuEz.so (unknown line)
unknown function (ip: 0x7ea27c7e0b7a) at /lib/x86_64-linux-gnu/libc.so.6
unknown function (ip: 0x7ea27c85e7f7) at /lib/x86_64-linux-gnu/libc.so.6
Allocations: 37680835 (Pool: 37680686; Big: 149); GC: 5294
Line 235 is if n == -1 immediately following @gcsafe_ccall epoll_wait(...). The inlined == frames in promotion.jl point to the type promotion for that comparison.
Minimal reproduction
If you have gVisor (runsc) installed locally, you should be able to reproduce it with:
docker build -t http-mwe .
docker run --runtime=runsc --rm http-mwe
and let it run a couple minutes.
mwe.jl:
using HTTP
@info "environment" julia=string(VERSION) os=Sys.KERNEL arch=Sys.ARCH
deps = Dict(v.name => v.version for (_, v) in Pkg.dependencies())
@info "packages" HTTP=deps["HTTP"] Reseau=deps["Reseau"]
@info "kernel" uname=readchomp(`uname -a`)
server = HTTP.serve!(_ -> HTTP.Response(200, "ok"), "127.0.0.1", 8765)
while true
HTTP.get("http://127.0.0.1:8765/")
rand(100_000)
end
Dockerfile:
FROM julia:1.12.6
ENV JULIA_CPU_TARGET="x86-64"
WORKDIR /app
COPY mwe.jl Project.toml ./
RUN julia --project=@. -e 'using Pkg; Pkg.instantiate()'
ENTRYPOINT ["julia", "-t1", "--project=@.", "mwe.jl"]
Project.toml:
[deps]
HTTP = "cd3eb016-35fb-5094-929b-558a96fad6f3"
[compat]
HTTP = "2.0.0"
# Could also be a reproducer for the precompilation issue ? I gave up after 40+ minutes
[preferences.HTTP]
precompile_workload = false
Environment
165e3ac) — also reproduced on 1.1.2 (3b6e3aa)uname -aLinux e7f94219-061e-42fc-a1fd-94e8979590e0 4.4.0 #1 SMP Sun Jan 10 15:06:54 PST 2016 x86_64 GNU/LinuxDescription
I'm getting a signal 11 (SIGSEGV) crash in Reseau's background epoll I/O poller thread, in
_backend_poll_once!.Claude suggests that
@gcsafe_ccall epoll_waitin_backend_poll_once!allows Julia's GC to run during the syscall, and gVisor's userspace epoll implementation exposes a race that doesn't exist on a real Linux kernel. This would explain why I can reproduce it consistently on Scaleway Serverless Jobs (a serverless container platform that uses gVisor as its sandbox) but not at all on standard Docker locally.Stack trace
MWE reproduction :
Line 235 is
if n == -1immediately following@gcsafe_ccall epoll_wait(...). The inlined==frames inpromotion.jlpoint to the type promotion for that comparison.Minimal reproduction
If you have gVisor (
runsc) installed locally, you should be able to reproduce it with:and let it run a couple minutes.
mwe.jl:Dockerfile:Project.toml: