Skip to content

Segmentation fault in Reseau epoll poller thread under gVisor (_backend_poll_once!, epoll.jl:235) #1266

@gsoleilhac

Description

@gsoleilhac

Environment

Julia 1.12.6 / 1.13-rc1
HTTP.jl 2.0.0
Reseau 1.1.3 (165e3ac) — also reproduced on 1.1.2 (3b6e3aa)
Platform Scaleway Serverless Jobs (gVisor sandbox)
uname -a Linux e7f94219-061e-42fc-a1fd-94e8979590e0 4.4.0 #1 SMP Sun Jan 10 15:06:54 PST 2016 x86_64 GNU/Linux

Description

I'm getting a signal 11 (SIGSEGV) crash in Reseau's background epoll I/O poller thread, in _backend_poll_once!.

Claude suggests that @gcsafe_ccall epoll_wait in _backend_poll_once! allows Julia's GC to run during the syscall, and gVisor's userspace epoll implementation exposes a race that doesn't exist on a real Linux kernel. This would explain why I can reproduce it consistently on Scaleway Serverless Jobs (a serverless container platform that uses gVisor as its sandbox) but not at all on standard Docker locally.

Stack trace

MWE reproduction :

┌ Info: environment
│   julia = "1.12.6"
│   os = :Linux
└   arch = :x86_64
┌ Info: packages
│   HTTP = v"2.0.0"
└   Reseau = v"1.1.3"
┌ Info: kernel
└   uname = "Linux 105e9ccc-a395-48a5-9f20-c2824a20e916 4.4.0 #1 SMP Sun Jan 10 15:06:54 PST 2016 x86_64 GNU/Linux"

[1] signal 11 (1): Segmentation fault
in expression starting at /app/mwe.jl:11
== at ./promotion.jl:637 [inlined]
== at ./promotion.jl:487 [inlined]
_backend_poll_once! at /root/.julia/packages/Reseau/chanm/src/iopoll/epoll.jl:235
_poller_thread_main! at /root/.julia/packages/Reseau/chanm/src/iopoll/runtime.jl:384
_poller_thread_entry at /root/.julia/packages/Reseau/chanm/src/iopoll/runtime.jl:401
jlcapi__poller_thread_entry_18496 at /root/.julia/compiled/v1.12/Reseau/vKBJ8_aNuEz.so (unknown line)
unknown function (ip: 0x7ea27c7e0b7a) at /lib/x86_64-linux-gnu/libc.so.6
unknown function (ip: 0x7ea27c85e7f7) at /lib/x86_64-linux-gnu/libc.so.6
Allocations: 37680835 (Pool: 37680686; Big: 149); GC: 5294

Line 235 is if n == -1 immediately following @gcsafe_ccall epoll_wait(...). The inlined == frames in promotion.jl point to the type promotion for that comparison.

Minimal reproduction

If you have gVisor (runsc) installed locally, you should be able to reproduce it with:

docker build -t http-mwe .
docker run --runtime=runsc --rm http-mwe

and let it run a couple minutes.

mwe.jl:

using HTTP

@info "environment" julia=string(VERSION) os=Sys.KERNEL arch=Sys.ARCH
deps = Dict(v.name => v.version for (_, v) in Pkg.dependencies())
@info "packages" HTTP=deps["HTTP"] Reseau=deps["Reseau"]
@info "kernel" uname=readchomp(`uname -a`)

server = HTTP.serve!(_ -> HTTP.Response(200, "ok"), "127.0.0.1", 8765)

while true
    HTTP.get("http://127.0.0.1:8765/")
    rand(100_000)
end

Dockerfile:

FROM julia:1.12.6

ENV JULIA_CPU_TARGET="x86-64"

WORKDIR /app
COPY mwe.jl Project.toml ./

RUN julia --project=@. -e 'using Pkg; Pkg.instantiate()'

ENTRYPOINT ["julia", "-t1", "--project=@.", "mwe.jl"]

Project.toml:

[deps]
HTTP = "cd3eb016-35fb-5094-929b-558a96fad6f3"

[compat]
HTTP = "2.0.0"

# Could also be a reproducer for the precompilation issue ? I gave up after 40+ minutes
[preferences.HTTP]
precompile_workload = false

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions