An OpenMP runtime that uses GHC's Runtime System (RTS) as its
thread pool and scheduler infrastructure. Standard OpenMP C code compiled with
gcc -fopenmp runs on GHC Capabilities instead of libgomp's pthreads, enabling
seamless interoperation between Haskell and OpenMP-parallelized C code.
- Performance parity with native libgomp across all benchmarks (fork/join, barrier, parallel for, DGEMM)
- Haskell FFI interop: Haskell programs call OpenMP C code via
foreign import ccall safe, with both running on the same thread pool - Concurrent execution: Haskell green threads and OpenMP parallel regions run simultaneously without starving each other
- GC isolation: GHC's stop-the-world GC does not pause OpenMP workers (workers don't hold Capabilities)
- Bidirectional FFI: OpenMP workers call back into Haskell via FunPtr, with automatic Capability acquisition
Haskell program (or C host)
|
| foreign import ccall safe / direct call
v
C code with #pragma omp parallel for
|
| calls GOMP_parallel(fn, data, N, flags)
v
ghc_omp_runtime_rts.c (our OpenMP runtime)
|
| dispatches to worker pool
v
GHC RTS Capabilities (OS threads)
Cap 0 Cap 1 Cap 2 ... Cap N-1
(master) (worker) (worker) (worker)
OpenMP workers are permanent OS threads pinned to GHC Capabilities via
rts_setInCallCapability(). After initialization, they do not hold
Capabilities — they are plain OS threads spinning on atomic variables,
invisible to GHC's garbage collector.
Full write-up with charts and benchmarks: https://jhhuh.github.io/ghc-openmp/
Built with MkDocs Material and the Haskell Chart library (static SVGs at build time, no client-side JS).
nix build .#docs # Build static site to ./result/
nix run .#docs # Serve locally at http://localhost:8080nix build # Build all binaries (libghcomp.so, tests, benchmarks, demos)
nix run .#test-all # Run all tests
nix run .#bench # Run microbenchmarks
nix run .#bench-dgemm # Run DGEMM benchmark
nix develop # Enter dev shell with GHC, GCC, and toolsPrerequisites: GHC (with threaded RTS), GCC (with OpenMP support), make.
make all # Build libghcomp.so and basic tests
make build-all # Build everything (tests, benchmarks, demos)
make test-all # Run all tests
make bench # Run microbenchmarksThe Makefile auto-discovers GHC RTS include/library paths via ghc --print-libdir.
Add to your .cabal file:
build-depends: ghc-openmp
ghc-options: -threadedThe C runtime source is compiled directly into your package using your own GHC — no shared library linkage, no ABI conflicts.
import GHC.OpenMP
-- Call your OpenMP C code via safe FFI
foreign import ccall safe "my_parallel_function"
c_myFunction :: CInt -> IO CDoublelibghcomp.so is a drop-in replacement for libgomp.so. Any C program
compiled with gcc -fopenmp can use it without source changes.
# With Nix:
nix build
ls result/lib/libghcomp.so
# Without Nix:
make build/libghcomp.so# Compile your OpenMP program, linking against libghcomp instead of libgomp
gcc -fopenmp my_program.c -Lresult/lib -lghcomp -Wl,-rpath,result/lib -o my_program
./my_program# Use with an existing binary — replaces libgomp at load time
LD_PRELOAD=result/lib/libghcomp.so ./my_existing_omp_programA ghcomp.pc.in template is shipped in data/. After installation:
gcc -fopenmp my_code.c $(pkg-config --cflags --libs ghcomp) -o my_codecbits/
ghc_omp_runtime_rts.c # The OpenMP runtime (~1300 lines)
ghc_omp_runtime.c # Phase 1 reference stub (pthread-based)
omp_compute.c # Shared compute kernels (sinsum, dgemm, etc.)
omp_prims.cmm # Cmm primitives (zero-overhead RTS access)
omp_batch.cmm # Batched safe calls (manual suspend/resume)
HsStub.hs # Minimal Haskell module for RTS initialization
bench_overhead.c # Microbenchmark suite
bench_dgemm.c # DGEMM benchmark (native vs RTS)
test_*.c # C test programs
demos/
HsMain.hs # Haskell FFI interop demo
HsConcurrent.hs # Concurrent Haskell + OpenMP
HsGCStress.hs # GC interaction test
HsMatMul.hs # Dense matrix multiply
HsCallback.hs # Bidirectional interop (OpenMP -> Haskell)
HsCmmDemo.hs # Calling convention benchmark
HsCmmBatch.hs # Batch overhead benchmark
HsCrossover.hs # Parallelism crossover analysis
HsParCompare.hs # GHC forkIO vs OpenMP comparison
HsTaskDemo.hs # Deferred task execution
HsZeroCopy.hs # Zero-copy FFI with pinned ByteArray
HsLinearDemo.hs # Linear typed arrays demo
Data/Array/Linear.hs # Linear typed array library
inline-cmm/ # inline-cmm quasiquoter demo (separate cabal package)
lib/
GHC/OpenMP.hs # Haskell API (Haddock: jhhuh.github.io/ghc-openmp/haddock/)
| Feature | Status |
|---|---|
#pragma omp parallel |
Full |
#pragma omp parallel for (static, dynamic, guided) |
Full |
#pragma omp barrier |
Full (sense-reversing, lock-free) |
#pragma omp critical (named and unnamed) |
Full |
#pragma omp single |
Full |
#pragma omp atomic |
Fallback mutex |
#pragma omp task / taskwait |
Full (deferred + work-stealing) |
#pragma omp sections |
Full |
#pragma omp ordered |
Mutex-based |
omp_* user API (threads, locks, timing) |
Full |
| Nested parallelism | Serialized (inner regions run single-threaded) |
| Target offloading | Not applicable |
| Metric | Native libgomp | RTS-backed | Ratio |
|---|---|---|---|
| Fork/join | 0.931 us | 0.945 us | 1.02x (parity) |
| Barrier | 0.248 us | 0.270 us | 1.09x |
| Parallel for (1M sin) | 3.777 ms | 3.879 ms | 1.03x (parity) |
| Critical section | 0.352 ms | 0.327 ms | 0.93x (RTS faster) |
| N | Native (ms) | RTS (ms) | Ratio |
|---|---|---|---|
| 512 | 78.59 | 83.66 | 1.06x |
| 1024 | 670.05 | 654.66 | 0.98x |
Performance is indistinguishable within measurement noise.
-
RTS Boot: On first
GOMP_parallelcall,hs_init_ghc()initializes the GHC RTS (or increments its ref count if already running from Haskell). -
Worker Pool: N-1 OS threads are created and pinned to Capabilities 1..N-1. Each does
rts_lock(); rts_unlock();once to register with the RTS, then enters a spin-wait loop. -
Parallel Region: Master stores work item (function pointer + data), increments an atomic generation counter. Workers detect the generation change, participate in a sense-reversing start barrier, execute the function, then hit the end barrier.
-
Synchronization: Lock-free sense-reversing centralized barriers with spin-wait (~4000 iterations) and condvar fallback for power efficiency.
-
Haskell Interop:
foreign import ccall safereleases the calling Capability, so other Haskell green threads run while OpenMP executes. Workers don't hold Capabilities, making them invisible to GC.
Cmm
(GHC's low-level intermediate representation) primitives callable from
Haskell via foreign import prim — the fastest calling convention GHC
offers. Arguments pass directly in STG registers with no FFI boundary at all.
For example, reading the current Capability number (equivalent to
omp_get_thread_num()) compiles to a single memory load:
#include "Cmm.h"
omp_prim_cap_no(W_ dummy) {
return (Capability_no(MyCapability()));
}foreign import prim "omp_prim_cap_no" primCapNo# :: Int# -> Int#The inline-cmm library automates this
pattern, letting you embed Cmm code directly in Haskell modules via a
[cmm| ... |] quasiquoter — similar to how inline-c handles C. It
automatically generates the foreign import prim declaration and compiles the
Cmm to an object file via Template Haskell.
| Convention | ns/call | Notes |
|---|---|---|
foreign import prim (Cmm) |
~0 | GHC can optimize away (LICM, CSE) |
foreign import ccall unsafe |
~2 | STG register save/restore |
foreign import ccall safe |
~68 | + Capability release/reacquire |
The ~68ns safe FFI overhead can be amortized by batching multiple C calls
within a single suspendThread/resumeThread cycle, written manually in Cmm:
| Batch size | Per-call cost | Speedup vs safe |
|---|---|---|
| 1 | 69 ns | 1.0x |
| 10 | 8.7 ns | 8.2x |
| 100 | 2.7 ns | 27x |
At batch=100, per-call overhead approaches unsafe FFI cost (~2 ns).
BSD-3-Clause. See LICENSE.