ghc-openmp: GHC's Runtime System as an OpenMP Runtime

An OpenMP runtime that uses GHC's Runtime System (RTS) as its thread pool and scheduler infrastructure. Standard OpenMP C code compiled with gcc -fopenmp runs on GHC Capabilities instead of libgomp's pthreads, enabling seamless interoperation between Haskell and OpenMP-parallelized C code.

Key Results

Performance parity with native libgomp across all benchmarks (fork/join, barrier, parallel for, DGEMM)
Haskell FFI interop: Haskell programs call OpenMP C code via foreign import ccall safe, with both running on the same thread pool
Concurrent execution: Haskell green threads and OpenMP parallel regions run simultaneously without starving each other
GC isolation: GHC's stop-the-world GC does not pause OpenMP workers (workers don't hold Capabilities)
Bidirectional FFI: OpenMP workers call back into Haskell via FunPtr, with automatic Capability acquisition

Architecture

Haskell program (or C host)
        |
        | foreign import ccall safe / direct call
        v
C code with #pragma omp parallel for
        |
        | calls GOMP_parallel(fn, data, N, flags)
        v
ghc_omp_runtime_rts.c  (our OpenMP runtime)
        |
        | dispatches to worker pool
        v
GHC RTS Capabilities (OS threads)
  Cap 0    Cap 1    Cap 2    ...    Cap N-1
  (master) (worker) (worker)        (worker)

OpenMP workers are permanent OS threads pinned to GHC Capabilities via rts_setInCallCapability(). After initialization, they do not hold Capabilities — they are plain OS threads spinning on atomic variables, invisible to GHC's garbage collector.

Documentation

Full write-up with charts and benchmarks: https://jhhuh.github.io/ghc-openmp/

Built with MkDocs Material and the Haskell Chart library (static SVGs at build time, no client-side JS).

nix build .#docs       # Build static site to ./result/
nix run .#docs         # Serve locally at http://localhost:8080

Building

With Nix (recommended)

nix build              # Build all binaries (libghcomp.so, tests, benchmarks, demos)
nix run .#test-all     # Run all tests
nix run .#bench        # Run microbenchmarks
nix run .#bench-dgemm  # Run DGEMM benchmark
nix develop            # Enter dev shell with GHC, GCC, and tools

Without Nix

Prerequisites: GHC (with threaded RTS), GCC (with OpenMP support), make.

make all               # Build libghcomp.so and basic tests
make build-all         # Build everything (tests, benchmarks, demos)
make test-all          # Run all tests
make bench             # Run microbenchmarks

The Makefile auto-discovers GHC RTS include/library paths via ghc --print-libdir.

As a Haskell library (cabal)

Add to your .cabal file:

build-depends: ghc-openmp
ghc-options:   -threaded

The C runtime source is compiled directly into your package using your own GHC — no shared library linkage, no ABI conflicts.

import GHC.OpenMP

-- Call your OpenMP C code via safe FFI
foreign import ccall safe "my_parallel_function"
    c_myFunction :: CInt -> IO CDouble

Drop-in libgomp Replacement (C projects)

libghcomp.so is a drop-in replacement for libgomp.so. Any C program compiled with gcc -fopenmp can use it without source changes.

Build the shared library

# With Nix:
nix build
ls result/lib/libghcomp.so

# Without Nix:
make build/libghcomp.so

Link against it

# Compile your OpenMP program, linking against libghcomp instead of libgomp
gcc -fopenmp my_program.c -Lresult/lib -lghcomp -Wl,-rpath,result/lib -o my_program
./my_program

LD_PRELOAD (no recompilation)

# Use with an existing binary — replaces libgomp at load time
LD_PRELOAD=result/lib/libghcomp.so ./my_existing_omp_program

pkg-config

A ghcomp.pc.in template is shipped in data/. After installation:

gcc -fopenmp my_code.c $(pkg-config --cflags --libs ghcomp) -o my_code

Project Structure

cbits/
  ghc_omp_runtime_rts.c    # The OpenMP runtime (~1300 lines)
  ghc_omp_runtime.c         # Phase 1 reference stub (pthread-based)
  omp_compute.c              # Shared compute kernels (sinsum, dgemm, etc.)
  omp_prims.cmm              # Cmm primitives (zero-overhead RTS access)
  omp_batch.cmm              # Batched safe calls (manual suspend/resume)
  HsStub.hs                  # Minimal Haskell module for RTS initialization
  bench_overhead.c           # Microbenchmark suite
  bench_dgemm.c              # DGEMM benchmark (native vs RTS)
  test_*.c                   # C test programs

demos/
  HsMain.hs                  # Haskell FFI interop demo
  HsConcurrent.hs            # Concurrent Haskell + OpenMP
  HsGCStress.hs              # GC interaction test
  HsMatMul.hs                # Dense matrix multiply
  HsCallback.hs              # Bidirectional interop (OpenMP -> Haskell)
  HsCmmDemo.hs               # Calling convention benchmark
  HsCmmBatch.hs              # Batch overhead benchmark
  HsCrossover.hs             # Parallelism crossover analysis
  HsParCompare.hs            # GHC forkIO vs OpenMP comparison
  HsTaskDemo.hs              # Deferred task execution
  HsZeroCopy.hs              # Zero-copy FFI with pinned ByteArray
  HsLinearDemo.hs            # Linear typed arrays demo
  Data/Array/Linear.hs       # Linear typed array library
  inline-cmm/                # inline-cmm quasiquoter demo (separate cabal package)

lib/
  GHC/OpenMP.hs              # Haskell API (Haddock: jhhuh.github.io/ghc-openmp/haddock/)

Implemented OpenMP Features

Feature	Status
`#pragma omp parallel`	Full
`#pragma omp parallel for` (static, dynamic, guided)	Full
`#pragma omp barrier`	Full (sense-reversing, lock-free)
`#pragma omp critical` (named and unnamed)	Full
`#pragma omp single`	Full
`#pragma omp atomic`	Fallback mutex
`#pragma omp task` / `taskwait`	Full (deferred + work-stealing)
`#pragma omp sections`	Full
`#pragma omp ordered`	Mutex-based
`omp_*` user API (threads, locks, timing)	Full
Nested parallelism	Serialized (inner regions run single-threaded)
Target offloading	Not applicable

Benchmark Results (4 threads, i7-10750H)

Microbenchmarks

Metric	Native libgomp	RTS-backed	Ratio
Fork/join	0.931 us	0.945 us	1.02x (parity)
Barrier	0.248 us	0.270 us	1.09x
Parallel for (1M sin)	3.777 ms	3.879 ms	1.03x (parity)
Critical section	0.352 ms	0.327 ms	0.93x (RTS faster)

DGEMM (dense matrix multiply)

N	Native (ms)	RTS (ms)	Ratio
512	78.59	83.66	1.06x
1024	670.05	654.66	0.98x

Performance is indistinguishable within measurement noise.

How It Works

RTS Boot: On first GOMP_parallel call, hs_init_ghc() initializes the GHC RTS (or increments its ref count if already running from Haskell).
Worker Pool: N-1 OS threads are created and pinned to Capabilities 1..N-1. Each does rts_lock(); rts_unlock(); once to register with the RTS, then enters a spin-wait loop.
Parallel Region: Master stores work item (function pointer + data), increments an atomic generation counter. Workers detect the generation change, participate in a sense-reversing start barrier, execute the function, then hit the end barrier.
Synchronization: Lock-free sense-reversing centralized barriers with spin-wait (~4000 iterations) and condvar fallback for power efficiency.
Haskell Interop: foreign import ccall safe releases the calling Capability, so other Haskell green threads run while OpenMP executes. Workers don't hold Capabilities, making them invisible to GC.

Cmm Primitives and inline-cmm

Cmm (GHC's low-level intermediate representation) primitives callable from Haskell via foreign import prim — the fastest calling convention GHC offers. Arguments pass directly in STG registers with no FFI boundary at all.

For example, reading the current Capability number (equivalent to omp_get_thread_num()) compiles to a single memory load:

#include "Cmm.h"

omp_prim_cap_no(W_ dummy) {
    return (Capability_no(MyCapability()));
}

foreign import prim "omp_prim_cap_no" primCapNo# :: Int# -> Int#

The inline-cmm library automates this pattern, letting you embed Cmm code directly in Haskell modules via a [cmm| ... |] quasiquoter — similar to how inline-c handles C. It automatically generates the foreign import prim declaration and compiles the Cmm to an object file via Template Haskell.

Calling Convention Overhead

Convention	ns/call	Notes
`foreign import prim` (Cmm)	~0	GHC can optimize away (LICM, CSE)
`foreign import ccall unsafe`	~2	STG register save/restore
`foreign import ccall safe`	~68	+ Capability release/reacquire

Batched Safe Calls

The ~68ns safe FFI overhead can be amortized by batching multiple C calls within a single suspendThread/resumeThread cycle, written manually in Cmm:

Batch size	Per-call cost	Speedup vs safe
1	69 ns	1.0x
10	8.7 ns	8.2x
100	2.7 ns	27x

At batch=100, per-call overhead approaches unsafe FFI cost (~2 ns).

License

BSD-3-Clause. See LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 83 Commits
.github/workflows		.github/workflows
artifacts		artifacts
cbits		cbits
data		data
demos		demos
docs		docs
lib/GHC		lib/GHC
scripts		scripts
.envrc		.envrc
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
cabal.project		cabal.project
flake.lock		flake.lock
flake.nix		flake.nix
ghc-openmp.cabal		ghc-openmp.cabal
ghcomp.pc.in		ghcomp.pc.in
index.md		index.md
mkdocs-multi.yml		mkdocs-multi.yml
mkdocs.yml		mkdocs.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ghc-openmp: GHC's Runtime System as an OpenMP Runtime

Key Results

Architecture

Documentation

Building

With Nix (recommended)

Without Nix

As a Haskell library (cabal)

Drop-in libgomp Replacement (C projects)

Build the shared library

Link against it

LD_PRELOAD (no recompilation)

pkg-config

Project Structure

Implemented OpenMP Features

Benchmark Results (4 threads, i7-10750H)

Microbenchmarks

DGEMM (dense matrix multiply)

How It Works

Cmm Primitives and inline-cmm

Calling Convention Overhead

Batched Safe Calls

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ghc-openmp: GHC's Runtime System as an OpenMP Runtime

Key Results

Architecture

Documentation

Building

With Nix (recommended)

Without Nix

As a Haskell library (cabal)

Drop-in libgomp Replacement (C projects)

Build the shared library

Link against it

LD_PRELOAD (no recompilation)

pkg-config

Project Structure

Implemented OpenMP Features

Benchmark Results (4 threads, i7-10750H)

Microbenchmarks

DGEMM (dense matrix multiply)

How It Works

Cmm Primitives and inline-cmm

Calling Convention Overhead

Batched Safe Calls

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages