High Performance Media Processing Distributed Data Flow: Zero-Copy And Kernel Bypass

Abstract

This article uses common serialization protocols—Protobuf, JSON, and the high-performance Cap'n Proto—as a medium to introduce the design philosophies behind serialization tools, their application in Media Processing Distributed Data Flow (MPDDF), and considerations in technology selection and zero-copy optimisation in real-world applications.

Subsequently, it discusses network transmission protocols such as gRPC and the high-performance RDMA, exploring RDMA's design principles, its application, and kernel bypass optimization within MPDDF.

Finally, the article presents some leading optimization techniques in related fields, including hardware-assisted Protobuf encoder and disaggregated memory CXL, offering potential directions for future exploration.

Introduction

Media Processing Distributed Data Flow (MPDDF)

Distributed Media Processing System, in industry, often refers to a scalable and developer-friendly framework designed for building and orchestrating end-to-end media processing workflows. It enables rapid development, flexible integration, and efficient deployment of video and image processing tasks, including atomic operators (e.g., enhancement, decoding), standalone processing modules (such as Lambda-style functions), and complex distributed workflows. For performance, it often supports both batch and streaming execution.

Media Processing Distributed Data Flow (MPDDF) is the data‐movement substrate at the heart of any large‑scale media processing system. It defines how media payloads—video frames, audio snippets, metadata—traverse a graph of operators (decoders, encoders, filters, analyzers) in a reliable, efficient, and schema‑aware fashion.

The diagram depicts a classic streaming‐operator pipeline in which video frames flow continuously between streaming operators for concurrent processing.

The media‐processing engine sits atop this pipeline, handling task orchestration, container scheduling, intermediate‐artifact storage, and inter‐container messaging. Within its MPDDF layer, communication performance is paramount: every end‑to‑end handoff incurs costs from serialization/deserialization, buffer allocation and NUMA placement, one or more memory copies, system calls and kernel‑mode switches, socket or RDMA I/O, and ultimately NIC DMA and network traversal. Optimizing MPDDF means minimizing these stages so that frames advance through the pipeline with the fewest software and hardware hops possible.

Protobuf

Proto is a commonly used data serialization/deserialization protocol by MPDDF in industry practice.

Protocol buffers are Google’s language-neutral, platform-neutral, extensible mechanism for serializing structured data – think XML_, but smaller, faster, and simpler. You define how you want your data to be structured once, then you can use special generated source code to easily write and read your structured data to and from a variety of data streams and using a variety of languages._

Why the name "Protocol Buffers"? The name originates from the early days of the format, before we had the protocol buffer compiler to generate classes for us. At the time, there was a class called ProtocolBuffer which actually acted as a buffer for an individual method. Users would add tag/value pairs to this buffer individually by calling methods like AddValue(tag, value). The raw bytes were stored in a buffer which could then be written out once the message had been constructed. Since that time, the "buffers" part of the name has lost its meaning, but it is still the name we use. Today, people usually use the term "protocol message" to refer to a message in an abstract sense, "protocol buffer" to refer to a serialized copy of a message, and "protocol message object" to refer to an in-memory object representing the parsed message.

Protobuf Wire Format for Encoding

The efficiency of Protobuf lies in its encoding format, which uses a T-(L)-V (Tag-Length-Value) structure. Each field has a unique tag that serves as its identifier. The length represents the size of the value data. However, length is not always required — it's omitted for fixed-length values. The value is the actual content of the data.

The tag consists of two parts: field_number and wire_type. The field_number is the number assigned to each field in the message definition, while wire_type indicates the type of the field—whether it's fixed-length or variable-length. There are currently six wire_type values (0 to 5), which means only 3 bits are needed to represent them. The structure of a tag is illustrated below.

A simple message

message SubMessage {  
  int32 c = 1;  
}  
  
message Test1 {  
  int32 a = 1;  
  string b = 2;  
  SubMessage d = 3;  
}

Construct in code

Test1 {  
  a: 150,  
  b: "hi",  
  d: { c: 1000 }  
}

Serialized Output (in hex):

08 96 01 12 02 68 69 1a 03 e8 07

Explain:

Encoding the Field a = 150

Field number = 1

Wire type = 0 (varint for int32)

Tag = Field number << 3 | Wire type

= (1 << 3) | 0 = 0x08 Value

In Protobuf, integers are encoded using Varint, a variable-length encoding format. Varints use one or more bytes. Each byte:

Uses the lower 7 bits for data.
Uses the highest bit (MSB) to indicate if more bytes follow:
1 → more bytes follow.
0 → this is the last byte.

The bytes are:

96 (hex) = 0b10010110 (binary)

01 (hex) = 0b00000001 (binary)

Remove the MSB from each byte:

From 0b10010110, remove the MSB → 0b00010110 = 0x16 = 22
From 0b00000001, remove the MSB → 0b0000001 = 0x01 = 1

Reconstruct the integer (little-endian):

In Varint, the first byte holds the least significant bits.

So we combine the bits as:

value = (1 << 7) | 22 = 128 + 22 = 150

So based on the bytes, it's possible to reverse-engineer the proto structure, including the type and value of each field. Media Processing system console can use this to implement the formatted display of execution's input/output.

Static Deserialize (with generated code)

Static parsing means using pre-generated code to decode protobuf data. In Go, this is done by generating .pb.go files from your .proto files.

When Deserialize:

Uses the hardcoded field numbers, types, and structure in the generated code.
Parses the binary data accordingly and fills the Go struct.

Dynamic Deserialize (Proto Descriptor)

A proto descriptor is a special protobuf message that describes the structure of a .proto file itself. It contains all the information about:

Messages, fields, enums, services
Field types and numbers
Packages and syntax (proto2/proto3)

It allows tools or programs to understand protobuf data without needing the original .proto source file.

To generate proto descriptor:

protoc --descriptor_set_out=desc.pb --include_imports your_file.proto

Programs can parse this descriptor file at runtime, use it to generate dynamic proto message instance, and then use the instance to deserialize protobuf blob through dynamic reflection, without generated code.

Media Processing system console can use proto descriptor to convert proto blob to human-readable format for debugging purpose.

{  
  "name":"John",   
  "age":30,   
  "car":null  
}

Benchmarking Protobuf and JSON: Performance and Size

Performance benchmarks reveal clear differences between the two formats:

Size: Protocol Buffers produce payloads that are approximately three times smaller than equivalent JSON data. (Can be further compressed using tools like gzip, reduce size by around 50%)
Serialization Speed: Proto serializes data over ten times faster than JSON.
Deserialization: For static deserialization, Proto is also over 6 times faster than JSON. For dynamic deserialization, both formats perform similarly.

Cost for Serialization/Deserialization

Although proto buffers' serialization/deserialization is already very fast, it still requires a significant amount of CPU computation and memory overhead.

In Go, the serialization process using bytes, err := proto.Marshal(message) typically involves the following steps:

Message Size Estimation & Buffer Allocation
Traverse each field of the message to compute the encoded size.
Once the total size is known, allocate a []byte buffer (e.g., via malloc) with size at least equal to the computed size.
In high-concurrency scenarios, especially when dealing with large messages, allocating buffers may trigger virtual memory expansion for the Golang processor. This can lead to system calls (e.g., mmap) to request virtual memory from the OS, or lead to GC, causing kernel mode switches.
Allocating large size buffers also implies a higher possibility of triggering GC.
Encoding Process
Each field is encoded sequentially. For example, varints are converted into their encoded byte representations, and string/bytes fields involve memory copying. The encoder keeps moving the writing offset and writes the bytes into the buffer.
For large messages, writing into a new page on the page table is likely to trigger a page fault. This causes a mode switch, and the OS will allocate physical memory.

For deserialization, the process differs slightly, but the overall overhead is largely similar.

Benchmarking Analysis

On a Debian 10 linux 5.4 machine with a c1.2xlarge.v6 | 8 cores | 16G__, with a 1080P video frame message, size of 3.110449 MB after encoding (including a 3.1104MB video frame**[]byte** field and some other meta field), w_ith a concurrency of_ 8 qps (1 qps per go-routine and 8 go-routines in total, after each run, sleep 1 second), here is the round-trip encoding time benchmark.

Under this setting, over 100 runs, the encoding round-trip time of a 1080P video frame message is at around 7.5235ms. This level of latency is considerable for a latency-sensitive dataflow engine system.

Performance profiling indicates that GC is not the primary contributor to the delay. Instead, the main root cause appears to be frequent page faults and kernal functions (mmp), which introduce substantial overhead.

Note

For my encoding round trip test, it includes the following procedures:

Malloc a continuous byte array in user mode memory
Marshal the message into the byte array (process's user mode memory heap runs out, acquire memory from kernel, page fault).
Create a pointer to a new message, with all fields empty.
Unmarshal the bytes to the new message, includes many malloc and memory copies, which is also likely to page fault.

The message contains over 3 MB frame data.

I think this can reflect the online scenario more realistically, and can reflect more performance differences between Capnp and ProtoBuf in real world scenarios.

Go benchmark results from running N iterations serially in a single goroutine (with no sleep)

BenchmarkProtoAVFrame-12                4464            237764 ns/op

Proto Buffers encoding round trip time by go benchmark (using same message) is 237.764µs

Note

To simulate 8 QPS, time.Sleep()was used between runs. Experiments show that this is the main reason for the performance difference between the two metric results. The exact cause remains to be further investigated, but the initial hypothesis is about CPU frequency ramp‑up and CPU cache invalidation.

Cap'n'proto

Cap’n Proto is an insanely fast data interchange format and capability-based RPC system. Think JSON, except binary. Or think Protocol Buffers, except faster. In fact, in benchmarks, Cap’n Proto is INFINITY TIMES faster than Protocol Buffers.

This benchmark is, of course, unfair. It is only measuring the time to encode and decode a message in memory. Cap’n Proto gets a perfect score because there is no encoding/decoding step. The Cap’n Proto encoding is appropriate both as a data interchange format and an in-memory representation, so once your structure is built, you can simply write the bytes straight out to disk!

MPDDF, as a high-throughput, latency-sensitive streaming dataflow transmission engine, responsible for handling data exchange and storage between media processing streaming operators. It can use Cap'n Proto as the serialization protocol.

Cap’n Proto uses a platform-independent, byte-for-byte encoding with fixed widths, offsets, and alignment. Pointers are offset-based (not absolute). This design ensures both portability and performance. Backwards compatibility is preserved by appending new fields to the end of structs or reusing padding space, leaving existing field positions unchanged.

Sample Usage (Golang)

After correctly installing Cap’n Proto, can declare IDLs in a way similar to Protocol Buffers.

using Go = import "/go.capnp";  
@0xabcdefabcdefabcdef;  
$Go.package("capnp_gen");  
$Go.import("test");  
  
struct Inner {  
  innerA @0 :UInt16;  
  innerB @1 :UInt32;  
}  
struct Outer {  
  outerA @0 :UInt16;  
  inner @1 :Inner;  
  data @2 :Text;  
}

Generate code with IDL
Create objects

arena := capnp.SingleSegment(nil)  
msg, seg, err := capnp.NewMessage(arena)  
if err != nil {  
    panic(err)  
}  
outer, err := capnp_gen.NewRootOuter(seg)  
if err != nil {  
    panic(err)  
}  
inner, err := outer.NewInner()  
if err != nil {  
    panic(err)  
}  
inner.SetInnerA(4)  
inner.SetInnerB(5)  
_ = outer.SetData("Bob")  
outer.SetOuterA(6)

Marshal & Unmarshal

b, err := msg.Marshal()  
if err != nil {  
    panic(err)  
}  
newMsg, err := capnp.Unmarshal(b)  
if err != nil {  
    panic(err)  
}  
newOuter, err := capnp_gen.ReadRootOuter(newMsg)  
if err != nil {  
    panic(err)  
}

Core Components & Responsibilities

Arena (Allocator Context)

Manages raw memory buffers of one root message.
Owns a list of Segments and a backing buffer pool. And handles on-demand allocation of new Segments, (using an exponential growth strategy to minimize copies).

Message

Represents a single Cap’n Proto message’s API surface (getters/setters).
Holds a reference to its Arena, does not itself contain data—reads and writes are directed into the Arena’s Segments.

Segment (Contiguous Buffer)

By default, an 8-byte-aligned slice of memory ([]byte), treated as a sequence of 64-bit words.
Stores the fixed-size struct section (fields + pointer words) and the variable-length data section.
Each pointer word encodes a (segmentIndex, wordOffset) pair, so readers know exactly where data lives.

Relationships

1 Arena → N Segments: An Arena can grow to many Segments as needed.
1 Message → 1 Arena: Each Message binds to a single Arena for all its allocations.
1 Segment → 1 Arena: Segments belong exclusively to their Arena and aren’t shared across Messages.

Message Wire Format & Memory representation

Take a simple message struct with only a single segment as an example

struct Inner {  
  innerA @0 :UInt16;  
  innerB @1 :UInt32;  
}  
struct Outer {  
  outerA @0 :UInt16;  
  inner @1 :Inner;  
  data @2 :Text;  
}  
  
Outer {  
  outerA: 6,  
  inner: {  
    innerA: 4,  
    innerB: 5  
  },  
  data: 'Bob'  
}

memory representation

[ 0  0  0  0  1  0  2  0     6  0  0  0  0  0  0  0  
  4  0  0  0  1  0  0  0     5  0  0  0 34  0  0  0  
  4  0  0  0  5  0  0  0     66 111  98  0  0  0  0  0 ]

Explain

Inter-Segment Pointer

When a pointer needs to point to a different segment, offsets no longer work. Instead, encode the pointer as a far pointer:

The pointer would be something that looks like this:

0A 00 00 00 | 01 00 00 00 0A = 01 | 0 | 10 Indicates that it is a far pointer, where the landing pad offset is at 1 in segments[1]

Serialisation & Deserialisation

For the native implementation of Capnp's serialisation, there's no need to encode/decode the message anymore. It allocates a new large memory bytes buffer, and copies all the raw bytes from different segments into the large buffer. Even though there is no encoding/decoding process, it still requires memory copy (likely to introduce page faults), which means it can still slow down the encoding round-trip time.

However, for the nature of Capnp's in-memory encoding representation, raw bytes can be directly deserialised between platforms. Hence, raw bytes of segments can be sent directly between services via network, without extra buffer allocation or memory copying, Capnp's Golang SDK provides encode/decode methods for achieving this (Scatter-Gather pattern).

// Server Side Code  
arena := capnp.MultiSegment(nil)  
msg, seg, _ := capnp.NewMessage(arena)  
outer, _ := capnp_gen.NewRootOuter(seg)  
outer.SetOuterA(6)  
ln, _ := net.Listen("tcp", ":12345")  
conn, _ := ln.Accept()  
encoder := capnp.NewEncoder(conn)  
encoder.Encode(msg)  
  
// Encode method is similar to:  
send([][]byte{  
    arena.Segment(0).Data(),  
    arena.Segment(1).Data(),  
    ...  
})

Benchmarking Analysis

On exactly the same machine, same video frame message, same concurrency, but encode/decode with Capnp, here is the round-trip encoding time benchmark.

Under the same setting, over 100 runs, the encoding round-trip time of a 1080P video frame message is at around 16.229µs, with almost no page faults, in comparison with 7.5235ms with ProtoBuf, its 463x times faster, while the message size is only 0.00002x times larger.

Go benchmark results from running N iterations serially in a single goroutine (with no sleep)

BenchmarkProtoAVFrame-12                4464            237764 ns/op

Capnp encoding round trip time by go benchmark (using same message) is 135.5ns, in comparison with 237.764µs with Protobuf, its 1754x times faster.

Note

To simulate 8 QPS, time.Sleep() was used between runs. Experiments show that this is the main reason for the performance difference between the two metric results. The exact cause remains to be further investigated, but the initial hypothesis is about context switches and CPU cache invalidation.

Advantages & Drawbacks

Advantages

Highly performant serialization: No need to allocate extra buffer on the heap and encode/memory copy.
Inter-process communication via shared memory: Multiple processes running on the same machine can share a Cap’n Proto message via shared memory.
Incremental reads: Can begin parsing before full message arrives, since outer objects appear entirely before inner objects.
Random access: Can read just one field of a message without parsing the whole thing.
mmap: Read a large capnp file by mmap it. The OS won’t even read in the parts that you don’t access.

Drawbacks

Larger message size: The wire format is not compact, which can increase bandwidth usage during network transmission.
Costly updates: Resetting variable-sized message fields (e.g., string and bytes), requires allocating new memory, making it less suitable for frequently mutating messages (unless directly modify raw bytes).


[0 0 0 0 1 0 3 0 6 0 0 0 0 0 0 0 13 0 0 0 34 0 0 0 4 0 0 0 1 0 0 0 
9 0 0 0 58 0 0 0 4 0 0 0 5 0 0 0 66 111 98 0 0 0 0 0 66 111 98 66 111 98 0 0] 

[0 0 0 0 1 0 3 0 6 0 0 0 0 0 0 0 21 0 0 0 130 0 0 0 4 0 0 0 1 0 0 0 
9 0 0 0 58 0 0 0 4 0 0 0 5 0 0 0 66 111 98 0 0 0 0 0 66 111 98 66 111 98 0 0 
66 111 98 66 111 98 66 111 98 66 111 98 66 111 97 0]

No shared substructures: Multiple root messages cannot reference the same inner object due to its wire format (can use orphan, but need to separately manage memory allocation and lifecycle).
Initial memory overhead: Cap’n Proto allocates an 8KB segment by default, which can be wasteful unless the approximate message size is known and tuned.
Safety & debugging: Cap’n Proto does support out-of-bounds check for safety. But pollution to pointer fields can lead to undefined behavior (such as infinite loops), which is very difficult to troubleshoot.

MPDDF, as a streaming dataflow system for media processing scenarios with high data throughput, high currency, and high latency sensitivity. It handles streaming workflows where video frames (type of []byte) are the primary source of data volume.

For video frame message types, Protobuf only applies some compression to the metadata. Based on experiments, the difference in message size between Cap’n Proto and Protobuf after serialization is negligible.

Since video frame content is typically fixed upon generation, and the data flows occur within the same data center with very low network latency, it is well-suited for a Cap'n Proto-based serialization and RPC solution.

Cost of RPC

Even though Cap’n Proto is designed for high performance, its RPC process still incurs fundamental system-level costs. Capnp's native RPC protocal or GRPC use traditional network data path, it needs to copy the buffers using the sockets API from user space to kernal space. In the kernel, the data path includes the TCP, IPv4/6 stack all the way down to the device driver and eventually the network fabric. All these steps require CPU cycles for processing. With high bandwidth networks today (25,40,50, and 100GbE), the amount of CPU time required to process data and put that data on the wire can still become a bottleneck.

Cap'n'proto X RDMA

Overview Remote Direct Memory Access (RDMA) is an extension of the Direct Memory Access (DMA) technology, which is the ability to access host memory directly without CPU intervention. RDMA allows for accessing memory data from one host to another. A key characteristic of RDMA is that it greatly improves throughput and performance while lowering latency because less CPU cycles are needed to process the network packets.

The core idea of RDMA is to offload key data transfer workloads from the CPU to the NIC hardware, such as memory access, transport protocol handling, and retransmission control. This reduces system calls or kernel mode transitions for CPU and increases NIC utilization. In MPDDF, RDMA can be used for transmitting Capnp message data.

RDMA schema - RoCEv2

Each RDMA schema requires NIC cards with different capabilities. Here we use RoCEv2 as an example. Compared to a “plain” Ethernet NIC, a RoCEv2‑capable NIC (often called an RNIC) must support:

Queue Pair (QP) Engines

Each RDMA session requires the creation of a Queue Pair (QP), which includes and Send Queue (SQ) and Receive Queue (RQ). Applications post Work Requests (WRs) to these queues using ibv_post_send() and ibv_post_recv(). The RoCEv2 NIC maintains and executes these queues in hardware asynchronously.

Memory Registration & Address Protection

RDMA requires the user to register memory onto NIC using ibv_reg_mr(), which maps virtual memory to physical addresses. The NIC manages page tables translations (the foundation for zero-copy memory access in RDMA).

RDMA Protocol Stack in Hardware

The NIC has a fully offloaded RDMA verbs engine. Also handles reliable Connected (RC) mode with hardware-based reliability. All this is performed in hardware with no CPU involvement.

UDP/IP Header Push-down

UDP/IP headers are generated by the NIC automatically. The NIC supports header insertion, IP/UDP checksum offload, so applications do not need to use sockets or construct UDP packets manually.

Data Sending Pipeline

Tunning - Completion Queue Polling

Polling the Completion Queue (CQ) can be done in two modes:

Polling mode: continuously call ibv_poll_cq() in a loop to check for completions.
Event-driven mode: register for CQ events with ibv_req_notify_cq() and wait for an interrupt/event notification.

Polling mode wastes CPU cycles because it continuously checks for completions without blocking, while event-driven mode requires system calls to wait for events, introducing latency and some overhead.

A hybrid approach combines both: it polls the CQ actively for a certain number of times, and if no completions are found, it switches to event-driven mode to save CPU resources.

Batch polling: submitting multiple write requests and then poll the CQ to retrieve multiple completion entries in a single call, reducing overhead.

Tunning - Staging Store

By default, Cap’n Proto allocates memory for each segment using calloc, which results in the OS assigning a new physical memory page to the user process.

Before sending data via an RDMA write queue, the user process must register the memory segment with the NIC via a system call, so the NIC can map the segment in its page table.

Cap’n Proto allows customizing the memory allocation strategy by overriding the allocateSegment() method. If there are multiple segments needing to be sent, a large shared memory buffer can be pre-allocated in advance , registered once with the NIC, and the capnp's allocateSegment() method is overridden to allocate segment memory via mmap from this shared memory. This eliminates all system calls and kernel mode switches during data transmission.

The diagram below shows one possible implementation. The Staging Store is responsible for creating the shared memory region and its bitmap header. The bitmap header divides the shared memory into a number of fixed‑size logical blocks, using one bit per block to indicate whether it is free or in use. User processes then allocate memory directly from this shared memory.

Real World MPDDF Implementation

In the real-world implementation inside MPDDF, it can be much more sophisticated than this example. Because it may need to support failure replay and intermediate-result persistence, the Staging Store can be actually implemented as an independent distributed service. It maintains its own reference‑counting mechanism to track, release, and reuse memory blocks. In this way, the Staging Store can also decouple interaction between different streaming operators, so that the failure of one operator does not impact the others.

Hardware Assisted Accelerator for ProtoBuf by Google

The Protocol Buffers hardware accelerator is a specialized module integrated into a RISC-V System-on-Chip (SoC), designed to offload serialization and deserialization tasks from the CPU. This integration leads to significant performance improvements, with evaluations showing an average speedup of 6.2× to 11.2× compared to a baseline RISC-V SoC with BOOM out-of-order cores, and a 3.8× improvement over a Xeon-based server, despite the RISC-V SoC's less robust supporting components .

It interfaces with the CPU via the RoCC (Rocket Custom Coprocessor) interface, allowing the CPU to dispatch custom RISC-Vinstructions directly to the accelerator with minimal latency . It has integrated with a modified version of the open-source protobuf library and maintains compatibility with standard protobuf formats

By offloading the computationally intensive tasks of serialization and deserialization, the hardware accelerator reduces CPU workload, lowers latency, and enhances overall system performance, making it particularly beneficial in data-intensive environments such as large-scale data centers (Karandikar et al., 2021).

Disaggregated Memory - CXL by Intel

Compute Express Link (CXL) as a high-performance, cache-coherent interconnect standard designed to enable memory disaggregation in data centers. By allowing CPUs to access remote memory with load/store semantics while maintaining cache coherence, CXL facilitates flexible and efficient pooling of memory resources separate from compute nodes. This architecture supports scalable, low-latency memory sharing, addressing key challenges in modern disaggregated systems. （Das Sharma et al., 2024）

In typical practical applications of MPDDF, streaming operators communicate via a third service (staging store) through RDMA. This setup is a potential scenario where disaggregated memory solutions can be utilized to enhance performance.

CXL integration with PolarDB by Alibaba

At the 2025 SIGMOD/PODS Conference, Alibaba Cloud presented a paper titled "Unlocking the Potential of CXL for Disaggregated Memory in Cloud-Native Databases," introducing PolarCXLMem—a disaggregated memory system based on Compute Express Link (CXL) technology. This system aims to address the limitations of traditional RDMA-based disaggregated memory solutions, which often suffer from:

Read/Write Amplification: Inefficient data access leading to performance bottlenecks.
Limited Bandwidth: Network bandwidth constraints hindering data transfer speeds.
Inefficient Recovery: Slow data recovery processes after system crashes.
Data Sharing Challenges: Complex mechanisms for data sharing between nodes, affecting system scalability.

Performance evaluations conducted on PolarDB, Alibaba Cloud's widely deployed cloud-native database, demonstrated that PolarCXLMem can improve throughput by up to 2.1× in pooling scenarios and 1.55× in sharing scenarios compared to RDMA-based systems (Yang et al., 2025).

These findings suggest that integrating CXL-based disaggregated memory solutions like PolarCXLMem can potentially enhance the performance and scalability of systems relying on inter-operator communication through staging stores, such as MPDDF.

Current Bottlenecks of MPDDF

Current two-stage RDMA path: streaming operator A → (RDMA) → staging store → (RDMA) → streaming operator B, requires:

Memory allocation on Staging;
Two full network round trips and multiple memory copies/DMA operations;

Key advantages of CXL also include, pooled memory resources, elastic memory scaling, physical resource isolation with logical access control, and improved tenant isolation, which is natively friendly to cloud service with multi-tenancy scenarios.

Hardware Requirements for CXL-based Disaggregated Memory

CXL-compatible CPU or Host Bridge
CXL-capable motherboard or PCIe slot
CXL Memory Devices (Memory Blades)
CXL Switch (for multi-host sharing)

Reference

Debendra Das Sharma, Robert Blankenship, and Daniel Berger. 2024. An Introduction to the Compute Express Link (CXL) Interconnect. ACM Comput. Surv. 56, 11, Article 290 (November 2024), 37 pages. https://doi.org/10.1145/3669900

Sagar Karandikar, Chris Leary, Chris Kennelly, Jerry Zhao, Dinesh Parimi, Borivoje Nikolic, Krste Asanovic, and Parthasarathy Ranganathan. 2021. A Hardware Accelerator for Protocol Buffers. In MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO '21). Association for Computing Machinery, New York, NY, USA, 462–478. https://doi.org/10.1145/3466752.3480051

Xinjun Yang, Yingqiang Zhang, Hao Chen, Feifei Li, Gerry Fan, Yang Kong, Bo Wang, Jing Fang, Yuhui Wang, Tao Huang, Wenpu Hu, Jim Kao, and Jianping Jiang. 2025. Unlocking the Potential of CXL for Disaggregated Memory in Cloud-Native Databases. In Companion of the 2025 International Conference on Management of Data (SIGMOD/PODS '25). Association for Computing Machinery, New York, NY, USA, 689–702. https://doi.org/10.1145/3722212.3724460

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
images		images
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

High Performance Media Processing Distributed Data Flow: Zero-Copy And Kernel Bypass

Abstract

Introduction

Media Processing Distributed Data Flow (MPDDF)

Protobuf

Protobuf Wire Format for Encoding

Static Deserialize (with generated code)

Dynamic Deserialize (Proto Descriptor)

Benchmarking Protobuf and JSON: Performance and Size

Cost for Serialization/Deserialization

Benchmarking Analysis

Cap'n'proto

Sample Usage (Golang)

Core Components & Responsibilities

Message Wire Format & Memory representation

Inter-Segment Pointer

Serialisation & Deserialisation

Benchmarking Analysis

Advantages & Drawbacks

Cost of RPC

Cap'n'proto X RDMA

RDMA schema - RoCEv2

Data Sending Pipeline

Tunning - Completion Queue Polling

Tunning - Staging Store

Real World MPDDF Implementation

Hardware Assisted Accelerator for ProtoBuf by Google

Disaggregated Memory - CXL by Intel

CXL integration with PolarDB by Alibaba

Current Bottlenecks of MPDDF

Hardware Requirements for CXL-based Disaggregated Memory

Reference

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages