LZ76 and FlashBack compression graphs for immune receptor repertoire analysis
Documentation · Quick Start · API Reference · Report Bug
LZGraphs is a Python library that turns T-cell and B-cell receptor CDR3 sequences into probabilistic directed graphs. It ships two graph families on a shared C core:
LZGraph: built from Lempel-Ziv 76 compression. Supports V/J gene annotation, three encoding variants, and alzgCLI.FlashBackGraph: a Markovian DAG built from FlashBack tokenization (recursive run-peeling from both ends of the sentinel-wrapped sequence). Diversity, entropy, and path counting have closed-form forward-DP solutions; sequence simulation is still sampled.
Both classes share a common surface for scoring, simulation, diversity, graph algebra, posterior personalization, and binary serialization. See When to use which for a comparison.
An LZGraph built from three CDR3s. @ and $ are start/end sentinels; subpattern nodes carry position suffixes.
pip install LZGraphsRequires Python 3.9 or later. Wheels are published for Linux, macOS, and Windows (CPython 3.9–3.12). Release history: CHANGELOG.md.
For programmatic use, all classes accept a plain list of CDR3 strings:
LZGraph(['CASSLEPSGGTDTQYF', 'CASSDTSGGTDTQYF', ...], variant='aap')For files (the CLI and FlashBackGraph.from_file), three formats are supported:
| Format | Layout | Example |
|---|---|---|
| Plain | one sequence per line | CASSLEPSGGTDTQYF |
| Seq + count | sequence\tcount (tab-separated) |
CASSLEPSGGTDTQYF\t42 |
| AIRR-compatible TSV | tab-separated, with header row | junction_aa, v_call, j_call, ... |
For AIRR TSV: the sequence column is auto-detected from junction_aa / cdr3_amino_acid / cdr3_aa (variant aap), junction / cdr3_rearrangement (variant ndp), or any column named sequence/cdr3/seq. Gene calls come from v_call / j_call and must use IMGT-style notation (e.g. TRBV5-1*01). Gzipped inputs (.tsv.gz) are supported transparently.
from LZGraphs import LZGraph
# Build a graph from CDR3 amino acid sequences
graph = LZGraph(
['CASSLEPSGGTDTQYF', 'CASSDTSGGTDTQYF', 'CASSLEPQTFTDTFFF',
'CASSLGQGSTEAFF', 'CASSLGIRRT'],
variant='aap',
)
# Score a sequence
log_p = graph.pgen('CASSLEPSGGTDTQYF')
print(f"log P(gen) = {log_p:.2f}")
# Simulate new sequences
result = graph.simulate(1000, seed=42)
print(f"Generated {len(result)} sequences")
# Diversity
print(f"D(1) = {graph.effective_diversity():.1f}")
print(f"D(2) = {graph.hill_number(2):.1f}")from LZGraphs import LZGraph
sequences = ['CASSLEPSGGTDTQYF', 'CASSDTSGGTDTQYF',
'CASSLEPQTFTDTFFF', 'CASSLGQGSTEAFF']
graph = LZGraph(
sequences,
variant='aap',
v_genes=['TRBV16-1*01', 'TRBV1-1*01', 'TRBV5-1*01', 'TRBV7-2*03'],
j_genes=['TRBJ1-2*01', 'TRBJ1-5*01', 'TRBJ2-7*01', 'TRBJ1-2*01'],
)
# Gene-constrained simulation
result = graph.simulate(100, sample_genes=True, seed=42)
print(result.v_genes[0], result.j_genes[0])| Variant | Input | Node format | Best for |
|---|---|---|---|
'aap' |
Amino acid CDR3 | C_2, SL_6 |
Most TCR/BCR analysis |
'ndp' |
Nucleotide CDR3 | TG0_4 |
Nucleotide-level analysis |
'naive' |
Any strings | C, SL |
Motif discovery, ML features |
lzg build repertoire.tsv -o rep.lzg
lzg score rep.lzg sequences.txt
lzg diversity rep.lzg
lzg simulate rep.lzg -n 10000 --seed 42
lzg compare healthy.lzg disease.lzgfrom LZGraphs import FlashBackGraph
# Build a Markovian DAG from CDR3 sequences
graph = FlashBackGraph(
['CASSLEPSGGTDTQYF', 'CASSDTSGGTDTQYF', 'CASSLEPQTFTDTFFF',
'CASSLGQGSTEAFF', 'CASSLGIRRT'],
)
# Score a sequence (exact forward DP, no MC)
log_p = graph.pgen('CASSLEPSGGTDTQYF')
print(f"log P(gen) = {log_p:.2f}")
# Simulate from the Markovian distribution
result = graph.simulate(1000, seed=42)
# Diversity, entropy, path count: closed-form via forward DP
print(f"D(1) = {graph.effective_diversity():.1f}")
print(f"D(2) = {graph.hill_number(2):.1f}")
print(f"# distinct paths = {graph.path_count:.3e}")
# SCALE: self-calibrated anomaly score for flagging atypical / error sequences
cal = graph.calibrate_scale(seed=42) # calibrate once against the graph
print(f"SCALE = {graph.scale_score('CASSLEPSGGTDTQYF', cal):.2f}") # higher = more anomalousfrom LZGraphs import FlashBackGraph
# Write a tiny example file (one CDR3 per line, or seq<TAB>count for abundance)
with open('repertoire.tsv', 'w') as f:
for s in ['CASSLEPSGGTDTQYF', 'CASSDTSGGTDTQYF', 'CASSLGQGSTEAFF']:
f.write(s + '\n')
graph = FlashBackGraph.from_file('repertoire.tsv')
print(graph.n_nodes, 'nodes')For incremental / checkpointed builds over very large repertoires, use FlashBackStream: same accumulator with add_sequences(), snapshot(), and finalize(). See the class docstring (help(FlashBackStream)) for the streaming protocol.
LZGraph |
FlashBackGraph |
|
|---|---|---|
| Tokenization | LZ76 dictionary | FlashBack (run-peeling) |
| Structure | LZ-constrained walks | Markovian DAG |
| Diversity / entropy / path count | Analytical (with MC where needed) | Closed-form forward DP |
| Self-calibrated anomaly scoring (SCALE) | No | Yes |
| V/J gene annotation & gene-conditioned simulation | Yes | No |
| Encoding variants | aap, ndp, naive |
Single representation |
CLI tool (lzg) |
Yes | No |
| Streaming / incremental build | No | Yes (FlashBackStream) |
Benchmark figures below are from a single CPU core on a 5,000-sequence amino-acid CDR3 repertoire (mean length 14.7 aa; resulting LZGraph has ~1,700 nodes, ~9,600 edges). See docs/resources/benchmarks.md for the full table and methodology.
| Operation | Throughput |
|---|---|
| Graph construction | ~50,000 sequences/sec (5k seqs in <100 ms) |
pgen() scoring |
~5,000 sequences/sec (constant across batch sizes) |
simulate() |
~4,800 sequences/sec |
| Hill numbers via MC (10k walks) | ~2 sec |
Load / save .lzg |
~100× faster than rebuilding |
For repertoires of ~100k sequences and above, graph construction stays linear and saved .lzg files round-trip in seconds. FlashBackGraph's from_file and FlashBackStream paths operate in bounded memory; we have built and validated graphs with >70,000 nodes and >11M edges this way.
Every snippet in this section is paste-and-runnable after the Setup block below. graph flags a method that works on either class; lz_graph is an LZGraph instance and fb_graph is a FlashBackGraph instance. Methods marked LZGraph-only or FlashBackGraph-only are not implemented on the other class.
from LZGraphs import LZGraph, FlashBackGraph, jensen_shannon_divergence
seqs = ['CASSLEPSGGTDTQYF', 'CASSDTSGGTDTQYF', 'CASSLEPQTFTDTFFF',
'CASSLGQGSTEAFF', 'CASSLGIRRT']
v_genes = ['TRBV5-1*01', 'TRBV5-1*01', 'TRBV5-1*01', 'TRBV7-2*03', 'TRBV7-2*03']
j_genes = ['TRBJ2-7*01', 'TRBJ2-7*01', 'TRBJ2-7*01', 'TRBJ1-2*01', 'TRBJ1-2*01']
lz_graph = LZGraph(seqs, variant='aap', v_genes=v_genes, j_genes=j_genes)
fb_graph = FlashBackGraph(seqs)
graph = lz_graph # `graph` flags methods that work on either class
graph_a = LZGraph(seqs[:3], variant='aap')
graph_b = LZGraph(seqs[2:], variant='aap')
population = LZGraph(seqs * 4, variant='aap')
patient_seqs = ['CASSLGIRRT', 'CASSLGQGSTEAFF']
lz_reference = population
lz_sample = LZGraph(seqs, variant='aap')# Log-probability of a sequence (works on LZGraph and FlashBackGraph alike)
graph.pgen('CASSLEPSGGTDTQYF') # single → float
graph.pgen(['seq1', 'seq2', 'seq3']) # batch → np.ndarray
# Simulate (both classes)
result = graph.simulate(1000, seed=42)
result = lz_graph.simulate(100, v_gene='TRBV5-1*01', j_gene='TRBJ2-7*01') # LZGraph onlygraph.effective_diversity() # exp(Shannon entropy)
graph.hill_number(2) # inverse Simpson
graph.hill_numbers([0, 1, 2, 5]) # multiple orders → np.ndarray
# LZGraph-only
lz_graph.pgen_distribution() # analytical log-pgen distribution (Gaussian mixture)
lz_graph.predicted_richness(100_000) # expected unique seqs at depth
lz_graph.predicted_overlap(10000, 50000) # expected shared sequences
lz_graph.predict_sharing([1000]*5, max_k=5) # sharing spectrum across donors
# FlashBackGraph-only (closed-form)
fb_graph.path_count # exact count of distinct walks
cal = fb_graph.calibrate_scale(seed=0) # self-calibrate the SCALE anomaly score (once)
fb_graph.scale_score('CASSLEPSGGTDTQYF', cal) # SCALE: higher = more anomalous
fb_graph.pgen_moments() # exact moments of log-pgen distributioncombined = graph_a | graph_b # union (LZGraph and FlashBackGraph)
shared = graph_a & graph_b # intersection (both)
unique_a = graph_a - graph_b # difference (both)
personal = population.posterior(patient_seqs, kappa=10.0) # Bayesian update (both)jsd = jensen_shannon_divergence(graph_a, graph_b) # natural log (nats): 0.0 identical, ln(2) ≈ 0.693 disjointgraph.feature_stats() # 15-element summary vector (both classes)
# LZGraph-only
lz_reference.feature_aligned(lz_sample) # project sample into a fixed reference space
lz_graph.feature_mass_profile() # position-based mass distribution# Both classes use the same .lzg binary format, but each file is class-specific.
lz_graph.save('rep_lz.lzg')
loaded_lz = LZGraph.load('rep_lz.lzg')
fb_graph.save('rep_fb.lzg')
loaded_fb = FlashBackGraph.load('rep_fb.lzg')Full documentation with tutorials, concept guides, and API reference:
https://MuteJester.github.io/LZGraphs/
- Quick Start: build your first graph in 5 minutes
- Tutorials: graph construction, sequence analysis, diversity metrics
- API Reference: complete class and function reference
- CLI Reference: terminal tool documentation
If you use LZGraphs in published research, please cite the methods paper. If you also want to cite a specific software version, add the software entry below.
@article{konstantinovsky2023novel,
title={A novel approach to T-cell receptor beta chain ({TCRB}) repertoire encoding using lossless string compression},
author={Konstantinovsky, Thomas and Yaari, Gur},
journal={Bioinformatics},
volume={39},
number={7},
pages={btad426},
year={2023},
publisher={Oxford University Press},
doi={10.1093/bioinformatics/btad426}
}
@software{lzgraphs_software,
author={Konstantinovsky, Thomas},
title={{LZGraphs}: {LZ76} and {FlashBack} compression graphs for immune repertoire analysis},
url={https://github.com/MuteJester/LZGraphs},
year={2026}
}Contributions are welcome. Please open an issue or submit a pull request.
LZGraphs builds a CPython extension from a C library at install time, so a working C toolchain is required:
- Linux:
gccorclang(any version supporting C11) - macOS: Xcode command-line tools (
xcode-select --install) - Windows: Visual Studio Build Tools with the "Desktop development with C++" workload
Then:
git clone https://github.com/MuteJester/LZGraphs.git
cd LZGraphs
pip install -e ".[dev]" # editable install + dev extras (pytest, pytest-cov, ruff, scipy, build)
pytest # run the test suite (~505 tests)- Fork the repository
- Create a feature branch (
git checkout -b feature/my-feature) - Add tests for new functionality; make sure
pytestandpytest tests/regression/both pass - Commit your changes (small, focused commits preferred)
- Push and open a Pull Request describing the motivation and any API changes
MIT License. See LICENSE for details.
Thomas Konstantinovsky, thomaskon90@gmail.com
GitHub · PyPI · Documentation