Skip to content

MuteJester/LZGraphs

Repository files navigation

LZGraphs

LZ76 and FlashBack compression graphs for immune receptor repertoire analysis

PyPI Python License Downloads Stars

Documentation  ·  Quick Start  ·  API Reference  ·  Report Bug


LZGraphs is a Python library that turns T-cell and B-cell receptor CDR3 sequences into probabilistic directed graphs. It ships two graph families on a shared C core:

  • LZGraph: built from Lempel-Ziv 76 compression. Supports V/J gene annotation, three encoding variants, and a lzg CLI.
  • FlashBackGraph: a Markovian DAG built from FlashBack tokenization (recursive run-peeling from both ends of the sentinel-wrapped sequence). Diversity, entropy, and path counting have closed-form forward-DP solutions; sequence simulation is still sampled.

Both classes share a common surface for scoring, simulation, diversity, graph algebra, posterior personalization, and binary serialization. See When to use which for a comparison.

Example LZGraph built from 3 CDR3 sequences
An LZGraph built from three CDR3s. @ and $ are start/end sentinels; subpattern nodes carry position suffixes.

Installation

pip install LZGraphs

Requires Python 3.9 or later. Wheels are published for Linux, macOS, and Windows (CPython 3.9–3.12). Release history: CHANGELOG.md.

Input format

For programmatic use, all classes accept a plain list of CDR3 strings:

LZGraph(['CASSLEPSGGTDTQYF', 'CASSDTSGGTDTQYF', ...], variant='aap')

For files (the CLI and FlashBackGraph.from_file), three formats are supported:

Format Layout Example
Plain one sequence per line CASSLEPSGGTDTQYF
Seq + count sequence\tcount (tab-separated) CASSLEPSGGTDTQYF\t42
AIRR-compatible TSV tab-separated, with header row junction_aa, v_call, j_call, ...

For AIRR TSV: the sequence column is auto-detected from junction_aa / cdr3_amino_acid / cdr3_aa (variant aap), junction / cdr3_rearrangement (variant ndp), or any column named sequence/cdr3/seq. Gene calls come from v_call / j_call and must use IMGT-style notation (e.g. TRBV5-1*01). Gzipped inputs (.tsv.gz) are supported transparently.

Quick Start: LZGraph

from LZGraphs import LZGraph

# Build a graph from CDR3 amino acid sequences
graph = LZGraph(
    ['CASSLEPSGGTDTQYF', 'CASSDTSGGTDTQYF', 'CASSLEPQTFTDTFFF',
     'CASSLGQGSTEAFF', 'CASSLGIRRT'],
    variant='aap',
)

# Score a sequence
log_p = graph.pgen('CASSLEPSGGTDTQYF')
print(f"log P(gen) = {log_p:.2f}")

# Simulate new sequences
result = graph.simulate(1000, seed=42)
print(f"Generated {len(result)} sequences")

# Diversity
print(f"D(1) = {graph.effective_diversity():.1f}")
print(f"D(2) = {graph.hill_number(2):.1f}")

With gene annotation

from LZGraphs import LZGraph

sequences = ['CASSLEPSGGTDTQYF', 'CASSDTSGGTDTQYF',
             'CASSLEPQTFTDTFFF', 'CASSLGQGSTEAFF']
graph = LZGraph(
    sequences,
    variant='aap',
    v_genes=['TRBV16-1*01', 'TRBV1-1*01', 'TRBV5-1*01', 'TRBV7-2*03'],
    j_genes=['TRBJ1-2*01', 'TRBJ1-5*01', 'TRBJ2-7*01', 'TRBJ1-2*01'],
)

# Gene-constrained simulation
result = graph.simulate(100, sample_genes=True, seed=42)
print(result.v_genes[0], result.j_genes[0])

LZGraph encoding variants

Variant Input Node format Best for
'aap' Amino acid CDR3 C_2, SL_6 Most TCR/BCR analysis
'ndp' Nucleotide CDR3 TG0_4 Nucleotide-level analysis
'naive' Any strings C, SL Motif discovery, ML features

Command line

lzg build repertoire.tsv -o rep.lzg
lzg score rep.lzg sequences.txt
lzg diversity rep.lzg
lzg simulate rep.lzg -n 10000 --seed 42
lzg compare healthy.lzg disease.lzg

Quick Start: FlashBackGraph

from LZGraphs import FlashBackGraph

# Build a Markovian DAG from CDR3 sequences
graph = FlashBackGraph(
    ['CASSLEPSGGTDTQYF', 'CASSDTSGGTDTQYF', 'CASSLEPQTFTDTFFF',
     'CASSLGQGSTEAFF', 'CASSLGIRRT'],
)

# Score a sequence (exact forward DP, no MC)
log_p = graph.pgen('CASSLEPSGGTDTQYF')
print(f"log P(gen) = {log_p:.2f}")

# Simulate from the Markovian distribution
result = graph.simulate(1000, seed=42)

# Diversity, entropy, path count: closed-form via forward DP
print(f"D(1) = {graph.effective_diversity():.1f}")
print(f"D(2) = {graph.hill_number(2):.1f}")
print(f"# distinct paths = {graph.path_count:.3e}")

# SCALE: self-calibrated anomaly score for flagging atypical / error sequences
cal = graph.calibrate_scale(seed=42)               # calibrate once against the graph
print(f"SCALE = {graph.scale_score('CASSLEPSGGTDTQYF', cal):.2f}")  # higher = more anomalous

Build from a file (streaming, constant memory)

from LZGraphs import FlashBackGraph

# Write a tiny example file (one CDR3 per line, or seq<TAB>count for abundance)
with open('repertoire.tsv', 'w') as f:
    for s in ['CASSLEPSGGTDTQYF', 'CASSDTSGGTDTQYF', 'CASSLGQGSTEAFF']:
        f.write(s + '\n')

graph = FlashBackGraph.from_file('repertoire.tsv')
print(graph.n_nodes, 'nodes')

For incremental / checkpointed builds over very large repertoires, use FlashBackStream: same accumulator with add_sequences(), snapshot(), and finalize(). See the class docstring (help(FlashBackStream)) for the streaming protocol.

When to use which

LZGraph FlashBackGraph
Tokenization LZ76 dictionary FlashBack (run-peeling)
Structure LZ-constrained walks Markovian DAG
Diversity / entropy / path count Analytical (with MC where needed) Closed-form forward DP
Self-calibrated anomaly scoring (SCALE) No Yes
V/J gene annotation & gene-conditioned simulation Yes No
Encoding variants aap, ndp, naive Single representation
CLI tool (lzg) Yes No
Streaming / incremental build No Yes (FlashBackStream)

Performance

Benchmark figures below are from a single CPU core on a 5,000-sequence amino-acid CDR3 repertoire (mean length 14.7 aa; resulting LZGraph has ~1,700 nodes, ~9,600 edges). See docs/resources/benchmarks.md for the full table and methodology.

Operation Throughput
Graph construction ~50,000 sequences/sec (5k seqs in <100 ms)
pgen() scoring ~5,000 sequences/sec (constant across batch sizes)
simulate() ~4,800 sequences/sec
Hill numbers via MC (10k walks) ~2 sec
Load / save .lzg ~100× faster than rebuilding

For repertoires of ~100k sequences and above, graph construction stays linear and saved .lzg files round-trip in seconds. FlashBackGraph's from_file and FlashBackStream paths operate in bounded memory; we have built and validated graphs with >70,000 nodes and >11M edges this way.

Key Capabilities

Every snippet in this section is paste-and-runnable after the Setup block below. graph flags a method that works on either class; lz_graph is an LZGraph instance and fb_graph is a FlashBackGraph instance. Methods marked LZGraph-only or FlashBackGraph-only are not implemented on the other class.

Setup

from LZGraphs import LZGraph, FlashBackGraph, jensen_shannon_divergence

seqs = ['CASSLEPSGGTDTQYF', 'CASSDTSGGTDTQYF', 'CASSLEPQTFTDTFFF',
        'CASSLGQGSTEAFF', 'CASSLGIRRT']
v_genes = ['TRBV5-1*01', 'TRBV5-1*01', 'TRBV5-1*01', 'TRBV7-2*03', 'TRBV7-2*03']
j_genes = ['TRBJ2-7*01', 'TRBJ2-7*01', 'TRBJ2-7*01', 'TRBJ1-2*01', 'TRBJ1-2*01']

lz_graph = LZGraph(seqs, variant='aap', v_genes=v_genes, j_genes=j_genes)
fb_graph = FlashBackGraph(seqs)
graph    = lz_graph                      # `graph` flags methods that work on either class

graph_a      = LZGraph(seqs[:3], variant='aap')
graph_b      = LZGraph(seqs[2:], variant='aap')
population   = LZGraph(seqs * 4, variant='aap')
patient_seqs = ['CASSLGIRRT', 'CASSLGQGSTEAFF']

lz_reference = population
lz_sample    = LZGraph(seqs, variant='aap')

Scoring & Simulation

# Log-probability of a sequence (works on LZGraph and FlashBackGraph alike)
graph.pgen('CASSLEPSGGTDTQYF')               # single → float
graph.pgen(['seq1', 'seq2', 'seq3'])          # batch  → np.ndarray

# Simulate (both classes)
result = graph.simulate(1000, seed=42)
result = lz_graph.simulate(100, v_gene='TRBV5-1*01', j_gene='TRBJ2-7*01')  # LZGraph only

Diversity & Analytics

graph.effective_diversity()          # exp(Shannon entropy)
graph.hill_number(2)                 # inverse Simpson
graph.hill_numbers([0, 1, 2, 5])     # multiple orders → np.ndarray

# LZGraph-only
lz_graph.pgen_distribution()         # analytical log-pgen distribution (Gaussian mixture)
lz_graph.predicted_richness(100_000) # expected unique seqs at depth
lz_graph.predicted_overlap(10000, 50000)        # expected shared sequences
lz_graph.predict_sharing([1000]*5, max_k=5)     # sharing spectrum across donors

# FlashBackGraph-only (closed-form)
fb_graph.path_count                  # exact count of distinct walks
cal = fb_graph.calibrate_scale(seed=0)          # self-calibrate the SCALE anomaly score (once)
fb_graph.scale_score('CASSLEPSGGTDTQYF', cal)   # SCALE: higher = more anomalous
fb_graph.pgen_moments()              # exact moments of log-pgen distribution

Graph Algebra

combined = graph_a | graph_b          # union          (LZGraph and FlashBackGraph)
shared   = graph_a & graph_b          # intersection   (both)
unique_a = graph_a - graph_b          # difference     (both)
personal = population.posterior(patient_seqs, kappa=10.0)  # Bayesian update (both)

Repertoire Comparison

jsd = jensen_shannon_divergence(graph_a, graph_b)  # natural log (nats): 0.0 identical, ln(2) ≈ 0.693 disjoint

ML Feature Extraction

graph.feature_stats()                 # 15-element summary vector (both classes)

# LZGraph-only
lz_reference.feature_aligned(lz_sample)   # project sample into a fixed reference space
lz_graph.feature_mass_profile()           # position-based mass distribution

Serialization

# Both classes use the same .lzg binary format, but each file is class-specific.
lz_graph.save('rep_lz.lzg')
loaded_lz = LZGraph.load('rep_lz.lzg')

fb_graph.save('rep_fb.lzg')
loaded_fb = FlashBackGraph.load('rep_fb.lzg')

Documentation

Full documentation with tutorials, concept guides, and API reference:

https://MuteJester.github.io/LZGraphs/

Citation

If you use LZGraphs in published research, please cite the methods paper. If you also want to cite a specific software version, add the software entry below.

@article{konstantinovsky2023novel,
  title={A novel approach to T-cell receptor beta chain ({TCRB}) repertoire encoding using lossless string compression},
  author={Konstantinovsky, Thomas and Yaari, Gur},
  journal={Bioinformatics},
  volume={39},
  number={7},
  pages={btad426},
  year={2023},
  publisher={Oxford University Press},
  doi={10.1093/bioinformatics/btad426}
}

@software{lzgraphs_software,
  author={Konstantinovsky, Thomas},
  title={{LZGraphs}: {LZ76} and {FlashBack} compression graphs for immune repertoire analysis},
  url={https://github.com/MuteJester/LZGraphs},
  year={2026}
}

Contributing

Contributions are welcome. Please open an issue or submit a pull request.

Local development setup

LZGraphs builds a CPython extension from a C library at install time, so a working C toolchain is required:

  • Linux: gcc or clang (any version supporting C11)
  • macOS: Xcode command-line tools (xcode-select --install)
  • Windows: Visual Studio Build Tools with the "Desktop development with C++" workload

Then:

git clone https://github.com/MuteJester/LZGraphs.git
cd LZGraphs
pip install -e ".[dev]"   # editable install + dev extras (pytest, pytest-cov, ruff, scipy, build)
pytest                    # run the test suite (~505 tests)

PR checklist

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/my-feature)
  3. Add tests for new functionality; make sure pytest and pytest tests/regression/ both pass
  4. Commit your changes (small, focused commits preferred)
  5. Push and open a Pull Request describing the motivation and any API changes

License

MIT License. See LICENSE for details.

Contact

Thomas Konstantinovsky, thomaskon90@gmail.com

GitHub · PyPI · Documentation

About

LZ76 Graphs and Applications in Immunology

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors