Deepdraw is an active learning algorithm for genetic circuit design. It uses genomic foundation model (GFM) embeddings to make accurate predictions from very few experimental observations. At each iteration, Deepdraw integrates measurements from previous rounds with sequence-level circuit embeddings and proposes informative candidate designs in practical batches of 12, a 100-fold reduction relative to prior active learning approaches for circuit design.
This README is for users who want to apply Deepdraw to their own design pool. Retrospective benchmarking, Hydra sweeps, and Slurm array jobs are documented separately in job_sub/README.md.
Prerequisites: Git, Python, and uv.
git clone https://github.com/cellethology/deepdraw.git
cd deepdraw
uv sync --python 3.10
uv run deepdraw --helpPlain uv sync installs the user-facing Deepdraw runtime. If you are developing the package and want test, lint, notebook, and plotting tools, include the dev group:
uv sync --python 3.10 --group devOptional retrospective analysis and cluster-job tooling can be added only when needed:
uv sync --extra analysis
uv sync --extra clusterIf you are testing the current development branch before it is merged:
git checkout codex/deepdraw-user-workflowThe fastest way to verify the workflow is to run the bundled dummy example. It includes a 60-sequence design pool, a matching embeddings file, and fake first-round measurements.
uv run deepdraw init \
--pool-csv examples/deepdraw_dummy/design_pool.csv \
--embeddings examples/deepdraw_dummy/embeddings.npz \
--sequence-column sequence \
--id-column variant_idDeepdraw writes the first batch to:
deepdraw_run/round_000_to_measure.csv
Now simulate receiving measurements from the first experimental round. In a real project, this should be a cumulative file, for example measurements.csv, that you keep appending to after each round. The dummy example provides its starter version as measurements.csv.
uv run deepdraw suggest \
--run-dir deepdraw_run \
--measurements examples/deepdraw_dummy/measurements.csv \
--label-column ExpressionDeepdraw trains on all rows in the measurement file and writes the next batch to:
deepdraw_run/round_001_to_measure.csv
The dummy example uses the same defaults as a real run: first batch size 12, later batch size 12, seed 0, ProbCover-euclidean initial selection, BoTorch GP prediction, and MES acquisition. See Useful Flags for ways to change output location, batch size, model, acquisition strategy, and preprocessing.
Deepdraw expects you to start with an unlabeled design pool. You do not need any experimental measurements for the first round.
Create a CSV with one row per candidate design. Include a sequence column and, preferably, a stable design ID column.
variant_id,sequence
variant_001,ATGCGTACGTTAGCGA
variant_002,ATGCGTACGATAGCAA
variant_003,ATGCGTACGCTAGCTAThe stable ID column is recommended because it makes measurement files easier to merge across rounds.
Generate one embedding vector per design using your chosen genomic foundation model. Deepdraw currently expects embeddings to be provided as an NPZ file; embedding generation itself is outside the deepdraw CLI.
Required NPZ structure:
{
"embeddings": np.ndarray, # shape: (num_designs, embedding_dim)
"ids": np.ndarray, # variant IDs or row indices aligned to the pool CSV
}sample_ids is also accepted instead of ids. If you pass --id-column variant_id, the NPZ IDs should match that CSV column. If you do not pass an ID column, use row indices 0, 1, 2, ....
Run deepdraw init to choose the first experimental batch from embeddings only.
uv run deepdraw init \
--pool-csv designs.csv \
--embeddings embeddings.npz \
--sequence-column sequence \
--id-column variant_id \
--output-dir runs/my_deepdraw_run \
--starting-batch-size 12 \
--batch-size 12Outputs:
runs/my_deepdraw_run/
├── deepdraw_state.json
├── latest_recommendations.csv
├── round_000_to_measure.csv
└── selection_history.csv
Send round_000_to_measure.csv to the wet lab.
Recommendation CSVs keep the original design-pool columns; selection_history.csv
adds only deepdraw_round so you can see which batch each design came from.
After the first experiment, create one cumulative measurements CSV, such as measurements.csv. The easiest approach is to copy round_000_to_measure.csv and add a measured label column.
variant_id,sequence,Expression
variant_001,ATGCGTACGTTAGCGA,1.42
variant_005,ATGCGTACGAAAGCGA,3.87
variant_009,ATGCGTACGCAAGTTA,5.11Keep the stable ID column, such as variant_id, in every measurement update. Deepdraw uses that column to map measurements back to the original design pool.
You can include extra measured designs that were not recommended by Deepdraw, as long as they are present in the original design pool. deepdraw suggest trains on every measured design in measurements.csv, then excludes all measured designs from the next recommendation batch.
Keep updating this same file over time. After you measure round_001_to_measure.csv, append those new rows to measurements.csv; after round_002_to_measure.csv, append those rows too. Each deepdraw suggest call expects the measurements file to contain every previously recommended design with a measured label.
Run deepdraw suggest with the cumulative measurement table:
uv run deepdraw suggest \
--run-dir runs/my_deepdraw_run \
--measurements measurements.csv \
--label-column ExpressionThis writes:
runs/my_deepdraw_run/round_001_to_measure.csv
Measure that batch, append the new labels to the same measurements.csv, and run deepdraw suggest again:
uv run deepdraw suggest \
--run-dir runs/my_deepdraw_run \
--measurements measurements.csv \
--label-column ExpressionThe next output will be:
runs/my_deepdraw_run/round_002_to_measure.csv
The loop is:
design pool + embeddings
|
v
deepdraw init
|
v
measure round_000
|
v
deepdraw suggest
|
v
measure round_001
|
v
append round_001 labels to measurements.csv
|
v
deepdraw suggest
|
v
repeat
For a first real campaign, use the defaults unless you have a reason to compare strategies:
uv run deepdraw init \
--pool-csv designs.csv \
--embeddings embeddings.npz \
--sequence-column sequence \
--id-column variant_id \
--output-dir runs/my_deepdraw_run \
--starting-batch-size 12 \
--batch-size 12The default workflow uses:
- initial selection:
probcover_euclidean - predictor:
gp - query strategy:
mes - feature transforms:
standardize - target transforms:
log_standardize
For very small smoke tests, you can use faster settings: random, ridge_regressor, topk, and no transforms.
The examples above use production defaults. You can override pieces of the workflow when you need a different experimental setup or a faster local smoke test.
Change the number of designs per round:
uv run deepdraw init \
--pool-csv designs.csv \
--embeddings embeddings.npz \
--sequence-column sequence \
--id-column variant_id \
--output-dir runs/my_deepdraw_run \
--starting-batch-size 24 \
--batch-size 12Choose where outputs are written:
uv run deepdraw init \
--pool-csv designs.csv \
--embeddings embeddings.npz \
--sequence-column sequence \
--id-column variant_id \
--output-dir runs/my_deepdraw_runIf the run directory already exists, deepdraw init stops rather than overwriting deepdraw_state.json. Add --force only when you intentionally want to overwrite an existing run directory.
Use a faster local smoke-test configuration by swapping in lighter model and transform settings:
uv run deepdraw init \
--pool-csv examples/deepdraw_dummy/design_pool.csv \
--embeddings examples/deepdraw_dummy/embeddings.npz \
--sequence-column sequence \
--id-column variant_id \
--output-dir /tmp/deepdraw_dummy_fast_run \
--starting-batch-size 4 \
--batch-size 3 \
--seed 11 \
--initial-selection-strategy random \
--predictor ridge_regressor \
--query-strategy topk \
--feature-transforms none \
--target-transforms none \
--forceCommon override flags:
--starting-batch-size: number of designs in the first batch.--batch-size: number of designs in each later batch.--seed: random seed for reproducible initial selection and stochastic model components.--initial-selection-strategy: first-round strategy, such asprobcover_euclidean,core_set, orrandom.--predictor: model used after measurements arrive, such asgporridge_regressor. Existingbotorch_*names are also accepted.--query-strategy: acquisition strategy for later rounds, such asmes,qlog_nei, ortopk. Existingbotorch_*names are also accepted.--feature-transforms: feature preprocessing config, such asstandardizeornone.--target-transforms: label preprocessing config, such aslog_standardizeornone.--log-level: progress output verbosity; default isINFO, useWARNINGfor quieter runs.
Create a run and select the first batch:
uv run deepdraw init --helpTrain on measured labels and select the next batch:
uv run deepdraw suggest --helpRequired/common arguments:
--pool-csv: CSV containing candidate designs.--embeddings: NPZ containing GFM embeddings aligned to the design pool.--sequence-column: sequence column in the design pool CSV.--id-column: optional stable design ID column in the pool CSV.--output-dir: run directory where Deepdraw writes state and recommendations. Defaults todeepdraw_run.--measurements: cumulative CSV containing measured labels for all previous Deepdraw recommendations.--label-column: measured target column, such asExpressionorFold Change.
├── deepdraw/ # User-facing Deepdraw CLI and workflow
├── examples/deepdraw_dummy/ # Tiny runnable example
├── core/ # Active learning models, strategies, and trainers
├── job_sub/ # Retrospective benchmark and Slurm/Hydra tooling
├── test/ # Unit and workflow tests
└── utils/ # Supporting utilities
uv run pytest test/test_deepdraw_workflow.py
uv run pytestCitation information will be added with the manuscript/release.