StaB-ddG predicts mutational effects on binding interaction energies (
We provide
- Installation instructions,
- An example demonstrating how to make predictions, and
- Training and inference code to reproduce the results of our paper.
We use conda to manage required dependencies. The are few required packages (see environment.yaml); alternatively to creating a new environment, you can run StaB-ddG with an existing environment with PyTorch, after pip install tqdm scipy wandb pandas.
git clone git@github.com:LDeng0205/StaB-ddG.git
cd StaB-ddG
conda env create -f environment.yaml
conda activate stabddg
# (optionally) install stabddg as a package
pip install -e .
StaB-ddG requires as input PDB files and a csv file with mutations of interest, and predicts the
For a single ddG prediction, StaB-ddG requires a pdb file, the chains specifying the two binders, and a mutation string.
-
--pdb_pathpath to the wild type structure. -
--mutationa single mutation or multiple mutations separated by commas. For example,YH103H,QC7Rdenotes a double mutant. The first character of a mutation string denotes the wild type amino acid, the second character the chain, followed by the position in the chain, and lastly the amino acid to mutate to. For example,YH103Hdenotes a mutation from Y to H at position 103 of chain H. Note: we assume that the residue id of each chain starts at 1. A function for renumbering residue indices is provided instabddg/utils.py. -
--chainschains specifying what interface the$\Delta \Delta G$ is computed over for a multichain complex, separated by an underscore. For example,ABC_DEdenotes that the energy is computed between the interface of chainsABCandDE. -
--mc_samplesthe number of Monte Carlo samples to average over for variance reduction. This is set to 20 by default.
We provide an example that predicts the effect of the mutation EA63Q,QD30V,KA66A on 1AO7.
python run_stabddg.py --pdb_path examples/one_mutation/1AO7.pdb \
--mutation EA63Q,QD30V,KA66A --chains ABC_DE
This will create a directory 1AO7_output/ with the following files:
-
output.csvcontains the predicted$\Delta \Delta G$in columnpred_1 -
1AO7.pdbthe wild type structure -
1AO7_ABC.pdbthe wild type structure with only chainsABC -
1AO7_DE.pdbthe wild type structure with only chainsDE -
1AO7_ABC_DE_cache.pklcache of intermediate output -
input.csvintermediate input file
We provide the following checkpoints. The final checkpoint, stabddg.pt, should be used for inference.
./model_ckpts/
├── proteinmpnn.pt # soluble ProteinMPNN weights (v_48_020.pt) from https://github.com/dauparas/ProteinMPNN
├── stability_finetuned.pt # model weights after stability fine-tuning
├── stabddg.pt # final model fine-tuned on both stability and binding data.
A list of mutations across different complexes can be provided in the form of a mutation csv file (--csv_path). The mutation csv file should contain two columns, #Pdb and mutation. The #Pdb column should contain the name of the complex PDB file concatenated with an underscore and the chains of the two binders without the .pdb suffix (e.g. 1AO7_ABC_DE corresponding to the interface between chains ABC and DE for 1AO7.pdb). The mutation column contains the mutations of interest with the same format as described above. In addition, a --pdb_dir should be specified that contains the wild type structures of the mutations of interest.
An example is provided in examples/list_of_mutations. Predictions can be generated using the following command:
python run_stabddg.py --pdb_dir examples/list_of_mutations/pdbs \
--csv_path examples/list_of_mutations/mutations.csv
By default, this command will create an output directory in the same directory as the mutant csv file. The predictions are saved in output/output.csv.
The two fine-tuning sections map onto the two fine-tuning steps described in the paper. First, we fine-tune on the Megascale protein folding stability dataset, then fine-tune on SKEMPI. The fine-tuning runs can be optionally tracked on Wandb with the flag --wandb.
First download the data from https://zenodo.org/records/7992926. Specifically, the files needed are Tsuboyama2023_Dataset2_Dataset3_20230416.csv and AlphaFold_model_PDBs.zip. This takes several minutes.
data_dir=<destination_for_files>
cd $data_dir
wget https://zenodo.org/records/7992926/files/AlphaFold_model_PDBs.zip
wget https://zenodo.org/records/7992926/files/Processed_K50_dG_datasets.zip
unzip AlphaFold_model_PDBs.zip
unzip Processed_K50_dG_datasets.zip
Then from the repo directory, launch stabiity finetuning as
python stability_finetune.py \
--stability_data $data_dir/Processed_K50_dG_datasets/Tsuboyama2023_Dataset2_Dataset3_20230416.csv \
--pdb_dir $data_dir/AlphaFold_model_PDBs
The above script requires torch to access a single gpu with torch.device='cuda'.
The model checkpoints will be saved in cache/stability_finetuned by default.
A filtered version of the SKEMPI csv is provided in data/SKEMPI/filtered_skempi.csv. The PDB files can be downloaded from https://life.bsc.es/pid/skempi2/database/index.
cd $data_dir
wget https://life.bsc.es/pid/skempi2/database/download/SKEMPI2_PDBs.tgz
tar -xvzf SKEMPI2_PDBs.tgz # Data now in $data_dir/PDBs/
After the download, finetune from the repo directory.
python skempi_finetune.py --train_split_path data/SKEMPI/train_pdb.pkl \
--run_name RUN_NAME \
--checkpoint model_ckpts/stability_finetuned.pt \
--skempi_pdb_dir $data_dir/PDBs \
--skempi_path data/SKEMPI/filtered_skempi.csv
The model checkpoints will be saved in cache/skempi_finetuned by default.
Training and evaluation on the SKEMPI binding energy data involves several files that we provide in ./data/SKEMPI/.
./data/SKEMPI/
├── skempi_v2.csv # unfiltered SKEMPI csv file from https://life.bsc.es/pid/skempi2/database/index
├── quality_filtering.ipynb # filtering script corresponding to the criteria described in our paper
├── skempi_splits.ipynb # splitting script corresponding to the interface-homology cluster splitting described in our paper
├── filtered_skempi.csv # the output of quality_filtering.ipynb (containing a subset of the rows in `skempi_v2.csv`). An additional column indicates the cluster assignment associated the train/test split.
├── test_clusters.(pkl/txt) # test split
├── test_pdb.pkl # test split pdbs
├── train_clusters.(pkl/txt) # train split
├── train_pdb.pkl # train split pdbs
To reproduce results from the paper, run the following command.
data_dir=<location of SKEMPI2_PDBs from life.bsc.es/pid/skempi2>
python skempi_eval.py --skempi_pdb_dir $data_dir/PDBs
Running the evaluation script will create a csv file containing the predictions, labels, mutations, and the PDB names in cache/eval.csv. A notebook is provided in baselines/read_results_skempi_test.ipynb to parse the results and reproduce numbers reported in the paper. Along with the script to compute metrics, we also provide StaB-ddG, FoldX, and Flex ddG predictions for reproducibility.
./baselines/
├── read_results_skempi_test.ipynb # notebook for computing metrics for prediction csv files.
├── eval_utils.py # script containing code for computing metrics, including t tests and cluster-bootstrapping.
├── foldx.csv # csv file containing FoldX predictions for train + test splits.
├── flexddg.csv # csv file containing Flex ddG predictions for train + test splits.
├── StaB-ddG.csv # csv file containing StaB-ddG predictions for the test split.
FoldX predictions are obtained by following the protocol outlined in this paper, with five RepairPDB steps followed by BuildModel with numberOfRuns set to 10. We found that these parameters are crucial to the accuracy of FoldX. For Flex ddG, we used the scripts provided in in the repository, and used the recommended parameters (e.g. number_backrub_trials=35,000), with nstruct=10.
- Code built upon ProteinMPNN and Graph-Based Protein Design, specifically ./stabddg/mpnn_utils.py.
- Folding stability data collected by Tsuboyama et al.. And train/test splits by Dieckhaus et al., specifically
rocklin/mega_splits.pkl - Binding energy data curated from SKEMPIV2 by Jankauskaite et al.