Hello ! I'm still new to learning on proteins and I was wondering how to train on the Structure Similarity Task (at least in an efficient manner) when using the graph format for PyTorch Geometric.
For loading the data i am using the following lines:
"""## Load the task and the dataset"""
datapath = './data/ec'
task = ps_tasks.StructureSimilarityTask(root=datapath)
dset = task.dataset
"""We convert the protein 3D structures to $\epsilon$-graphs ($\epsilon=8$ here):"""
def transform(data):
data, protein_dict = data
data.y = protein_dict['protein']['ID']
return data
dset2 = dset.to_graph(eps=8.0).pyg(transform=transform)
from torch.utils.data import Subset
from torch_geometric.loader import DataLoader
batch_size = args.batch_size
train_loader = DataLoader(Subset(dset2, task.train_index), batch_size=batch_size,shuffle=True, num_workers=0)
val_loader = DataLoader(Subset(dset2, task.val_index), batch_size=batch_size,shuffle=False, num_workers=0)
test_loader = DataLoader(Subset(dset2, task.test_index), batch_size=batch_size,shuffle=False, num_workers=0)
My understanding is that we need to take two graph (protein) samples, embed them and predict a regression value for the similarity. Using the PyG dataloader will batch all dictionaries together, that's why i decided to select only the protein ID to be batched, and so i removed the ['protein']['ID'] part from the target task function in structure_similarity.py. As a result, my model looks as follows:
def forward(self, batch):
it=0
for sample in batch: ## embed each batch in every sample separately.
x=sample.x
edge_index=sample.edge_index
x = self.x_embedding(x)
x = self.conv1(x, edge_index)
x = F.leaky_relu(x)
x=self.bano1(x)
#x = F.dropout(x, training=self.training,p=0.2)
x = self.conv2(x, edge_index)
x = F.leaky_relu(x)
x=self.bano2(x)
#x = F.dropout(x, training=self.training,p=0.2)
x = self.conv3(x, edge_index)
x = F.relu(x)
x = self.bano3(x)
# #x = F.dropout(x, training=self.training,p=0.2)
x = self.conv4(x, edge_index)
# x = F.relu(x)
# x = self.bano3(x)
# # #x = F.dropout(x, training=self.training,p=0.2)
if it==0:
s1=global_add_pool(x, sample.batch)
else:
s2=global_add_pool(x, sample.batch)
it+=1
final=self.mlpRep(s1+s2)
return final
and the evaluation function where i have to do a for loop to append the ground truths labels for the similarity values.
@torch.no_grad()
def eval_epoch(model, loader):
model.eval()
y_true = []
y_pred = []
for step, batch in enumerate(val_loader):
size = len(batch[0].y)
batch[0] = batch[0].to(device)
batch[1] = batch[1].to(device)
y_hat=model(batch)
truths=[]
for g in range(size):
truths.append(task.targetBatch(batch[0].y[g],batch[1].y[g]))
y_pred.append(y_hat)
y_true.append(torch.Tensor(truths))
y_true = torch.hstack(y_true).detach().cpu().numpy()
y_pred2 = torch.vstack(y_pred).detach().cpu().numpy()
scores = task.evaluate(y_true, y_pred2)
return scores
Of course the training is taking too long and I would appreciate any tip on how to use the protein shake package more efficiently for this task. Thanks a lot in advance !
Hello ! I'm still new to learning on proteins and I was wondering how to train on the Structure Similarity Task (at least in an efficient manner) when using the graph format for PyTorch Geometric.
For loading the data i am using the following lines:
My understanding is that we need to take two graph (protein) samples, embed them and predict a regression value for the similarity. Using the PyG dataloader will batch all dictionaries together, that's why i decided to select only the protein ID to be batched, and so i removed the ['protein']['ID'] part from the target task function in structure_similarity.py. As a result, my model looks as follows:
and the evaluation function where i have to do a for loop to append the ground truths labels for the similarity values.
Of course the training is taking too long and I would appreciate any tip on how to use the protein shake package more efficiently for this task. Thanks a lot in advance !