Skip to content

Structure Similarity Task training #273

@ahariri13

Description

@ahariri13

Hello ! I'm still new to learning on proteins and I was wondering how to train on the Structure Similarity Task (at least in an efficient manner) when using the graph format for PyTorch Geometric.

For loading the data i am using the following lines:

"""## Load the task and the dataset"""
datapath = './data/ec'
task = ps_tasks.StructureSimilarityTask(root=datapath)
dset = task.dataset

"""We convert the protein 3D structures to $\epsilon$-graphs ($\epsilon=8$ here):"""

def transform(data):
    data, protein_dict = data
    data.y = protein_dict['protein']['ID']
    return data

dset2 = dset.to_graph(eps=8.0).pyg(transform=transform)

from torch.utils.data import Subset
from torch_geometric.loader import DataLoader

batch_size = args.batch_size
train_loader = DataLoader(Subset(dset2, task.train_index), batch_size=batch_size,shuffle=True, num_workers=0)

val_loader = DataLoader(Subset(dset2, task.val_index), batch_size=batch_size,shuffle=False, num_workers=0)

test_loader = DataLoader(Subset(dset2, task.test_index), batch_size=batch_size,shuffle=False, num_workers=0)

My understanding is that we need to take two graph (protein) samples, embed them and predict a regression value for the similarity. Using the PyG dataloader will batch all dictionaries together, that's why i decided to select only the protein ID to be batched, and so i removed the ['protein']['ID'] part from the target task function in structure_similarity.py. As a result, my model looks as follows:

    def forward(self, batch):

        it=0
        for sample in batch: ## embed each batch in every sample separately. 
          x=sample.x
          edge_index=sample.edge_index

          x = self.x_embedding(x)
          x = self.conv1(x, edge_index)
          x = F.leaky_relu(x)
          x=self.bano1(x)
          #x = F.dropout(x, training=self.training,p=0.2)

          x = self.conv2(x, edge_index)
          x = F.leaky_relu(x)
          x=self.bano2(x)
          #x = F.dropout(x, training=self.training,p=0.2)

          x = self.conv3(x, edge_index)
          x = F.relu(x)
          x = self.bano3(x)
          # #x = F.dropout(x, training=self.training,p=0.2)

          x = self.conv4(x, edge_index)
          # x = F.relu(x)
          # x = self.bano3(x)
          # # #x = F.dropout(x, training=self.training,p=0.2)
          if it==0:
            s1=global_add_pool(x, sample.batch)
          else:
            s2=global_add_pool(x, sample.batch)

          it+=1
        final=self.mlpRep(s1+s2)

        return final 

and the evaluation function where i have to do a for loop to append the ground truths labels for the similarity values.

@torch.no_grad()
def eval_epoch(model, loader):
    model.eval()

    y_true = []
    y_pred = []

    for step, batch in enumerate(val_loader):
        size = len(batch[0].y)
        batch[0] = batch[0].to(device)
        batch[1] = batch[1].to(device)

        y_hat=model(batch)

        truths=[]
        for g in range(size):
          truths.append(task.targetBatch(batch[0].y[g],batch[1].y[g]))
        y_pred.append(y_hat)
        y_true.append(torch.Tensor(truths))

    y_true = torch.hstack(y_true).detach().cpu().numpy()
    y_pred2 = torch.vstack(y_pred).detach().cpu().numpy()
    scores = task.evaluate(y_true, y_pred2)
    return scores

Of course the training is taking too long and I would appreciate any tip on how to use the protein shake package more efficiently for this task. Thanks a lot in advance !

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions