PeptidesFunctionalDataset

class dgl.data.PeptidesFunctionalDataset(raw_dir=None, force_reload=None, verbose=None, transform=None, smiles2graph=<function smiles2graph>)[source]

Bases: dgl.data.dgl_dataset.DGLDataset

Peptides functional dataset for the graph classification task.

DGL dataset of Peptides-func in the LRGB benchmark which contains 15,535 peptides represented as their molecular graph(SMILES) with 10-way multi-task binary classification of their functional classes.

The 10 classes represent the following functional classes (in order):

[‘antifungal’, ‘cell_cell_communication’, ‘anticancer’, ‘drug_delivery_vehicle’, ‘antimicrobial’, ‘antiviral’, ‘antihypertensive’, ‘antibacterial’, ‘antiparasitic’, ‘toxic’]

Reference https://arxiv.org/abs/2206.08164.pdf

Statistics:

  • Train examples: 10,873

  • Valid examples: 2,331

  • Test examples: 2,331

  • Average number of nodes: 150.94

  • Average number of edges: 307.30

  • Number of atom types: 9

  • Number of bond types: 3

Parameters
  • raw_dir (str) – Directory to store all the downloaded raw datasets. Default: “~/.dgl/”.

  • force_reload (bool) – Whether to reload the dataset. Default: False.

  • verbose (bool) – Whether to print out progress information. Default: False.

  • transform (callable, optional) – A transform that takes in a DGLGraph object and returns a transformed version. The DGLGraph object will be transformed before every access.

  • (callable) (smiles2graph) – A callable function that converts a SMILES string into a graph object. * The default smiles2graph requires rdkit to be installed *

Examples

>>> from dgl.data import PeptidesFunctionalDataset
>>> dataset = PeptidesFunctionalDataset()
>>> len(dataset)
15535
>>> dataset.num_classes
10
>>> graph, label = dataset[0]
>>> graph
Graph(num_nodes=119, num_edges=244,
    ndata_schemes={'feat': Scheme(shape=(9,), dtype=torch.int64)}
    edata_schemes={'feat': Scheme(shape=(3,), dtype=torch.int64)})
>>> # support tensor to be index when transform is None
>>> # see details in __getitem__ function
>>> # get train dataset
>>> split_dict = dataset.get_idx_split()
>>> trainset = dataset[split_dict["train"]]
>>> graph, label = trainset[0]
>>> graph
Graph(num_nodes=338, num_edges=682,
    ndata_schemes={'feat': Scheme(shape=(9,), dtype=torch.int64)}
    edata_schemes={'feat': Scheme(shape=(3,), dtype=torch.int64)})
>>> # get subset of dataset
>>> import torch
>>> idx = torch.tensor([0, 1, 2])
>>> dataset_subset = dataset[idx]
>>> graph, label = dataset_subset[0]
>>> graph
Graph(num_nodes=119, num_edges=244,
    ndata_schemes={'feat': Scheme(shape=(9,), dtype=torch.int64)}
    edata_schemes={'feat': Scheme(shape=(3,), dtype=torch.int64)})
__getitem__(idx)[source]

Get the idx-th sample.

Parameters

idx (int or tensor) – The sample index. 1-D tensor as idx is allowed when transform is None.

Returns

__len__()[source]

The number of examples in the dataset.