PeptidesStructuralDataset¶

class dgl.data.PeptidesStructuralDataset(raw_dir=None, force_reload=None, verbose=None, transform=None, smiles2graph=<function smiles2graph>)[source]¶

Bases: dgl.data.dgl_dataset.DGLDataset

Peptides structure dataset for the graph regression task.

DGL dataset of Peptides-struct in the LRGB benchmark which contains 15,535 small peptides represented as their molecular graph (SMILES) with 11 regression targets derived from the peptide’s 3D structure.

The 11 regression targets were precomputed from molecules’ 3D structure:

Inertia_mass_[a-c]: The principal component of the inertia of the mass, with some normalizations. (Sorted)
Inertia_valence_[a-c]: The principal component of the inertia of the Hydrogen atoms. This is basically a measure of the 3D distribution of hydrogens. (Sorted)
length_[a-c]: The length around the 3 main geometric axis of the 3D objects (without considering atom types). (Sorted)
Spherocity: SpherocityIndex descriptor computed by rdkit.Chem.rdMolDescriptors.CalcSpherocityIndex
Plane_best_fit: Plane of best fit (PBF) descriptor computed by rdkit.Chem.rdMolDescriptors.CalcPBF

Reference https://arxiv.org/abs/2206.08164.pdf

Statistics:

Train examples: 10,873
Valid examples: 2,331
Test examples: 2,331
Average number of nodes: 150.94
Average number of edges: 307.30
Number of atom types: 9
Number of bond types: 3

Parameters

raw_dir (str) – Directory to store all the downloaded raw datasets. Default: “~/.dgl/”.
force_reload (bool) – Whether to reload the dataset. Default: False.
verbose (bool) – Whether to print out progress information. Default: False.
transform (callable, optional) – A transform that takes in a DGLGraph object and returns a transformed version. The DGLGraph object will be transformed before every access.
smiles2graph (callable) – A callable function that converts a SMILES string into a graph object. * The default smiles2graph requires rdkit to be installed *

Examples

>>> from dgl.data import PeptidesStructuralDataset

>>> dataset = PeptidesStructuralDataset()
>>> len(dataset)
15535
>>> dataset.num_atom_types
9
>>> graph, label = dataset[0]
>>> graph
Graph(num_nodes=119, num_edges=244,
    ndata_schemes={'feat': Scheme(shape=(9,), dtype=torch.int64)}
    edata_schemes={'feat': Scheme(shape=(3,), dtype=torch.int64)})

>>> # support tensor to be index when transform is None
>>> # see details in __getitem__ function
>>> # get train dataset
>>> split_dict = dataset.get_idx_split()
>>> trainset = dataset[split_dict["train"]]
>>> graph, label = trainset[0]
>>> graph
Graph(num_nodes=338, num_edges=682,
    ndata_schemes={'feat': Scheme(shape=(9,), dtype=torch.int64)}
    edata_schemes={'feat': Scheme(shape=(3,), dtype=torch.int64)})

>>> # get subset of dataset
>>> import torch
>>> idx = torch.tensor([0, 1, 2])
>>> dataset_subset = dataset[idx]
>>> graph, label = dataset_subset[0]
>>> graph
Graph(num_nodes=119, num_edges=244,
    ndata_schemes={'feat': Scheme(shape=(9,), dtype=torch.int64)}
    edata_schemes={'feat': Scheme(shape=(3,), dtype=torch.int64)})

__getitem__(idx)[source]¶

Get the idx-th sample.

Parameters

idx (int or tensor) – The sample index. 1-D tensor as idx is allowed when transform is None.

Returns

(dgl.DGLGraph, Tensor) – Graph with node feature stored in feat field and its label.
or
dgl.data.utils.Subset – Subset of the dataset at specified indices

__len__()[source]¶: The number of examples in the dataset.