PeptidesStructuralDatasetยถ

class dgl.data.PeptidesStructuralDataset(raw_dir=None, force_reload=None, verbose=None, transform=None, smiles2graph=<function smiles2graph>)[source]ยถ

Bases: dgl.data.dgl_dataset.DGLDataset

Peptides structure dataset for the graph regression task.

DGL dataset of Peptides-struct in the LRGB benchmark which contains 15,535 small peptides represented as their molecular graph (SMILES) with 11 regression targets derived from the peptideโ€™s 3D structure.

The 11 regression targets were precomputed from moleculesโ€™ 3D structure:

  • Inertia_mass_[a-c]: The principal component of the inertia of the mass, with some normalizations. (Sorted)

  • Inertia_valence_[a-c]: The principal component of the inertia of the Hydrogen atoms. This is basically a measure of the 3D distribution of hydrogens. (Sorted)

  • length_[a-c]: The length around the 3 main geometric axis of the 3D objects (without considering atom types). (Sorted)

  • Spherocity: SpherocityIndex descriptor computed by rdkit.Chem.rdMolDescriptors.CalcSpherocityIndex

  • Plane_best_fit: Plane of best fit (PBF) descriptor computed by rdkit.Chem.rdMolDescriptors.CalcPBF

Reference https://arxiv.org/abs/2206.08164.pdf

Statistics:

  • Train examples: 10,873

  • Valid examples: 2,331

  • Test examples: 2,331

  • Average number of nodes: 150.94

  • Average number of edges: 307.30

  • Number of atom types: 9

  • Number of bond types: 3

Parameters
  • raw_dir (str) โ€“ Directory to store all the downloaded raw datasets. Default: โ€œ~/.dgl/โ€.

  • force_reload (bool) โ€“ Whether to reload the dataset. Default: False.

  • verbose (bool) โ€“ Whether to print out progress information. Default: False.

  • transform (callable, optional) โ€“ A transform that takes in a DGLGraph object and returns a transformed version. The DGLGraph object will be transformed before every access.

  • smiles2graph (callable) โ€“ A callable function that converts a SMILES string into a graph object. * The default smiles2graph requires rdkit to be installed *

Examples

>>> from dgl.data import PeptidesStructuralDataset
>>> dataset = PeptidesStructuralDataset()
>>> len(dataset)
15535
>>> dataset.num_atom_types
9
>>> graph, label = dataset[0]
>>> graph
Graph(num_nodes=119, num_edges=244,
    ndata_schemes={'feat': Scheme(shape=(9,), dtype=torch.int64)}
    edata_schemes={'feat': Scheme(shape=(3,), dtype=torch.int64)})
>>> # support tensor to be index when transform is None
>>> # see details in __getitem__ function
>>> # get train dataset
>>> split_dict = dataset.get_idx_split()
>>> trainset = dataset[split_dict["train"]]
>>> graph, label = trainset[0]
>>> graph
Graph(num_nodes=338, num_edges=682,
    ndata_schemes={'feat': Scheme(shape=(9,), dtype=torch.int64)}
    edata_schemes={'feat': Scheme(shape=(3,), dtype=torch.int64)})
>>> # get subset of dataset
>>> import torch
>>> idx = torch.tensor([0, 1, 2])
>>> dataset_subset = dataset[idx]
>>> graph, label = dataset_subset[0]
>>> graph
Graph(num_nodes=119, num_edges=244,
    ndata_schemes={'feat': Scheme(shape=(9,), dtype=torch.int64)}
    edata_schemes={'feat': Scheme(shape=(3,), dtype=torch.int64)})
__getitem__(idx)[source]ยถ

Get the idx-th sample.

Parameters

idx (int or tensor) โ€“ The sample index. 1-D tensor as idx is allowed when transform is None.

Returns

  • (dgl.DGLGraph, Tensor) โ€“ Graph with node feature stored in feat field and its label.

  • or

  • dgl.data.utils.Subset โ€“ Subset of the dataset at specified indices

__len__()[source]ยถ

The number of examples in the dataset.