Dataset

Utils

utils.get_download_dir() Get the absolute path to the download directory.
utils.download(url[, path, overwrite, …]) Download a given URL.
utils.check_sha1(filename, sha1_hash) Check whether the sha1 hash of the file content matches the expected hash.
utils.extract_archive(file, target_dir) Extract archive file.
utils.split_dataset(dataset[, frac_list, …]) Split dataset into training, validation and test set.
utils.save_graphs(filename, g_list[, labels]) Save DGLGraphs and graph labels to file
utils.load_graphs(filename[, idx_list]) Load DGLGraphs from file
utils.load_labels(filename) Load label dict from file
class dgl.data.utils.Subset(dataset, indices)[source]

Subset of a dataset at specified indices

Code adapted from PyTorch.

Parameters:
  • dataset – dataset[i] should return the ith datapoint
  • indices (list) – List of datapoint indices to construct the subset
__getitem__(item)[source]

Get the datapoint indexed by item

Returns:datapoint
Return type:tuple
__len__()[source]

Get subset size

Returns:Number of datapoints in the subset
Return type:int

Dataset Classes

Stanford sentiment treebank dataset

For more information about the dataset, see Sentiment Analysis.

class dgl.data.SST(mode='train', vocab_file=None)[source]

Stanford Sentiment Treebank dataset.

Each sample is the constituency tree of a sentence. The leaf nodes represent words. The word is a int value stored in the x feature field. The non-leaf node has a special value PAD_WORD in the x field. Each node also has a sentiment annotation: 5 classes (very negative, negative, neutral, positive and very positive). The sentiment label is a int value stored in the y feature field.

Note

This dataset class is compatible with pytorch’s Dataset class.

Note

All the samples will be loaded and preprocessed in the memory first.

Parameters:
  • mode (str, optional) – Can be 'train', 'val', 'test' and specifies which data file to use.
  • vocab_file (str, optional) – Optional vocabulary file.
__getitem__(idx)[source]

Get the tree with index idx.

Parameters:idx (int) – Tree index.
Returns:Tree.
Return type:dgl.DGLGraph
__len__()[source]

Get the number of trees in the dataset.

Returns:Number of trees.
Return type:int

Karate Club dataset

class dgl.data.KarateClub[source]

Zachary’s karate club is a social network of a university karate club, described in the paper “An Information Flow Model for Conflict and Fission in Small Groups” by Wayne W. Zachary. The network became a popular example of community structure in networks after its use by Michelle Girvan and Mark Newman in 2002.

This dataset has only one graph, with ndata ‘label’ means whether the node is belong to the “Mr. Hi” club.

Citation Network dataset

class dgl.data.CitationGraphDataset(name)[source]

The citation graph dataset, including citeseer and pubmeb. Nodes mean authors and edges mean citation relationships.

Parameters:name (str) – name can be ‘citeseer’ or ‘pubmed’.

Cora Citation Network dataset

class dgl.data.CoraDataset[source]

Cora citation network dataset. Nodes mean author and edges mean citation relationships.

CoraFull dataset

class dgl.data.CoraFull[source]

Extended Cora dataset from Deep Gaussian Embedding of Graphs: Unsupervised Inductive Learning via Ranking. Nodes represent paper and edges represent citations.

Reference: https://github.com/shchur/gnn-benchmark#datasets

Amazon Co-Purchase dataset

class dgl.data.AmazonCoBuy(name)[source]

Amazon Computers and Amazon Photo are segments of the Amazon co-purchase graph [McAuley et al., 2015], where nodes represent goods, edges indicate that two goods are frequently bought together, node features are bag-of-words encoded product reviews, and class labels are given by the product category.

Reference: https://github.com/shchur/gnn-benchmark#datasets

Parameters:name (str) – Name of the dataset, has to be ‘computers’ or ‘photo’

Coauthor dataset

class dgl.data.Coauthor(name)[source]

Coauthor CS and Coauthor Physics are co-authorship graphs based on the Microsoft Academic Graph from the KDD Cup 2016 challenge 3 . Here, nodes are authors, that are connected by an edge if they co-authored a paper; node features represent paper keywords for each author’s papers, and class labels indicate most active fields of study for each author.

Parameters:name (str) – Name of the dataset, has to be ‘cs’ or ‘physics’

BitcoinOTC dataset

class dgl.data.BitcoinOTC[source]

This is who-trusts-whom network of people who trade using Bitcoin on a platform called Bitcoin OTC. Since Bitcoin users are anonymous, there is a need to maintain a

record of users’ reputation to prevent transactions with fraudulent and risky users. Members of Bitcoin OTC rate other members in a scale of -10 (total distrust) to +10 (total trust) in steps of 1.

Reference: - Bitcoin OTC trust weighted signed network - EvolveGCN: Evolving Graph Convolutional Networks for Dynamic Graphs

ICEWS18 dataset

class dgl.data.ICEWS18(mode)[source]

Integrated Crisis Early Warning System (ICEWS18) Event data consists of coded interactions between socio-political

actors (i.e., cooperative or hostile actions between individuals, groups, sectors and nation states).
This Dataset consists of events from 1/1/2018
to 10/31/2018 (24 hours time granularity).

Reference: - Recurrent Event Network for Reasoning over Temporal Knowledge Graphs - ICEWS Coded Event Data

Parameters:mode (str) – Load train/valid/test data. Has to be one of [‘train’, ‘valid’, ‘test’]

QM7b dataset

class dgl.data.QM7b[source]

This dataset consists of 7,211 molecules with 14 regression targets. Nodes means atoms and edges means bonds. Edge data ‘h’ means the entry of Coulomb matrix.

Reference: - QM7b Dataset

GDELT dataset

class dgl.data.GDELT(mode)[source]

The Global Database of Events, Language, and Tone (GDELT) dataset. This contains events happend all over the world (ie every protest held anywhere

in Russia on a given day is collapsed to a single entry).

This Dataset consists of events collected from 1/1/2018 to 1/31/2018 (15 minutes time granularity).

Reference: - Recurrent Event Network for Reasoning over Temporal Knowledge Graphs - The Global Database of Events, Language, and Tone (GDELT)

Parameters:mode (str) – Load train/valid/test data. Has to be one of [‘train’, ‘valid’, ‘test’]

Mini graph classification dataset

class dgl.data.MiniGCDataset(num_graphs, min_num_v, max_num_v)[source]

The dataset class.

The datset contains 8 different types of graphs.

  • class 0 : cycle graph
  • class 1 : star graph
  • class 2 : wheel graph
  • class 3 : lollipop graph
  • class 4 : hypercube graph
  • class 5 : grid graph
  • class 6 : clique graph
  • class 7 : circular ladder graph

Note

This dataset class is compatible with pytorch’s Dataset class.

Parameters:
  • num_graphs (int) – Number of graphs in this dataset.
  • min_num_v (int) – Minimum number of nodes for graphs
  • max_num_v (int) – Maximum number of nodes for graphs
__getitem__(idx)[source]

Get the i^th sample.

idx : int
The sample index.
Returns:The graph and its label.
Return type:(dgl.DGLGraph, int)
__len__()[source]

Return the number of graphs in the dataset.

num_classes

Number of classes.

Graph kernel dataset

For more information about the dataset, see Benchmark Data Sets for Graph Kernels.

class dgl.data.TUDataset(name)[source]

TUDataset contains lots of graph kernel datasets for graph classification. Graphs may have node labels, node attributes, edge labels, and edge attributes, varing from different dataset.

Parameters:name – Dataset Name, such as ENZYMES, DD, COLLAB, MUTAG, can be the

datasets name on https://ls11-www.cs.tu-dortmund.de/staff/morris/graphkerneldatasets.

__getitem__(idx)[source]

Get the i^th sample. Paramters ——— idx : int

The sample index.
Returns:DGLGraph with node feature stored in feat field and node label in node_label if available. And its label.
Return type:(dgl.DGLGraph, int)

Graph isomorphism network dataset

A compact subset of graph kernel dataset

class dgl.data.GINDataset(name, self_loop, degree_as_nlabel=False)[source]

Datasets for Graph Isomorphism Network (GIN) Adapted from https://github.com/weihua916/powerful-gnns/blob/master/dataset.zip.

The dataset contains the compact format of popular graph kernel datasets, which includes: MUTAG, COLLAB, IMDBBINARY, IMDBMULTI, NCI1, PROTEINS, PTC, REDDITBINARY, REDDITMULTI5K

This datset class processes all data sets listed above. For more graph kernel datasets, see TUDataset

name: str
dataset name, one of below - (‘MUTAG’, ‘COLLAB’, ‘IMDBBINARY’, ‘IMDBMULTI’, ‘NCI1’, ‘PROTEINS’, ‘PTC’, ‘REDDITBINARY’, ‘REDDITMULTI5K’)
self_loop: boolean
add self to self edge if true
degree_as_nlabel: boolean
take node degree as label and feature if true
__getitem__(idx)[source]

Get the i^th sample.

idx : int
The sample index.
Returns:The graph and its label.
Return type:(dgl.DGLGraph, int)
__len__()[source]

Return the number of graphs in the dataset.

Protein-Protein Interaction dataset

class dgl.data.PPIDataset(mode)[source]

A toy Protein-Protein Interaction network dataset.

Adapted from https://github.com/williamleif/GraphSAGE/tree/master/example_data.

The dataset contains 24 graphs. The average number of nodes per graph is 2372. Each node has 50 features and 121 labels.

We use 20 graphs for training, 2 for validation and 2 for testing.

__getitem__(item)[source]

Get the i^th sample.

idx : int
The sample index.
Returns:The graph, and its label.
Return type:(dgl.DGLGraph, ndarray)
__len__()[source]

Return number of samples in this dataset.

Molecular Graphs

To work on molecular graphs, make sure you have installed RDKit 2018.09.3.

Data Loading and Processing Utils

We adapt several utilities for processing molecules from DeepChem.

chem.add_hydrogens_to_mol(mol) Add hydrogens to an RDKit molecule instance.
chem.get_mol_3D_coordinates(mol) Get 3D coordinates of the molecule.
chem.load_molecule(molecule_file[, …]) Load a molecule from a file.
chem.multiprocess_load_molecules(files[, …]) Load molecules from files with multiprocessing.

Featurization Utils for Single Molecule

For the use of graph neural networks, we need to featurize nodes (atoms) and edges (bonds).

General utils:

chem.one_hot_encoding(x, allowable_set[, …]) One-hot encoding.
chem.ConcatFeaturizer(func_list) Concatenate the evaluation results of multiple functions as a single feature.
chem.ConcatFeaturizer.__call__(x) Featurize the input data.

Utils for atom featurization:

chem.atom_type_one_hot(atom[, …]) One hot encoding for the type of an atom.
chem.atomic_number_one_hot(atom[, …]) One hot encoding for the atomic number of an atom.
chem.atomic_number(atom) Get the atomic number for an atom.
chem.atom_degree_one_hot(atom[, …]) One hot encoding for the degree of an atom.
chem.atom_degree(atom) Get the degree of an atom.
chem.atom_total_degree_one_hot(atom[, …]) One hot encoding for the degree of an atom including Hs.
chem.atom_total_degree(atom) The degree of an atom including Hs.
chem.atom_implicit_valence_one_hot(atom[, …]) One hot encoding for the implicit valences of an atom.
chem.atom_implicit_valence(atom) Get the implicit valence of an atom.
chem.atom_hybridization_one_hot(atom[, …]) One hot encoding for the hybridization of an atom.
chem.atom_total_num_H_one_hot(atom[, …]) One hot encoding for the total number of Hs of an atom.
chem.atom_total_num_H(atom) Get the total number of Hs of an atom.
chem.atom_formal_charge_one_hot(atom[, …]) One hot encoding for the formal charge of an atom.
chem.atom_formal_charge(atom) Get formal charge for an atom.
chem.atom_num_radical_electrons_one_hot(atom) One hot encoding for the number of radical electrons of an atom.
chem.atom_num_radical_electrons(atom) Get the number of radical electrons for an atom.
chem.atom_is_aromatic_one_hot(atom[, …]) One hot encoding for whether the atom is aromatic.
chem.atom_is_aromatic(atom) Get whether the atom is aromatic.
chem.atom_chiral_tag_one_hot(atom[, …]) One hot encoding for the chiral tag of an atom.
chem.atom_mass(atom[, coef]) Get the mass of an atom and scale it.
chem.BaseAtomFeaturizer(featurizer_funcs[, …]) An abstract class for atom featurizers.
chem.BaseAtomFeaturizer.feat_size(feat_name) Get the feature size for feat_name.
chem.BaseAtomFeaturizer.__call__(mol) Featurize all atoms in a molecule.
chem.CanonicalAtomFeaturizer([atom_data_field]) A default featurizer for atoms.

Utils for bond featurization:

chem.bond_type_one_hot(bond[, …]) One hot encoding for the type of a bond.
chem.bond_is_conjugated_one_hot(bond[, …]) One hot encoding for whether the bond is conjugated.
chem.bond_is_conjugated(bond) Get whether the bond is conjugated.
chem.bond_is_in_ring_one_hot(bond[, …]) One hot encoding for whether the bond is in a ring of any size.
chem.bond_is_in_ring(bond) Get whether the bond is in a ring of any size.
chem.bond_stereo_one_hot(bond[, …]) One hot encoding for the stereo configuration of a bond.
chem.BaseBondFeaturizer(featurizer_funcs[, …]) An abstract class for bond featurizers.
chem.BaseBondFeaturizer.feat_size(feat_name) Get the feature size for feat_name.
chem.BaseBondFeaturizer.__call__(mol) Featurize all bonds in a molecule.
chem.CanonicalBondFeaturizer([bond_data_field]) A default featurizer for bonds.

Graph Construction for Single Molecule

Several methods for constructing DGLGraphs from SMILES/RDKit molecule objects are listed below:

chem.mol_to_graph(mol, graph_constructor, …) Convert an RDKit molecule object into a DGLGraph and featurize for it.
chem.smiles_to_bigraph(smiles[, …]) Convert a SMILES into a bi-directed DGLGraph and featurize for it.
chem.mol_to_bigraph(mol[, add_self_loop, …]) Convert an RDKit molecule object into a bi-directed DGLGraph and featurize for it.
chem.smiles_to_complete_graph(smiles[, …]) Convert a SMILES into a complete DGLGraph and featurize for it.
chem.mol_to_complete_graph(mol[, …]) Convert an RDKit molecule into a complete DGLGraph and featurize for it.
chem.k_nearest_neighbors(coordinates, …) Find k nearest neighbors for each atom based on the 3D coordinates.

Graph Construction and Featurization for Ligand-Protein Complex

Constructing DGLHeteroGraphs and featurize for them.

chem.ACNN_graph_construction_and_featurization(…) Graph construction and featurization for Atomic Convolutional Networks for Predicting Protein-Ligand Binding Affinity.

Dataset Classes

If your dataset is stored in a .csv file, you may find it helpful to use

Currently four datasets are supported:

  • Tox21
  • TencentAlchemyDataset
  • PubChemBioAssayAromaticity
  • PDBBind
class dgl.data.chem.Tox21(smiles_to_graph=<function smiles_to_bigraph>, node_featurizer=None, edge_featurizer=None)[source]

Tox21 dataset.

The Toxicology in the 21st Century (https://tripod.nih.gov/tox21/challenge/) initiative created a public database measuring toxicity of compounds, which has been used in the 2014 Tox21 Data Challenge. The dataset contains qualitative toxicity measurements for 8014 compounds on 12 different targets, including nuclear receptors and stress response pathways. Each target results in a binary label.

A common issue for multi-task prediction is that some datapoints are not labeled for all tasks. This is also the case for Tox21. In data pre-processing, we set non-existing labels to be 0 so that they can be placed in tensors and used for masking in loss computation. See examples below for more details.

All molecules are converted into DGLGraphs. After the first-time construction, the DGLGraphs will be saved for reloading so that we do not need to reconstruct them everytime.

Parameters:
  • smiles_to_graph (callable, str -> DGLGraph) – A function turning smiles into a DGLGraph. Default to dgl.data.chem.smiles_to_bigraph().
  • node_featurizer (callable, rdkit.Chem.rdchem.Mol -> dict) – Featurization for nodes like atoms in a molecule, which can be used to update ndata for a DGLGraph. Default to None.
  • edge_featurizer (callable, rdkit.Chem.rdchem.Mol -> dict) – Featurization for edges like bonds in a molecule, which can be used to update edata for a DGLGraph. Default to None.
__getitem__(item)

Get datapoint with index

Parameters:item (int) – Datapoint index
Returns:
  • str – SMILES for the ith datapoint
  • DGLGraph – DGLGraph for the ith datapoint
  • Tensor of dtype float32 – Labels of the datapoint for all tasks
  • Tensor of dtype float32 – Binary masks indicating the existence of labels for all tasks
__len__()

Length of the dataset

Returns:Length of Dataset
Return type:int
task_pos_weights

Get weights for positive samples on each task

Returns:numpy array gives the weight of positive samples on all tasks
Return type:numpy.ndarray
class dgl.data.chem.TencentAlchemyDataset(mode='dev', from_raw=False, mol_to_graph=<function mol_to_complete_graph>, node_featurizer=<function alchemy_nodes>, edge_featurizer=<function alchemy_edges>)[source]

Developed by the Tencent Quantum Lab, the dataset lists 12 quantum mechanical properties of 130, 000+ organic molecules, comprising up to 12 heavy atoms (C, N, O, S, F and Cl), sampled from the GDBMedChem database. These properties have been calculated using the open-source computational chemistry program Python-based Simulation of Chemistry Framework (PySCF).

For more details, check the paper.

Parameters:
  • mode (str) – ‘dev’, ‘valid’ or ‘test’, separately for training, validation and test. Default to be ‘dev’. Note that ‘test’ is not available as the Alchemy contest is ongoing.
  • from_raw (bool) – Whether to process the dataset from scratch or use a processed one for faster speed. If you use different ways to featurize atoms or bonds, you should set this to be True. Default to be False.
  • mol_to_graph (callable, str -> DGLGraph) – A function turning an RDKit molecule instance into a DGLGraph. Default to dgl.data.chem.mol_to_complete_graph().
  • node_featurizer (callable, rdkit.Chem.rdchem.Mol -> dict) – Featurization for nodes like atoms in a molecule, which can be used to update ndata for a DGLGraph. By default, we construct graphs where nodes represent atoms and node features represent atom features. We store the atomic numbers under the name "node_type" and store the atom features under the name "n_feat". The atom features include: * One hot encoding for atom types * Atomic number of atoms * Whether the atom is a donor * Whether the atom is an acceptor * Whether the atom is aromatic * One hot encoding for atom hybridization * Total number of Hs on the atom
  • edge_featurizer (callable, rdkit.Chem.rdchem.Mol -> dict) – Featurization for edges like bonds in a molecule, which can be used to update edata for a DGLGraph. By default, we construct edges between every pair of atoms, excluding the self loops. We store the distance between the end atoms under the name "distance" and store the edge features under the name "e_feat". The edge features represent one hot encoding of edge types (bond types and non-bond edges).
__getitem__(item)[source]

Get datapoint with index

Parameters:item (int) – Datapoint index
Returns:
  • str – SMILES for the ith datapoint
  • DGLGraph – DGLGraph for the ith datapoint
  • Tensor of dtype float32 – Labels of the datapoint for all tasks
__len__()[source]

Length of the dataset

Returns:Length of Dataset
Return type:int
set_mean_and_std(mean=None, std=None)[source]

Set mean and std or compute from labels for future normalization.

Parameters:
  • mean (int or float) – Default to be None.
  • std (int or float) – Default to be None.
class dgl.data.chem.PubChemBioAssayAromaticity(smiles_to_graph=<function smiles_to_bigraph>, node_featurizer=None, edge_featurizer=None)[source]

Subset of PubChem BioAssay Dataset for aromaticity prediction.

The dataset was constructed in Pushing the Boundaries of Molecular Representation for Drug Discovery with the Graph Attention Mechanism. and is accompanied by the task of predicting the number of aromatic atoms in molecules.

The dataset was constructed by sampling 3945 molecules with 0-40 aromatic atoms from the PubChem BioAssay dataset.

Parameters:
  • smiles_to_graph (callable, str -> DGLGraph) – A function turning smiles into a DGLGraph. Default to dgl.data.chem.smiles_to_bigraph().
  • node_featurizer (callable, rdkit.Chem.rdchem.Mol -> dict) – Featurization for nodes like atoms in a molecule, which can be used to update ndata for a DGLGraph. Default to None.
  • edge_featurizer (callable, rdkit.Chem.rdchem.Mol -> dict) – Featurization for edges like bonds in a molecule, which can be used to update edata for a DGLGraph. Default to None.
__getitem__(item)

Get datapoint with index

Parameters:item (int) – Datapoint index
Returns:
  • str – SMILES for the ith datapoint
  • DGLGraph – DGLGraph for the ith datapoint
  • Tensor of dtype float32 – Labels of the datapoint for all tasks
  • Tensor of dtype float32 – Binary masks indicating the existence of labels for all tasks
__len__()

Length of the dataset

Returns:Length of Dataset
Return type:int
class dgl.data.chem.PDBBind(subset, load_binding_pocket=True, add_hydrogens=False, sanitize=False, calc_charges=False, remove_hs=False, use_conformation=True, construct_graph_and_featurize=<function ACNN_graph_construction_and_featurization>, zero_padding=True, num_processes=64)[source]

PDBbind dataset processed by MoleculeNet.

The description below is mainly based on [1]. The PDBBind database consists of experimentally measured binding affinities for bio-molecular complexes [2], [3]. It provides detailed 3D Cartesian coordinates of both ligands and their target proteins derived from experimental (e.g., X-ray crystallography) measurements. The availability of coordinates of the protein-ligand complexes permits structure-based featurization that is aware of the protein-ligand binding geometry. The authors of [1] use the “refined” and “core” subsets of the database [4], more carefully processed for data artifacts, as additional benchmarking targets.

References

  • [1] MoleculeNet: a benchmark for molecular machine learning
  • [2] The PDBbind database: collection of binding affinities for protein-ligand complexes

with known three-dimensional structures * [3] The PDBbind database: methodologies and updates * [4] PDB-wide collection of binding data: current status of the PDBbind database

Parameters:
  • subset (str) – In MoleculeNet, we can use either the “refined” subset or the “core” subset. We can retrieve them by setting subset to be 'refined' or 'core'. The size of the 'core' set is 195 and the size of the 'refined' set is 3706.
  • load_binding_pocket (bool) – Whether to load binding pockets or full proteins. Default to True.
  • add_hydrogens (bool) – Whether to add hydrogens via pdbfixer. Default to False.
  • sanitize (bool) – Whether sanitization is performed in initializing RDKit molecule instances. See https://www.rdkit.org/docs/RDKit_Book.html for details of the sanitization. Default to False.
  • calc_charges (bool) – Whether to add Gasteiger charges via RDKit. Setting this to be True will enforce add_hydrogens and sanitize to be True. Default to False.
  • remove_hs (bool) – Whether to remove hydrogens via RDKit. Note that removing hydrogens can be quite slow for large molecules. Default to False.
  • use_conformation (bool) – Whether we need to extract molecular conformation from proteins and ligands. Default to True.
  • construct_graph_and_featurize (callable) – Construct a DGLHeteroGraph for the use of GNNs. Mapping self.ligand_mols[i], self.protein_mols[i], self.ligand_coordinates[i] and self.protein_coordinates[i] to a DGLHeteroGraph. Default to ACNN_graph_construction_and_featurization().
  • zero_padding (bool) – Whether to perform zero padding. While DGL does not necessarily require zero padding, pooling operations for variable length inputs can introduce stochastic behaviour, which is not desired for sensitive scenarios. Default to True.
  • num_processes (int or None) – Number of worker processes to use. If None, then we will use the number of CPUs in the system. Default to 64.
__getitem__(item)[source]

Get the datapoint associated with the index.

Parameters:item (int) – Index for the datapoint.
Returns:
  • int – Index for the datapoint.
  • rdkit.Chem.rdchem.Mol – RDKit molecule instance for the ligand molecule.
  • rdkit.Chem.rdchem.Mol – RDKit molecule instance for the protein molecule.
  • DGLHeteroGraph – Pre-processed DGLHeteroGraph with features extracted.
  • Float32 tensor – Label for the datapoint.
__len__()[source]

Get the size of the dataset.

Returns:Number of valid ligand-protein pairs in the dataset.
Return type:int

Dataset Splitting

We provide support for some common data splitting methods:

  • consecutive split
  • random split
  • molecular weight split
  • Bemis-Murcko scaffold split
  • single-task-stratified split
class dgl.data.chem.ConsecutiveSplitter[source]

Split datasets with the input order.

The dataset is split without permutation, so the splitting is deterministic.

static k_fold_split(dataset, k=5, log=True)[source]

Split the dataset for k-fold cross validation by taking consecutive chunks.

Parameters:
  • dataset – We assume len(dataset) gives the size for the dataset and dataset[i] gives the ith datapoint.
  • k (int) – Number of folds to use and should be no smaller than 2. Default to be 5.
  • log (bool) – Whether to print a message at the start of preparing each fold.
Returns:

Each element of the list represents a fold and is a 2-tuple (train_set, val_set).

Return type:

list of 2-tuples

static train_val_test_split(dataset, frac_train=0.8, frac_val=0.1, frac_test=0.1)[source]

Split the dataset into three consecutive chunks for training, validation and test.

Parameters:
  • dataset – We assume len(dataset) gives the size for the dataset and dataset[i] gives the ith datapoint.
  • frac_train (float) – Fraction of data to use for training. By default, we set this to be 0.8, i.e. 80% of the dataset is used for training.
  • frac_val (float) – Fraction of data to use for validation. By default, we set this to be 0.1, i.e. 10% of the dataset is used for validation.
  • frac_test (float) – Fraction of data to use for test. By default, we set this to be 0.1, i.e. 10% of the dataset is used for test.
Returns:

Subsets for training, validation and test, which are all Subset instances.

Return type:

list of length 3

class dgl.data.chem.RandomSplitter[source]

Randomly reorder datasets and then split them.

The dataset is split with permutation and the splitting is hence random.

static k_fold_split(dataset, k=5, random_state=None, log=True)[source]

Randomly permute the dataset and then split it for k-fold cross validation by taking consecutive chunks.

Parameters:
  • dataset – We assume len(dataset) gives the size for the dataset and dataset[i] gives the ith datapoint.
  • k (int) – Number of folds to use and should be no smaller than 2. Default to be 5.
  • random_state (None, int or array_like, optional) – Random seed used to initialize the pseudo-random number generator. Can be any integer between 0 and 2**32 - 1 inclusive, an array (or other sequence) of such integers, or None (the default). If seed is None, then RandomState will try to read data from /dev/urandom (or the Windows analogue) if available or seed from the clock otherwise.
  • log (bool) – Whether to print a message at the start of preparing each fold. Default to True.
Returns:

Each element of the list represents a fold and is a 2-tuple (train_set, val_set).

Return type:

list of 2-tuples

static train_val_test_split(dataset, frac_train=0.8, frac_val=0.1, frac_test=0.1, random_state=None)[source]

Randomly permute the dataset and then split it into three consecutive chunks for training, validation and test.

Parameters:
  • dataset – We assume len(dataset) gives the size for the dataset and dataset[i] gives the ith datapoint.
  • frac_train (float) – Fraction of data to use for training. By default, we set this to be 0.8, i.e. 80% of the dataset is used for training.
  • frac_val (float) – Fraction of data to use for validation. By default, we set this to be 0.1, i.e. 10% of the dataset is used for validation.
  • frac_test (float) – Fraction of data to use for test. By default, we set this to be 0.1, i.e. 10% of the dataset is used for test.
  • random_state (None, int or array_like, optional) – Random seed used to initialize the pseudo-random number generator. Can be any integer between 0 and 2**32 - 1 inclusive, an array (or other sequence) of such integers, or None (the default). If seed is None, then RandomState will try to read data from /dev/urandom (or the Windows analogue) if available or seed from the clock otherwise.
Returns:

Subsets for training, validation and test.

Return type:

list of length 3

class dgl.data.chem.MolecularWeightSplitter[source]

Sort molecules based on their weights and then split them.

static k_fold_split(dataset, mols=None, sanitize=True, k=5, log_every_n=1000)[source]

Sort molecules based on their weights and then split them for k-fold cross validation by taking consecutive chunks.

Parameters:
  • dataset – We assume len(dataset) gives the size for the dataset, dataset[i] gives the ith datapoint and dataset.smiles[i] gives the SMILES for the ith datapoint.
  • mols (None or list of rdkit.Chem.rdchem.Mol) – None or pre-computed RDKit molecule instances. If not None, we expect a one-on-one correspondence between dataset.smiles and mols, i.e. mols[i] corresponds to dataset.smiles[i]. Default to None.
  • sanitize (bool) – This argument only comes into effect when mols is None and decides whether sanitization is performed in initializing RDKit molecule instances. See https://www.rdkit.org/docs/RDKit_Book.html for details of the sanitization. Default to be True.
  • k (int) – Number of folds to use and should be no smaller than 2. Default to be 5.
  • log_every_n (None or int) – Molecule related computation can take a long time for a large dataset and we want to learn the progress of processing. This can be done by printing a message whenever a batch of log_every_n molecules have been processed. If None, no messages will be printed. Default to 1000.
Returns:

Each element of the list represents a fold and is a 2-tuple (train_set, val_set).

Return type:

list of 2-tuples

static train_val_test_split(dataset, mols=None, sanitize=True, frac_train=0.8, frac_val=0.1, frac_test=0.1, log_every_n=1000)[source]

Sort molecules based on their weights and then split them into three consecutive chunks for training, validation and test.

Parameters:
  • dataset – We assume len(dataset) gives the size for the dataset, dataset[i] gives the ith datapoint and dataset.smiles[i] gives the SMILES for the ith datapoint.
  • mols (None or list of rdkit.Chem.rdchem.Mol) – None or pre-computed RDKit molecule instances. If not None, we expect a one-on-one correspondence between dataset.smiles and mols, i.e. mols[i] corresponds to dataset.smiles[i]. Default to None.
  • sanitize (bool) – This argument only comes into effect when mols is None and decides whether sanitization is performed in initializing RDKit molecule instances. See https://www.rdkit.org/docs/RDKit_Book.html for details of the sanitization. Default to be True.
  • frac_train (float) – Fraction of data to use for training. By default, we set this to be 0.8, i.e. 80% of the dataset is used for training.
  • frac_val (float) – Fraction of data to use for validation. By default, we set this to be 0.1, i.e. 10% of the dataset is used for validation.
  • frac_test (float) – Fraction of data to use for test. By default, we set this to be 0.1, i.e. 10% of the dataset is used for test.
  • log_every_n (None or int) – Molecule related computation can take a long time for a large dataset and we want to learn the progress of processing. This can be done by printing a message whenever a batch of log_every_n molecules have been processed. If None, no messages will be printed. Default to 1000.
Returns:

Subsets for training, validation and test, which are all Subset instances.

Return type:

list of length 3

class dgl.data.chem.ScaffoldSplitter[source]

Group molecules based on their Bemis-Murcko scaffolds and then split the groups.

Group molecules so that all molecules in a group have a same scaffold (see reference). The dataset is then split at the level of groups.

References

Bemis, G. W.; Murcko, M. A. “The Properties of Known Drugs.
  1. Molecular Frameworks.” J. Med. Chem. 39:2887-93 (1996).
static k_fold_split(dataset, mols=None, sanitize=True, include_chirality=False, k=5, log_every_n=1000)[source]

Group molecules based on their scaffolds and sort groups based on their sizes. The groups are then split for k-fold cross validation.

Same as usual k-fold splitting methods, each molecule will appear only once in the validation set among all folds. In addition, this method ensures that molecules with a same scaffold will be collectively in either the training set or the validation set for each fold.

Note that the folds can be highly imbalanced depending on the scaffold distribution in the dataset.

Parameters:
  • dataset – We assume len(dataset) gives the size for the dataset, dataset[i] gives the ith datapoint and dataset.smiles[i] gives the SMILES for the ith datapoint.
  • mols (None or list of rdkit.Chem.rdchem.Mol) – None or pre-computed RDKit molecule instances. If not None, we expect a one-on-one correspondence between dataset.smiles and mols, i.e. mols[i] corresponds to dataset.smiles[i]. Default to None.
  • sanitize (bool) – This argument only comes into effect when mols is None and decides whether sanitization is performed in initializing RDKit molecule instances. See https://www.rdkit.org/docs/RDKit_Book.html for details of the sanitization. Default to True.
  • include_chirality (bool) – Whether to consider chirality in computing scaffolds. Default to False.
  • k (int) – Number of folds to use and should be no smaller than 2. Default to be 5.
  • log_every_n (None or int) – Molecule related computation can take a long time for a large dataset and we want to learn the progress of processing. This can be done by printing a message whenever a batch of log_every_n molecules have been processed. If None, no messages will be printed. Default to 1000.
Returns:

Each element of the list represents a fold and is a 2-tuple (train_set, val_set).

Return type:

list of 2-tuples

static train_val_test_split(dataset, mols=None, sanitize=True, include_chirality=False, frac_train=0.8, frac_val=0.1, frac_test=0.1, log_every_n=1000)[source]

Split the dataset into training, validation and test set based on molecular scaffolds.

This spliting method ensures that molecules with a same scaffold will be collectively in only one of the training, validation or test set. As a result, the fraction of dataset to use for training and validation tend to be smaller than frac_train and frac_val, while the fraction of dataset to use for test tends to be larger than frac_test.

Parameters:
  • dataset – We assume len(dataset) gives the size for the dataset, dataset[i] gives the ith datapoint and dataset.smiles[i] gives the SMILES for the ith datapoint.
  • mols (None or list of rdkit.Chem.rdchem.Mol) – None or pre-computed RDKit molecule instances. If not None, we expect a one-on-one correspondence between dataset.smiles and mols, i.e. mols[i] corresponds to dataset.smiles[i]. Default to None.
  • sanitize (bool) – This argument only comes into effect when mols is None and decides whether sanitization is performed in initializing RDKit molecule instances. See https://www.rdkit.org/docs/RDKit_Book.html for details of the sanitization. Default to True.
  • include_chirality (bool) – Whether to consider chirality in computing scaffolds. Default to False.
  • frac_train (float) – Fraction of data to use for training. By default, we set this to be 0.8, i.e. 80% of the dataset is used for training.
  • frac_val (float) – Fraction of data to use for validation. By default, we set this to be 0.1, i.e. 10% of the dataset is used for validation.
  • frac_test (float) – Fraction of data to use for test. By default, we set this to be 0.1, i.e. 10% of the dataset is used for test.
  • log_every_n (None or int) – Molecule related computation can take a long time for a large dataset and we want to learn the progress of processing. This can be done by printing a message whenever a batch of log_every_n molecules have been processed. If None, no messages will be printed. Default to 1000.
Returns:

Subsets for training, validation and test, which are all Subset instances.

Return type:

list of length 3

class dgl.data.chem.SingleTaskStratifiedSplitter[source]

Splits the dataset by stratification on a single task.

We sort the molecules based on their label values for a task and then repeatedly take buckets of datapoints to augment the training, validation and test subsets.

static k_fold_split(dataset, labels, task_id, k=5, log=True)[source]

Sort molecules based on their label values for a task and then split them for k-fold cross validation by taking consecutive chunks.

Parameters:
  • dataset – We assume len(dataset) gives the size for the dataset, dataset[i] gives the ith datapoint and dataset.smiles[i] gives the SMILES for the ith datapoint.
  • labels (tensor of shape (N, T)) – Dataset labels all tasks. N for the number of datapoints and T for the number of tasks.
  • task_id (int) – Index for the task.
  • k (int) – Number of folds to use and should be no smaller than 2. Default to be 5.
  • log (bool) – Whether to print a message at the start of preparing each fold.
Returns:

Each element of the list represents a fold and is a 2-tuple (train_set, val_set).

Return type:

list of 2-tuples

static train_val_test_split(dataset, labels, task_id, frac_train=0.8, frac_val=0.1, frac_test=0.1, bucket_size=10, random_state=None)[source]

Split the dataset into training, validation and test subsets as stated above.

Parameters:
  • dataset – We assume len(dataset) gives the size for the dataset, dataset[i] gives the ith datapoint and dataset.smiles[i] gives the SMILES for the ith datapoint.
  • labels (tensor of shape (N, T)) – Dataset labels all tasks. N for the number of datapoints and T for the number of tasks.
  • task_id (int) – Index for the task.
  • frac_train (float) – Fraction of data to use for training. By default, we set this to be 0.8, i.e. 80% of the dataset is used for training.
  • frac_val (float) – Fraction of data to use for validation. By default, we set this to be 0.1, i.e. 10% of the dataset is used for validation.
  • frac_test (float) – Fraction of data to use for test. By default, we set this to be 0.1, i.e. 10% of the dataset is used for test.
  • bucket_size (int) – Size of bucket of datapoints. Default to 10.
  • random_state (None, int or array_like, optional) – Random seed used to initialize the pseudo-random number generator. Can be any integer between 0 and 2**32 - 1 inclusive, an array (or other sequence) of such integers, or None (the default). If seed is None, then RandomState will try to read data from /dev/urandom (or the Windows analogue) if available or seed from the clock otherwise.
Returns:

Subsets for training, validation and test, which are all Subset instances.

Return type:

list of length 3