Dataset

Utils

utils.get_download_dir() Get the absolute path to the download directory.
utils.download(url[, path, overwrite, …]) Download a given URL.
utils.check_sha1(filename, sha1_hash) Check whether the sha1 hash of the file content matches the expected hash.
utils.extract_archive(file, target_dir) Extract archive file.
utils.split_dataset(dataset[, frac_list, …]) Split dataset into training, validation and test set.
utils.save_graphs(filename, g_list[, labels]) Save DGLGraphs and graph labels to file
utils.load_graphs(filename[, idx_list]) Load DGLGraphs from file
utils.load_labels(filename) Load label dict from file
class dgl.data.utils.Subset(dataset, indices)[source]

Subset of a dataset at specified indices

Code adapted from PyTorch.

Parameters:
  • dataset – dataset[i] should return the ith datapoint
  • indices (list) – List of datapoint indices to construct the subset
__getitem__(item)[source]

Get the datapoint indexed by item

Returns:datapoint
Return type:tuple
__len__()[source]

Get subset size

Returns:Number of datapoints in the subset
Return type:int

Dataset Classes

Stanford sentiment treebank dataset

For more information about the dataset, see Sentiment Analysis.

class dgl.data.SST(mode='train', vocab_file=None)[source]

Stanford Sentiment Treebank dataset.

Each sample is the constituency tree of a sentence. The leaf nodes represent words. The word is a int value stored in the x feature field. The non-leaf node has a special value PAD_WORD in the x field. Each node also has a sentiment annotation: 5 classes (very negative, negative, neutral, positive and very positive). The sentiment label is a int value stored in the y feature field.

Note

This dataset class is compatible with pytorch’s Dataset class.

Note

All the samples will be loaded and preprocessed in the memory first.

Parameters:
  • mode (str, optional) – Can be 'train', 'val', 'test' and specifies which data file to use.
  • vocab_file (str, optional) – Optional vocabulary file.
__getitem__(idx)[source]

Get the tree with index idx.

Parameters:idx (int) – Tree index.
Returns:Tree.
Return type:dgl.DGLGraph
__len__()[source]

Get the number of trees in the dataset.

Returns:Number of trees.
Return type:int

Karate Club dataset

class dgl.data.KarateClub[source]

Zachary’s karate club is a social network of a university karate club, described in the paper “An Information Flow Model for Conflict and Fission in Small Groups” by Wayne W. Zachary. The network became a popular example of community structure in networks after its use by Michelle Girvan and Mark Newman in 2002.

This dataset has only one graph, with ndata ‘label’ means whether the node is belong to the “Mr. Hi” club.

Citation Network dataset

class dgl.data.CitationGraphDataset(name)[source]

The citation graph dataset, including citeseer and pubmeb. Nodes mean authors and edges mean citation relationships.

Parameters:name (str) – name can be ‘citeseer’ or ‘pubmed’.

Cora Citation Network dataset

class dgl.data.CoraDataset[source]

Cora citation network dataset. Nodes mean author and edges mean citation relationships.

CoraFull dataset

class dgl.data.CoraFull[source]

Extended Cora dataset from Deep Gaussian Embedding of Graphs: Unsupervised Inductive Learning via Ranking. Nodes represent paper and edges represent citations.

Reference: https://github.com/shchur/gnn-benchmark#datasets

Amazon Co-Purchase dataset

class dgl.data.AmazonCoBuy(name)[source]

Amazon Computers and Amazon Photo are segments of the Amazon co-purchase graph [McAuley et al., 2015], where nodes represent goods, edges indicate that two goods are frequently bought together, node features are bag-of-words encoded product reviews, and class labels are given by the product category.

Reference: https://github.com/shchur/gnn-benchmark#datasets

Parameters:name (str) – Name of the dataset, has to be ‘computers’ or ‘photo’

Coauthor dataset

class dgl.data.Coauthor(name)[source]

Coauthor CS and Coauthor Physics are co-authorship graphs based on the Microsoft Academic Graph from the KDD Cup 2016 challenge 3 . Here, nodes are authors, that are connected by an edge if they co-authored a paper; node features represent paper keywords for each author’s papers, and class labels indicate most active fields of study for each author.

Parameters:name (str) – Name of the dataset, has to be ‘cs’ or ‘physics’

BitcoinOTC dataset

class dgl.data.BitcoinOTC[source]

This is who-trusts-whom network of people who trade using Bitcoin on a platform called Bitcoin OTC. Since Bitcoin users are anonymous, there is a need to maintain a

record of users’ reputation to prevent transactions with fraudulent and risky users. Members of Bitcoin OTC rate other members in a scale of -10 (total distrust) to +10 (total trust) in steps of 1.

Reference: - Bitcoin OTC trust weighted signed network - EvolveGCN: Evolving Graph Convolutional Networks for Dynamic Graphs

ICEWS18 dataset

class dgl.data.ICEWS18(mode)[source]

Integrated Crisis Early Warning System (ICEWS18) Event data consists of coded interactions between socio-political

actors (i.e., cooperative or hostile actions between individuals, groups, sectors and nation states).
This Dataset consists of events from 1/1/2018
to 10/31/2018 (24 hours time granularity).

Reference: - Recurrent Event Network for Reasoning over Temporal Knowledge Graphs - ICEWS Coded Event Data

Parameters:mode (str) – Load train/valid/test data. Has to be one of [‘train’, ‘valid’, ‘test’]

QM7b dataset

class dgl.data.QM7b[source]

This dataset consists of 7,211 molecules with 14 regression targets. Nodes means atoms and edges means bonds. Edge data ‘h’ means the entry of Coulomb matrix.

Reference: - QM7b Dataset

GDELT dataset

class dgl.data.GDELT(mode)[source]

The Global Database of Events, Language, and Tone (GDELT) dataset. This contains events happend all over the world (ie every protest held anywhere

in Russia on a given day is collapsed to a single entry).

This Dataset consists of events collected from 1/1/2018 to 1/31/2018 (15 minutes time granularity).

Reference: - Recurrent Event Network for Reasoning over Temporal Knowledge Graphs - The Global Database of Events, Language, and Tone (GDELT)

Parameters:mode (str) – Load train/valid/test data. Has to be one of [‘train’, ‘valid’, ‘test’]

Mini graph classification dataset

class dgl.data.MiniGCDataset(num_graphs, min_num_v, max_num_v)[source]

The dataset class.

The datset contains 8 different types of graphs.

  • class 0 : cycle graph
  • class 1 : star graph
  • class 2 : wheel graph
  • class 3 : lollipop graph
  • class 4 : hypercube graph
  • class 5 : grid graph
  • class 6 : clique graph
  • class 7 : circular ladder graph

Note

This dataset class is compatible with pytorch’s Dataset class.

Parameters:
  • num_graphs (int) – Number of graphs in this dataset.
  • min_num_v (int) – Minimum number of nodes for graphs
  • max_num_v (int) – Maximum number of nodes for graphs
__getitem__(idx)[source]

Get the i^th sample.

idx : int
The sample index.
Returns:The graph and its label.
Return type:(dgl.DGLGraph, int)
__len__()[source]

Return the number of graphs in the dataset.

num_classes

Number of classes.

Graph kernel dataset

For more information about the dataset, see Benchmark Data Sets for Graph Kernels.

class dgl.data.TUDataset(name)[source]

TUDataset contains lots of graph kernel datasets for graph classification. Graphs may have node labels, node attributes, edge labels, and edge attributes, varing from different dataset.

Parameters:name – Dataset Name, such as ENZYMES, DD, COLLAB, MUTAG, can be the

datasets name on https://ls11-www.cs.tu-dortmund.de/staff/morris/graphkerneldatasets.

__getitem__(idx)[source]

Get the i^th sample. Paramters ——— idx : int

The sample index.
Returns:DGLGraph with node feature stored in feat field and node label in node_label if available. And its label.
Return type:(dgl.DGLGraph, int)

Graph isomorphism network dataset

A compact subset of graph kernel dataset

class dgl.data.GINDataset(name, self_loop, degree_as_nlabel=False)[source]

Datasets for Graph Isomorphism Network (GIN) Adapted from https://github.com/weihua916/powerful-gnns/blob/master/dataset.zip.

The dataset contains the compact format of popular graph kernel datasets, which includes: MUTAG, COLLAB, IMDBBINARY, IMDBMULTI, NCI1, PROTEINS, PTC, REDDITBINARY, REDDITMULTI5K

This datset class processes all data sets listed above. For more graph kernel datasets, see TUDataset

name: str
dataset name, one of below - (‘MUTAG’, ‘COLLAB’, ‘IMDBBINARY’, ‘IMDBMULTI’, ‘NCI1’, ‘PROTEINS’, ‘PTC’, ‘REDDITBINARY’, ‘REDDITMULTI5K’)
self_loop: boolean
add self to self edge if true
degree_as_nlabel: boolean
take node degree as label and feature if true
__getitem__(idx)[source]

Get the i^th sample.

idx : int
The sample index.
Returns:The graph and its label.
Return type:(dgl.DGLGraph, int)
__len__()[source]

Return the number of graphs in the dataset.

Protein-Protein Interaction dataset

class dgl.data.PPIDataset(mode)[source]

A toy Protein-Protein Interaction network dataset.

Adapted from https://github.com/williamleif/GraphSAGE/tree/master/example_data.

The dataset contains 24 graphs. The average number of nodes per graph is 2372. Each node has 50 features and 121 labels.

We use 20 graphs for training, 2 for validation and 2 for testing.

__getitem__(item)[source]

Get the i^th sample.

idx : int
The sample index.
Returns:The graph, and its label.
Return type:(dgl.DGLGraph, ndarray)
__len__()[source]

Return number of samples in this dataset.

Molecular Graphs

To work on molecular graphs, make sure you have installed RDKit 2018.09.3.

Featurization Utils

For the use of graph neural networks, we need to featurize nodes (atoms) and edges (bonds).

General utils:

chem.one_hot_encoding(x, allowable_set[, …]) One-hot encoding.
chem.ConcatFeaturizer(func_list) Concatenate the evaluation results of multiple functions as a single feature.
chem.ConcatFeaturizer.__call__(x) Featurize the input data.

Utils for atom featurization:

chem.atom_type_one_hot(atom[, …]) One hot encoding for the type of an atom.
chem.atomic_number_one_hot(atom[, …]) One hot encoding for the atomic number of an atom.
chem.atomic_number(atom) Get the atomic number for an atom.
chem.atom_degree_one_hot(atom[, …]) One hot encoding for the degree of an atom.
chem.atom_degree(atom) Get the degree of an atom.
chem.atom_total_degree_one_hot(atom[, …]) One hot encoding for the degree of an atom including Hs.
chem.atom_total_degree(atom)

See also

atom_degree()

chem.atom_implicit_valence_one_hot(atom[, …]) One hot encoding for the implicit valences of an atom.
chem.atom_implicit_valence(atom) Get the implicit valence of an atom.
chem.atom_hybridization_one_hot(atom[, …]) One hot encoding for the hybridization of an atom.
chem.atom_total_num_H_one_hot(atom[, …]) One hot encoding for the total number of Hs of an atom.
chem.atom_total_num_H(atom) Get the total number of Hs of an atom.
chem.atom_formal_charge_one_hot(atom[, …]) One hot encoding for the formal charge of an atom.
chem.atom_formal_charge(atom) Get formal charge for an atom.
chem.atom_num_radical_electrons_one_hot(atom) One hot encoding for the number of radical electrons of an atom.
chem.atom_num_radical_electrons(atom) Get the number of radical electrons for an atom.
chem.atom_is_aromatic_one_hot(atom[, …]) One hot encoding for whether the atom is aromatic.
chem.atom_is_aromatic(atom) Get whether the atom is aromatic.
chem.atom_chiral_tag_one_hot(atom[, …]) One hot encoding for the chiral tag of an atom.
chem.atom_mass(atom[, coef]) Get the mass of an atom and scale it.
chem.BaseAtomFeaturizer(featurizer_funcs[, …]) An abstract class for atom featurizers.
chem.BaseAtomFeaturizer.feat_size(feat_name) Get the feature size for feat_name.
chem.BaseAtomFeaturizer.__call__(mol) Featurize all atoms in a molecule.
chem.CanonicalAtomFeaturizer([atom_data_field]) A default featurizer for atoms.

Utils for bond featurization:

chem.bond_type_one_hot(bond[, …]) One hot encoding for the type of a bond.
chem.bond_is_conjugated_one_hot(bond[, …]) One hot encoding for whether the bond is conjugated.
chem.bond_is_conjugated(bond) Get whether the bond is conjugated.
chem.bond_is_in_ring_one_hot(bond[, …]) One hot encoding for whether the bond is in a ring of any size.
chem.bond_is_in_ring(bond) Get whether the bond is in a ring of any size.
chem.bond_stereo_one_hot(bond[, …]) One hot encoding for the stereo configuration of a bond.
chem.BaseBondFeaturizer(featurizer_funcs[, …]) An abstract class for bond featurizers.
chem.BaseBondFeaturizer.feat_size(feat_name) Get the feature size for feat_name.
chem.BaseBondFeaturizer.__call__(mol) Featurize all bonds in a molecule.
chem.CanonicalBondFeaturizer([bond_data_field]) A default featurizer for bonds.

Graph Construction

Several methods for constructing DGLGraphs from SMILES/RDKit molecule objects are listed below:

chem.mol_to_graph(mol, graph_constructor, …) Convert an RDKit molecule object into a DGLGraph and featurize for it.
chem.smiles_to_bigraph(smiles[, …]) Convert a SMILES into a bi-directed DGLGraph and featurize for it.
chem.mol_to_bigraph(mol[, add_self_loop, …]) Convert an RDKit molecule object into a bi-directed DGLGraph and featurize for it.
chem.smiles_to_complete_graph(smiles[, …]) Convert a SMILES into a complete DGLGraph and featurize for it.
chem.mol_to_complete_graph(mol[, …]) Convert an RDKit molecule into a complete DGLGraph and featurize for it.

Dataset Classes

If your dataset is stored in a .csv file, you may find it helpful to use

Currently three datasets are supported:

  • Tox21
  • TencentAlchemyDataset
  • PubChemBioAssayAromaticity
class dgl.data.chem.Tox21(smiles_to_graph=<function smiles_to_bigraph>, atom_featurizer=None, bond_featurizer=None)[source]

Tox21 dataset.

The Toxicology in the 21st Century (https://tripod.nih.gov/tox21/challenge/) initiative created a public database measuring toxicity of compounds, which has been used in the 2014 Tox21 Data Challenge. The dataset contains qualitative toxicity measurements for 8014 compounds on 12 different targets, including nuclear receptors and stress response pathways. Each target results in a binary label.

A common issue for multi-task prediction is that some datapoints are not labeled for all tasks. This is also the case for Tox21. In data pre-processing, we set non-existing labels to be 0 so that they can be placed in tensors and used for masking in loss computation. See examples below for more details.

All molecules are converted into DGLGraphs. After the first-time construction, the DGLGraphs will be saved for reloading so that we do not need to reconstruct them everytime.

Parameters:
  • smiles_to_graph (callable, str -> DGLGraph) – A function turning smiles into a DGLGraph. Default to dgl.data.chem.smiles_to_bigraph().
  • atom_featurizer (callable, rdkit.Chem.rdchem.Mol -> dict) – Featurization for atoms in a molecule, which can be used to update ndata for a DGLGraph. Default to None.
  • bond_featurizer (callable, rdkit.Chem.rdchem.Mol -> dict) – Featurization for bonds in a molecule, which can be used to update edata for a DGLGraph. Default to None.
__getitem__(item)

Get datapoint with index

Parameters:item (int) – Datapoint index
Returns:
  • str – SMILES for the ith datapoint
  • DGLGraph – DGLGraph for the ith datapoint
  • Tensor of dtype float32 – Labels of the datapoint for all tasks
  • Tensor of dtype float32 – Binary masks indicating the existence of labels for all tasks
__len__()

Length of the dataset

Returns:Length of Dataset
Return type:int
task_pos_weights

Get weights for positive samples on each task

Returns:numpy array gives the weight of positive samples on all tasks
Return type:numpy.ndarray
class dgl.data.chem.TencentAlchemyDataset(mode='dev', from_raw=False, mol_to_graph=<function mol_to_complete_graph>, atom_featurizer=<function alchemy_nodes>, bond_featurizer=<function alchemy_edges>)[source]

Developed by the Tencent Quantum Lab, the dataset lists 12 quantum mechanical properties of 130, 000+ organic molecules, comprising up to 12 heavy atoms (C, N, O, S, F and Cl), sampled from the GDBMedChem database. These properties have been calculated using the open-source computational chemistry program Python-based Simulation of Chemistry Framework (PySCF).

For more details, check the paper.

Parameters:
  • mode (str) – ‘dev’, ‘valid’ or ‘test’, separately for training, validation and test. Default to be ‘dev’. Note that ‘test’ is not available as the Alchemy contest is ongoing.
  • from_raw (bool) – Whether to process the dataset from scratch or use a processed one for faster speed. If you use different ways to featurize atoms or bonds, you should set this to be True. Default to be False.
  • mol_to_graph (callable, str -> DGLGraph) – A function turning an RDKit molecule instance into a DGLGraph. Default to dgl.data.chem.mol_to_complete_graph().
  • atom_featurizer (callable, rdkit.Chem.rdchem.Mol -> dict) – Featurization for atoms in a molecule, which can be used to update ndata for a DGLGraph. By default, we store the atom atomic numbers under the name "node_type" and store the atom features under the name "n_feat". The atom features include: * One hot encoding for atom types * Atomic number of atoms * Whether the atom is a donor * Whether the atom is an acceptor * Whether the atom is aromatic * One hot encoding for atom hybridization * Total number of Hs on the atom
  • bond_featurizer (callable, rdkit.Chem.rdchem.Mol -> dict) – Featurization for bonds in a molecule, which can be used to update edata for a DGLGraph. By default, we store the distance between the end atoms under the name "distance" and store the bond features under the name "e_feat". The bond features are one-hot encodings of the bond type.
__getitem__(item)[source]

Get datapoint with index

Parameters:item (int) – Datapoint index
Returns:
  • str – SMILES for the ith datapoint
  • DGLGraph – DGLGraph for the ith datapoint
  • Tensor of dtype float32 – Labels of the datapoint for all tasks
__len__()[source]

Length of the dataset

Returns:Length of Dataset
Return type:int
set_mean_and_std(mean=None, std=None)[source]

Set mean and std or compute from labels for future normalization.

Parameters:
  • mean (int or float) – Default to be None.
  • std (int or float) – Default to be None.
class dgl.data.chem.PubChemBioAssayAromaticity(smiles_to_graph=<function smiles_to_bigraph>, atom_featurizer=None, bond_featurizer=None)[source]

Subset of PubChem BioAssay Dataset for aromaticity prediction.

The dataset was constructed in Pushing the Boundaries of Molecular Representation for Drug Discovery with the Graph Attention Mechanism. and is accompanied by the task of predicting the number of aromatic atoms in molecules.

The dataset was constructed by sampling 3945 molecules with 0-40 aromatic atoms from the PubChem BioAssay dataset.

__getitem__(item)

Get datapoint with index

Parameters:item (int) – Datapoint index
Returns:
  • str – SMILES for the ith datapoint
  • DGLGraph – DGLGraph for the ith datapoint
  • Tensor of dtype float32 – Labels of the datapoint for all tasks
  • Tensor of dtype float32 – Binary masks indicating the existence of labels for all tasks
__len__()

Length of the dataset

Returns:Length of Dataset
Return type:int