# dgl.data¶

## Utils¶

 utils.get_download_dir() Get the absolute path to the download directory. utils.download(url[, path, overwrite, …]) Download a given URL. utils.check_sha1(filename, sha1_hash) Check whether the sha1 hash of the file content matches the expected hash. utils.extract_archive(file, target_dir) Extract archive file. utils.split_dataset(dataset[, frac_list, …]) Split dataset into training, validation and test set. utils.save_graphs(filename, g_list[, labels]) Save DGLGraphs and graph labels to file utils.load_graphs(filename[, idx_list]) Load DGLGraphs from file utils.load_labels(filename) Load label dict from file
class dgl.data.utils.Subset(dataset, indices)[source]

Subset of a dataset at specified indices

Parameters: dataset – dataset[i] should return the ith datapoint indices (list) – List of datapoint indices to construct the subset
__getitem__(item)[source]

Get the datapoint indexed by item

Returns: datapoint tuple
__len__()[source]

Get subset size

Returns: Number of datapoints in the subset int

## Dataset Classes¶

### Stanford sentiment treebank dataset¶

class dgl.data.SST(mode='train', vocab_file=None)[source]

Stanford Sentiment Treebank dataset.

Each sample is the constituency tree of a sentence. The leaf nodes represent words. The word is a int value stored in the x feature field. The non-leaf node has a special value PAD_WORD in the x field. Each node also has a sentiment annotation: 5 classes (very negative, negative, neutral, positive and very positive). The sentiment label is a int value stored in the y feature field.

Note

This dataset class is compatible with pytorch’s Dataset class.

Note

All the samples will be loaded and preprocessed in the memory first.

Parameters: mode (str, optional) – Can be 'train', 'val', 'test' and specifies which data file to use. vocab_file (str, optional) – Optional vocabulary file.
__getitem__(idx)[source]

Get the tree with index idx.

Parameters: idx (int) – Tree index. Tree. dgl.DGLGraph
__len__()[source]

Get the number of trees in the dataset.

Returns: Number of trees. int

### Karate Club dataset¶

class dgl.data.KarateClub[source]

Zachary’s karate club is a social network of a university karate club, described in the paper “An Information Flow Model for Conflict and Fission in Small Groups” by Wayne W. Zachary. The network became a popular example of community structure in networks after its use by Michelle Girvan and Mark Newman in 2002.

This dataset has only one graph, with ndata ‘label’ means whether the node is belong to the “Mr. Hi” club.

### Citation Network dataset¶

class dgl.data.CitationGraphDataset(name)[source]

The citation graph dataset, including citeseer and pubmeb. Nodes mean authors and edges mean citation relationships.

Parameters: name (str) – name can be ‘citeseer’ or ‘pubmed’.

### Cora Citation Network dataset¶

class dgl.data.CoraDataset[source]

Cora citation network dataset. Nodes mean author and edges mean citation relationships.

### CoraFull dataset¶

class dgl.data.CoraFull[source]

Extended Cora dataset from Deep Gaussian Embedding of Graphs: Unsupervised Inductive Learning via Ranking. Nodes represent paper and edges represent citations.

### Amazon Co-Purchase dataset¶

class dgl.data.AmazonCoBuy(name)[source]

Amazon Computers and Amazon Photo are segments of the Amazon co-purchase graph [McAuley et al., 2015], where nodes represent goods, edges indicate that two goods are frequently bought together, node features are bag-of-words encoded product reviews, and class labels are given by the product category.

Parameters: name (str) – Name of the dataset, has to be ‘computers’ or ‘photo’

### Coauthor dataset¶

class dgl.data.Coauthor(name)[source]

Coauthor CS and Coauthor Physics are co-authorship graphs based on the Microsoft Academic Graph from the KDD Cup 2016 challenge 3 . Here, nodes are authors, that are connected by an edge if they co-authored a paper; node features represent paper keywords for each author’s papers, and class labels indicate most active fields of study for each author.

Parameters: name (str) – Name of the dataset, has to be ‘cs’ or ‘physics’

### BitcoinOTC dataset¶

class dgl.data.BitcoinOTC[source]

This is who-trusts-whom network of people who trade using Bitcoin on a platform called Bitcoin OTC. Since Bitcoin users are anonymous, there is a need to maintain a

record of users’ reputation to prevent transactions with fraudulent and risky users. Members of Bitcoin OTC rate other members in a scale of -10 (total distrust) to +10 (total trust) in steps of 1.

### ICEWS18 dataset¶

class dgl.data.ICEWS18(mode)[source]

Integrated Crisis Early Warning System (ICEWS18) Event data consists of coded interactions between socio-political

actors (i.e., cooperative or hostile actions between individuals, groups, sectors and nation states).
This Dataset consists of events from 1/1/2018
to 10/31/2018 (24 hours time granularity).
Parameters: mode (str) – Load train/valid/test data. Has to be one of [‘train’, ‘valid’, ‘test’]

### QM7b dataset¶

class dgl.data.QM7b[source]

This dataset consists of 7,211 molecules with 14 regression targets. Nodes means atoms and edges means bonds. Edge data ‘h’ means the entry of Coulomb matrix.

Reference: - QM7b Dataset

### GDELT dataset¶

class dgl.data.GDELT(mode)[source]

The Global Database of Events, Language, and Tone (GDELT) dataset. This contains events happend all over the world (ie every protest held anywhere

in Russia on a given day is collapsed to a single entry).

This Dataset consists of events collected from 1/1/2018 to 1/31/2018 (15 minutes time granularity).

Parameters: mode (str) – Load train/valid/test data. Has to be one of [‘train’, ‘valid’, ‘test’]

### Mini graph classification dataset¶

class dgl.data.MiniGCDataset(num_graphs, min_num_v, max_num_v)[source]

The dataset class.

The datset contains 8 different types of graphs.

• class 0 : cycle graph
• class 1 : star graph
• class 2 : wheel graph
• class 3 : lollipop graph
• class 4 : hypercube graph
• class 5 : grid graph
• class 6 : clique graph
• class 7 : circular ladder graph

Note

This dataset class is compatible with pytorch’s Dataset class.

Parameters: num_graphs (int) – Number of graphs in this dataset. min_num_v (int) – Minimum number of nodes for graphs max_num_v (int) – Maximum number of nodes for graphs
__getitem__(idx)[source]

Get the i^th sample.

idx : int
The sample index.
Returns: The graph and its label. (dgl.DGLGraph, int)
__len__()[source]

Return the number of graphs in the dataset.

num_classes

Number of classes.

### Graph kernel dataset¶

class dgl.data.TUDataset(name)[source]

TUDataset contains lots of graph kernel datasets for graph classification. Graphs may have node labels, node attributes, edge labels, and edge attributes, varing from different dataset.

Parameters: name – Dataset Name, such as ENZYMES, DD, COLLAB, MUTAG, can be the

datasets name on https://ls11-www.cs.tu-dortmund.de/staff/morris/graphkerneldatasets.

__getitem__(idx)[source]

Get the i^th sample. Paramters ——— idx : int

The sample index.
Returns: DGLGraph with node feature stored in feat field and node label in node_label if available. And its label. (dgl.DGLGraph, int)

### Graph isomorphism network dataset¶

A compact subset of graph kernel dataset

class dgl.data.GINDataset(name, self_loop, degree_as_nlabel=False)[source]

Datasets for Graph Isomorphism Network (GIN) Adapted from https://github.com/weihua916/powerful-gnns/blob/master/dataset.zip.

The dataset contains the compact format of popular graph kernel datasets, which includes: MUTAG, COLLAB, IMDBBINARY, IMDBMULTI, NCI1, PROTEINS, PTC, REDDITBINARY, REDDITMULTI5K

This datset class processes all data sets listed above. For more graph kernel datasets, see TUDataset

name: str
dataset name, one of below - (‘MUTAG’, ‘COLLAB’, ‘IMDBBINARY’, ‘IMDBMULTI’, ‘NCI1’, ‘PROTEINS’, ‘PTC’, ‘REDDITBINARY’, ‘REDDITMULTI5K’)
self_loop: boolean
add self to self edge if true
degree_as_nlabel: boolean
take node degree as label and feature if true
__getitem__(idx)[source]

Get the i^th sample.

idx : int
The sample index.
Returns: The graph and its label. (dgl.DGLGraph, int)
__len__()[source]

Return the number of graphs in the dataset.

### Protein-Protein Interaction dataset¶

class dgl.data.PPIDataset(mode)[source]

A toy Protein-Protein Interaction network dataset.

The dataset contains 24 graphs. The average number of nodes per graph is 2372. Each node has 50 features and 121 labels.

We use 20 graphs for training, 2 for validation and 2 for testing.

__getitem__(item)[source]

Get the i^th sample.

idx : int
The sample index.
Returns: The graph, and its label. (dgl.DGLGraph, ndarray)
__len__()[source]

Return number of samples in this dataset.

## Molecular Graphs¶

To work on molecular graphs, make sure you have installed RDKit 2018.09.3.

We adapt several utilities for processing molecules from DeepChem.

 chem.add_hydrogens_to_mol(*args, **kwargs) chem.get_mol_3D_coordinates(*args, **kwargs) chem.load_molecule(*args, **kwargs) chem.multiprocess_load_molecules(*args, **kwargs)

### Featurization Utils for Single Molecule¶

For the use of graph neural networks, we need to featurize nodes (atoms) and edges (bonds).

General utils:

 chem.one_hot_encoding(*args, **kwargs) chem.ConcatFeaturizer(**kwargs) Concatenate the evaluation results of multiple functions as a single feature. chem.ConcatFeaturizer.__call__(x) Featurize the input data.

Utils for atom featurization:

 chem.atom_type_one_hot(*args, **kwargs) chem.atomic_number_one_hot(*args, **kwargs) chem.atomic_number(*args, **kwargs) chem.atom_degree_one_hot(*args, **kwargs) chem.atom_degree(*args, **kwargs) chem.atom_total_degree_one_hot(*args, **kwargs) chem.atom_total_degree(*args, **kwargs) chem.atom_implicit_valence_one_hot(*args, …) chem.atom_implicit_valence(*args, **kwargs) chem.atom_hybridization_one_hot(*args, **kwargs) chem.atom_total_num_H_one_hot(*args, **kwargs) chem.atom_total_num_H(*args, **kwargs) chem.atom_formal_charge_one_hot(*args, **kwargs) chem.atom_formal_charge(*args, **kwargs) chem.atom_num_radical_electrons_one_hot(…) chem.atom_num_radical_electrons(*args, **kwargs) chem.atom_is_aromatic_one_hot(*args, **kwargs) chem.atom_is_aromatic(*args, **kwargs) chem.atom_chiral_tag_one_hot(*args, **kwargs) chem.atom_mass(*args, **kwargs) chem.BaseAtomFeaturizer(**kwargs) An abstract class for atom featurizers. chem.BaseAtomFeaturizer.feat_size(feat_name) Get the feature size for feat_name. chem.BaseAtomFeaturizer.__call__(mol) Featurize all atoms in a molecule. chem.CanonicalAtomFeaturizer(**kwargs) A default featurizer for atoms.

Utils for bond featurization:

 chem.bond_type_one_hot(*args, **kwargs) chem.bond_is_conjugated_one_hot(*args, **kwargs) chem.bond_is_conjugated(*args, **kwargs) chem.bond_is_in_ring_one_hot(*args, **kwargs) chem.bond_is_in_ring(*args, **kwargs) chem.bond_stereo_one_hot(*args, **kwargs) chem.BaseBondFeaturizer(**kwargs) An abstract class for bond featurizers. chem.BaseBondFeaturizer.feat_size(feat_name) Get the feature size for feat_name. chem.BaseBondFeaturizer.__call__(mol) Featurize all bonds in a molecule. chem.CanonicalBondFeaturizer(**kwargs) A default featurizer for bonds.

### Graph Construction for Single Molecule¶

Several methods for constructing DGLGraphs from SMILES/RDKit molecule objects are listed below:

 chem.mol_to_graph(*args, **kwargs) chem.smiles_to_bigraph(*args, **kwargs) chem.mol_to_bigraph(*args, **kwargs) chem.smiles_to_complete_graph(*args, **kwargs) chem.mol_to_complete_graph(*args, **kwargs) chem.k_nearest_neighbors(*args, **kwargs)

### Graph Construction and Featurization for Ligand-Protein Complex¶

Constructing DGLHeteroGraphs and featurize for them.

### Dataset Classes¶

If your dataset is stored in a .csv file, you may find it helpful to use

Currently four datasets are supported:

• Tox21
• TencentAlchemyDataset
• PubChemBioAssayAromaticity
• PDBBind
class dgl.data.chem.Tox21(**kwargs)[source]

Tox21 dataset.

The Toxicology in the 21st Century (https://tripod.nih.gov/tox21/challenge/) initiative created a public database measuring toxicity of compounds, which has been used in the 2014 Tox21 Data Challenge. The dataset contains qualitative toxicity measurements for 8014 compounds on 12 different targets, including nuclear receptors and stress response pathways. Each target results in a binary label.

A common issue for multi-task prediction is that some datapoints are not labeled for all tasks. This is also the case for Tox21. In data pre-processing, we set non-existing labels to be 0 so that they can be placed in tensors and used for masking in loss computation. See examples below for more details.

All molecules are converted into DGLGraphs. After the first-time construction, the DGLGraphs will be saved for reloading so that we do not need to reconstruct them everytime.

Parameters: smiles_to_graph (callable, str -> DGLGraph) – A function turning smiles into a DGLGraph. Default to dgl.data.chem.smiles_to_bigraph(). node_featurizer (callable, rdkit.Chem.rdchem.Mol -> dict) – Featurization for nodes like atoms in a molecule, which can be used to update ndata for a DGLGraph. Default to None. edge_featurizer (callable, rdkit.Chem.rdchem.Mol -> dict) – Featurization for edges like bonds in a molecule, which can be used to update edata for a DGLGraph. Default to None. load (bool) – Whether to load the previously pre-processed dataset or pre-process from scratch. load should be False when we want to try different graph construction and featurization methods and need to preprocess from scratch. Default to True.
__getitem__(item)

Get datapoint with index

Parameters: item (int) – Datapoint index str – SMILES for the ith datapoint DGLGraph – DGLGraph for the ith datapoint Tensor of dtype float32 – Labels of the datapoint for all tasks Tensor of dtype float32 – Binary masks indicating the existence of labels for all tasks
__len__()

Length of the dataset

Returns: Length of Dataset int
task_pos_weights

Get weights for positive samples on each task

Returns: numpy array gives the weight of positive samples on all tasks numpy.ndarray
class dgl.data.chem.TencentAlchemyDataset(**kwargs)[source]

Developed by the Tencent Quantum Lab, the dataset lists 12 quantum mechanical properties of 130, 000+ organic molecules, comprising up to 12 heavy atoms (C, N, O, S, F and Cl), sampled from the GDBMedChem database. These properties have been calculated using the open-source computational chemistry program Python-based Simulation of Chemistry Framework (PySCF).

For more details, check the paper.

Parameters: mode (str) – ‘dev’, ‘valid’ or ‘test’, separately for training, validation and test. Default to be ‘dev’. Note that ‘test’ is not available as the Alchemy contest is ongoing. mol_to_graph (callable, str -> DGLGraph) – A function turning an RDKit molecule instance into a DGLGraph. Default to dgl.data.chem.mol_to_complete_graph(). node_featurizer (callable, rdkit.Chem.rdchem.Mol -> dict) – Featurization for nodes like atoms in a molecule, which can be used to update ndata for a DGLGraph. By default, we construct graphs where nodes represent atoms and node features represent atom features. We store the atomic numbers under the name "node_type" and store the atom features under the name "n_feat". The atom features include: * One hot encoding for atom types * Atomic number of atoms * Whether the atom is a donor * Whether the atom is an acceptor * Whether the atom is aromatic * One hot encoding for atom hybridization * Total number of Hs on the atom edge_featurizer (callable, rdkit.Chem.rdchem.Mol -> dict) – Featurization for edges like bonds in a molecule, which can be used to update edata for a DGLGraph. By default, we construct edges between every pair of atoms, excluding the self loops. We store the distance between the end atoms under the name "distance" and store the edge features under the name "e_feat". The edge features represent one hot encoding of edge types (bond types and non-bond edges). load (bool) – Whether to load the previously pre-processed dataset or pre-process from scratch. load should be False when we want to try different graph construction and featurization methods and need to preprocess from scratch. Default to True.
__getitem__(item)[source]

Get datapoint with index

Parameters: item (int) – Datapoint index str – SMILES for the ith datapoint DGLGraph – DGLGraph for the ith datapoint Tensor of dtype float32 – Labels of the datapoint for all tasks
__len__()[source]

Length of the dataset

Returns: Length of Dataset int
set_mean_and_std(mean=None, std=None)[source]

Set mean and std or compute from labels for future normalization.

Parameters: mean (int or float) – Default to be None. std (int or float) – Default to be None.
class dgl.data.chem.PubChemBioAssayAromaticity(**kwargs)[source]

Subset of PubChem BioAssay Dataset for aromaticity prediction.

The dataset was constructed in Pushing the Boundaries of Molecular Representation for Drug Discovery with the Graph Attention Mechanism. and is accompanied by the task of predicting the number of aromatic atoms in molecules.

The dataset was constructed by sampling 3945 molecules with 0-40 aromatic atoms from the PubChem BioAssay dataset.

Parameters: smiles_to_graph (callable, str -> DGLGraph) – A function turning smiles into a DGLGraph. Default to dgl.data.chem.smiles_to_bigraph(). node_featurizer (callable, rdkit.Chem.rdchem.Mol -> dict) – Featurization for nodes like atoms in a molecule, which can be used to update ndata for a DGLGraph. Default to None. edge_featurizer (callable, rdkit.Chem.rdchem.Mol -> dict) – Featurization for edges like bonds in a molecule, which can be used to update edata for a DGLGraph. Default to None. load (bool) – Whether to load the previously pre-processed dataset or pre-process from scratch. load should be False when we want to try different graph construction and featurization methods and need to pre-process from scratch. Default to True.
__getitem__(item)

Get datapoint with index

Parameters: item (int) – Datapoint index str – SMILES for the ith datapoint DGLGraph – DGLGraph for the ith datapoint Tensor of dtype float32 – Labels of the datapoint for all tasks Tensor of dtype float32 – Binary masks indicating the existence of labels for all tasks
__len__()

Length of the dataset

Returns: Length of Dataset int
class dgl.data.chem.PDBBind(**kwargs)[source]

PDBbind dataset processed by MoleculeNet.

The description below is mainly based on [1]. The PDBBind database consists of experimentally measured binding affinities for bio-molecular complexes [2], [3]. It provides detailed 3D Cartesian coordinates of both ligands and their target proteins derived from experimental (e.g., X-ray crystallography) measurements. The availability of coordinates of the protein-ligand complexes permits structure-based featurization that is aware of the protein-ligand binding geometry. The authors of [1] use the “refined” and “core” subsets of the database [4], more carefully processed for data artifacts, as additional benchmarking targets.

References

• [1] MoleculeNet: a benchmark for molecular machine learning
• [2] The PDBbind database: collection of binding affinities for protein-ligand complexes

with known three-dimensional structures * [3] The PDBbind database: methodologies and updates * [4] PDB-wide collection of binding data: current status of the PDBbind database

Parameters: subset (str) – In MoleculeNet, we can use either the “refined” subset or the “core” subset. We can retrieve them by setting subset to be 'refined' or 'core'. The size of the 'core' set is 195 and the size of the 'refined' set is 3706. load_binding_pocket (bool) – Whether to load binding pockets or full proteins. Default to True. add_hydrogens (bool) – Whether to add hydrogens via pdbfixer. Default to False. sanitize (bool) – Whether sanitization is performed in initializing RDKit molecule instances. See https://www.rdkit.org/docs/RDKit_Book.html for details of the sanitization. Default to False. calc_charges (bool) – Whether to add Gasteiger charges via RDKit. Setting this to be True will enforce add_hydrogens and sanitize to be True. Default to False. remove_hs (bool) – Whether to remove hydrogens via RDKit. Note that removing hydrogens can be quite slow for large molecules. Default to False. use_conformation (bool) – Whether we need to extract molecular conformation from proteins and ligands. Default to True. construct_graph_and_featurize (callable) – Construct a DGLHeteroGraph for the use of GNNs. Mapping self.ligand_mols[i], self.protein_mols[i], self.ligand_coordinates[i] and self.protein_coordinates[i] to a DGLHeteroGraph. Default to ACNN_graph_construction_and_featurization(). zero_padding (bool) – Whether to perform zero padding. While DGL does not necessarily require zero padding, pooling operations for variable length inputs can introduce stochastic behaviour, which is not desired for sensitive scenarios. Default to True. num_processes (int or None) – Number of worker processes to use. If None, then we will use the number of CPUs in the system. Default to 64.
__getitem__(item)[source]

Get the datapoint associated with the index.

Parameters: item (int) – Index for the datapoint. int – Index for the datapoint. rdkit.Chem.rdchem.Mol – RDKit molecule instance for the ligand molecule. rdkit.Chem.rdchem.Mol – RDKit molecule instance for the protein molecule. DGLHeteroGraph – Pre-processed DGLHeteroGraph with features extracted. Float32 tensor – Label for the datapoint.
__len__()[source]

Get the size of the dataset.

Returns: Number of valid ligand-protein pairs in the dataset. int

### Dataset Splitting¶

We provide support for some common data splitting methods:

• consecutive split
• random split
• molecular weight split
• Bemis-Murcko scaffold split
class dgl.data.chem.ConsecutiveSplitter[source]

Split datasets with the input order.

The dataset is split without permutation, so the splitting is deterministic.

class dgl.data.chem.RandomSplitter[source]

Randomly reorder datasets and then split them.

The dataset is split with permutation and the splitting is hence random.

class dgl.data.chem.MolecularWeightSplitter[source]

Sort molecules based on their weights and then split them.

class dgl.data.chem.ScaffoldSplitter[source]

Group molecules based on their Bemis-Murcko scaffolds and then split the groups.

Group molecules so that all molecules in a group have a same scaffold (see reference). The dataset is then split at the level of groups.

References

Bemis, G. W.; Murcko, M. A. “The Properties of Known Drugs.
1. Molecular Frameworks.” J. Med. Chem. 39:2887-93 (1996).
class dgl.data.chem.SingleTaskStratifiedSplitter[source]

Splits the dataset by stratification on a single task.

We sort the molecules based on their label values for a task and then repeatedly take buckets of datapoints to augment the training, validation and test subsets.