Dataset

Utils

utils.get_download_dir() Get the absolute path to the download directory.
utils.download(url[, path, overwrite, …]) Download a given URL.
utils.check_sha1(filename, sha1_hash) Check whether the sha1 hash of the file content matches the expected hash.
utils.extract_archive(file, target_dir) Extract archive file.

Dataset Classes

Stanford sentiment treebank dataset

For more information about the dataset, see Sentiment Analysis.

class dgl.data.SST(mode='train', vocab_file=None)[source]

Stanford Sentiment Treebank dataset.

Each sample is the constituency tree of a sentence. The leaf nodes represent words. The word is a int value stored in the x feature field. The non-leaf node has a special value PAD_WORD in the x field. Each node also has a sentiment annotation: 5 classes (very negative, negative, neutral, positive and very positive). The sentiment label is a int value stored in the y feature field.

Note

This dataset class is compatible with pytorch’s Dataset class.

Note

All the samples will be loaded and preprocessed in the memory first.

Parameters:
  • mode (str, optional) – Can be 'train', 'val', 'test' and specifies which data file to use.
  • vocab_file (str, optional) – Optional vocabulary file.
__getitem__(idx)[source]

Get the tree with index idx.

Parameters:idx (int) – Tree index.
Returns:Tree.
Return type:dgl.DGLGraph
__len__()[source]

Get the number of trees in the dataset.

Returns:Number of trees.
Return type:int

Mini graph classification dataset

class dgl.data.MiniGCDataset(num_graphs, min_num_v, max_num_v)[source]

The dataset class.

The datset contains 8 different types of graphs.

  • class 0 : cycle graph
  • class 1 : star graph
  • class 2 : wheel graph
  • class 3 : lollipop graph
  • class 4 : hypercube graph
  • class 5 : grid graph
  • class 6 : clique graph
  • class 7 : circular ladder graph

Note

This dataset class is compatible with pytorch’s Dataset class.

Parameters:
  • num_graphs (int) – Number of graphs in this dataset.
  • min_num_v (int) – Minimum number of nodes for graphs
  • max_num_v (int) – Maximum number of nodes for graphs
__getitem__(idx)[source]

Get the i^th sample.

idx : int
The sample index.
Returns:The graph and its label.
Return type:(dgl.DGLGraph, int)
__len__()[source]

Return the number of graphs in the dataset.

num_classes

Number of classes.

Graph kernel dataset

For more information about the dataset, see Benchmark Data Sets for Graph Kernels.

class dgl.data.TUDataset(name, use_pandas=False, hidden_size=10, max_allow_node=None)[source]

TUDataset contains lots of graph kernel datasets for graph classification. Use provided node feature by default. If no feature provided, use one-hot node label instead. If neither labels provided, use constant for node feature.

Parameters:
  • name – Dataset Name, such as ENZYMES, DD, COLLAB
  • use_pandas – Default: False. Numpy’s file read function has performance issue when file is large, using pandas can be faster.
  • hidden_size – Default 10. Some dataset doesn’t contain features. Use constant node features initialization instead, with hidden size as hidden_size.
__getitem__(idx)[source]

Get the i^th sample. Paramters ——— idx : int

The sample index.
Returns:DGLGraph with node feature stored in feat field and node label in node_label if available. And its label.
Return type:(dgl.DGLGraph, int)

Graph isomorphism network dataset

A compact subset of graph kernel dataset

class dgl.data.GINDataset(name, self_loop, degree_as_nlabel=False)[source]

Datasets for Graph Isomorphism Network (GIN) Adapted from https://github.com/weihua916/powerful-gnns/blob/master/dataset.zip.

The dataset contains the compact format of popular graph kernel datasets, which includes: MUTAG, COLLAB, IMDBBINARY, IMDBMULTI, NCI1, PROTEINS, PTC, REDDITBINARY, REDDITMULTI5K

This datset class processes all data sets listed above. For more graph kernel datasets, see TUDataset

name: str
dataset name, one of below - (‘MUTAG’, ‘COLLAB’, ‘IMDBBINARY’, ‘IMDBMULTI’, ‘NCI1’, ‘PROTEINS’, ‘PTC’, ‘REDDITBINARY’, ‘REDDITMULTI5K’)
self_loop: boolean
add self to self edge if true
degree_as_nlabel: boolean
take node degree as label and feature if true
__getitem__(idx)[source]

Get the i^th sample.

idx : int
The sample index.
Returns:The graph and its label.
Return type:(dgl.DGLGraph, int)
__len__()[source]

Return the number of graphs in the dataset.

Protein-Protein Interaction dataset

class dgl.data.PPIDataset(mode)[source]

A toy Protein-Protein Interaction network dataset.

Adapted from https://github.com/williamleif/GraphSAGE/tree/master/example_data.

The dataset contains 24 graphs. The average number of nodes per graph is 2372. Each node has 50 features and 121 labels.

We use 20 graphs for training, 2 for validation and 2 for testing.

__getitem__(item)[source]

Get the i^th sample.

idx : int
The sample index.
Returns:The graph, features and its label.
Return type:(dgl.DGLGraph, ndarray, ndarray)
__len__()[source]

Return number of samples in this dataset.