TUDataset

class dgl.data.TUDataset(name, raw_dir=None, force_reload=False, verbose=False, transform=None)[source]

Bases: dgl.data.dgl_dataset.DGLBuiltinDataset

TUDataset contains lots of graph kernel datasets for graph classification.

Parameters
max_num_node

Maximum number of nodes

Type

int

num_labels

Number of classes

Type

int

Notes

IMPORTANT: Some of the datasets have duplicate edges exist in the graphs, e.g. the edges in IMDB-BINARY are all duplicated. DGL faithfully keeps the duplicates as per the original data. Other frameworks such as PyTorch Geometric removes the duplicates by default. You can remove the duplicate edges with dgl.to_simple().

Graphs may have node labels, node attributes, edge labels, and edge attributes, varing from different dataset.

Labels are mapped to \(\lbrace 0,\cdots,n-1 \rbrace\) where \(n\) is the number of labels (some datasets have raw labels \(\lbrace -1, 1 \rbrace\) which will be mapped to \(\lbrace 0, 1 \rbrace\)). In previous versions, the minimum label was added so that \(\lbrace -1, 1 \rbrace\) was mapped to \(\lbrace 0, 2 \rbrace\).

The dataset sorts graphs by their labels. Shuffle is preferred before manual train/val split.

Examples

>>> data = TUDataset('DD')

The dataset instance is an iterable

>>> len(data)
188
>>> g, label = data[1024]
>>> g
Graph(num_nodes=88, num_edges=410,
      ndata_schemes={'_ID': Scheme(shape=(), dtype=torch.int64), 'node_labels': Scheme(shape=(1,), dtype=torch.int64)}
      edata_schemes={'_ID': Scheme(shape=(), dtype=torch.int64)})
>>> label
tensor([1])

Batch the graphs and labels for mini-batch training

>>> graphs, labels = zip(*[data[i] for i in range(16)])
>>> batched_graphs = dgl.batch(graphs)
>>> batched_labels = torch.tensor(labels)
>>> batched_graphs
Graph(num_nodes=9539, num_edges=47382,
      ndata_schemes={'node_labels': Scheme(shape=(1,), dtype=torch.int64), '_ID': Scheme(shape=(), dtype=torch.int64)}
      edata_schemes={'_ID': Scheme(shape=(), dtype=torch.int64)})
__getitem__(idx)[source]

Get the idx-th sample.

Parameters

idx (int) – The sample index.

Returns

Graph with node feature stored in feat field and node label in node_label if available. And its label.

Return type

(dgl.DGLGraph, Tensor)

__len__()[source]

Return the number of graphs in the dataset.