Tutorial: Batched graph classification with DGL

Author: Mufei Li, Minjie Wang, Zheng Zhang.

In this tutorial, you learn how to use DGL to batch multiple graphs of variable size and shape. The tutorial also demonstrates training a graph neural network for a simple graph classification task.

Graph classification is an important problem with applications across many fields, such as bioinformatics, chemoinformatics, social network analysis, urban computing, and cybersecurity. Applying graph neural networks to this problem has been a popular approach recently. This can be seen in the following reserach references: Ying et al., 2018, Cangea et al., 2018, Knyazev et al., 2018, Bianchi et al., 2019, Liao et al., 2019, Gao et al., 2019).

Simple graph classification task

In this tutorial, you learn how to perform batched graph classification with DGL. The example task objective is to classify eight types of topologies shown here.


Implement a synthetic dataset data.MiniGCDataset in DGL. The dataset has eight different types of graphs and each class has the same number of graph samples.

from dgl.data import MiniGCDataset
import matplotlib.pyplot as plt
import networkx as nx
# A dataset with 80 samples, each graph is
# of size [10, 20]
dataset = MiniGCDataset(80, 10, 20)
graph, label = dataset[0]
fig, ax = plt.subplots()
nx.draw(graph.to_networkx(), ax=ax)
ax.set_title('Class: {:d}'.format(label))

Form a graph mini-batch

To train neural networks efficiently, a common practice is to batch multiple samples together to form a mini-batch. Batching fixed-shaped tensor inputs is common. For example, batching two images of size 28 x 28 gives a tensor of shape 2 x 28 x 28. By contrast, batching graph inputs has two challenges:

  • Graphs are sparse.
  • Graphs can have various length. For example, number of nodes and edges.

To address this, DGL provides a dgl.batch() API. It leverages the idea that a batch of graphs can be viewed as a large graph that has many disjointed connected components. Below is a visualization that gives the general idea.


Define the following collate function to form a mini-batch from a given list of graph and label pairs.

import dgl

def collate(samples):
    # The input `samples` is a list of pairs
    #  (graph, label).
    graphs, labels = map(list, zip(*samples))
    batched_graph = dgl.batch(graphs)
    return batched_graph, torch.tensor(labels)

The return type of dgl.batch() is still a graph. In the same way, a batch of tensors is still a tensor. This means that any code that works for one graph immediately works for a batch of graphs. More importantly, because DGL processes messages on all nodes and edges in parallel, this greatly improves efficiency.

Graph classifier

Graph classification proceeds as follows.


From a batch of graphs, perform message passing and graph convolution for nodes to communicate with others. After message passing, compute a tensor for graph representation from node (and edge) attributes. This step might be called readout or aggregation. Finally, the graph representations are fed into a classifier \(g\) to predict the graph labels.

Graph convolution

The graph convolution operation is basically the same as that for graph convolutional network (GCN). To learn more, see the GCN tutorial). The only difference is that we replace \(h_{v}^{(l+1)} = \text{ReLU}\left(b^{(l)}+\sum_{u\in\mathcal{N}(v)}h_{u}^{(l)}W^{(l)}\right)\) by \(h_{v}^{(l+1)} = \text{ReLU}\left(b^{(l)}+\frac{1}{|\mathcal{N}(v)|}\sum_{u\in\mathcal{N}(v)}h_{u}^{(l)}W^{(l)}\right)\)

The replacement of summation by average is to balance nodes with different degrees. This gives a better performance for this experiment.

The self edges added in the dataset initialization allows you to include the original node feature \(h_{v}^{(l)}\) when taking the average.

import dgl.function as fn
import torch
import torch.nn as nn

# Sends a message of node feature h.
msg = fn.copy_src(src='h', out='m')

def reduce(nodes):
    """Take an average over all neighbor node features hu and use it to
    overwrite the original node feature."""
    accum = torch.mean(nodes.mailbox['m'], 1)
    return {'h': accum}

class NodeApplyModule(nn.Module):
    """Update the node feature hv with ReLU(Whv+b)."""
    def __init__(self, in_feats, out_feats, activation):
        super(NodeApplyModule, self).__init__()
        self.linear = nn.Linear(in_feats, out_feats)
        self.activation = activation

    def forward(self, node):
        h = self.linear(node.data['h'])
        h = self.activation(h)
        return {'h' : h}

class GCN(nn.Module):
    def __init__(self, in_feats, out_feats, activation):
        super(GCN, self).__init__()
        self.apply_mod = NodeApplyModule(in_feats, out_feats, activation)

    def forward(self, g, feature):
        # Initialize the node features with h.
        g.ndata['h'] = feature
        g.update_all(msg, reduce)
        return g.ndata.pop('h')

Readout and classification

For this demonstration, consider initial node features to be their degrees. After two rounds of graph convolution, perform a graph readout by averaging over all node features for each graph in the batch.


In DGL, dgl.mean_nodes() handles this task for a batch of graphs with variable size. You then feed the graph representations into a classifier with one linear layer to obtain pre-softmax logits.

import torch.nn.functional as F

class Classifier(nn.Module):
    def __init__(self, in_dim, hidden_dim, n_classes):
        super(Classifier, self).__init__()

        self.layers = nn.ModuleList([
            GCN(in_dim, hidden_dim, F.relu),
            GCN(hidden_dim, hidden_dim, F.relu)])
        self.classify = nn.Linear(hidden_dim, n_classes)

    def forward(self, g):
        # For undirected graphs, in_degree is the same as
        # out_degree.
        h = g.in_degrees().view(-1, 1).float()
        for conv in self.layers:
            h = conv(g, h)
        g.ndata['h'] = h
        hg = dgl.mean_nodes(g, 'h')
        return self.classify(hg)

Setup and training

Create a synthetic dataset of \(400\) graphs with \(10\) ~ \(20\) nodes. \(320\) graphs constitute a training set and \(80\) graphs constitute a test set.

import torch.optim as optim
from torch.utils.data import DataLoader

# Create training and test sets.
trainset = MiniGCDataset(320, 10, 20)
testset = MiniGCDataset(80, 10, 20)
# Use PyTorch's DataLoader and the collate function
# defined before.
data_loader = DataLoader(trainset, batch_size=32, shuffle=True,

# Create model
model = Classifier(1, 256, trainset.num_classes)
loss_func = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

epoch_losses = []
for epoch in range(80):
    epoch_loss = 0
    for iter, (bg, label) in enumerate(data_loader):
        prediction = model(bg)
        loss = loss_func(prediction, label)
        epoch_loss += loss.detach().item()
    epoch_loss /= (iter + 1)
    print('Epoch {}, loss {:.4f}'.format(epoch, epoch_loss))


Epoch 0, loss 2.1400
Epoch 1, loss 1.9978
Epoch 2, loss 1.9376
Epoch 3, loss 1.8663
Epoch 4, loss 1.8007
Epoch 5, loss 1.7170
Epoch 6, loss 1.6633
Epoch 7, loss 1.5871
Epoch 8, loss 1.5217
Epoch 9, loss 1.4495
Epoch 10, loss 1.3770
Epoch 11, loss 1.3064
Epoch 12, loss 1.2611
Epoch 13, loss 1.2100
Epoch 14, loss 1.1865
Epoch 15, loss 1.1500
Epoch 16, loss 1.0974
Epoch 17, loss 1.0685
Epoch 18, loss 1.0463
Epoch 19, loss 1.0457
Epoch 20, loss 1.0216
Epoch 21, loss 0.9991
Epoch 22, loss 0.9706
Epoch 23, loss 0.9454
Epoch 24, loss 0.9488
Epoch 25, loss 0.9117
Epoch 26, loss 0.9011
Epoch 27, loss 0.9116
Epoch 28, loss 0.8760
Epoch 29, loss 0.8418
Epoch 30, loss 0.8409
Epoch 31, loss 0.8312
Epoch 32, loss 0.8419
Epoch 33, loss 0.8297
Epoch 34, loss 0.8080
Epoch 35, loss 0.8232
Epoch 36, loss 0.8128
Epoch 37, loss 0.7758
Epoch 38, loss 0.7720
Epoch 39, loss 0.7582
Epoch 40, loss 0.7721
Epoch 41, loss 0.7777
Epoch 42, loss 0.8143
Epoch 43, loss 0.8123
Epoch 44, loss 0.7534
Epoch 45, loss 0.7591
Epoch 46, loss 0.7166
Epoch 47, loss 0.7093
Epoch 48, loss 0.6926
Epoch 49, loss 0.7048
Epoch 50, loss 0.7177
Epoch 51, loss 0.6863
Epoch 52, loss 0.6933
Epoch 53, loss 0.6930
Epoch 54, loss 0.6766
Epoch 55, loss 0.6707
Epoch 56, loss 0.6647
Epoch 57, loss 0.6567
Epoch 58, loss 0.6471
Epoch 59, loss 0.6493
Epoch 60, loss 0.6420
Epoch 61, loss 0.6642
Epoch 62, loss 0.6403
Epoch 63, loss 0.6454
Epoch 64, loss 0.6384
Epoch 65, loss 0.6312
Epoch 66, loss 0.6322
Epoch 67, loss 0.6260
Epoch 68, loss 0.6056
Epoch 69, loss 0.6063
Epoch 70, loss 0.6269
Epoch 71, loss 0.5989
Epoch 72, loss 0.6101
Epoch 73, loss 0.6024
Epoch 74, loss 0.6081
Epoch 75, loss 0.5930
Epoch 76, loss 0.6015
Epoch 77, loss 0.5853
Epoch 78, loss 0.5741
Epoch 79, loss 0.5791

The learning curve of a run is presented below.

plt.title('cross entropy averaged over minibatches')

The trained model is evaluated on the test set created. To deploy the tutorial, restrict the running time to get a higher accuracy (\(80\) % ~ \(90\) %) than the ones printed below.

# Convert a list of tuples to two lists
test_X, test_Y = map(list, zip(*testset))
test_bg = dgl.batch(test_X)
test_Y = torch.tensor(test_Y).float().view(-1, 1)
probs_Y = torch.softmax(model(test_bg), 1)
sampled_Y = torch.multinomial(probs_Y, 1)
argmax_Y = torch.max(probs_Y, 1)[1].view(-1, 1)
print('Accuracy of sampled predictions on the test set: {:.4f}%'.format(
    (test_Y == sampled_Y.float()).sum().item() / len(test_Y) * 100))
print('Accuracy of argmax predictions on the test set: {:4f}%'.format(
    (test_Y == argmax_Y.float()).sum().item() / len(test_Y) * 100))


Accuracy of sampled predictions on the test set: 72.5000%
Accuracy of argmax predictions on the test set: 71.250000%

The animation here plots the probability that a trained model predicts the correct graph type.


To understand the node and graph representations that a trained model learned, we use t-SNE, for dimensionality reduction and visualization.

https://s3.us-east-2.amazonaws.com/dgl.ai/tutorial/batch/tsne_node2.png https://s3.us-east-2.amazonaws.com/dgl.ai/tutorial/batch/tsne_graph2.png

The two small figures on the top separately visualize node representations after one and two layers of graph convolution. The figure on the bottom visualizes the pre-softmax logits for graphs as graph representations.

While the visualization does suggest some clustering effects of the node features, you would not expect a perfect result. Node degrees are deterministic for these node features. The graph features are improved when separated.

What’s next?

Graph classification with graph neural networks is still a new field. It’s waiting for people to bring more exciting discoveries. The work requires mapping different graphs to different embeddings, while preserving their structural similarity in the embedding space. To learn more about it, see How Powerful Are Graph Neural Networks? a research paper published for the International Conference on Learning Representations 2019.

For more examples about batched graph processing, see the following:

Total running time of the script: ( 0 minutes 22.708 seconds)

Gallery generated by Sphinx-Gallery