dgl.dataloading

The dgl.dataloading package contains:

  • Data loader classes for iterating over a set of nodes or edges in a graph and generates computation dependency via neighborhood sampling methods.

  • Various sampler classes that perform neighborhood sampling for multi-layer GNNs.

  • Negative samplers for link prediction.

For a holistic explanation on how different components work together. Read the user guide Chapter 6: Stochastic Training on Large Graphs.

Note

This package is experimental and the interfaces may be subject to changes in future releases. It currently only has implementations in PyTorch.

DataLoaders

DGL DataLoader for mini-batch training works similarly to PyTorch’s DataLoader. It has a generator interface that returns mini-batches sampled from some given graphs. DGL provides two DataLoaders: a NodeDataLoader for node classification task and an EdgeDataLoader for edge/link prediction task.

class dgl.dataloading.pytorch.NodeDataLoader(g, nids, block_sampler, **kwargs)[source]

PyTorch dataloader for batch-iterating over a set of nodes, generating the list of blocks as computation dependency of the said minibatch.

Parameters
  • g (DGLGraph) – The graph.

  • nids (Tensor or dict[ntype, Tensor]) – The node set to compute outputs.

  • block_sampler (dgl.dataloading.BlockSampler) – The neighborhood sampler.

  • kwargs (dict) – Arguments being passed to torch.utils.data.DataLoader.

Examples

To train a 3-layer GNN for node classification on a set of nodes train_nid on a homogeneous graph where each node takes messages from all neighbors (assume the backend is PyTorch):

>>> sampler = dgl.dataloading.MultiLayerNeighborSampler([15, 10, 5])
>>> dataloader = dgl.dataloading.NodeDataLoader(
...     g, train_nid, sampler,
...     batch_size=1024, shuffle=True, drop_last=False, num_workers=4)
>>> for input_nodes, output_nodes, blocks in dataloader:
...     train_on(input_nodes, output_nodes, blocks)
class dgl.dataloading.pytorch.EdgeDataLoader(g, eids, block_sampler, **kwargs)[source]

PyTorch dataloader for batch-iterating over a set of edges, generating the list of blocks as computation dependency of the said minibatch for edge classification, edge regression, and link prediction.

Parameters
  • g (DGLGraph) – The graph.

  • eids (Tensor or dict[etype, Tensor]) – The edge set in graph g to compute outputs.

  • block_sampler (dgl.dataloading.BlockSampler) – The neighborhood sampler.

  • g_sampling (DGLGraph, optional) –

    The graph where neighborhood sampling is performed.

    One may wish to iterate over the edges in one graph while perform sampling in another graph. This may be the case for iterating over validation and test edge set while perform neighborhood sampling on the graph formed by only the training edge set.

    If None, assume to be the same as g.

  • exclude (str, optional) –

    Whether and how to exclude dependencies related to the sampled edges in the minibatch. Possible values are

    • None,

    • reverse_id,

    • reverse_types

    See the description of the argument with the same name in the docstring of EdgeCollator for more details.

  • reverse_edge_ids (Tensor or dict[etype, Tensor], optional) –

    The mapping from the original edge IDs to the ID of their reverse edges.

    See the description of the argument with the same name in the docstring of EdgeCollator for more details.

  • reverse_etypes (dict[etype, etype], optional) –

    The mapping from the original edge types to their reverse edge types.

    See the description of the argument with the same name in the docstring of EdgeCollator for more details.

  • negative_sampler (callable, optional) –

    The negative sampler.

    See the description of the argument with the same name in the docstring of EdgeCollator for more details.

  • kwargs (dict) – Arguments being passed to torch.utils.data.DataLoader.

Examples

The following example shows how to train a 3-layer GNN for edge classification on a set of edges train_eid on a homogeneous undirected graph. Each node takes messages from all neighbors.

Say that you have an array of source node IDs src and another array of destination node IDs dst. One can make it bidirectional by adding another set of edges that connects from dst to src:

>>> g = dgl.graph((torch.cat([src, dst]), torch.cat([dst, src])))

One can then know that the ID difference of an edge and its reverse edge is |E|, where |E| is the length of your source/destination array. The reverse edge mapping can be obtained by

>>> E = len(src)
>>> reverse_eids = torch.cat([torch.arange(E, 2 * E), torch.arange(0, E)])

Note that the sampled edges as well as their reverse edges are removed from computation dependencies of the incident nodes. This is a common trick to avoid information leakage.

>>> sampler = dgl.dataloading.MultiLayerNeighborSampler([15, 10, 5])
>>> dataloader = dgl.dataloading.EdgeDataLoader(
...     g, train_eid, sampler, exclude='reverse_id',
...     reverse_eids=reverse_eids,
...     batch_size=1024, shuffle=True, drop_last=False, num_workers=4)
>>> for input_nodes, pair_graph, blocks in dataloader:
...     train_on(input_nodes, pair_graph, blocks)

To train a 3-layer GNN for link prediction on a set of edges train_eid on a homogeneous graph where each node takes messages from all neighbors (assume the backend is PyTorch), with 5 uniformly chosen negative samples per edge:

>>> sampler = dgl.dataloading.MultiLayerNeighborSampler([15, 10, 5])
>>> neg_sampler = dgl.dataloading.negative_sampler.Uniform(5)
>>> dataloader = dgl.dataloading.EdgeDataLoader(
...     g, train_eid, sampler, exclude='reverse_id',
...     reverse_eids=reverse_eids, negative_sampler=neg_sampler,
...     batch_size=1024, shuffle=True, drop_last=False, num_workers=4)
>>> for input_nodes, pos_pair_graph, neg_pair_graph, blocks in dataloader:
...     train_on(input_nodse, pair_graph, neg_pair_graph, blocks)

For heterogeneous graphs, the reverse of an edge may have a different edge type from the original edge. For instance, consider that you have an array of user-item clicks, representated by a user array user and an item array item. You may want to build a heterogeneous graph with a user-click-item relation and an item-clicked-by-user relation.

>>> g = dgl.heterograph({
...     ('user', 'click', 'item'): (user, item),
...     ('item', 'clicked-by', 'user'): (item, user)})

To train a 3-layer GNN for edge classification on a set of edges train_eid with type click, you can write

>>> sampler = dgl.dataloading.MultiLayerNeighborSampler([15, 10, 5])
>>> dataloader = dgl.dataloading.EdgeDataLoader(
...     g, {'click': train_eid}, sampler, exclude='reverse_types',
...     reverse_etypes={'click': 'clicked-by', 'clicked-by': 'click'},
...     batch_size=1024, shuffle=True, drop_last=False, num_workers=4)
>>> for input_nodes, pair_graph, blocks in dataloader:
...     train_on(input_nodes, pair_graph, blocks)

To train a 3-layer GNN for link prediction on a set of edges train_eid with type click, you can write

>>> sampler = dgl.dataloading.MultiLayerNeighborSampler([15, 10, 5])
>>> neg_sampler = dgl.dataloading.negative_sampler.Uniform(5)
>>> dataloader = dgl.dataloading.EdgeDataLoader(
...     g, train_eid, sampler, exclude='reverse_types',
...     reverse_etypes={'click': 'clicked-by', 'clicked-by': 'click'},
...     negative_sampler=neg_sampler,
...     batch_size=1024, shuffle=True, drop_last=False, num_workers=4)
>>> for input_nodes, pos_pair_graph, neg_pair_graph, blocks in dataloader:
...     train_on(input_nodse, pair_graph, neg_pair_graph, blocks)

See also

EdgeCollator

For end-to-end usages, please refer to the following tutorial/examples:

  • Edge classification on heterogeneous graph: GCMC

  • Link prediction on homogeneous graph: GraphSAGE for unsupervised learning

  • Link prediction on heterogeneous graph: RGCN for link prediction.

Neighbor Sampler

Neighbor samplers are classes that control the behavior of DataLoader s to sample neighbors. All of them inherit the base BlockSampler class, but implement different neighbor sampling strategies by overriding the sample_frontier or the sample_blocks methods.

class dgl.dataloading.neighbor.BlockSampler(num_layers, return_eids)[source]

Abstract class specifying the neighborhood sampling strategy for DGL data loaders.

The main method for BlockSampler is sample_blocks(), which generates a list of blocks for a multi-layer GNN given a set of seed nodes to have their outputs computed.

The default implementation of sample_blocks() is to repeat num_layers times the following procedure from the last layer to the first layer:

  • Obtain a frontier. The frontier is defined as a graph with the same nodes as the original graph but only the edges involved in message passing on the current layer. Customizable via sample_frontier().

  • Optionally, if the task is link prediction or edge classfication, remove edges connecting training node pairs. If the graph is undirected, also remove the reverse edges. This is controlled by the argument exclude_eids in sample_blocks() method.

  • Convert the frontier into a block.

  • Optionally assign the IDs of the edges in the original graph selected in the first step to the block, controlled by the argument return_eids in sample_blocks() method.

  • Prepend the block to the block list to be returned.

All subclasses should override sample_frontier() method while specifying the number of layers to sample in num_layers argument.

Parameters
  • num_layers (int) – The number of layers to sample.

  • return_eids (bool, default False) – Whether to return the edge IDs involved in message passing in the block. If True, the edge IDs will be stored as an edge feature named dgl.EID.

Notes

For the concept of frontiers and blocks, please refer to User Guide Section 6 [TODO].

sample_blocks(g, seed_nodes, exclude_eids=None)[source]

Generate the a list of blocks given the output nodes.

Parameters
  • g (DGLGraph) – The original graph.

  • seed_nodes (Tensor or dict[ntype, Tensor]) –

    The output nodes by node type.

    If the graph only has one node type, one can just specify a single tensor of node IDs.

  • exclude_eids (Tensor or dict[etype, Tensor]) – The edges to exclude from computation dependency.

Returns

The blocks generated for computing the multi-layer GNN output.

Return type

list[DGLGraph]

Notes

For the concept of frontiers and blocks, please refer to User Guide Section 6 [TODO].

sample_frontier(block_id, g, seed_nodes)[source]

Generate the frontier given the output nodes.

The subclasses should override this function.

Parameters
  • block_id (int) – Represents which GNN layer the frontier is generated for.

  • g (DGLGraph) – The original graph.

  • seed_nodes (Tensor or dict[ntype, Tensor]) –

    The output nodes by node type.

    If the graph only has one node type, one can just specify a single tensor of node IDs.

Returns

The frontier generated for the current layer.

Return type

DGLGraph

Notes

For the concept of frontiers and blocks, please refer to User Guide Section 6 [TODO].

class dgl.dataloading.neighbor.MultiLayerNeighborSampler(fanouts, replace=False, return_eids=False)[source]

Bases: dgl.dataloading.dataloader.BlockSampler

Sampler that builds computational dependency of node representations via neighbor sampling for multilayer GNN.

This sampler will make every node gather messages from a fixed number of neighbors per edge type. The neighbors are picked uniformly.

Parameters
  • fanouts (list[int] or list[dict[etype, int] or None]) –

    List of neighbors to sample per edge type for each GNN layer, starting from the first layer.

    If the graph is homogeneous, only an integer is needed for each layer.

    If None is provided for one layer, all neighbors will be included regardless of edge types.

    If -1 is provided for one edge type on one layer, then all inbound edges of that edge type will be included.

  • replace (bool, default True) – Whether to sample with replacement

  • return_eids (bool, default False) – Whether to return the edge IDs involved in message passing in the block. If True, the edge IDs will be stored as an edge feature named dgl.EID.

Examples

To train a 3-layer GNN for node classification on a set of nodes train_nid on a homogeneous graph where each node takes messages from 5, 10, 15 neighbors for the first, second, and third layer respectively (assuming the backend is PyTorch):

>>> sampler = dgl.dataloading.MultiLayerNeighborSampler([5, 10, 15])
>>> collator = dgl.dataloading.NodeCollator(g, train_nid, sampler)
>>> dataloader = torch.utils.data.DataLoader(
...     collator.dataset, collate_fn=collator.collate,
...     batch_size=1024, shuffle=True, drop_last=False, num_workers=4)
>>> for blocks in dataloader:
...     train_on(blocks)

If training on a heterogeneous graph and you want different number of neighbors for each edge type, one should instead provide a list of dicts. Each dict would specify the number of neighbors to pick per edge type.

>>> sampler = dgl.dataloading.MultiLayerNeighborSampler([
...     {('user', 'follows', 'user'): 5,
...      ('user', 'plays', 'game'): 4,
...      ('game', 'played-by', 'user'): 3}] * 3)
sample_frontier(block_id, g, seed_nodes)[source]

Generate the frontier given the output nodes.

The subclasses should override this function.

Parameters
  • block_id (int) – Represents which GNN layer the frontier is generated for.

  • g (DGLGraph) – The original graph.

  • seed_nodes (Tensor or dict[ntype, Tensor]) –

    The output nodes by node type.

    If the graph only has one node type, one can just specify a single tensor of node IDs.

Returns

The frontier generated for the current layer.

Return type

DGLGraph

Notes

For the concept of frontiers and blocks, please refer to User Guide Section 6 [TODO].

class dgl.dataloading.neighbor.MultiLayerFullNeighborSampler(n_layers, return_eids=False)[source]

Bases: dgl.dataloading.neighbor.MultiLayerNeighborSampler

Sampler that builds computational dependency of node representations by taking messages from all neighbors for multilayer GNN.

This sampler will make every node gather messages from every single neighbor per edge type.

Parameters
  • n_layers (int) – The number of GNN layers to sample.

  • return_eids (bool, default False) – Whether to return the edge IDs involved in message passing in the block. If True, the edge IDs will be stored as an edge feature named dgl.EID.

Examples

To train a 3-layer GNN for node classification on a set of nodes train_nid on a homogeneous graph where each node takes messages from all neighbors for the first, second, and third layer respectively (assuming the backend is PyTorch):

>>> sampler = dgl.dataloading.MultiLayerFullNeighborSampler(3)
>>> collator = dgl.dataloading.NodeCollator(g, train_nid, sampler)
>>> dataloader = torch.utils.data.DataLoader(
...     collator.dataset, collate_fn=collator.collate,
...     batch_size=1024, shuffle=True, drop_last=False, num_workers=4)
>>> for blocks in dataloader:
...     train_on(blocks)