dgl.dataloading¶
The dgl.dataloading
package contains:
Data loader classes for iterating over a set of nodes or edges in a graph and generates computation dependency via neighborhood sampling methods.
Various sampler classes that perform neighborhood sampling for multilayer GNNs.
Negative samplers for link prediction.
For a holistic explanation on how different components work together. Read the user guide Chapter 6: Stochastic Training on Large Graphs.
Note
This package is experimental and the interfaces may be subject to changes in future releases. It currently only has implementations in PyTorch.
DataLoaders¶
DGL DataLoader for minibatch training works similarly to PyTorch’s DataLoader.
It has a generator interface that returns minibatches sampled from some given graphs.
DGL provides two DataLoaders: a NodeDataLoader
for node classification task
and an EdgeDataLoader
for edge/link prediction task.

class
dgl.dataloading.pytorch.
NodeDataLoader
(g, nids, block_sampler, **kwargs)[source]¶ PyTorch dataloader for batchiterating over a set of nodes, generating the list of blocks as computation dependency of the said minibatch.
 Parameters
Examples
To train a 3layer GNN for node classification on a set of nodes
train_nid
on a homogeneous graph where each node takes messages from all neighbors (assume the backend is PyTorch):>>> sampler = dgl.dataloading.MultiLayerNeighborSampler([15, 10, 5]) >>> dataloader = dgl.dataloading.NodeDataLoader( ... g, train_nid, sampler, ... batch_size=1024, shuffle=True, drop_last=False, num_workers=4) >>> for input_nodes, output_nodes, blocks in dataloader: ... train_on(input_nodes, output_nodes, blocks)

class
dgl.dataloading.pytorch.
EdgeDataLoader
(g, eids, block_sampler, **kwargs)[source]¶ PyTorch dataloader for batchiterating over a set of edges, generating the list of blocks as computation dependency of the said minibatch for edge classification, edge regression, and link prediction.
 Parameters
g (DGLGraph) – The graph.
eids (Tensor or dict[etype, Tensor]) – The edge set in graph
g
to compute outputs.block_sampler (dgl.dataloading.BlockSampler) – The neighborhood sampler.
g_sampling (DGLGraph, optional) –
The graph where neighborhood sampling is performed.
One may wish to iterate over the edges in one graph while perform sampling in another graph. This may be the case for iterating over validation and test edge set while perform neighborhood sampling on the graph formed by only the training edge set.
If None, assume to be the same as
g
.exclude (str, optional) –
Whether and how to exclude dependencies related to the sampled edges in the minibatch. Possible values are
None,
reverse_id
,reverse_types
See the description of the argument with the same name in the docstring of
EdgeCollator
for more details.reverse_edge_ids (Tensor or dict[etype, Tensor], optional) –
The mapping from the original edge IDs to the ID of their reverse edges.
See the description of the argument with the same name in the docstring of
EdgeCollator
for more details.reverse_etypes (dict[etype, etype], optional) –
The mapping from the original edge types to their reverse edge types.
See the description of the argument with the same name in the docstring of
EdgeCollator
for more details.negative_sampler (callable, optional) –
The negative sampler.
See the description of the argument with the same name in the docstring of
EdgeCollator
for more details.kwargs (dict) – Arguments being passed to
torch.utils.data.DataLoader
.
Examples
The following example shows how to train a 3layer GNN for edge classification on a set of edges
train_eid
on a homogeneous undirected graph. Each node takes messages from all neighbors.Say that you have an array of source node IDs
src
and another array of destination node IDsdst
. One can make it bidirectional by adding another set of edges that connects fromdst
tosrc
:>>> g = dgl.graph((torch.cat([src, dst]), torch.cat([dst, src])))
One can then know that the ID difference of an edge and its reverse edge is
E
, whereE
is the length of your source/destination array. The reverse edge mapping can be obtained by>>> E = len(src) >>> reverse_eids = torch.cat([torch.arange(E, 2 * E), torch.arange(0, E)])
Note that the sampled edges as well as their reverse edges are removed from computation dependencies of the incident nodes. This is a common trick to avoid information leakage.
>>> sampler = dgl.dataloading.MultiLayerNeighborSampler([15, 10, 5]) >>> dataloader = dgl.dataloading.EdgeDataLoader( ... g, train_eid, sampler, exclude='reverse_id', ... reverse_eids=reverse_eids, ... batch_size=1024, shuffle=True, drop_last=False, num_workers=4) >>> for input_nodes, pair_graph, blocks in dataloader: ... train_on(input_nodes, pair_graph, blocks)
To train a 3layer GNN for link prediction on a set of edges
train_eid
on a homogeneous graph where each node takes messages from all neighbors (assume the backend is PyTorch), with 5 uniformly chosen negative samples per edge:>>> sampler = dgl.dataloading.MultiLayerNeighborSampler([15, 10, 5]) >>> neg_sampler = dgl.dataloading.negative_sampler.Uniform(5) >>> dataloader = dgl.dataloading.EdgeDataLoader( ... g, train_eid, sampler, exclude='reverse_id', ... reverse_eids=reverse_eids, negative_sampler=neg_sampler, ... batch_size=1024, shuffle=True, drop_last=False, num_workers=4) >>> for input_nodes, pos_pair_graph, neg_pair_graph, blocks in dataloader: ... train_on(input_nodse, pair_graph, neg_pair_graph, blocks)
For heterogeneous graphs, the reverse of an edge may have a different edge type from the original edge. For instance, consider that you have an array of useritem clicks, representated by a user array
user
and an item arrayitem
. You may want to build a heterogeneous graph with a userclickitem relation and an itemclickedbyuser relation.>>> g = dgl.heterograph({ ... ('user', 'click', 'item'): (user, item), ... ('item', 'clickedby', 'user'): (item, user)})
To train a 3layer GNN for edge classification on a set of edges
train_eid
with typeclick
, you can write>>> sampler = dgl.dataloading.MultiLayerNeighborSampler([15, 10, 5]) >>> dataloader = dgl.dataloading.EdgeDataLoader( ... g, {'click': train_eid}, sampler, exclude='reverse_types', ... reverse_etypes={'click': 'clickedby', 'clickedby': 'click'}, ... batch_size=1024, shuffle=True, drop_last=False, num_workers=4) >>> for input_nodes, pair_graph, blocks in dataloader: ... train_on(input_nodes, pair_graph, blocks)
To train a 3layer GNN for link prediction on a set of edges
train_eid
with typeclick
, you can write>>> sampler = dgl.dataloading.MultiLayerNeighborSampler([15, 10, 5]) >>> neg_sampler = dgl.dataloading.negative_sampler.Uniform(5) >>> dataloader = dgl.dataloading.EdgeDataLoader( ... g, train_eid, sampler, exclude='reverse_types', ... reverse_etypes={'click': 'clickedby', 'clickedby': 'click'}, ... negative_sampler=neg_sampler, ... batch_size=1024, shuffle=True, drop_last=False, num_workers=4) >>> for input_nodes, pos_pair_graph, neg_pair_graph, blocks in dataloader: ... train_on(input_nodse, pair_graph, neg_pair_graph, blocks)
See also
EdgeCollator
For endtoend usages, please refer to the following tutorial/examples:
Edge classification on heterogeneous graph: GCMC
Link prediction on homogeneous graph: GraphSAGE for unsupervised learning
Link prediction on heterogeneous graph: RGCN for link prediction.
Neighbor Sampler¶
Neighbor samplers are classes that control the behavior of DataLoader
s
to sample neighbors. All of them inherit the base BlockSampler
class, but implement
different neighbor sampling strategies by overriding the sample_frontier
or
the sample_blocks
methods.

class
dgl.dataloading.neighbor.
BlockSampler
(num_layers, return_eids)[source]¶ Abstract class specifying the neighborhood sampling strategy for DGL data loaders.
The main method for BlockSampler is
sample_blocks()
, which generates a list of blocks for a multilayer GNN given a set of seed nodes to have their outputs computed.The default implementation of
sample_blocks()
is to repeatnum_layers
times the following procedure from the last layer to the first layer:Obtain a frontier. The frontier is defined as a graph with the same nodes as the original graph but only the edges involved in message passing on the current layer. Customizable via
sample_frontier()
.Optionally, if the task is link prediction or edge classfication, remove edges connecting training node pairs. If the graph is undirected, also remove the reverse edges. This is controlled by the argument
exclude_eids
insample_blocks()
method.Convert the frontier into a block.
Optionally assign the IDs of the edges in the original graph selected in the first step to the block, controlled by the argument
return_eids
insample_blocks()
method.Prepend the block to the block list to be returned.
All subclasses should override
sample_frontier()
method while specifying the number of layers to sample innum_layers
argument. Parameters
Notes
For the concept of frontiers and blocks, please refer to User Guide Section 6 [TODO].

sample_blocks
(g, seed_nodes, exclude_eids=None)[source]¶ Generate the a list of blocks given the output nodes.
 Parameters
 Returns
The blocks generated for computing the multilayer GNN output.
 Return type
list[DGLGraph]
Notes
For the concept of frontiers and blocks, please refer to User Guide Section 6 [TODO].

sample_frontier
(block_id, g, seed_nodes)[source]¶ Generate the frontier given the output nodes.
The subclasses should override this function.
 Parameters
 Returns
The frontier generated for the current layer.
 Return type
DGLGraph
Notes
For the concept of frontiers and blocks, please refer to User Guide Section 6 [TODO].

class
dgl.dataloading.neighbor.
MultiLayerNeighborSampler
(fanouts, replace=False, return_eids=False)[source]¶ Bases:
dgl.dataloading.dataloader.BlockSampler
Sampler that builds computational dependency of node representations via neighbor sampling for multilayer GNN.
This sampler will make every node gather messages from a fixed number of neighbors per edge type. The neighbors are picked uniformly.
 Parameters
fanouts (list[int] or list[dict[etype, int] or None]) –
List of neighbors to sample per edge type for each GNN layer, starting from the first layer.
If the graph is homogeneous, only an integer is needed for each layer.
If None is provided for one layer, all neighbors will be included regardless of edge types.
If 1 is provided for one edge type on one layer, then all inbound edges of that edge type will be included.
replace (bool, default True) – Whether to sample with replacement
return_eids (bool, default False) – Whether to return the edge IDs involved in message passing in the block. If True, the edge IDs will be stored as an edge feature named
dgl.EID
.
Examples
To train a 3layer GNN for node classification on a set of nodes
train_nid
on a homogeneous graph where each node takes messages from 5, 10, 15 neighbors for the first, second, and third layer respectively (assuming the backend is PyTorch):>>> sampler = dgl.dataloading.MultiLayerNeighborSampler([5, 10, 15]) >>> collator = dgl.dataloading.NodeCollator(g, train_nid, sampler) >>> dataloader = torch.utils.data.DataLoader( ... collator.dataset, collate_fn=collator.collate, ... batch_size=1024, shuffle=True, drop_last=False, num_workers=4) >>> for blocks in dataloader: ... train_on(blocks)
If training on a heterogeneous graph and you want different number of neighbors for each edge type, one should instead provide a list of dicts. Each dict would specify the number of neighbors to pick per edge type.
>>> sampler = dgl.dataloading.MultiLayerNeighborSampler([ ... {('user', 'follows', 'user'): 5, ... ('user', 'plays', 'game'): 4, ... ('game', 'playedby', 'user'): 3}] * 3)

sample_frontier
(block_id, g, seed_nodes)[source]¶ Generate the frontier given the output nodes.
The subclasses should override this function.
 Parameters
 Returns
The frontier generated for the current layer.
 Return type
DGLGraph
Notes
For the concept of frontiers and blocks, please refer to User Guide Section 6 [TODO].

class
dgl.dataloading.neighbor.
MultiLayerFullNeighborSampler
(n_layers, return_eids=False)[source]¶ Bases:
dgl.dataloading.neighbor.MultiLayerNeighborSampler
Sampler that builds computational dependency of node representations by taking messages from all neighbors for multilayer GNN.
This sampler will make every node gather messages from every single neighbor per edge type.
 Parameters
Examples
To train a 3layer GNN for node classification on a set of nodes
train_nid
on a homogeneous graph where each node takes messages from all neighbors for the first, second, and third layer respectively (assuming the backend is PyTorch):>>> sampler = dgl.dataloading.MultiLayerFullNeighborSampler(3) >>> collator = dgl.dataloading.NodeCollator(g, train_nid, sampler) >>> dataloader = torch.utils.data.DataLoader( ... collator.dataset, collate_fn=collator.collate, ... batch_size=1024, shuffle=True, drop_last=False, num_workers=4) >>> for blocks in dataloader: ... train_on(blocks)
Negative Samplers for Link Prediction¶
Negative samplers are classes that control the behavior of the EdgeDataLoader
to generate negative edges.

class
dgl.dataloading.negative_sampler.
Uniform
(k)[source]¶ Negative sampler that randomly chooses negative destination nodes for each source node according to a uniform distribution.
For each edge
(u, v)
of type(srctype, etype, dsttype)
, DGL generatesk
pairs of negative edges(u, v')
, wherev'
is chosen uniformly from all the nodes of typedsttype
. The resulting edges will also have type(srctype, etype, dsttype)
. Parameters
k (int) – The number of negative examples per edge.
Examples
>>> g = dgl.graph(([0, 1, 2], [1, 2, 3])) >>> neg_sampler = dgl.dataloading.negative_sampler.Uniform(2) >>> neg_sampler(g, [0, 1]) (tensor([0, 0, 1, 1]), tensor([1, 0, 2, 3]))