6.7 Using GPU for Neighborhood Sampling

DGL since 0.7 has been supporting GPU-based neighborhood sampling, which has a significant speed advantage over CPU-based neighborhood sampling. If you estimate that your graph and its features can fit onto GPU and your model does not take a lot of GPU memory, then it is best to put the GPU into memory and use GPU-based neighbor sampling.

For example, OGB Products has 2.4M nodes and 61M edges, each node having 100-dimensional features. The node feature themselves take less than 1GB memory, and the graph also takes less than 1GB since the memory consumption of a graph depends on the number of edges. Therefore it is entirely possible to fit the whole graph onto GPU.


This feature is experimental and a work-in-progress. Please stay tuned for further updates.

Using GPU-based neighborhood sampling in DGL data loaders

One can use GPU-based neighborhood sampling with DGL data loaders via

  • Putting the graph onto GPU.

  • Set num_workers argument to 0, because CUDA does not allow multiple processes accessing the same context.

  • Set device argument to a GPU device.

All the other arguments for the NodeDataLoader can be the same as the other user guides and tutorials.

g = g.to('cuda:0')
dataloader = dgl.dataloading.NodeDataLoader(
    g,                                # The graph must be on GPU.
    device=torch.device('cuda:0'),    # The device argument must be GPU.
    num_workers=0,                    # Number of workers must be 0.

GPU-based neighbor sampling also works for custom neighborhood samplers as long as (1) your sampler is subclassed from BlockSampler, and (2) your sampler entirely works on GPU.


Currently EdgeDataLoader and heterogeneous graphs are not supported.

Using GPU-based neighbor sampling with DGL functions

The following sampling functions support operating on GPU:

Besides the functions above, dgl.to_block() can also run on GPU.