dgl.sampling

Sampling operators.

This module contains the implementations of various sampling operators.

Sampling algorithms on graphs.

Random walk sampling functions

random_walk(g, nodes, *[, metapath, length, …]) Generate random walk traces from an array of starting nodes based on the given metapath.
pack_traces(traces, types) Pack the padded traces returned by random_walk() into a concatenated array.

Neighbor sampling functions

sample_neighbors(g, nodes, fanout[, …]) Sample neighboring edges of the given nodes and return the induced subgraph.
select_topk(g, k, weight[, nodes, edge_dir, …]) Select the neighboring edges with k-largest (or k-smallest) weights of the given nodes and return the induced subgraph.

PyTorch DataLoaders with neighborhood sampling

Builtin sampler classes for more complicated sampling algorithms

class dgl.sampling.RandomWalkNeighborSampler(G, num_traversals, termination_prob, num_random_walks, num_neighbors, metapath=None, weight_column='weights')[source]

PinSage-like neighbor sampler extended to any heterogeneous graphs.

Given a heterogeneous graph and a list of nodes, this callable will generate a homogeneous graph where the neighbors of each given node are the most commonly visited nodes of the same type by multiple random walks starting from that given node. Each random walk consists of multiple metapath-based traversals, with a probability of termination after each traversal.

The edges of the returned homogeneous graph will connect to the given nodes from their most commonly visited nodes, with a feature indicating the number of visits.

The metapath must have the same beginning and ending node type to make the algorithm work.

This is a generalization of PinSAGE sampler which only works on bidirectional bipartite graphs.

Parameters:
  • G (DGLGraph) – The graph.
  • num_traversals (int) –

    The maximum number of metapath-based traversals for a single random walk.

    Usually considered a hyperparameter.

  • termination_prob (float) –

    Termination probability after each metapath-based traversal.

    Usually considered a hyperparameter.

  • num_random_walks (int) –

    Number of random walks to try for each given node.

    Usually considered a hyperparameter.

  • num_neighbors (int) – Number of neighbors (or most commonly visited nodes) to select for each given node.
  • metapath (list[str] or list[tuple[str, str, str]], optional) –

    The metapath.

    If not given, DGL assumes that the graph is homogeneous and the metapath consists of one step over the single edge type.

  • weight_column (str, default "weights") – The name of the edge feature to be stored on the returned graph with the number of visits.
  • Inputs
  • ------
  • seed_nodes (Tensor) – A tensor of given node IDs of node type ntype to generate neighbors from. The node type ntype is the beginning and ending node type of the given metapath.
  • Outputs
  • -------
  • g (DGLGraph) – A homogeneous graph constructed by selecting neighbors for each given node according to the algorithm above.

Examples

See examples in PinSAGESampler.

class dgl.sampling.PinSAGESampler(G, ntype, other_type, num_traversals, termination_prob, num_random_walks, num_neighbors, weight_column='weights')[source]

PinSAGE-like neighbor sampler.

This callable works on a bidirectional bipartite graph with edge types (ntype, fwtype, other_type) and (other_type, bwtype, ntype) (where ntype, fwtype, bwtype and other_type could be arbitrary type names). It will generate a homogeneous graph of node type ntype where the neighbors of each given node are the most commonly visited nodes of the same type by multiple random walks starting from that given node. Each random walk consists of multiple metapath-based traversals, with a probability of termination after each traversal. The metapath is always [fwtype, bwtype], walking from node type ntype to node type other_type then back to ntype.

The edges of the returned homogeneous graph will connect to the given nodes from their most commonly visited nodes, with a feature indicating the number of visits.

Parameters:
  • G (DGLGraph) –

    The bidirectional bipartite graph.

    The graph should only have two node types: ntype and other_type. The graph should only have two edge types, one connecting from ntype to other_type, and another connecting from other_type to ntype.

  • ntype (str) – The node type for which the graph would be constructed on.
  • other_type (str) – The other node type.
  • num_traversals (int) –

    The maximum number of metapath-based traversals for a single random walk.

    Usually considered a hyperparameter.

  • termination_prob (int) –

    Termination probability after each metapath-based traversal.

    Usually considered a hyperparameter.

  • num_random_walks (int) –

    Number of random walks to try for each given node.

    Usually considered a hyperparameter.

  • num_neighbors (int) – Number of neighbors (or most commonly visited nodes) to select for each given node.
  • weight_column (str, default "weights") – The name of the edge feature to be stored on the returned graph with the number of visits.
  • Inputs
  • ------
  • seed_nodes (Tensor) – A tensor of given node IDs of node type ntype to generate neighbors from.
  • Outputs
  • -------
  • g (DGLHeteroGraph) – A homogeneous graph constructed by selecting neighbors for each given node according to PinSage algorithm.

Examples

Generate a random bidirectional bipartite graph with 3000 “A” nodes and 5000 “B” nodes.

>>> g = scipy.sparse.random(3000, 5000, 0.003)
>>> G = dgl.heterograph({
...     ('A', 'AB', 'B'): g,
...     ('B', 'BA', 'A'): g.T})

Then we create a PinSage neighbor sampler that samples a graph of node type “A”. Each node would have (a maximum of) 10 neighbors.

>>> sampler = dgl.sampling.PinSAGESampler(G, 'A', 'B', 3, 0.5, 200, 10)

This is how we select the neighbors for node #0, #1 and #2 of type “A” according to PinSAGE algorithm:

>>> seeds = torch.LongTensor([0, 1, 2])
>>> frontier = sampler(seeds)
>>> frontier.all_edges(form='uv')
(tensor([ 230,    0,  802,   47,   50, 1639, 1533,  406, 2110, 2687, 2408, 2823,
            0,  972, 1230, 1658, 2373, 1289, 1745, 2918, 1818, 1951, 1191, 1089,
         1282,  566, 2541, 1505, 1022,  812]),
 tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2,
         2, 2, 2, 2, 2, 2]))

For an end-to-end example of PinSAGE model, including sampling on multiple layers and computing with the sampled graphs, please refer to our PinSage example in examples/pytorch/pinsage.

References

Graph Convolutional Neural Networks for Web-Scale Recommender Systems
Ying et al., 2018, https://arxiv.org/abs/1806.01973

Neighborhood samplers for multilayer GNNs

class dgl.sampling.MultiLayerNeighborSampler(fanouts, replace=False, return_eids=False)[source]

Sampler that builds computational dependency of node representations via neighbor sampling for multilayer GNN.

This sampler will make every node gather messages from a fixed number of neighbors per edge type. The neighbors are picked uniformly.

Parameters:
  • fanouts (list[int] or list[dict[etype, int] or None]) –

    List of neighbors to sample per edge type for each GNN layer, starting from the first layer.

    If the graph is homogeneous, only an integer is needed for each layer.

    If None is provided for one layer, all neighbors will be included regardless of edge types.

    If -1 is provided for one edge type on one layer, then all inbound edges of that edge type will be included.

  • replace (bool, default True) – Whether to sample with replacement
  • return_eids (bool, default False) –

    Whether to return edge IDs of the original graph in the sampled blocks.

    If True, the edge IDs will be stored as dgl.EID feature for each edge type.

Examples

To train a 3-layer GNN for node classification on a set of nodes train_nid on a homogeneous graph where each node takes messages from all neighbors (assume the backend is PyTorch): >>> sampler = dgl.sampling.NeighborSampler([None, None, None]) >>> collator = dgl.sampling.NodeCollator(g, train_nid, sampler) >>> dataloader = torch.utils.data.DataLoader( … collator.dataset, collate_fn=collator.collate, … batch_size=1024, shuffle=True, drop_last=False, num_workers=4) >>> for blocks in dataloader: … train_on(blocks)

If we wish to gather from 5 neighbors on the first layer, 10 neighbors on the second, and 15 layers on the third: >>> sampler = dgl.sampling.NeighborSampler([5, 10, 15])

If training on a heterogeneous graph and you want different number of neighbors for each edge type, one should instead provide a list of dicts. Each dict would specify the number of neighbors to pick per edge type. >>> sampler = dgl.sampling.NeighborSampler([ … {(‘user’, ‘follows’, ‘user’): 5, … (‘user’, ‘plays’, ‘game’): 4, … (‘game’, ‘played-by’, ‘user’): 3}] * 3)

Data loaders for minibatch iteration

class dgl.sampling.NodeCollator(g, nids, block_sampler)[source]

DGL collator to combine training node classification or regression on a single graph.

Parameters:
  • g (DGLHeteroGraph) – The graph.
  • nids (Tensor or dict[ntype, Tensor]) – The node set to compute outputs.
  • block_sampler (BlockSampler) – The neighborhood sampler.

Examples

To train a 3-layer GNN for node classification on a set of nodes train_nid on a homogeneous graph where each node takes messages from all neighbors (assume the backend is PyTorch): >>> sampler = dgl.sampling.NeighborSampler([None, None, None]) >>> collator = dgl.sampling.NodeCollator(g, train_nid, sampler) >>> dataloader = torch.utils.data.DataLoader( … collator.dataset, collate_fn=collator.collate, … batch_size=1024, shuffle=True, drop_last=False, num_workers=4) >>> for input_nodes, output_nodes, blocks in dataloader: … train_on(input_nodes, output_nodes, blocks)

Abstract class for neighborhood sampler

class dgl.sampling.BlockSampler(num_hops)[source]

Abstract class specifying the neighborhood sampling strategy for DGL data loaders.

The main method for BlockSampler is sample_blocks(), which generates a list of blocks for a multi-layer GNN given a set of seed nodes to have their outputs computed.

The default implementation of sample_blocks() is to repeat num_hops times the following:

  • Obtain a frontier with the same nodes as the original graph but only the edges involved in message passing on the last layer. Customizable via sample_frontier().
  • Optionally, post-process the obtained frontier (e.g. by removing edges connecting training node pairs). One can add such postprocessors via add_frontier_postprocessor().
  • Convert the frontier into a block.
  • Optionally, post-process the block (e.g. by assigning edge IDs). One can add such postprocessors via add_block_postprocessor().
  • Prepend the block to the block list to be returned.

All subclasses should either

  • Override sample_blocks() method, or
  • Override sample_frontier() method while specifying the number of layers to sample in num_hops argument.

See also

For, please