Graph samplers

dgl.contrib.sampling.sampler.NeighborSampler(g, batch_size, expand_factor, num_hops=1, neighbor_type='in', node_prob=None, seed_nodes=None, shuffle=False, num_workers=1, prefetch=False, add_self_loop=False)[source]

Create a sampler that samples neighborhood.

It returns a generator of NodeFlow. This can be viewed as an analogy of mini-batch training on graph data – the given graph represents the whole dataset and the returned generator produces mini-batches (in the form of NodeFlow objects).

A NodeFlow grows from sampled nodes. It first samples a set of nodes from the given seed_nodes (or all the nodes if not given), then samples their neighbors and extracts the subgraph. If the number of hops is \(k(>1)\), the process is repeated recursively, with the neighbor nodes just sampled become the new seed nodes. The result is a graph we defined as NodeFlow that contains \(k+1\) layers. The last layer is the initial seed nodes. The sampled neighbor nodes in layer \(i+1\) are in layer \(i\). All the edges are from nodes in layer \(i\) to layer \(i+1\).

TODO(minjie): give a figure here.

As an analogy to mini-batch training, the batch_size here is equal to the number of the initial seed nodes (number of nodes in the last layer). The number of nodeflow objects (the number of batches) is calculated by len(seed_nodes) // batch_size (if seed_nodes is None, then it is equal to the set of all nodes in the graph).

Parameters:
  • g (DGLGraph) – The DGLGraph where we sample NodeFlows.
  • batch_size (int) – The batch size (i.e, the number of nodes in the last layer)
  • expand_factor (int, float, str) –

    The number of neighbors sampled from the neighbor list of a vertex. The value of this parameter can be:

    • int: indicates the number of neighbors sampled from a neighbor list.
    • float: indicates the ratio of the sampled neighbors in a neighbor list.
    • str: indicates some common ways of calculating the number of sampled neighbors, e.g., sqrt(deg).

    Note that no matter how large the expand_factor, the max number of sampled neighbors is the neighborhood size.

  • num_hops (int, optional) – The number of hops to sample (i.e, the number of layers in the NodeFlow). Default: 1
  • neighbor_type (str, optional) –

    Indicates the neighbors on different types of edges.

    • ”in”: the neighbors on the in-edges.
    • ”out”: the neighbors on the out-edges.
    • ”both”: the neighbors on both types of edges.

    Default: “in”

  • node_prob (Tensor, optional) – A 1D tensor for the probability that a neighbor node is sampled. None means uniform sampling. Otherwise, the number of elements should be equal to the number of vertices in the graph. Default: None
  • seed_nodes (Tensor, optional) – A 1D tensor list of nodes where we sample NodeFlows from. If None, the seed vertices are all the vertices in the graph. Default: None
  • shuffle (bool, optional) – Indicates the sampled NodeFlows are shuffled. Default: False
  • num_workers (int, optional) – The number of worker threads that sample NodeFlows in parallel. Default: 1
  • prefetch (bool, optional) – If true, prefetch the samples in the next batch. Default: False
  • add_self_loop (bool, optional) – If true, add self loop to the sampled NodeFlow. The edge IDs of the self loop edges are -1. Default: False
Returns:

The generator of NodeFlows.

Return type:

generator