dgl.sampling.sample_neighbors_biased

dgl.sampling.sample_neighbors_biased(g, nodes, fanout, bias, edge_dir='in', tag_offset_name='_TAG_OFFSET', replace=False, copy_ndata=True, copy_edata=True)[source]

Sample neighboring edges of the given nodes and return the induced subgraph, where each neighbor’s probability to be picked is determined by its tag.

For each node, a number of inbound (or outbound when edge_dir == 'out') edges will be randomly chosen. The graph returned will then contain all the nodes in the original graph, but only the sampled edges.

This version of neighbor sampling can support the scenario where adjacent nodes with different types have different sampling probability. Each node is assigned an integer (called a tag) which represents its type. Tag is an analogue of node type under the framework of homogeneous graphs. Nodes with the same tag share the same probability.

For example, assume a node has \(N+M\) neighbors, and \(N\) of them have tag 0 while \(M\) of them have tag 1. Assume a node of tag 0 has an unnormalized probability \(p\) to be picked while a node of tag 1 has \(q\). This function first chooses a tag according to the unnormalized probability distribution \(\frac{P(tag=0)}{P(tag=1)}=\frac{Np}{Mq}\), and then run a uniform sampling to get a node of the chosen tag.

In order to make sampling more efficient, the input graph must have its CSC matrix (or CSR matrix if edge_dir='out') sorted according to the tag. The API sort_csc_by_tag() and sort_csr_by_tag() are designed for this purpose, which will internally reorder the neighbors by tags so that neighbors of the same tags are stored in a consecutive range. The two APIs will also store the offsets of these ranges in a node feature with tag_offset_name as its name.

Please make sure that the CSR (or CSC) matrix of the graph has been sorted before calling this function. This function itself will not check whether the input graph is sorted. Note that the input tag_offset_name should be consistent with that in the sorting function.

Only homogeneous or bipartite graphs are supported. For bipartite graphs, the tag offsets of the source nodes when edge_dir='in' (or the destination nodes when edge_dir='out') will be used in sampling.

Node/edge features are not preserved. The original IDs of the sampled edges are stored as the dgl.EID feature in the returned graph.

Parameters
  • g (DGLGraph) – The graph. Must be homogeneous or bipartite (only one edge type). Must be on CPU.

  • nodes (tensor or list) – Node IDs to sample neighbors from.

  • fanout (int) –

    The number of edges to be sampled for each node on each edge type.

    If -1 is given, all the neighboring edges will be selected.

  • bias (tensor or list) –

    The (unnormalized) probabilities associated with each tag. Its length should be equal to the number of tags.

    Entries of this array must be non-negative floats, and the sum of the entries must be positive (though they don’t have to sum up to one). Otherwise, the result will be undefined.

  • edge_dir (str, optional) –

    Determines whether to sample inbound or outbound edges.

    Can take either in for inbound edges or out for outbound edges.

  • tag_offset_name (str, optional) –

    The name of the node feature storing tag offsets.

    (Default: “_TAG_OFFSET”)

  • replace (bool, optional) – If True, sample with replacement.

  • copy_ndata (bool, optional) –

    If True, the node features of the new graph are copied from the original graph. If False, the new graph will not have any node features.

    (Default: True)

  • copy_edata (bool, optional) –

    If True, the edge features of the new graph are copied from the original graph. If False, the new graph will not have any edge features.

    (Default: True)

Returns

A sampled subgraph containing only the sampled neighboring edges. It is on CPU.

Return type

DGLGraph

Notes

If copy_ndata or copy_edata is True, same tensors are used as the node or edge features of the original graph and the new graph. As a result, users should avoid performing in-place operations on the node features of the new graph to avoid feature corruption.

Examples

Assume that you have the following graph

>>> g = dgl.graph(([0, 0, 1, 1, 2, 2], [1, 2, 0, 1, 2, 0]))

And the tags

>>> tag = torch.IntTensor([0, 0, 1])

Sort the graph (necessary!)

>>> g_sorted = dgl.transform.sort_csr_by_tag(g, tag)
>>> g_sorted.ndata['_TAG_OFFSET']
tensor([[0, 1, 2],
        [0, 2, 2],
        [0, 1, 2]])

Set the probability of each tag:

>>> bias = torch.tensor([1.0, 0.001])
>>> # node 2 is almost impossible to be sampled because it has tag 1.

To sample one out bound edge for node 0 and node 2:

>>> sg = dgl.sampling.sample_neighbors_biased(g_sorted, [0, 2], 1, bias, edge_dir='out')
>>> sg.edges(order='eid')
(tensor([0, 2]), tensor([1, 0]))
>>> sg.edata[dgl.EID]
tensor([0, 5])

With fanout greater than the number of actual neighbors and without replacement, DGL will take all neighbors instead:

>>> sg = dgl.sampling.sample_neighbors_biased(g_sorted, [0, 2], 3, bias, edge_dir='out')
>>> sg.edges(order='eid')
(tensor([0, 0, 2, 2]), tensor([1, 2, 0, 2]))