dgl.sampling.sample_neighbors_biased¶
-
dgl.sampling.
sample_neighbors_biased
(g, nodes, fanout, bias, edge_dir='in', tag_offset_name='_TAG_OFFSET', replace=False, copy_ndata=True, copy_edata=True, output_device=None)[source]¶ Sample neighboring edges of the given nodes and return the induced subgraph, where each neighbor’s probability to be picked is determined by its tag.
For each node, a number of inbound (or outbound when
edge_dir == 'out'
) edges will be randomly chosen. The graph returned will then contain all the nodes in the original graph, but only the sampled edges.This version of neighbor sampling can support the scenario where adjacent nodes with different types have different sampling probability. Each node is assigned an integer (called a tag) which represents its type. Tag is an analogue of node type under the framework of homogeneous graphs. Nodes with the same tag share the same probability.
For example, assume a node has \(N+M\) neighbors, and \(N\) of them have tag 0 while \(M\) of them have tag 1. Assume a node of tag 0 has an unnormalized probability \(p\) to be picked while a node of tag 1 has \(q\). This function first chooses a tag according to the unnormalized probability distribution \(\frac{P(tag=0)}{P(tag=1)}=\frac{Np}{Mq}\), and then run a uniform sampling to get a node of the chosen tag.
In order to make sampling more efficient, the input graph must have its CSC matrix (or CSR matrix if
edge_dir='out'
) sorted according to the tag. The APIsort_csc_by_tag()
andsort_csr_by_tag()
are designed for this purpose, which will internally reorder the neighbors by tags so that neighbors of the same tags are stored in a consecutive range. The two APIs will also store the offsets of these ranges in a node feature withtag_offset_name
as its name.Please make sure that the CSR (or CSC) matrix of the graph has been sorted before calling this function. This function itself will not check whether the input graph is sorted. Note that the input
tag_offset_name
should be consistent with that in the sorting function.Only homogeneous or bipartite graphs are supported. For bipartite graphs, the tag offsets of the source nodes when
edge_dir='in'
(or the destination nodes whenedge_dir='out'
) will be used in sampling.Node/edge features are not preserved. The original IDs of the sampled edges are stored as the
dgl.EID
feature in the returned graph.- Parameters
g (DGLGraph) – The graph. Must be homogeneous or bipartite (only one edge type). Must be on CPU.
nodes (tensor or list) – Node IDs to sample neighbors from.
fanout (int) –
The number of edges to be sampled for each node on each edge type.
If -1 is given, all the neighboring edges with non-zero probability will be selected.
bias (tensor or list) –
The (unnormalized) probabilities associated with each tag. Its length should be equal to the number of tags.
Entries of this array must be non-negative floats. Otherwise, the result will be undefined.
edge_dir (str, optional) –
Determines whether to sample inbound or outbound edges.
Can take either
in
for inbound edges orout
for outbound edges.tag_offset_name (str, optional) –
The name of the node feature storing tag offsets.
(Default: “_TAG_OFFSET”)
replace (bool, optional) – If True, sample with replacement.
copy_ndata (bool, optional) –
If True, the node features of the new graph are copied from the original graph. If False, the new graph will not have any node features.
(Default: True)
copy_edata (bool, optional) –
If True, the edge features of the new graph are copied from the original graph. If False, the new graph will not have any edge features.
(Default: True)
output_device (Framework-specific device context object, optional) – The output device. Default is the same as the input graph.
- Returns
A sampled subgraph containing only the sampled neighboring edges. It is on CPU.
- Return type
Notes
If
copy_ndata
orcopy_edata
is True, same tensors are used as the node or edge features of the original graph and the new graph. As a result, users should avoid performing in-place operations on the node features of the new graph to avoid feature corruption.See also
Examples
Assume that you have the following graph
>>> g = dgl.graph(([0, 0, 1, 1, 2, 2], [1, 2, 0, 1, 2, 0]))
And the tags
>>> tag = torch.IntTensor([0, 0, 1])
Sort the graph (necessary!)
>>> g_sorted = dgl.transforms.sort_csr_by_tag(g, tag) >>> g_sorted.ndata['_TAG_OFFSET'] tensor([[0, 1, 2], [0, 2, 2], [0, 1, 2]])
Set the probability of each tag:
>>> bias = torch.tensor([1.0, 0.001]) >>> # node 2 is almost impossible to be sampled because it has tag 1.
To sample one out bound edge for node 0 and node 2:
>>> sg = dgl.sampling.sample_neighbors_biased(g_sorted, [0, 2], 1, bias, edge_dir='out') >>> sg.edges(order='eid') (tensor([0, 2]), tensor([1, 0])) >>> sg.edata[dgl.EID] tensor([0, 5])
With
fanout
greater than the number of actual neighbors and without replacement, DGL will take all neighbors instead:>>> sg = dgl.sampling.sample_neighbors_biased(g_sorted, [0, 2], 3, bias, edge_dir='out') >>> sg.edges(order='eid') (tensor([0, 0, 2, 2]), tensor([1, 2, 0, 2]))