dgl.contrib.sampling (Deprecating)

Warning

This module is going to be deprecated in favor of dgl.sampling.

Module for sampling algorithms on graph. Each algorithm is implemented as a data loader, which produces sampled subgraphs (called Nodeflow) at each iteration.

dgl.contrib.sampling.sampler.NeighborSampler(g, batch_size, expand_factor=None, num_hops=1, neighbor_type='in', transition_prob=None, seed_nodes=None, shuffle=False, num_workers=1, prefetch=False, add_self_loop=False)[source]

Create a sampler that samples neighborhood.

It returns a generator of NodeFlow. This can be viewed as an analogy of mini-batch training on graph data – the given graph represents the whole dataset and the returned generator produces mini-batches (in the form of NodeFlow objects).

A NodeFlow grows from sampled nodes. It first samples a set of nodes from the given seed_nodes (or all the nodes if not given), then samples their neighbors and extracts the subgraph. If the number of hops is \(k(>1)\), the process is repeated recursively, with the neighbor nodes just sampled become the new seed nodes. The result is a graph we defined as NodeFlow that contains \(k+1\) layers. The last layer is the initial seed nodes. The sampled neighbor nodes in layer \(i+1\) are in layer \(i\). All the edges are from nodes in layer \(i\) to layer \(i+1\).

https://data.dgl.ai/tutorial/sampling/NodeFlow.png

As an analogy to mini-batch training, the batch_size here is equal to the number of the initial seed nodes (number of nodes in the last layer). The number of nodeflow objects (the number of batches) is calculated by len(seed_nodes) // batch_size (if seed_nodes is None, then it is equal to the set of all nodes in the graph).

Note: NeighborSampler currently only supprts immutable graphs.

Parameters
  • g (DGLGraphStale) – The DGLGraphStale where we sample NodeFlows.

  • batch_size (int) – The batch size (i.e, the number of nodes in the last layer)

  • expand_factor (int) –

    The number of neighbors sampled from the neighbor list of a vertex.

    Note that no matter how large the expand_factor, the max number of sampled neighbors is the neighborhood size.

  • num_hops (int, optional) – The number of hops to sample (i.e, the number of layers in the NodeFlow). Default: 1

  • neighbor_type (str, optional) –

    Indicates the neighbors on different types of edges.

    • ”in”: the neighbors on the in-edges.

    • ”out”: the neighbors on the out-edges.

    Default: “in”

  • transition_prob (str, optional) –

    A 1D tensor containing the (unnormalized) transition probability.

    The probability of a node v being sampled from a neighbor u is proportional to the edge weight, normalized by the sum over edge weights grouping by the destination node.

    In other words, given a node v, the probability of node u and edge (u, v) included in the NodeFlow layer preceding that of v is given by:

    \[p(u, v) = \frac{w_{u, v}}{\sum_{u', (u', v) \in E} w_{u', v}}\]

    If neighbor type is “out”, then the probability is instead normalized by the sum grouping by source node:

    \[p(v, u) = \frac{w_{v, u}}{\sum_{u', (v, u') \in E} w_{v, u'}}\]

    If a str is given, the edge weight will be loaded from the edge feature column with the same name. The feature column must be a scalar column in this case.

    Default: None

  • seed_nodes (Tensor, optional) – A 1D tensor list of nodes where we sample NodeFlows from. If None, the seed vertices are all the vertices in the graph. Default: None

  • shuffle (bool, optional) – Indicates the sampled NodeFlows are shuffled. Default: False

  • num_workers (int, optional) – The number of worker threads that sample NodeFlows in parallel. Default: 1

  • prefetch (bool, optional) – If true, prefetch the samples in the next batch. Default: False

  • add_self_loop (bool, optional) – If true, add self loop to the sampled NodeFlow. The edge IDs of the self loop edges are -1. Default: False

dgl.contrib.sampling.sampler.LayerSampler(g, batch_size, layer_sizes, neighbor_type='in', node_prob=None, seed_nodes=None, shuffle=False, num_workers=1, prefetch=False)[source]

Create a sampler that samples neighborhood.

This creates a NodeFlow loader that samples subgraphs from the input graph with layer-wise sampling. This sampling method is implemented in C and can perform sampling very efficiently.

The NodeFlow loader returns a list of NodeFlows. The size of the NodeFlow list is the number of workers.

Note: LayerSampler currently only supprts immutable graphs.

Parameters
  • g (DGLGraphStale) – The DGLGraphStale where we sample NodeFlows.

  • batch_size (int) – The batch size (i.e, the number of nodes in the last layer)

  • layer_size (int) – A list of layer sizes.

  • neighbor_type (str, optional) –

    Indicates the neighbors on different types of edges.

    • ”in”: the neighbors on the in-edges.

    • ”out”: the neighbors on the out-edges.

    Default: “in”

  • node_prob (Tensor, optional) – A 1D tensor for the probability that a neighbor node is sampled. None means uniform sampling. Otherwise, the number of elements should be equal to the number of vertices in the graph. It’s not implemented. Default: None

  • seed_nodes (Tensor, optional) – A 1D tensor list of nodes where we sample NodeFlows from. If None, the seed vertices are all the vertices in the graph. Default: None

  • shuffle (bool, optional) – Indicates the sampled NodeFlows are shuffled. Default: False

  • num_workers (int, optional) – The number of worker threads that sample NodeFlows in parallel. Default: 1

  • prefetch (bool, optional) – If true, prefetch the samples in the next batch. Default: False

dgl.contrib.sampling.sampler.EdgeSampler(g, batch_size, seed_edges=None, edge_weight=None, node_weight=None, shuffle=False, num_workers=1, prefetch=False, replacement=False, reset=False, negative_mode='', neg_sample_size=0, exclude_positive=False, return_false_neg=False, relations=None, chunk_size=None)[source]

Edge sampler for link prediction.

This samples edges from a given graph. The edges sampled for a batch are placed in a subgraph before returning. In many link prediction tasks, negative edges are required to train a model. A negative edge is constructed by corrupting an existing edge in the graph. The current implementation support two ways of corrupting an edge: corrupt the head node of an edge (by randomly selecting a node as the head node), or corrupt the tail node of an edge. When we corrupt the head node of an edge, we randomly sample a node from the entire graph as the head node. It’s possible the constructed edge exists in the graph. By default, the implementation doesn’t explicitly check if the sampled negative edge exists in a graph. However, a user can exclude positive edges from negative edges by specifying ‘exclude_positive=True’. When negative edges are created, a batch of negative edges are also placed in a subgraph.

Currently, negative_mode only supports:

  • ‘head’: the negative edges are generated by corrupting head nodes with uniformly randomly sampled nodes,

  • ‘tail’: the negative edges are generated by corrupting tail nodes with uniformly randomly sampled nodes,

  • ‘chunk-head’: the negative edges are generated for a chunk of positive edges. It first groups positive edges into chunks and corrupts a chunk of edges together by replacing a set of head nodes with the same set of nodes uniformly randomly sampled from the graph.

  • ‘chunk-tail’: the negative edges are generated by corrupting a set of tail nodes with the same set of nodes similar to ‘chunk-head’.

When we use chunked negative sampling, a chunk size needs to be specified. By default, the chunk size is the same as the number of negative edges.

The sampler returns EdgeSubgraph, where a user can access the unique head nodes and tail nodes directly.

This sampler allows to non-uniformly sample positive edges and negative edges. For non-uniformly sampling positive edges, users need to provide an array of m elements (m is the number of edges), i.e. edge_weight, each of which represents the sampling probability of an edge. For non-uniformly sampling negative edges, users need to provide an array of n elements, i.e. node_weight and the sampler samples nodes based on the sampling probability to corrupt a positive edge. If both edge_weight and node_weight are not provided, a uniformed sampler is used. if only edge_weight is provided, the sampler will take uniform sampling when corrupt positive edges.

When the flag return_false_neg is turned on, the sampler will also check if the generated negative edges are true negative edges and will return a vector that indicates false negative edges. The vector is stored in the negative graph as false_neg edge data.

When checking false negative edges, a user can provide edge relations for a knowledge graph. A negative edge is considered as a false negative edge only if the triple (source node, destination node and relation) matches one of the edges in the graph.

This sampler samples positive edges without replacement by default, which means it returns a fixed number of batches (i.e., num_edges/batch_size), and the positive edges sampled will not be duplicated. However, one can explicitly specify sampling with replacement (replacement = True), that the sampler treats each sampling of a single positive edge as a standalone event.

To contorl how many samples the sampler can return, a reset parameter can be used. If it is set to true, the sampler will generate samples infinitely. For the sampler with replacement, it will reshuffle the seed edges each time it consumes all the edges and reset the replacement state. If it is set to false, the sampler will only generate num_edges/batch_size samples.

Note: If node_weight is extremely imbalanced, the sampler will take much longer time to return a minibatch, as sampled negative nodes must not be duplicated for one corruptted positive edge.

Parameters
  • g (DGLGraphStale) – The DGLGraphStale where we sample edges.

  • batch_size (int) – The batch size (i.e, the number of edges from the graph)

  • seed_edges (tensor, optional) – A list of edges where we sample from.

  • edge_weight (tensor, optional) – The weight of each edge which decide the change of certain edge being sampled.

  • node_weight (tensor, optional) – The weight of each node which decide the change of certain node being sampled. Used in negative sampling. If not provided, uniform node sampling is used.

  • shuffle (bool, optional) – whether randomly shuffle the list of edges where we sample from.

  • num_workers (int, optional) – The number of workers to sample edges in parallel.

  • prefetch (bool, optional) – If true, prefetch the samples in the next batch. Default: False

  • replacement (bool, optional) – Whether the sampler samples edges with or without repalcement. Default False

  • reset (bool, optional) – If true, the sampler will generate samples infinitely, and for the sampler with replacement, it will reshuffle the edges each time it consumes all the edges and reset the replacement state. If false, the sampler will only generate num_edges/batch_size samples by default. Default: False.

  • negative_mode (string, optional) – The method used to construct negative edges. Possible values are ‘head’, ‘tail’.

  • neg_sample_size (int, optional) – The number of negative edges to sample for each edge.

  • chunk_size (int, optional) – The chunk size for chunked negative sampling.

  • exclude_positive (int, optional) – Whether to exclude positive edges from the negative edges.

  • return_false_neg (bool, optional) – Whether to calculate false negative edges and return them as edge data in negative graphs.

  • relations (tensor, optional) – relations of the edges if this is a knowledge graph.

Examples

>>> for pos_g, neg_g in EdgeSampler(g, batch_size=10):
>>>     print(pos_g.head_nid, pos_g.tail_nid)
>>>     print(neg_g.head_nid, pos_g.tail_nid)
>>>     print(neg_g.edata['false_neg'])
immutable_onlybool

Whether the sampler only works on immutable graphs. Subclasses can override this property.

Distributed sampler

class dgl.contrib.sampling.dis_sampler.SamplerPool[source]

SamplerPool is an abstract class, in which the worker() method should be implemented by users. SamplerPool will fork() N (N = num_worker) child processes, and each process will perform worker() method independently. Note that, the fork() API uses shared memory for N processes and the OS will perfrom copy-on-write on that only when developers write that piece of memory. So fork N processes and load N copies of graph will not increase the memory overhead.

For example, users can use this class like this:

class MySamplerPool(SamplerPool):

def worker(self):

# Do anything here #

if __name__ == ‘__main__’:

… args = parser.parse_args() pool = MySamplerPool() pool.start(args.num_sender, args)

SamplerPool.start(num_worker, args)

Start sampler pool

SamplerPool.worker(args)

User-defined function for worker

class dgl.contrib.sampling.dis_sampler.SamplerSender(namebook, net_type='socket')[source]

SamplerSender for DGL distributed training.

Users use SamplerSender to send sampled subgraphs (NodeFlow) to remote SamplerReceiver. Note that, a SamplerSender can connect to multiple SamplerReceiver currently. The underlying implementation will send different subgraphs to different SamplerReceiver in parallel via multi-threading.

Parameters
  • namebook (dict) –

    IP address namebook of SamplerReceiver, where the key is recevier’s ID (start from 0) and value is receiver’s address, e.g.,

    { 0:‘168.12.23.45:50051’,

    1:‘168.12.23.21:50051’, 2:‘168.12.46.12:50051’ }

  • net_type (str) – networking type, e.g., ‘socket’ (default) or ‘mpi’.

SamplerSender.send(nodeflow, recv_id)

Send sampled subgraph (NodeFlow) to remote trainer.

SamplerSender.signal(recv_id)

When the samplling of each epoch is finished, users can invoke this API to tell SamplerReceiver that sampler has finished its job.

class dgl.contrib.sampling.dis_sampler.SamplerReceiver(graph, addr, num_sender, net_type='socket')[source]

SamplerReceiver for DGL distributed training.

Users use SamplerReceiver to receive sampled subgraphs (NodeFlow) from remote SamplerSender. Note that SamplerReceiver can receive messages from multiple SamplerSenders concurrently by given the num_sender parameter. Only when all SamplerSenders connected to SamplerReceiver successfully, SamplerReceiver can start its job.

Parameters
  • graph (DGLGraph) – The parent graph

  • addr (str) – address of SamplerReceiver, e.g., ‘127.0.0.1:50051’

  • num_sender (int) – total number of SamplerSender

  • net_type (str) – networking type, e.g., ‘socket’ (default) or ‘mpi’.

SamplerReceiver.__iter__()

Sampler iterator

SamplerReceiver.__next__()

Return sampled NodeFlow object