Graph samplers¶

dgl.contrib.sampling.sampler.
NeighborSampler
(g, batch_size, expand_factor=None, num_hops=1, neighbor_type='in', node_prob=None, seed_nodes=None, shuffle=False, num_workers=1, prefetch=False, add_self_loop=False)[source]¶ Create a sampler that samples neighborhood.
It returns a generator of
NodeFlow
. This can be viewed as an analogy of minibatch training on graph data – the given graph represents the whole dataset and the returned generator produces minibatches (in the form ofNodeFlow
objects).A NodeFlow grows from sampled nodes. It first samples a set of nodes from the given
seed_nodes
(or all the nodes if not given), then samples their neighbors and extracts the subgraph. If the number of hops is \(k(>1)\), the process is repeated recursively, with the neighbor nodes just sampled become the new seed nodes. The result is a graph we defined asNodeFlow
that contains \(k+1\) layers. The last layer is the initial seed nodes. The sampled neighbor nodes in layer \(i+1\) are in layer \(i\). All the edges are from nodes in layer \(i\) to layer \(i+1\).TODO(minjie): give a figure here.
As an analogy to minibatch training, the
batch_size
here is equal to the number of the initial seed nodes (number of nodes in the last layer). The number of nodeflow objects (the number of batches) is calculated bylen(seed_nodes) // batch_size
(ifseed_nodes
is None, then it is equal to the set of all nodes in the graph).Note: NeighborSampler currently only supprts immutable graphs.
Parameters:  g (DGLGraph) – The DGLGraph where we sample NodeFlows.
 batch_size (int) – The batch size (i.e, the number of nodes in the last layer)
 expand_factor (int, float, str) –
The number of neighbors sampled from the neighbor list of a vertex. The value of this parameter can be:
 int: indicates the number of neighbors sampled from a neighbor list.
 float: indicates the ratio of the sampled neighbors in a neighbor list.
 str: indicates some common ways of calculating the number of sampled neighbors,
e.g.,
sqrt(deg)
.
Note that no matter how large the expand_factor, the max number of sampled neighbors is the neighborhood size.
 num_hops (int, optional) – The number of hops to sample (i.e, the number of layers in the NodeFlow). Default: 1
 neighbor_type (str, optional) –
Indicates the neighbors on different types of edges.
 ”in”: the neighbors on the inedges.
 ”out”: the neighbors on the outedges.
Default: “in”
 node_prob (Tensor, optional) – A 1D tensor for the probability that a neighbor node is sampled. None means uniform sampling. Otherwise, the number of elements should be equal to the number of vertices in the graph. Default: None
 seed_nodes (Tensor, optional) – A 1D tensor list of nodes where we sample NodeFlows from. If None, the seed vertices are all the vertices in the graph. Default: None
 shuffle (bool, optional) – Indicates the sampled NodeFlows are shuffled. Default: False
 num_workers (int, optional) – The number of worker threads that sample NodeFlows in parallel. Default: 1
 prefetch (bool, optional) – If true, prefetch the samples in the next batch. Default: False
 add_self_loop (bool, optional) – If true, add self loop to the sampled NodeFlow. The edge IDs of the self loop edges are 1. Default: False

dgl.contrib.sampling.sampler.
LayerSampler
(g, batch_size, layer_sizes, neighbor_type='in', node_prob=None, seed_nodes=None, shuffle=False, num_workers=1, prefetch=False)[source]¶ Create a sampler that samples neighborhood.
This creates a NodeFlow loader that samples subgraphs from the input graph with layerwise sampling. This sampling method is implemented in C and can perform sampling very efficiently.
The NodeFlow loader returns a list of NodeFlows. The size of the NodeFlow list is the number of workers.
Note: LayerSampler currently only supprts immutable graphs.
Parameters:  g (DGLGraph) – The DGLGraph where we sample NodeFlows.
 batch_size (int) – The batch size (i.e, the number of nodes in the last layer)
 layer_size (int) – A list of layer sizes.
 neighbor_type (str, optional) –
Indicates the neighbors on different types of edges.
 ”in”: the neighbors on the inedges.
 ”out”: the neighbors on the outedges.
Default: “in”
 node_prob (Tensor, optional) – A 1D tensor for the probability that a neighbor node is sampled. None means uniform sampling. Otherwise, the number of elements should be equal to the number of vertices in the graph. It’s not implemented. Default: None
 seed_nodes (Tensor, optional) – A 1D tensor list of nodes where we sample NodeFlows from. If None, the seed vertices are all the vertices in the graph. Default: None
 shuffle (bool, optional) – Indicates the sampled NodeFlows are shuffled. Default: False
 num_workers (int, optional) – The number of worker threads that sample NodeFlows in parallel. Default: 1
 prefetch (bool, optional) – If true, prefetch the samples in the next batch. Default: False
Distributed sampler¶

class
dgl.contrib.sampling.dis_sampler.
SamplerPool
[source]¶ SamplerPool is an abstract class, in which the worker method should be implemented by users. SamplerPool will fork() N (N = num_worker) child processes, and each process will perform worker() method independently. Note that, the fork() API will use shared memory for N process and the OS will perfrom copyonwrite only when developers write that piece of memory. So fork N processes and load N copy of graph will not increase the memory overhead.
Users can use this class like this:
class MySamplerPool(SamplerPool):
 def worker(self):
 # Do anything here #
 if __name__ == ‘__main__’:
 … args = parser.parse_args() pool = MySamplerPool() pool.start(args.num_sender, args)
SamplerPool.start (num_worker, args) 
Start sampler pool 
SamplerPool.worker (args) 
Userdefined function 

class
dgl.contrib.sampling.dis_sampler.
SamplerSender
(namebook)[source]¶ SamplerSender for DGL distributed training.
Users use SamplerSender to send sampled subgraph (NodeFlow) to remote SamplerReceiver. Note that a SamplerSender can connect to multiple SamplerReceiver.
Parameters: namebook (dict) – address namebook of SamplerReceiver, where key is recevier’s ID and value is receiver’s address, e.g.,
 { 0:‘168.12.23.45:50051’,
 1:‘168.12.23.21:50051’, 2:‘168.12.46.12:50051’ }
SamplerSender.send (nodeflow, recv_id) 
Send sampled subgraph (NodeFlow) to remote trainer. 
SamplerSender.signal (recv_id) 
Whene samplling of each epoch is finished, users can invoke this API to tell SamplerReceiver it has finished its job. 

class
dgl.contrib.sampling.dis_sampler.
SamplerReceiver
(graph, addr, num_sender)[source]¶ SamplerReceiver for DGL distributed training.
Users use SamplerReceiver to receive sampled subgraph (NodeFlow) from remote SamplerSender. Note that SamplerReceiver can receive messages from multiple SamplerSenders concurrently by given the num_sender parameter. Note that, only when all SamplerSenders connect to SamplerReceiver, receiver can start its job.
Parameters:
SamplerReceiver.__iter__ () 
Iterator 
SamplerReceiver.__next__ () 
Return sampled NodeFlow object 