CPU Best Practices๏ƒ

This chapter focus on providing best practises for environment setup to get the best performance during training and inference on the CPU.

Intel๏ƒ

Hyper-threading๏ƒ

For specific workloads as GNNโ€™s domain, suggested default setting for having best performance is to turn off hyperthreading. Turning off the hyper threading feature can be done at BIOS [1] or operating system level [2] [3] .

Alternative memory allocators๏ƒ

Alternative memory allocators, such as tcmalloc, might provide significant performance improvements by more efficient memory usage, reducing overhead on unnecessary memory allocations or deallocations. tcmalloc uses thread-local caches to reduce overhead on thread synchronization, locks contention by using spinlocks and per-thread arenas respectively and categorizes memory allocations by sizes to reduce overhead on memory fragmentation.

To take advantage of optimizations tcmalloc provides, install it on your system (on Ubuntu tcmalloc is included in libgoogle-perftools4 package) and add shared library to the LD_PRELOAD environment variable:

export LD_PRELOAD=/lib/x86_64-linux-gnu/libtcmalloc.so.4:$LD_PRELOAD

OpenMP settings๏ƒ

As OpenMP is the default parallel backend, we could control performance including sampling and training via dgl.utils.set_num_threads().

If number of OpenMP threads is not set and num_workers in dataloader is set to 0, the OpenMP runtime typically use the number of available CPU cores by default. This works well for most cases, and is also the default behavior in DGL.

If num_workers in dataloader is set to greater than 0, the number of OpenMP threads will be set to 1 for each worker process. This is the default behavior in PyTorch. In this case, we can set the number of OpenMP threads to the number of CPU cores in the main process.

Performance tuning is highly dependent on the workload and hardware configuration. We recommend users to try different settings and choose the best one for their own cases.

Dataloader CPU affinity

Note

This feature is available for dgl.dataloading.DataLoader only. Not available for dataloaders in dgl.graphbolt yet.

If number of dataloader workers is more than 0, please consider using use_cpu_affinity() method of DGL Dataloader class, it will generally result in significant performance improvement for training.

use_cpu_affinity will set the proper OpenMP thread count (equal to the number of CPU cores allocated for main process), affinitize dataloader workers for separate CPU cores and restrict the main process to remaining cores

In multiple NUMA nodes setups use_cpu_affinity will only use cores of NUMA node 0 by default with an assumption, that the workload is scaling poorly across multiple NUMA nodes. If you believe your workload will have better performance utilizing more than one NUMA node, you can pass the list of cores to use for dataloading (loader_cores) and for compute (compute_cores).

loader_cores and compute_cores arguments (list of CPU cores) can be passed to enable_cpu_affinity for more control over which cores should be used, e.g. in case a workload scales well across multiple NUMA nodes.

Usage:
dataloader = dgl.dataloading.DataLoader(...)
...
with dataloader.enable_cpu_affinity():
    <training loop or inferencing>

Manual control

For advanced and more fine-grained control over OpenMP settings please refer to Maximize Performance of Intelยฎ Optimization for PyTorch* on CPU [4] article

Footnotes

Total running time of the script: (0 minutes 0.000 seconds)

Gallery generated by Sphinx-Gallery