Benchmark the performance of KNN algorithms¶

In this doc, we benchmark the performance on multiple K-Nearest Neighbor algorithms implemented by dgl.knn_graph().

Given a dataset of N samples with D dimensions, the common use case of KNN algorithms in graph learning is to build a KNN graph by finding the K nearest neighbors for each of the N samples among the dataset.

Empirically, the three parameters, N, D, and K, all have impact on the computation cost. To benchmark the algorithms, we pick a few represensitive datasets to cover most common scenarios:

A synthetic dataset with mixed gaussian samples: N = 1000, D = 3.
A point cloud sample from ModelNet: N = 10000, D = 3.
Subsets of MNIST
- A small subset: N = 1000, D = 784
- A medium subset: N = 10000, D = 784
- A large subset: N = 50000, D = 784

Some notes:

bruteforce-sharemem is an optimized implementation of bruteforce on GPU.
kd-tree is currently only implemented on CPU.
bruteforce-blas conducts matrix multiplication, thus is memory inefficient.
nn-descent is an approximate algorithm, and we also report the recall rate of its result.

Results¶

In this section, we show the runtime and recall rate (where applicable) for the algorithms under various scenarios.

The experiments are run on an Amazon EC2 P3.2xlarge instance. This instance has 8 vCPUs with 61GB RAM, and one Tesla V100 GPU with 16GB RAM. In terms of the environment, we obtain the numbers with DGL==0.7.0(64d0f3f), PyTorch==1.8.1, CUDA==11.1 on Ubuntu 18.04.5 LTS.

Mixed Gaussian:

Model	CPU		GPU
Model	K = 8	K = 64	K = 8	K = 64
bruteforce-blas	0.010	0.011	0.002	0.003
kd-tree	0.004	0.006	n/a	n/a
bruteforce	0.004	0.006	0.126	0.009
bruteforce-sharemem	n/a	n/a	0.002	0.003
nn-descent	0.014 (R: 0.985)	0.148 (R: 1.000)	0.016 (R: 0.973)	0.077 (R: 1.000)

Point Cloud

Model	CPU		GPU
Model	K = 8	K = 64	K = 8	K = 64
bruteforce-blas	0.359	0.432	0.010	0.010
kd-tree	0.007	0.026	n/a	n/a
bruteforce	0.074	0.167	0.008	0.039
bruteforce-sharemem	n/a	n/a	0.004	0.017
nn-descent	0.161 (R: 0.977)	1.345 (R: 0.999)	0.086 (R: 0.966)	0.445 (R: 0.999)

Small MNIST

Model	CPU		GPU
Model	K = 8	K = 64	K = 8	K = 64
bruteforce-blas	0.014	0.015	0.002	0.002
kd-tree	0.179	0.182	n/a	n/a
bruteforce	0.173	0.228	0.123	0.170
bruteforce-sharemem	n/a	n/a	0.045	0.054
nn-descent	0.060 (R: 0.878)	1.077 (R: 0.999)	0.030 (R: 0.952)	0.457 (R: 0.999)

Medium MNIST

Model	CPU		GPU
Model	K = 8	K = 64	K = 8	K = 64
bruteforce-blas	0.897	0.970	0.019	0.023
kd-tree	18.902	18.928	n/a	n/a
bruteforce	14.495	17.652	2.058	2.588
bruteforce-sharemem	n/a	n/a	2.257	2.524
nn-descent	0.804 (R: 0.755)	14.108 (R: 0.999)	0.158 (R: 0.900)	1.794 (R: 0.999)

Large MNIST

Model	CPU		GPU
Model	K = 8	K = 64	K = 8	K = 64
bruteforce-blas	21.829	22.135	Out of Memory	Out of Memory
kd-tree	542.688	573.379	n/a	n/a
bruteforce	373.823	432.963	10.317	12.639
bruteforce-sharemem	n/a	n/a	53.133	58.419
nn-descent	4.995 (R: 0.658)	75.487 (R: 0.999)	1.478 (R: 0.860)	15.698 (R: 0.999)

Conclusion¶

As long as you have enough memory, bruteforce-blas is the default algorithm to go with.
Specifically, when D is small and the data is on CPU, kd-tree is the best algorithm.