FraudDataset

class dgl.data.FraudDataset(name, raw_dir=None, random_seed=717, train_size=0.7, val_size=0.1, force_reload=False, verbose=True, transform=None)[source]

Bases: dgl.data.dgl_dataset.DGLBuiltinDataset

Fraud node prediction dataset.

The dataset includes two multi-relational graphs extracted from Yelp and Amazon where nodes represent fraudulent reviews or fraudulent reviewers.

It was first proposed in a CIKM‘20 paper <https://arxiv.org/pdf/2008.08692.pdf> and has been used by a recent WWW‘21 paper <https://ponderly.github.io/pub/PCGNN_WWW2021.pdf> as a benchmark. Another paper <https://arxiv.org/pdf/2104.01404.pdf> also takes the dataset as an example to study the non-homophilous graphs. This dataset is built upon industrial data and has rich relational information and unique properties like class-imbalance and feature inconsistency, which makes the dataset be a good instance to investigate how GNNs perform on real-world noisy graphs. These graphs are bidirected and not self connected.

Reference: <https://github.com/YingtongDou/CARE-GNN>

Parameters
  • name (str) – Name of the dataset

  • raw_dir (str) – Specifying the directory that will store the downloaded data or the directory that already stores the input data. Default: ~/.dgl/

  • random_seed (int) – Specifying the random seed in splitting the dataset. Default: 717

  • train_size (float) – training set size of the dataset. Default: 0.7

  • val_size (float) – validation set size of the dataset, and the size of testing set is (1 - train_size - val_size) Default: 0.1

  • force_reload (bool) – Whether to reload the dataset. Default: False

  • verbose (bool) – Whether to print out progress information. Default: True.

  • transform (callable, optional) – A transform that takes in a DGLGraph object and returns a transformed version. The DGLGraph object will be transformed before every access.

num_classes

Number of label classes

Type

int

graph

Graph structure, etc.

Type

dgl.DGLGraph

seed

Random seed in splitting the dataset.

Type

int

train_size

Training set size of the dataset.

Type

float

val_size

Validation set size of the dataset

Type

float

Examples

>>> dataset = FraudDataset('yelp')
>>> graph = dataset[0]
>>> num_classes = dataset.num_classes
>>> feat = graph.ndata['feature']
>>> label = graph.ndata['label']
__getitem__(idx)[source]

Get graph object

Parameters

idx (int) – Item index

Returns

graph structure, node features, node labels and masks

  • ndata['feature']: node features

  • ndata['label']: node labels

  • ndata['train_mask']: mask of training set

  • ndata['val_mask']: mask of validation set

  • ndata['test_mask']: mask of testing set

Return type

dgl.DGLGraph

__len__()[source]

number of data examples