dgl.data

The dgl.data package contains datasets hosted by DGL and also utilities for downloading, processing, saving and loading data from external resources.

Quick links:

Base Dataset Class

class dgl.data.DGLDataset(name, url=None, raw_dir=None, save_dir=None, hash_key=(), force_reload=False, verbose=False)[source]

The basic DGL dataset for creating graph datasets. This class defines a basic template class for DGL Dataset. The following steps will are executed automatically:

  1. Check whether there is a dataset cache on disk (already processed and stored on the disk) by invoking has_cache(). If true, goto 5.

  2. Call download() to download the data.

  3. Call process() to process the data.

  4. Call save() to save the processed dataset on disk and goto 6.

  5. Call load() to load the processed dataset from disk.

  6. Done.

Users can overwite these functions with their own data processing logic.

Parameters
  • name (str) – Name of the dataset

  • url (str) – Url to download the raw dataset

  • raw_dir (str) – Specifying the directory that will store the downloaded data or the directory that already stores the input data. Default: ~/.dgl/

  • save_dir (str) – Directory to save the processed dataset. Default: same as raw_dir

  • hash_key (tuple) – A tuple of values as the input for the hash function. Users can distinguish instances (and their caches on the disk) from the same dataset class by comparing the hash values. Default: (), the corresponding hash value is 'f9065fa7'.

  • force_reload (bool) – Whether to reload the dataset. Default: False

  • verbose (bool) – Whether to print out progress information

url

The URL to download the dataset

Type

str

name

The dataset name

Type

str

raw_dir

Raw file directory contains the input data folder

Type

str

raw_path

Directory contains the input data files. Default : os.path.join(self.raw_dir, self.name)

Type

str

save_dir

Directory to save the processed dataset

Type

str

save_path

File path to save the processed dataset

Type

str

verbose

Whether to print information

Type

bool

hash

Hash value for the dataset and the setting.

Type

str

abstract __getitem__(idx)[source]

Gets the data object at index.

abstract __len__()[source]

The number of examples in the dataset.

download()[source]

Overwite to realize your own logic of downloading data.

It is recommended to download the to the self.raw_dir folder. Can be ignored if the dataset is already in self.raw_dir.

has_cache()[source]

Overwrite to realize your own logic of deciding whether there exists a cached dataset.

By default False.

load()[source]

Overwite to realize your own logic of loading the saved dataset from files.

It is recommended to use dgl.utils.data.load_graphs to load dgl graph from files and use dgl.utils.data.load_info to load extra information into python dict object.

process()[source]

Overwrite to realize your own logic of processing the input data.

save()[source]

Overwite to realize your own logic of saving the processed dataset into files.

It is recommended to use dgl.utils.data.save_graphs to save dgl graph into files and use dgl.utils.data.save_info to save extra information into files.

Node Prediction Datasets

DGL hosted datasets for node classification/regression tasks.

Stanford sentiment treebank dataset

class dgl.data.SSTDataset(mode='train', glove_embed_file=None, vocab_file=None, raw_dir=None, force_reload=False, verbose=False)[source]

Stanford Sentiment Treebank dataset.

    Deprecated since version 0.5.0:
  • trees is deprecated, it is replaced by:

    >>> dataset = SSTDataset()
    >>> for tree in dataset:
    ....    # your code here
    
  • num_vocabs is deprecated, it is replaced by vocab_size.

Each sample is the constituency tree of a sentence. The leaf nodes represent words. The word is a int value stored in the x feature field. The non-leaf node has a special value PAD_WORD in the x field. Each node also has a sentiment annotation: 5 classes (very negative, negative, neutral, positive and very positive). The sentiment label is a int value stored in the y feature field. Official site: http://nlp.stanford.edu/sentiment/index.html

Statistics:

  • Train examples: 8,544

  • Dev examples: 1,101

  • Test examples: 2,210

  • Number of classes for each node: 5

Parameters
  • mode (str, optional) – Should be one of [‘train’, ‘dev’, ‘test’, ‘tiny’] Default: train

  • glove_embed_file (str, optional) – The path to pretrained glove embedding file. Default: None

  • vocab_file (str, optional) – Optional vocabulary file. If not given, the default vacabulary file is used. Default: None

  • raw_dir (str) – Raw file directory to download/contains the input data directory. Default: ~/.dgl/

  • force_reload (bool) – Whether to reload the dataset. Default: False

  • verbose (bool) – Whether to print out progress information. Default: True.

vocab

Vocabulary of the dataset

Type

OrderedDict

trees

A list of DGLGraph objects

Type

list

num_classes

Number of classes for each node

Type

int

pretrained_emb

Pretrained glove embedding with respect the vocabulary.

Type

Tensor

vocab_size

The size of the vocabulary

Type

int

num_vocabs

The size of the vocabulary

Type

int

Notes

All the samples will be loaded and preprocessed in the memory first.

Examples

>>> # get dataset
>>> train_data = SSTDataset()
>>> dev_data = SSTDataset(mode='dev')
>>> test_data = SSTDataset(mode='test')
>>> tiny_data = SSTDataset(mode='tiny')
>>>
>>> len(train_data)
8544
>>> train_data.num_classes
5
>>> glove_embed = train_data.pretrained_emb
>>> train_data.vocab_size
19536
>>> train_data[0]
Graph(num_nodes=71, num_edges=70,
  ndata_schemes={'x': Scheme(shape=(), dtype=torch.int64), 'y': Scheme(shape=(), dtype=torch.int64), 'mask': Scheme(shape=(), dtype=torch.int64)}
  edata_schemes={})
>>> for tree in train_data:
...     input_ids = tree.ndata['x']
...     labels = tree.ndata['y']
...     mask = tree.ndata['mask']
...     # your code here
__getitem__(idx)[source]

Get graph by index

Parameters

idx (int) –

Returns

graph structure, word id for each node, node labels and masks.

  • ndata['x']: word id of the node

  • ndata['y']: label of the node

  • ndata['mask']: 1 if the node is a leaf, otherwise 0

Return type

dgl.DGLGraph

__len__()[source]

Number of graphs in the dataset.

Karate club dataset

class dgl.data.KarateClubDataset[source]

Karate Club dataset for Node Classification

    Deprecated since version 0.5.0:
  • data is deprecated, it is replaced by:

    >>> dataset = KarateClubDataset()
    >>> g = dataset[0]
    

Zachary’s karate club is a social network of a university karate club, described in the paper “An Information Flow Model for Conflict and Fission in Small Groups” by Wayne W. Zachary. The network became a popular example of community structure in networks after its use by Michelle Girvan and Mark Newman in 2002. Official website: http://konect.cc/networks/ucidata-zachary/

Karate Club dataset statistics:

  • Nodes: 34

  • Edges: 156

  • Number of Classes: 2

num_classes

Number of node classes

Type

int

data

A list of dgl.DGLGraph objects

Type

list

Examples

>>> dataset = KarateClubDataset()
>>> num_classes = dataset.num_classes
>>> g = dataset[0]
>>> labels = g.ndata['label']
__getitem__(idx)[source]

Get graph object

Parameters

idx (int) – Item index, KarateClubDataset has only one graph object

Returns

graph structure and labels.

  • ndata['label']: ground truth labels

Return type

dgl.DGLGraph

__len__()[source]

The number of graphs in the dataset.

Citation network dataset

class dgl.data.CoraGraphDataset(raw_dir=None, force_reload=False, verbose=True)[source]

Cora citation network dataset.

    Deprecated since version 0.5.0:
  • graph is deprecated, it is replaced by:

    >>> dataset = CoraGraphDataset()
    >>> graph = dataset[0]
    
  • train_mask is deprecated, it is replaced by:

    >>> dataset = CoraGraphDataset()
    >>> graph = dataset[0]
    >>> train_mask = graph.ndata['train_mask']
    
  • val_mask is deprecated, it is replaced by:

    >>> dataset = CoraGraphDataset()
    >>> graph = dataset[0]
    >>> val_mask = graph.ndata['val_mask']
    
  • test_mask is deprecated, it is replaced by:

    >>> dataset = CoraGraphDataset()
    >>> graph = dataset[0]
    >>> test_mask = graph.ndata['test_mask']
    
  • labels is deprecated, it is replaced by:

    >>> dataset = CoraGraphDataset()
    >>> graph = dataset[0]
    >>> labels = graph.ndata['label']
    
  • feat is deprecated, it is replaced by:

    >>> dataset = CoraGraphDataset()
    >>> graph = dataset[0]
    >>> feat = graph.ndata['feat']
    

Nodes mean paper and edges mean citation relationships. Each node has a predefined feature with 1433 dimensions. The dataset is designed for the node classification task. The task is to predict the category of certain paper.

Statistics:

  • Nodes: 2708

  • Edges: 10556

  • Number of Classes: 7

  • Label split:

    • Train: 140

    • Valid: 500

    • Test: 1000

Parameters
  • raw_dir (str) – Raw file directory to download/contains the input data directory. Default: ~/.dgl/

  • force_reload (bool) – Whether to reload the dataset. Default: False

  • verbose (bool) – Whether to print out progress information. Default: True.

num_classes

Number of label classes

Type

int

graph

Graph structure

Type

networkx.DiGraph

train_mask

Mask of training nodes

Type

numpy.ndarray

val_mask

Mask of validation nodes

Type

numpy.ndarray

test_mask

Mask of test nodes

Type

numpy.ndarray

labels

Ground truth labels of each node

Type

numpy.ndarray

features

Node features

Type

Tensor

Notes

The node feature is row-normalized.

Examples

>>> dataset = CoraGraphDataset()
>>> g = dataset[0]
>>> num_class = g.num_classes
>>>
>>> # get node feature
>>> feat = g.ndata['feat']
>>>
>>> # get data split
>>> train_mask = g.ndata['train_mask']
>>> val_mask = g.ndata['val_mask']
>>> test_mask = g.ndata['test_mask']
>>>
>>> # get labels
>>> label = g.ndata['label']
>>>
>>> # Train, Validation and Test
__getitem__(idx)[source]

Gets the graph object

Parameters

idx (int) – Item index, CoraGraphDataset has only one graph object

Returns

graph structure, node features and labels.

  • ndata['train_mask']: mask for training node set

  • ndata['val_mask']: mask for validation node set

  • ndata['test_mask']: mask for test node set

  • ndata['feat']: node feature

  • ndata['label']: ground truth labels

Return type

dgl.DGLGraph

__len__()[source]

The number of graphs in the dataset.

class dgl.data.CiteseerGraphDataset(raw_dir=None, force_reload=False, verbose=True)[source]

Citeseer citation network dataset.

    Deprecated since version 0.5.0:
  • graph is deprecated, it is replaced by:

    >>> dataset = CiteseerGraphDataset()
    >>> graph = dataset[0]
    
  • train_mask is deprecated, it is replaced by:

    >>> dataset = CiteseerGraphDataset()
    >>> graph = dataset[0]
    >>> train_mask = graph.ndata['train_mask']
    
  • val_mask is deprecated, it is replaced by:

    >>> dataset = CiteseerGraphDataset()
    >>> graph = dataset[0]
    >>> val_mask = graph.ndata['val_mask']
    
  • test_mask is deprecated, it is replaced by:

    >>> dataset = CiteseerGraphDataset()
    >>> graph = dataset[0]
    >>> test_mask = graph.ndata['test_mask']
    
  • labels is deprecated, it is replaced by:

    >>> dataset = CiteseerGraphDataset()
    >>> graph = dataset[0]
    >>> labels = graph.ndata['label']
    
  • feat is deprecated, it is replaced by:

    >>> dataset = CiteseerGraphDataset()
    >>> graph = dataset[0]
    >>> feat = graph.ndata['feat']
    

Nodes mean scientific publications and edges mean citation relationships. Each node has a predefined feature with 3703 dimensions. The dataset is designed for the node classification task. The task is to predict the category of certain publication.

Statistics:

  • Nodes: 3327

  • Edges: 9228

  • Number of Classes: 6

  • Label Split:

    • Train: 120

    • Valid: 500

    • Test: 1000

Parameters
  • raw_dir (str) – Raw file directory to download/contains the input data directory. Default: ~/.dgl/

  • force_reload (bool) – Whether to reload the dataset. Default: False

  • verbose (bool) – Whether to print out progress information. Default: True.

num_classes

Number of label classes

Type

int

graph

Graph structure

Type

networkx.DiGraph

train_mask

Mask of training nodes

Type

numpy.ndarray

val_mask

Mask of validation nodes

Type

numpy.ndarray

test_mask

Mask of test nodes

Type

numpy.ndarray

labels

Ground truth labels of each node

Type

numpy.ndarray

features

Node features

Type

Tensor

Notes

The node feature is row-normalized.

In citeseer dataset, there are some isolated nodes in the graph. These isolated nodes are added as zero-vecs into the right position.

Examples

>>> dataset = CiteseerGraphDataset()
>>> g = dataset[0]
>>> num_class = g.num_classes
>>>
>>> # get node feature
>>> feat = g.ndata['feat']
>>>
>>> # get data split
>>> train_mask = g.ndata['train_mask']
>>> val_mask = g.ndata['val_mask']
>>> test_mask = g.ndata['test_mask']
>>>
>>> # get labels
>>> label = g.ndata['label']
>>>
>>> # Train, Validation and Test
__getitem__(idx)[source]

Gets the graph object

Parameters

idx (int) – Item index, CiteseerGraphDataset has only one graph object

Returns

graph structure, node features and labels.

  • ndata['train_mask']: mask for training node set

  • ndata['val_mask']: mask for validation node set

  • ndata['test_mask']: mask for test node set

  • ndata['feat']: node feature

  • ndata['label']: ground truth labels

Return type

dgl.DGLGraph

__len__()[source]

The number of graphs in the dataset.

class dgl.data.PubmedGraphDataset(raw_dir=None, force_reload=False, verbose=True)[source]

Pubmed citation network dataset.

    Deprecated since version 0.5.0:
  • graph is deprecated, it is replaced by:

    >>> dataset = PubmedGraphDataset()
    >>> graph = dataset[0]
    
  • train_mask is deprecated, it is replaced by:

    >>> dataset = PubmedGraphDataset()
    >>> graph = dataset[0]
    >>> train_mask = graph.ndata['train_mask']
    
  • val_mask is deprecated, it is replaced by:

    >>> dataset = PubmedGraphDataset()
    >>> graph = dataset[0]
    >>> val_mask = graph.ndata['val_mask']
    
  • test_mask is deprecated, it is replaced by:

    >>> dataset = PubmedGraphDataset()
    >>> graph = dataset[0]
    >>> test_mask = graph.ndata['test_mask']
    
  • labels is deprecated, it is replaced by:

    >>> dataset = PubmedGraphDataset()
    >>> graph = dataset[0]
    >>> labels = graph.ndata['label']
    
  • feat is deprecated, it is replaced by:

    >>> dataset = PubmedGraphDataset()
    >>> graph = dataset[0]
    >>> feat = graph.ndata['feat']
    

Nodes mean scientific publications and edges mean citation relationships. Each node has a predefined feature with 500 dimensions. The dataset is designed for the node classification task. The task is to predict the category of certain publication.

Statistics:

  • Nodes: 19717

  • Edges: 88651

  • Number of Classes: 3

  • Label Split:

    • Train: 60

    • Valid: 500

    • Test: 1000

Parameters
  • raw_dir (str) – Raw file directory to download/contains the input data directory. Default: ~/.dgl/

  • force_reload (bool) – Whether to reload the dataset. Default: False

  • verbose (bool) – Whether to print out progress information. Default: True.

num_classes

Number of label classes

Type

int

graph

Graph structure

Type

networkx.DiGraph

train_mask

Mask of training nodes

Type

numpy.ndarray

val_mask

Mask of validation nodes

Type

numpy.ndarray

test_mask

Mask of test nodes

Type

numpy.ndarray

labels

Ground truth labels of each node

Type

numpy.ndarray

features

Node features

Type

Tensor

Notes

The node feature is row-normalized.

Examples

>>> dataset = PubmedGraphDataset()
>>> g = dataset[0]
>>> num_class = g.num_of_class
>>>
>>> # get node feature
>>> feat = g.ndata['feat']
>>>
>>> # get data split
>>> train_mask = g.ndata['train_mask']
>>> val_mask = g.ndata['val_mask']
>>> test_mask = g.ndata['test_mask']
>>>
>>> # get labels
>>> label = g.ndata['label']
>>>
>>> # Train, Validation and Test
__getitem__(idx)[source]

Gets the graph object

Parameters

idx (int) – Item index, PubmedGraphDataset has only one graph object

Returns

graph structure, node features and labels.

  • ndata['train_mask']: mask for training node set

  • ndata['val_mask']: mask for validation node set

  • ndata['test_mask']: mask for test node set

  • ndata['feat']: node feature

  • ndata['label']: ground truth labels

Return type

dgl.DGLGraph

__len__()[source]

The number of graphs in the dataset.

CoraFull dataset

class dgl.data.CoraFullDataset(raw_dir=None, force_reload=False, verbose=False)[source]

CORA-Full dataset for node classification task.

    Deprecated since version 0.5.0:
  • data is deprecated, it is repalced by:

>>> dataset = CoraFullDataset()
>>> graph = dataset[0]

Extended Cora dataset. Nodes represent paper and edges represent citations.

Reference: https://github.com/shchur/gnn-benchmark#datasets

Statistics:

  • Nodes: 19,793

  • Edges: 130,622

  • Number of Classes: 70

  • Node feature size: 8,710

Parameters
  • raw_dir (str) – Raw file directory to download/contains the input data directory. Default: ~/.dgl/

  • force_reload (bool) – Whether to reload the dataset. Default: False

  • verbose (bool) – Whether to print out progress information. Default: True.

num_classes

Number of classes for each node.

Type

int

data

A list of DGLGraph objects

Type

list

Examples

>>> data = CoraFullDataset()
>>> g = data[0]
>>> num_class = data.num_classes
>>> feat = g.ndata['feat']  # get node feature
>>> label = g.ndata['label']  # get node labels
__getitem__(idx)

Get graph by index

Parameters

idx (int) – Item index

Returns

The graph contains:

  • ndata['feat']: node features

  • ndata['label']: node labels

Return type

dgl.DGLGraph

__len__()

Number of graphs in the dataset

RDF datasets

class dgl.data.AIFBDataset(print_every=10000, insert_reverse=True, raw_dir=None, force_reload=False, verbose=True)[source]

AIFB dataset for node classification task

    Deprecated since version 0.5.0:
  • graph is deprecated, it is replaced by:

    >>> dataset = AIFBDataset()
    >>> graph = dataset[0]
    
  • train_idx is deprecated, it can be replaced by:

    >>> dataset = AIFBDataset()
    >>> graph = dataset[0]
    >>> train_mask = graph.nodes[dataset.category].data['train_mask']
    >>> train_idx = th.nonzero(train_mask).squeeze()
    
  • test_idx is deprecated, it can be replaced by:

    >>> dataset = AIFBDataset()
    >>> graph = dataset[0]
    >>> test_mask = graph.nodes[dataset.category].data['test_mask']
    >>> test_idx = th.nonzero(test_mask).squeeze()
    

AIFB DataSet is a Semantic Web (RDF) dataset used as a benchmark in data mining. It records the organizational structure of AIFB at the University of Karlsruhe.

AIFB dataset statistics:

  • Nodes: 7262

  • Edges: 48810 (including reverse edges)

  • Target Category: Personen

  • Number of Classes: 4

  • Label Split:

    • Train: 140

    • Test: 36

Parameters
  • print_every (int) – Preprocessing log for every X tuples. Default: 10000.

  • insert_reverse (bool) – If true, add reverse edge and reverse relations to the final graph. Default: True.

  • raw_dir (str) – Raw file directory to download/contains the input data directory. Default: ~/.dgl/

  • force_reload (bool) – Whether to reload the dataset. Default: False

  • verbose (bool) – Whether to print out progress information. Default: True.

num_classes

Number of classes to predict

Type

int

predict_category

The entity category (node type) that has labels for prediction

Type

str

labels

All the labels of the entities in predict_category

Type

Tensor

graph

Graph structure

Type

dgl.DGLGraph

train_idx

Entity IDs for training. All IDs are local IDs w.r.t. to predict_category.

Type

Tensor

test_idx

Entity IDs for testing. All IDs are local IDs w.r.t. to predict_category.

Type

Tensor

Examples

>>> dataset = dgl.data.rdf.AIFBDataset()
>>> graph = dataset[0]
>>> category = dataset.predict_category
>>> num_classes = dataset.num_classes
>>>
>>> train_mask = g.nodes[category].data.pop('train_mask')
>>> test_mask = g.nodes[category].data.pop('test_mask')
>>> labels = g.nodes[category].data.pop('labels')
__getitem__(idx)[source]

Gets the graph object

Parameters

idx (int) – Item index, AIFBDataset has only one graph object

Returns

The graph contains:

  • ndata['train_mask']: mask for training node set

  • ndata['test_mask']: mask for testing node set

  • ndata['labels']: mask for labels

Return type

dgl.DGLGraph

__len__()[source]

The number of graphs in the dataset.

Returns

Return type

int

class dgl.data.MUTAGDataset(print_every=10000, insert_reverse=True, raw_dir=None, force_reload=False, verbose=True)[source]

MUTAG dataset for node classification task

    Deprecated since version 0.5.0:
  • graph is deprecated, it is replaced by:

    >>> dataset = MUTAGDataset()
    >>> graph = dataset[0]
    
  • train_idx is deprecated, it can be replaced by:

    >>> dataset = MUTAGDataset()
    >>> graph = dataset[0]
    >>> train_mask = graph.nodes[dataset.category].data['train_mask']
    >>> train_idx = th.nonzero(train_mask).squeeze()
    
  • test_idx is deprecated, it can be replaced by:

    >>> dataset = MUTAGDataset()
    >>> graph = dataset[0]
    >>> test_mask = graph.nodes[dataset.category].data['test_mask']
    >>> test_idx = th.nonzero(test_mask).squeeze()
    

Mutag dataset statistics:

  • Nodes: 27163

  • Edges: 148100 (including reverse edges)

  • Target Category: d

  • Number of Classes: 2

  • Label Split:

    • Train: 272

    • Test: 68

Parameters
  • print_every (int) – Preprocessing log for every X tuples. Default: 10000.

  • insert_reverse (bool) – If true, add reverse edge and reverse relations to the final graph. Default: True.

  • raw_dir (str) – Raw file directory to download/contains the input data directory. Default: ~/.dgl/

  • force_reload (bool) – Whether to reload the dataset. Default: False

  • verbose (bool) – Whether to print out progress information. Default: True.

num_classes

Number of classes to predict

Type

int

predict_category

The entity category (node type) that has labels for prediction

Type

str

labels

All the labels of the entities in predict_category

Type

Tensor

graph

Graph structure

Type

dgl.DGLGraph

train_idx

Entity IDs for training. All IDs are local IDs w.r.t. to predict_category.

Type

Tensor

test_idx

Entity IDs for testing. All IDs are local IDs w.r.t. to predict_category.

Type

Tensor

Examples

>>> dataset = dgl.data.rdf.MUTAGDataset()
>>> graph = dataset[0]
>>> category = dataset.predict_category
>>> num_classes = dataset.num_classes
>>>
>>> train_mask = g.nodes[category].data.pop('train_mask')
>>> test_mask = g.nodes[category].data.pop('test_mask')
>>> labels = g.nodes[category].data.pop('labels')
__getitem__(idx)[source]

Gets the graph object

Parameters

idx (int) – Item index, MUTAGDataset has only one graph object

Returns

The graph contains:

  • ndata['train_mask']: mask for training node set

  • ndata['test_mask']: mask for testing node set

  • ndata['labels']: mask for labels

Return type

dgl.DGLGraph

__len__()[source]

The number of graphs in the dataset.

Returns

Return type

int

class dgl.data.BGSDataset(print_every=10000, insert_reverse=True, raw_dir=None, force_reload=False, verbose=True)[source]

BGS dataset for node classification task

    Deprecated since version 0.5.0:
  • graph is deprecated, it is replaced by:

    >>> dataset = BGSDataset()
    >>> graph = dataset[0]
    
  • train_idx is deprecated, it can be replaced by:

    >>> dataset = BGSDataset()
    >>> graph = dataset[0]
    >>> train_mask = graph.nodes[dataset.category].data['train_mask']
    >>> train_idx = th.nonzero(train_mask).squeeze()
    
  • test_idx is deprecated, it can be replaced by:

    >>> dataset = BGSDataset()
    >>> graph = dataset[0]
    >>> test_mask = graph.nodes[dataset.category].data['test_mask']
    >>> test_idx = th.nonzero(test_mask).squeeze()
    

BGS namespace convention: http://data.bgs.ac.uk/(ref|id)/<Major Concept>/<Sub Concept>/INSTANCE. We ignored all literal nodes and the relations connecting them in the output graph. We also ignored the relation used to mark whether a term is CURRENT or DEPRECATED.

BGS dataset statistics:

  • Nodes: 94806

  • Edges: 672884 (including reverse edges)

  • Target Category: Lexicon/NamedRockUnit

  • Number of Classes: 2

  • Label Split:

    • Train: 117

    • Test: 29

Parameters
  • print_every (int) – Preprocessing log for every X tuples. Default: 10000.

  • insert_reverse (bool) – If true, add reverse edge and reverse relations to the final graph. Default: True.

  • raw_dir (str) – Raw file directory to download/contains the input data directory. Default: ~/.dgl/

  • force_reload (bool) – Whether to reload the dataset. Default: False

  • verbose (bool) – Whether to print out progress information. Default: True.

num_classes

Number of classes to predict

Type

int

predict_category

The entity category (node type) that has labels for prediction

Type

str

labels

All the labels of the entities in predict_category

Type

Tensor

graph

Graph structure

Type

dgl.DGLGraph

train_idx

Entity IDs for training. All IDs are local IDs w.r.t. to predict_category.

Type

Tensor

test_idx

Entity IDs for testing. All IDs are local IDs w.r.t. to predict_category.

Type

Tensor

Examples

>>> dataset = dgl.data.rdf.BGSDataset()
>>> graph = dataset[0]
>>> category = dataset.predict_category
>>> num_classes = dataset.num_classes
>>>
>>> train_mask = g.nodes[category].data.pop('train_mask')
>>> test_mask = g.nodes[category].data.pop('test_mask')
>>> labels = g.nodes[category].data.pop('labels')
__getitem__(idx)[source]

Gets the graph object

Parameters

idx (int) – Item index, BGSDataset has only one graph object

Returns

The graph contains:

  • ndata['train_mask']: mask for training node set

  • ndata['test_mask']: mask for testing node set

  • ndata['labels']: mask for labels

Return type

dgl.DGLGraph

__len__()[source]

The number of graphs in the dataset.

Returns

Return type

int

class dgl.data.AMDataset(print_every=10000, insert_reverse=True, raw_dir=None, force_reload=False, verbose=True)[source]

AM dataset. for node classification task

    Deprecated since version 0.5.0:
  • graph is deprecated, it is replaced by:

    >>> dataset = AMDataset()
    >>> graph = dataset[0]
    
  • train_idx is deprecated, it can be replaced by:

    >>> dataset = AMDataset()
    >>> graph = dataset[0]
    >>> train_mask = graph.nodes[dataset.category].data['train_mask']
    >>> train_idx = th.nonzero(train_mask).squeeze()
    
  • test_idx is deprecated, it can be replaced by:

    >>> dataset = AMDataset()
    >>> graph = dataset[0]
    >>> test_mask = graph.nodes[dataset.category].data['test_mask']
    >>> test_idx = th.nonzero(test_mask).squeeze()
    

Namespace convention:

  • Instance: http://purl.org/collections/nl/am/<type>-<id>

  • Relation: http://purl.org/collections/nl/am/<name>

We ignored all literal nodes and the relations connecting them in the output graph.

AM dataset statistics:

  • Nodes: 881680

  • Edges: 5668682 (including reverse edges)

  • Target Category: proxy

  • Number of Classes: 11

  • Label Split:

    • Train: 802

    • Test: 198

Parameters
  • print_every (int) – Preprocessing log for every X tuples. Default: 10000.

  • insert_reverse (bool) – If true, add reverse edge and reverse relations to the final graph. Default: True.

  • raw_dir (str) – Raw file directory to download/contains the input data directory. Default: ~/.dgl/

  • force_reload (bool) – Whether to reload the dataset. Default: False

  • verbose (bool) – Whether to print out progress information. Default: True.

num_classes

Number of classes to predict

Type

int

predict_category

The entity category (node type) that has labels for prediction

Type

str

labels

All the labels of the entities in predict_category

Type

Tensor

graph

Graph structure

Type

dgl.DGLGraph

train_idx

Entity IDs for training. All IDs are local IDs w.r.t. to predict_category.

Type

Tensor

test_idx

Entity IDs for testing. All IDs are local IDs w.r.t. to predict_category.

Type

Tensor

Examples

>>> dataset = dgl.data.rdf.AMDataset()
>>> graph = dataset[0]
>>> category = dataset.predict_category
>>> num_classes = dataset.num_classes
>>>
>>> train_mask = g.nodes[category].data.pop('train_mask')
>>> test_mask = g.nodes[category].data.pop('test_mask')
>>> labels = g.nodes[category].data.pop('labels')
__getitem__(idx)[source]

Gets the graph object

Parameters

idx (int) – Item index, AMDataset has only one graph object

Returns

The graph contains:

  • ndata['train_mask']: mask for training node set

  • ndata['test_mask']: mask for testing node set

  • ndata['labels']: mask for labels

Return type

dgl.DGLGraph

__len__()[source]

The number of graphs in the dataset.

Returns

Return type

int

Amazon Co-Purchase dataset

class dgl.data.AmazonCoBuyComputerDataset(raw_dir=None, force_reload=False, verbose=False)[source]

‘Computer’ part of the AmazonCoBuy dataset for node classification task.

    Deprecated since version 0.5.0:
  • data is deprecated, it is repalced by:

>>> dataset = AmazonCoBuyComputerDataset()
>>> graph = dataset[0]

Amazon Computers and Amazon Photo are segments of the Amazon co-purchase graph [McAuley et al., 2015], where nodes represent goods, edges indicate that two goods are frequently bought together, node features are bag-of-words encoded product reviews, and class labels are given by the product category.

Reference: https://github.com/shchur/gnn-benchmark#datasets

Statistics:

  • Nodes: 13,752

  • Edges: 574,418

  • Number of classes: 5

  • Node feature size: 767

Parameters
  • raw_dir (str) – Raw file directory to download/contains the input data directory. Default: ~/.dgl/

  • force_reload (bool) – Whether to reload the dataset. Default: False

  • verbose (bool) – Whether to print out progress information. Default: True.

num_classes

Number of classes for each node.

Type

int

data

A list of DGLGraph objects

Type

list

Examples

>>> data = AmazonCoBuyComputerDataset()
>>> g = data[0]
>>> num_class = data.num_classes
>>> feat = g.ndata['feat']  # get node feature
>>> label = g.ndata['label']  # get node labels
__getitem__(idx)

Get graph by index

Parameters

idx (int) – Item index

Returns

The graph contains:

  • ndata['feat']: node features

  • ndata['label']: node labels

Return type

dgl.DGLGraph

__len__()

Number of graphs in the dataset

class dgl.data.AmazonCoBuyPhotoDataset(raw_dir=None, force_reload=False, verbose=False)[source]

AmazonCoBuy dataset for node classification task.

    Deprecated since version 0.5.0:
  • data is deprecated, it is repalced by:

>>> dataset = AmazonCoBuyPhotoDataset()
>>> graph = dataset[0]

Amazon Computers and Amazon Photo are segments of the Amazon co-purchase graph [McAuley et al., 2015], where nodes represent goods, edges indicate that two goods are frequently bought together, node features are bag-of-words encoded product reviews, and class labels are given by the product category.

Reference: https://github.com/shchur/gnn-benchmark#datasets

Statistics

  • Nodes: 7,650

  • Edges: 287,326

  • Number of classes: 5

  • Node feature size: 745

Parameters
  • raw_dir (str) – Raw file directory to download/contains the input data directory. Default: ~/.dgl/

  • force_reload (bool) – Whether to reload the dataset. Default: False

  • verbose (bool) – Whether to print out progress information. Default: True.

num_classes

Number of classes for each node.

Type

int

data

A list of DGLGraph objects

Type

list

Examples

>>> data = AmazonCoBuyPhotoDataset()
>>> g = data[0]
>>> num_class = data.num_classes
>>> feat = g.ndata['feat']  # get node feature
>>> label = g.ndata['label']  # get node labels
__getitem__(idx)

Get graph by index

Parameters

idx (int) – Item index

Returns

The graph contains:

  • ndata['feat']: node features

  • ndata['label']: node labels

Return type

dgl.DGLGraph

__len__()

Number of graphs in the dataset

Coauthor dataset

class dgl.data.CoauthorCSDataset(raw_dir=None, force_reload=False, verbose=False)[source]

‘Computer Science (CS)’ part of the Coauthor dataset for node classification task.

    Deprecated since version 0.5.0:
  • data is deprecated, it is repalced by:

>>> dataset = CoauthorCSDataset()
>>> graph = dataset[0]

Coauthor CS and Coauthor Physics are co-authorship graphs based on the Microsoft Academic Graph from the KDD Cup 2016 challenge. Here, nodes are authors, that are connected by an edge if they co-authored a paper; node features represent paper keywords for each author’s papers, and class labels indicate most active fields of study for each author.

Reference: https://github.com/shchur/gnn-benchmark#datasets

Statistics:

  • Nodes: 18,333

  • Edges: 327,576

  • Number of classes: 15

  • Node feature size: 6,805

Parameters
  • raw_dir (str) – Raw file directory to download/contains the input data directory. Default: ~/.dgl/

  • force_reload (bool) – Whether to reload the dataset. Default: False

  • verbose (bool) – Whether to print out progress information. Default: True.

num_classes

Number of classes for each node.

Type

int

data

A list of DGLGraph objects

Type

list

Examples

>>> data = CoauthorCSDataset()
>>> g = data[0]
>>> num_class = data.num_classes
>>> feat = g.ndata['feat']  # get node feature
>>> label = g.ndata['label']  # get node labels
__getitem__(idx)

Get graph by index

Parameters

idx (int) – Item index

Returns

The graph contains:

  • ndata['feat']: node features

  • ndata['label']: node labels

Return type

dgl.DGLGraph

__len__()

Number of graphs in the dataset

class dgl.data.CoauthorPhysicsDataset(raw_dir=None, force_reload=False, verbose=False)[source]

‘Physics’ part of the Coauthor dataset for node classification task.

    Deprecated since version 0.5.0:
  • data is deprecated, it is repalced by:

>>> dataset = CoauthorPhysicsDataset()
>>> graph = dataset[0]

Coauthor CS and Coauthor Physics are co-authorship graphs based on the Microsoft Academic Graph from the KDD Cup 2016 challenge. Here, nodes are authors, that are connected by an edge if they co-authored a paper; node features represent paper keywords for each author’s papers, and class labels indicate most active fields of study for each author.

Reference: https://github.com/shchur/gnn-benchmark#datasets

Statistics

  • Nodes: 34,493

  • Edges: 991,848

  • Number of classes: 5

  • Node feature size: 8,415

Parameters
  • raw_dir (str) – Raw file directory to download/contains the input data directory. Default: ~/.dgl/

  • force_reload (bool) – Whether to reload the dataset. Default: False

  • verbose (bool) – Whether to print out progress information. Default: True.

num_classes

Number of classes for each node.

Type

int

data

A list of DGLGraph objects

Type

list

Examples

>>> data = CoauthorPhysicsDataset()
>>> g = data[0]
>>> num_class = data.num_classes
>>> feat = g.ndata['feat']  # get node feature
>>> label = g.ndata['label']  # get node labels
__getitem__(idx)

Get graph by index

Parameters

idx (int) – Item index

Returns

The graph contains:

  • ndata['feat']: node features

  • ndata['label']: node labels

Return type

dgl.DGLGraph

__len__()

Number of graphs in the dataset

Protein-Protein Interaction dataset

class dgl.data.PPIDataset(mode='train', raw_dir=None, force_reload=False, verbose=False)[source]

Protein-Protein Interaction dataset for inductive node classification

    Deprecated since version 0.5.0:
  • lables is deprecated, it is replaced by:

    >>> dataset = PPIDataset()
    >>> for g in dataset:
    ....    labels = g.ndata['label']
    ....
    >>>
    
  • features is deprecated, it is replaced by:

    >>> dataset = PPIDataset()
    >>> for g in dataset:
    ....    features = g.ndata['feat']
    ....
    >>>
    

A toy Protein-Protein Interaction network dataset. The dataset contains 24 graphs. The average number of nodes per graph is 2372. Each node has 50 features and 121 labels. 20 graphs for training, 2 for validation and 2 for testing.

Reference: http://snap.stanford.edu/graphsage/

Statistics:

  • Train examples: 20

  • Valid examples: 2

  • Test examples: 2

Parameters
  • mode (str) – Must be one of (‘train’, ‘valid’, ‘test’). Default: ‘train’

  • raw_dir (str) – Raw file directory to download/contains the input data directory. Default: ~/.dgl/

  • force_reload (bool) – Whether to reload the dataset. Default: False

  • verbose (bool) – Whether to print out progress information. Default: True.

num_labels

Number of labels for each node

Type

int

labels

Node labels

Type

Tensor

features

Node features

Type

Tensor

Examples

>>> dataset = PPIDataset(mode='valid')
>>> num_labels = dataset.num_labels
>>> for g in dataset:
....    feat = g.ndata['feat']
....    label = g.ndata['label']
....    # your code here
>>>
__getitem__(item)[source]

Get the item^th sample.

Parameters

item (int) – The sample index.

Returns

graph structure, node features and node labels.

  • ndata['feat']: node features

  • ndata['label']: node labels

Return type

dgl.DGLGraph

__len__()[source]

Return number of samples in this dataset.

Reddit dataset

class dgl.data.RedditDataset(self_loop=False, raw_dir=None, force_reload=False, verbose=False)[source]

Reddit dataset for community detection (node classification)

    Deprecated since version 0.5.0:
  • graph is deprecated, it is replaced by:

    >>> dataset = RedditDataset()
    >>> graph = dataset[0]
    
  • num_labels is deprecated, it is replaced by:

    >>> dataset = RedditDataset()
    >>> num_classes = dataset.num_classes
    
  • train_mask is deprecated, it is replaced by:

    >>> dataset = RedditDataset()
    >>> graph = dataset[0]
    >>> train_mask = graph.ndata['train_mask']
    
  • val_mask is deprecated, it is replaced by:

    >>> dataset = RedditDataset()
    >>> graph = dataset[0]
    >>> val_mask = graph.ndata['val_mask']
    
  • test_mask is deprecated, it is replaced by:

    >>> dataset = RedditDataset()
    >>> graph = dataset[0]
    >>> test_mask = graph.ndata['test_mask']
    
  • features is deprecated, it is replaced by:

    >>> dataset = RedditDataset()
    >>> graph = dataset[0]
    >>> features = graph.ndata['feat']
    
  • labels is deprecated, it is replaced by:

    >>> dataset = RedditDataset()
    >>> graph = dataset[0]
    >>> labels = graph.ndata['label']
    

This is a graph dataset from Reddit posts made in the month of September, 2014. The node label in this case is the community, or “subreddit”, that a post belongs to. The authors sampled 50 large communities and built a post-to-post graph, connecting posts if the same user comments on both. In total this dataset contains 232,965 posts with an average degree of 492. We use the first 20 days for training and the remaining days for testing (with 30% used for validation).

Reference: http://snap.stanford.edu/graphsage/

Statistics

  • Nodes: 232,965

  • Edges: 114,615,892

  • Node feature size: 602

  • Number of training samples: 153,431

  • Number of validation samples: 23,831

  • Number of test samples: 55,703

Parameters
  • self_loop (bool) – Whether load dataset with self loop connections. Default: False

  • raw_dir (str) – Raw file directory to download/contains the input data directory. Default: ~/.dgl/

  • force_reload (bool) – Whether to reload the dataset. Default: False

  • verbose (bool) – Whether to print out progress information. Default: True.

num_classes

Number of classes for each node

Type

int

graph

Graph of the dataset

Type

dgl.DGLGraph

num_labels

Number of classes for each node

Type

int

train_mask

Mask of training nodes

Type

numpy.ndarray

val_mask

Mask of validation nodes

Type

numpy.ndarray

test_mask

Mask of test nodes

Type

numpy.ndarray

features

Node features

Type

Tensor

labels

Node labels

Type

Tensor

Examples

>>> data = RedditDataset()
>>> g = data[0]
>>> num_classes = data.num_classes
>>>
>>> # get node feature
>>> feat = g.ndata['feat']
>>>
>>> # get data split
>>> train_mask = g.ndata['train_mask']
>>> val_mask = g.ndata['val_mask']
>>> test_mask = g.ndata['test_mask']
>>>
>>> # get labels
>>> label = g.ndata['label']
>>>
>>> # Train, Validation and Test
__getitem__(idx)[source]

Get graph by index

Parameters

idx (int) – Item index

Returns

graph structure, node labels, node features and splitting masks:

  • ndata['label']: node label

  • ndata['feat']: node feature

  • ndata['train_mask']: mask for training node set

  • ndata['val_mask']: mask for validation node set

  • ndata['test_mask']: mask for test node set

Return type

dgl.DGLGraph

__len__()[source]

Number of graphs in the dataset

Symmetric Stochastic Block Model Mixture dataset

class dgl.data.SBMMixtureDataset(n_graphs, n_nodes, n_communities, k=2, avg_deg=3, pq='Appendix_C', rng=None)[source]

Symmetric Stochastic Block Model Mixture

Reference: Appendix C of Supervised Community Detection with Hierarchical Graph Neural Networks

Parameters
  • n_graphs (int) – Number of graphs.

  • n_nodes (int) – Number of nodes.

  • n_communities (int) – Number of communities.

  • k (int, optional) – Multiplier. Default: 2

  • avg_deg (int, optional) – Average degree. Default: 3

  • pq (list of pair of nonnegative float or str, optional) – Random densities. This parameter is for future extension, for now it’s always using the default value. Default: Appendix_C

  • rng (numpy.random.RandomState, optional) – Random number generator. If not given, it’s numpy.random.RandomState() with seed=None, which read data from /dev/urandom (or the Windows analogue) if available or seed from the clock otherwise. Default: None

Raises

RuntimeError is raised if pq is not a list or string.

Examples

>>> data = SBMMixtureDataset(n_graphs=16, n_nodes=10000, n_communities=2)
>>> from torch.utils.data import DataLoader
>>> dataloader = DataLoader(data, batch_size=1, collate_fn=data.collate_fn)
>>> for graph, line_graph, graph_degrees, line_graph_degrees, pm_pd in dataloader:
...     # your code here
__getitem__(idx)[source]

Get one example by index

Parameters

idx (int) – Item index

Returns

  • graph (dgl.DGLGraph) – The original graph

  • line_graph (dgl.DGLGraph) – The line graph of graph

  • graph_degree (numpy.ndarray) – In degrees for each node in graph

  • line_graph_degree (numpy.ndarray) – In degrees for each node in line_graph

  • pm_pd (numpy.ndarray) – Edge indicator matrices Pm and Pd

__len__()[source]

Number of graphs in the dataset.

collate_fn(x)[source]

The collate function for dataloader

Parameters

x (tuple) –

a batch of data that contains:

  • graph: dgl.DGLGraph

    The original graph

  • line_graph: dgl.DGLGraph

    The line graph of graph

  • graph_degree: numpy.ndarray

    In degrees for each node in graph

  • line_graph_degree: numpy.ndarray

    In degrees for each node in line_graph

  • pm_pd: numpy.ndarray

    Edge indicator matrices Pm and Pd

Returns

  • g_batch (dgl.DGLGraph) – Batched graphs

  • lg_batch (dgl.DGLGraph) – Batched line graphs

  • degg_batch (numpy.ndarray) – A batch of in degrees for each node in g_batch

  • deglg_batch (numpy.ndarray) – A batch of in degrees for each node in lg_batch

  • pm_pd_batch (numpy.ndarray) – A batch of edge indicator matrices Pm and Pd

Edge Prediction Datasets

DGL hosted datasets for edge classification/regression and link prediction tasks.

Knowlege graph dataset

class dgl.data.FB15k237Dataset(reverse=True, raw_dir=None, force_reload=False, verbose=True)[source]

FB15k237 link prediction dataset.

    Deprecated since version 0.5.0:
  • train is deprecated, it is replaced by:

    >>> dataset = FB15k237Dataset()
    >>> graph = dataset[0]
    >>> train_mask = graph.edata['train_mask']
    >>> train_idx = th.nonzero(train_mask).squeeze()
    >>> src, dst = graph.edges(train_idx)
    >>> rel = graph.edata['etype'][train_idx]
    
  • valid is deprecated, it is replaced by:

    >>> dataset = FB15k237Dataset()
    >>> graph = dataset[0]
    >>> val_mask = graph.edata['val_mask']
    >>> val_idx = th.nonzero(val_mask).squeeze()
    >>> src, dst = graph.edges(val_idx)
    >>> rel = graph.edata['etype'][val_idx]
    
  • test is deprecated, it is replaced by:

    >>> dataset = FB15k237Dataset()
    >>> graph = dataset[0]
    >>> test_mask = graph.edata['test_mask']
    >>> test_idx = th.nonzero(test_mask).squeeze()
    >>> src, dst = graph.edges(test_idx)
    >>> rel = graph.edata['etype'][test_idx]
    

FB15k-237 is a subset of FB15k where inverse relations are removed. When creating the dataset, a reverse edge with reversed relation types are created for each edge by default.

FB15k237 dataset statistics:

  • Nodes: 14541

  • Number of relation types: 237

  • Number of reversed relation types: 237

  • Label Split:

    • Train: 272115

    • Valid: 17535

    • Test: 20466

Parameters
  • reverse (bool) – Whether to add reverse edge. Default True.

  • raw_dir (str) – Raw file directory to download/contains the input data directory. Default: ~/.dgl/

  • force_reload (bool) – Whether to reload the dataset. Default: False

  • verbose (bool) – Whether to print out progress information. Default: True.

num_nodes

Number of nodes

Type

int

num_rels

Number of relation types

Type

int

train

A numpy array of triplets (src, rel, dst) for the training graph

Type

numpy.ndarray

valid

A numpy array of triplets (src, rel, dst) for the validation graph

Type

numpy.ndarray

test

A numpy array of triplets (src, rel, dst) for the test graph

Type

numpy.ndarray

Examples

>>> dataset = FB15k237Dataset()
>>> g = dataset.graph
>>> e_type = g.edata['e_type']
>>>
>>> # get data split
>>> train_mask = g.edata['train_mask']
>>> val_mask = g.edata['val_mask']
>>> test_mask = g.edata['test_mask']
>>>
>>> train_set = th.arange(g.number_of_edges())[train_mask]
>>> val_set = th.arange(g.number_of_edges())[val_mask]
>>>
>>> # build train_g
>>> train_edges = train_set
>>> train_g = g.edge_subgraph(train_edges,
                              preserve_nodes=True)
>>> train_g.edata['e_type'] = e_type[train_edges];
>>>
>>> # build val_g
>>> val_edges = th.cat([train_edges, val_edges])
>>> val_g = g.edge_subgraph(val_edges,
                            preserve_nodes=True)
>>> val_g.edata['e_type'] = e_type[val_edges];
>>>
>>> # Train, Validation and Test
__getitem__(idx)[source]

Gets the graph object

Parameters

idx (int) – Item index, FB15k237Dataset has only one graph object

Returns

The graph contains

  • edata['e_type']: edge relation type

  • edata['train_edge_mask']: positive training edge mask

  • edata['val_edge_mask']: positive validation edge mask

  • edata['test_edge_mask']: positive testing edge mask

  • edata['train_mask']: training edge set mask (include reversed training edges)

  • edata['val_mask']: validation edge set mask (include reversed validation edges)

  • edata['test_mask']: testing edge set mask (include reversed testing edges)

  • ndata['ntype']: node type. All 0 in this dataset

Return type

dgl.DGLGraph

__len__()[source]

The number of graphs in the dataset.

class dgl.data.FB15kDataset(reverse=True, raw_dir=None, force_reload=False, verbose=True)[source]

FB15k link prediction dataset.

    Deprecated since version 0.5.0:
  • train is deprecated, it is replaced by:

    >>> dataset = FB15kDataset()
    >>> graph = dataset[0]
    >>> train_mask = graph.edata['train_mask']
    >>> train_idx = th.nonzero(train_mask).squeeze()
    >>> src, dst = graph.edges(train_idx)
    >>> rel = graph.edata['etype'][train_idx]
    
  • valid is deprecated, it is replaced by:

    >>> dataset = FB15kDataset()
    >>> graph = dataset[0]
    >>> val_mask = graph.edata['val_mask']
    >>> val_idx = th.nonzero(val_mask).squeeze()
    >>> src, dst = graph.edges(val_idx)
    >>> rel = graph.edata['etype'][val_idx]
    
  • test is deprecated, it is replaced by:

    >>> dataset = FB15kDataset()
    >>> graph = dataset[0]
    >>> test_mask = graph.edata['test_mask']
    >>> test_idx = th.nonzero(test_mask).squeeze()
    >>> src, dst = graph.edges(test_idx)
    >>> rel = graph.edata['etype'][test_idx]
    

The FB15K dataset was introduced in Translating Embeddings for Modeling Multi-relational Data. It is a subset of Freebase which contains about 14,951 entities with 1,345 different relations. When creating the dataset, a reverse edge with reversed relation types are created for each edge by default.

FB15k dataset statistics:

  • Nodes: 14,951

  • Number of relation types: 1,345

  • Number of reversed relation types: 1,345

  • Label Split:

    • Train: 483142

    • Valid: 50000

    • Test: 59071

Parameters
  • reverse (bool) – Whether to add reverse edge. Default True.

  • raw_dir (str) – Raw file directory to download/contains the input data directory. Default: ~/.dgl/

  • force_reload (bool) – Whether to reload the dataset. Default: False

  • verbose (bool) – Whether to print out progress information. Default: True.

num_nodes

Number of nodes

Type

int

num_rels

Number of relation types

Type

int

train

A numpy array of triplets (src, rel, dst) for the training graph

Type

numpy.ndarray

valid

A numpy array of triplets (src, rel, dst) for the validation graph

Type

numpy.ndarray

test

A numpy array of triplets (src, rel, dst) for the test graph

Type

numpy.ndarray

Examples

>>> dataset = FB15kDataset()
>>> g = dataset.graph
>>> e_type = g.edata['e_type']
>>>
>>> # get data split
>>> train_mask = g.edata['train_mask']
>>> val_mask = g.edata['val_mask']
>>>
>>> train_set = th.arange(g.number_of_edges())[train_mask]
>>> val_set = th.arange(g.number_of_edges())[val_mask]
>>>
>>> # build train_g
>>> train_edges = train_set
>>> train_g = g.edge_subgraph(train_edges,
                              preserve_nodes=True)
>>> train_g.edata['e_type'] = e_type[train_edges];
>>>
>>> # build val_g
>>> val_edges = th.cat([train_edges, val_edges])
>>> val_g = g.edge_subgraph(val_edges,
                            preserve_nodes=True)
>>> val_g.edata['e_type'] = e_type[val_edges];
>>>
>>> # Train, Validation and Test
>>>
__getitem__(idx)[source]

Gets the graph object

Parameters

idx (int) – Item index, FB15kDataset has only one graph object

Returns

The graph contains

  • edata['e_type']: edge relation type

  • edata['train_edge_mask']: positive training edge mask

  • edata['val_edge_mask']: positive validation edge mask

  • edata['test_edge_mask']: positive testing edge mask

  • edata['train_mask']: training edge set mask (include reversed training edges)

  • edata['val_mask']: validation edge set mask (include reversed validation edges)

  • edata['test_mask']: testing edge set mask (include reversed testing edges)

  • ndata['ntype']: node type. All 0 in this dataset

Return type

dgl.DGLGraph

__len__()[source]

The number of graphs in the dataset.

class dgl.data.WN18Dataset(reverse=True, raw_dir=None, force_reload=False, verbose=True)[source]

WN18 link prediction dataset.

    Deprecated since version 0.5.0:
  • train is deprecated, it is replaced by:

    >>> dataset = WN18Dataset()
    >>> graph = dataset[0]
    >>> train_mask = graph.edata['train_mask']
    >>> train_idx = th.nonzero(train_mask).squeeze()
    >>> src, dst = graph.edges(train_idx)
    >>> rel = graph.edata['etype'][train_idx]
    
  • valid is deprecated, it is replaced by:

    >>> dataset = WN18Dataset()
    >>> graph = dataset[0]
    >>> val_mask = graph.edata['val_mask']
    >>> val_idx = th.nonzero(val_mask).squeeze()
    >>> src, dst = graph.edges(val_idx)
    >>> rel = graph.edata['etype'][val_idx]
    
  • test is deprecated, it is replaced by:

    >>> dataset = WN18Dataset()
    >>> graph = dataset[0]
    >>> test_mask = graph.edata['test_mask']
    >>> test_idx = th.nonzero(test_mask).squeeze()
    >>> src, dst = graph.edges(test_idx)
    >>> rel = graph.edata['etype'][test_idx]
    

The WN18 dataset was introduced in Translating Embeddings for Modeling Multi-relational Data. It included the full 18 relations scraped from WordNet for roughly 41,000 synsets. When creating the dataset, a reverse edge with reversed relation types are created for each edge by default.

WN18 dataset statistics:

  • Nodes: 40943

  • Number of relation types: 18

  • Number of reversed relation types: 18

  • Label Split:

    • Train: 141442

    • Valid: 5000

    • Test: 5000

Parameters
  • reverse (bool) – Whether to add reverse edge. Default True.

  • raw_dir (str) – Raw file directory to download/contains the input data directory. Default: ~/.dgl/

  • force_reload (bool) – Whether to reload the dataset. Default: False

  • verbose (bool) – Whether to print out progress information. Default: True.

num_nodes

Number of nodes

Type

int

num_rels

Number of relation types

Type

int

train

A numpy array of triplets (src, rel, dst) for the training graph

Type

numpy.ndarray

valid

A numpy array of triplets (src, rel, dst) for the validation graph

Type

numpy.ndarray

test

A numpy array of triplets (src, rel, dst) for the test graph

Type

numpy.ndarray

Examples

>>> dataset = WN18Dataset()
>>> g = dataset.graph
>>> e_type = g.edata['e_type']
>>>
>>> # get data split
>>> train_mask = g.edata['train_mask']
>>> val_mask = g.edata['val_mask']
>>>
>>> train_set = th.arange(g.number_of_edges())[train_mask]
>>> val_set = th.arange(g.number_of_edges())[val_mask]
>>>
>>> # build train_g
>>> train_edges = train_set
>>> train_g = g.edge_subgraph(train_edges,
                              preserve_nodes=True)
>>> train_g.edata['e_type'] = e_type[train_edges];
>>>
>>> # build val_g
>>> val_edges = th.cat([train_edges, val_edges])
>>> val_g = g.edge_subgraph(val_edges,
                            preserve_nodes=True)
>>> val_g.edata['e_type'] = e_type[val_edges];
>>>
>>> # Train, Validation and Test
>>>
__getitem__(idx)[source]

Gets the graph object

Parameters

idx (int) – Item index, WN18Dataset has only one graph object

Returns

The graph contains

  • edata['e_type']: edge relation type

  • edata['train_edge_mask']: positive training edge mask

  • edata['val_edge_mask']: positive validation edge mask

  • edata['test_edge_mask']: positive testing edge mask

  • edata['train_mask']: training edge set mask (include reversed training edges)

  • edata['val_mask']: validation edge set mask (include reversed validation edges)

  • edata['test_mask']: testing edge set mask (include reversed testing edges)

  • ndata['ntype']: node type. All 0 in this dataset

Return type

dgl.DGLGraph

__len__()[source]

The number of graphs in the dataset.

BitcoinOTC dataset

class dgl.data.BitcoinOTCDataset(raw_dir=None, force_reload=False, verbose=False)[source]

BitcoinOTC dataset for fraud detection

This is who-trusts-whom network of people who trade using Bitcoin on a platform called Bitcoin OTC. Since Bitcoin users are anonymous, there is a need to maintain a record of users’ reputation to prevent transactions with fraudulent and risky users.

Offical website: https://snap.stanford.edu/data/soc-sign-bitcoin-otc.html

Bitcoin OTC dataset statistics:

  • Nodes: 5,881

  • Edges: 35,592

  • Range of edge weight: -10 to +10

  • Percentage of positive edges: 89%

Parameters
  • raw_dir (str) – Raw file directory to download/contains the input data directory. Default: ~/.dgl/

  • force_reload (bool) – Whether to reload the dataset. Default: False

  • verbose (bool) – Whether to print out progress information. Default: True.

graphs

A list of DGLGraph objects

Type

list

is_temporal

Indicate whether the graphs are temporal graphs

Type

bool

Raises

UserWarning – If the raw data is changed in the remote server by the author.

Examples

>>> dataset = BitcoinOTCDataset()
>>> len(dataset)
136
>>> for g in dataset:
....    # get edge feature
....    edge_weights = g.edata['h']
....    # your code here
>>>
__getitem__(item)[source]

Get graph by index

Parameters

item (int) – Item index

Returns

The graph contains:

  • edata['h'] : edge weights

Return type

dgl.DGLGraph

__len__()[source]

Number of graphs in the dataset.

Returns

Return type

int

ICEWS18 dataset

class dgl.data.ICEWS18Dataset(mode='train', raw_dir=None, force_reload=False, verbose=False)[source]

ICEWS18 dataset for temporal graph

Integrated Crisis Early Warning System (ICEWS18)

Event data consists of coded interactions between socio-political actors (i.e., cooperative or hostile actions between individuals, groups, sectors and nation states). This Dataset consists of events from 1/1/2018 to 10/31/2018 (24 hours time granularity).

Reference:

Statistics:

  • Train examples: 240

  • Valid examples: 30

  • Test examples: 34

  • Nodes per graph: 23033

Parameters
  • mode (str) – Load train/valid/test data. Has to be one of [‘train’, ‘valid’, ‘test’]

  • raw_dir (str) – Raw file directory to download/contains the input data directory. Default: ~/.dgl/

  • force_reload (bool) – Whether to reload the dataset. Default: False

  • verbose (bool) – Whether to print out progress information. Default: True.

is_temporal

Is the dataset contains temporal graphs

Type

bool

Examples

>>> # get train, valid, test set
>>> train_data = ICEWS18Dataset()
>>> valid_data = ICEWS18Dataset(mode='valid')
>>> test_data = ICEWS18Dataset(mode='test')
>>>
>>> train_size = len(train_data)
>>> for g in train_data:
....    e_feat = g.edata['rel_type']
....    # your code here
....
>>>
__getitem__(idx)[source]

Get graph by index

Parameters

idx (int) – Item index

Returns

The graph contains:

  • edata['rel_type']: edge type

Return type

dgl.DGLGraph

__len__()[source]

Number of graphs in the dataset.

Returns

Return type

int

GDELT dataset

class dgl.data.GDELTDataset(mode='train', raw_dir=None, force_reload=False, verbose=False)[source]

GDELT dataset for event-based temporal graph

The Global Database of Events, Language, and Tone (GDELT) dataset. This contains events happend all over the world (ie every protest held anywhere in Russia on a given day is collapsed to a single entry). This Dataset consists ofevents collected from 1/1/2018 to 1/31/2018 (15 minutes time granularity).

Reference:

Statistics:

  • Train examples: 2,304

  • Valid examples: 288

  • Test examples: 384

Parameters
  • mode (str) – Must be one of (‘train’, ‘valid’, ‘test’). Default: ‘train’

  • raw_dir (str) – Raw file directory to download/contains the input data directory. Default: ~/.dgl/

  • force_reload (bool) – Whether to reload the dataset. Default: False

  • verbose (bool) – Whether to print out progress information. Default: True.

start_time

Start time of the temporal graph

Type

int

end_time

End time of the temporal graph

Type

int

is_temporal

Does the dataset contain temporal graphs

Type

bool

Examples

>>> # get train, valid, test dataset
>>> train_data = GDELTDataset()
>>> valid_data = GDELTDataset(mode='valid')
>>> test_data = GDELTDataset(mode='test')
>>>
>>> # length of train set
>>> train_size = len(train_data)
>>>
>>> for g in train_data:
....    e_feat = g.edata['rel_type']
....    # your code here
....
>>>
__getitem__(t)[source]

Get graph by with events before time t + self.start_time

Parameters

t (int) – Time, its value must be in range [0, self.end_time - self.start_time]

Returns

The graph contains:

  • edata['rel_type']: edge type

Return type

dgl.DGLGraph

__len__()[source]

Number of graphs in the dataset.

Returns

Return type

int

Graph Prediction Datasets

DGL hosted datasets for graph classification/regression tasks.

QM7b dataset

class dgl.data.QM7bDataset(raw_dir=None, force_reload=False, verbose=False)[source]

QM7b dataset for graph property prediction (regression)

This dataset consists of 7,211 molecules with 14 regression targets. Nodes means atoms and edges means bonds. Edge data ‘h’ means the entry of Coulomb matrix.

Reference: http://quantum-machine.org/datasets/

Statistics:

  • Number of graphs: 7,211

  • Number of regression targets: 14

  • Average number of nodes: 15

  • Average number of edges: 245

  • Edge feature size: 1

Parameters
  • raw_dir (str) – Raw file directory to download/contains the input data directory. Default: ~/.dgl/

  • force_reload (bool) – Whether to reload the dataset. Default: False

  • verbose (bool) – Whether to print out progress information. Default: True.

num_labels

Number of labels for each graph, i.e. number of prediction tasks

Type

int

Raises

UserWarning – If the raw data is changed in the remote server by the author.

Examples

>>> data = QM7bDataset()
>>> data.num_labels
14
>>>
>>> # iterate over the dataset
>>> for g, label in data:
...     edge_feat = g.edata['h']  # get edge feature
...     # your code here...
...
>>>
__getitem__(idx)[source]

Get graph and label by index

Parameters

idx (int) – Item index

Returns

Return type

(dgl.DGLGraph, Tensor)

__len__()[source]

Number of graphs in the dataset.

Returns

Return type

int

Mini graph classification dataset

class dgl.data.MiniGCDataset(num_graphs, min_num_v, max_num_v, seed=0, save_graph=True, force_reload=False, verbose=False)[source]

The synthetic graph classification dataset class.

The datset contains 8 different types of graphs.

  • class 0 : cycle graph

  • class 1 : star graph

  • class 2 : wheel graph

  • class 3 : lollipop graph

  • class 4 : hypercube graph

  • class 5 : grid graph

  • class 6 : clique graph

  • class 7 : circular ladder graph

Parameters
  • num_graphs (int) – Number of graphs in this dataset.

  • min_num_v (int) – Minimum number of nodes for graphs

  • max_num_v (int) – Maximum number of nodes for graphs

  • seed (int, default is 0) – Random seed for data generation

num_graphs

Number of graphs

Type

int

min_num_v

The minimum number of nodes

Type

int

max_num_v

The maximum number of nodes

Type

int

num_classes

The number of classes

Type

int

Examples

>>> data = MiniGCDataset(100, 16, 32, seed=0)

The dataset instance is an iterable

>>> len(data)
100
>>> g, label = data[64]
>>> g
Graph(num_nodes=20, num_edges=82,
      ndata_schemes={}
      edata_schemes={})
>>> label
tensor(5)

Batch the graphs and labels for mini-batch training

>>> graphs, labels = zip(*[data[i] for i in range(16)])
>>> batched_graphs = dgl.batch(graphs)
>>> batched_labels = torch.tensor(labels)
>>> batched_graphs
Graph(num_nodes=356, num_edges=1060,
      ndata_schemes={}
      edata_schemes={})
__getitem__(idx)[source]

Get the idx-th sample.

Parameters

idx (int) – The sample index.

Returns

The graph and its label.

Return type

(dgl.Graph, Tensor)

__len__()[source]

Return the number of graphs in the dataset.

TU dataset

class dgl.data.TUDataset(name, raw_dir=None, force_reload=False, verbose=False)[source]

TUDataset contains lots of graph kernel datasets for graph classification.

Parameters

name (str) – Dataset Name, such as ENZYMES, DD, COLLAB, MUTAG, can be the datasets name on https://chrsmrrs.github.io/datasets/docs/datasets/.

max_num_node

Maximum number of nodes

Type

int

num_labels

Number of classes

Type

int

Examples

>>> data = TUDataset('DD')

The dataset instance is an iterable

>>> len(data)
188
>>> g, label = data[1024]
>>> g
Graph(num_nodes=88, num_edges=410,
      ndata_schemes={'_ID': Scheme(shape=(), dtype=torch.int64), 'node_labels': Scheme(shape=(1,), dtype=torch.int64)}
      edata_schemes={'_ID': Scheme(shape=(), dtype=torch.int64)})
>>> label
tensor([1])

Batch the graphs and labels for mini-batch training

>>> graphs, labels = zip(*[data[i] for i in range(16)])
>>> batched_graphs = dgl.batch(graphs)
>>> batched_labels = torch.tensor(labels)
>>> batched_graphs
Graph(num_nodes=9539, num_edges=47382,
      ndata_schemes={'node_labels': Scheme(shape=(1,), dtype=torch.int64), '_ID': Scheme(shape=(), dtype=torch.int64)}
      edata_schemes={'_ID': Scheme(shape=(), dtype=torch.int64)})

Notes

Graphs may have node labels, node attributes, edge labels, and edge attributes, varing from different dataset.

Labels are mapped to \(\lbrace 0,\cdots,n-1 \rbrace\) where \(n\) is the number of labels (some datasets have raw labels \(\lbrace -1, 1 \rbrace\) which will be mapped to \(\lbrace 0, 1 \rbrace\)). In previous versions, the minimum label was added so that \(\lbrace -1, 1 \rbrace\) was mapped to \(\lbrace 0, 2 \rbrace\).

__getitem__(idx)[source]

Get the idx-th sample.

Parameters

idx (int) – The sample index.

Returns

Graph with node feature stored in feat field and node label in node_label if available. And its label.

Return type

(dgl.DGLGraph, Tensor)

__len__()[source]

Return the number of graphs in the dataset.

class dgl.data.LegacyTUDataset(name, use_pandas=False, hidden_size=10, max_allow_node=None, raw_dir=None, force_reload=False, verbose=False)[source]

LegacyTUDataset contains lots of graph kernel datasets for graph classification.

Parameters
  • name (str) –

    Dataset Name, such as ENZYMES, DD, COLLAB, MUTAG, can be the datasets name on https://chrsmrrs.github.io/datasets/docs/datasets/.

  • use_pandas (bool) – Numpy’s file read function has performance issue when file is large, using pandas can be faster. Default: False

  • hidden_size (int) – Some dataset doesn’t contain features. Use constant node features initialization instead, with hidden size as hidden_size. Default : 10

  • max_allow_node (int) – Remove graphs that contains more nodes than max_allow_node. Default : None

max_num_node

Maximum number of nodes

Type

int

num_labels

Number of classes

Type

int

Examples

>>> data = LegacyTUDataset('DD')

The dataset instance is an iterable

>>> len(data)
1178
>>> g, label = data[1024]
>>> g
Graph(num_nodes=88, num_edges=410,
      ndata_schemes={'feat': Scheme(shape=(89,), dtype=torch.float64), '_ID': Scheme(shape=(), dtype=torch.int64)}
      edata_schemes={'_ID': Scheme(shape=(), dtype=torch.int64)})
>>> label
tensor(1)

Batch the graphs and labels for mini-batch training

>>> graphs, labels = zip(*[data[i] for i in range(16)])
>>> batched_graphs = dgl.batch(graphs)
>>> batched_labels = torch.tensor(labels)
>>> batched_graphs
Graph(num_nodes=9539, num_edges=47382,
      ndata_schemes={'feat': Scheme(shape=(89,), dtype=torch.float64), '_ID': Scheme(shape=(), dtype=torch.int64)}
      edata_schemes={'_ID': Scheme(shape=(), dtype=torch.int64)})

Notes

LegacyTUDataset uses provided node feature by default. If no feature provided, it uses one-hot node label instead. If neither labels provided, it uses constant for node feature.

__getitem__(idx)[source]

Get the idx-th sample.

Parameters

idx (int) – The sample index.

Returns

Graph with node feature stored in feat field and node label in node_label if available. And its label.

Return type

(dgl.DGLGraph, Tensor)

__len__()[source]

Return the number of graphs in the dataset.

Graph isomorphism network dataset

A compact subset of graph kernel dataset .. autoclass:: GINDataset

members

__getitem__, __len__

Utilities

utils.get_download_dir()

Get the absolute path to the download directory.

utils.download(url[, path, overwrite, …])

Download a given URL.

utils.check_sha1(filename, sha1_hash)

Check whether the sha1 hash of the file content matches the expected hash.

utils.extract_archive(file, target_dir[, …])

Extract archive file.

utils.split_dataset(dataset[, frac_list, …])

Split dataset into training, validation and test set.

utils.load_labels(filename)

Load label dict from file

utils.save_info(path, info)

Save dataset related information into disk.

utils.load_info(path)

Load dataset related information from disk.

class dgl.data.utils.Subset(dataset, indices)[source]

Subset of a dataset at specified indices

Code adapted from PyTorch.

Parameters
  • dataset – dataset[i] should return the ith datapoint

  • indices (list) – List of datapoint indices to construct the subset

__getitem__(item)[source]

Get the datapoint indexed by item

Returns

datapoint

Return type

tuple

__len__()[source]

Get subset size

Returns

Number of datapoints in the subset

Return type

int