OnDiskDataset¶

class dgl.graphbolt.OnDiskDataset(path: str, include_original_edge_id: bool = False)[source]¶

Bases: dgl.graphbolt.dataset.Dataset

An on-disk dataset which reads graph topology, feature data and Train/Validation/Test set from disk.

Due to limited resources, the data which are too large to fit into RAM will remain on disk while others reside in RAM once OnDiskDataset is initialized. This behavior could be controled by user via in_memory field in YAML file. All paths in YAML file are relative paths to the dataset directory.

A full example of YAML file is as follows:

dataset_name: graphbolt_test
graph:
  nodes:
    - type: paper # could be omitted for homogeneous graph.
      num: 1000
    - type: author
      num: 1000
  edges:
    - type: author:writes:paper # could be omitted for homogeneous graph.
      format: csv # Can be csv only.
      path: edge_data/author-writes-paper.csv
    - type: paper:cites:paper
      format: csv
      path: edge_data/paper-cites-paper.csv
feature_data:
  - domain: node
    type: paper # could be omitted for homogeneous graph.
    name: feat
    format: numpy
    in_memory: false # If not specified, default to true.
    path: node_data/paper-feat.npy
  - domain: edge
    type: "author:writes:paper"
    name: feat
    format: numpy
    in_memory: false
    path: edge_data/author-writes-paper-feat.npy
tasks:
  - name: "edge_classification"
    num_classes: 10
    train_set:
      - type: paper # could be omitted for homogeneous graph.
        data: # multiple data sources could be specified.
          - name: node_pairs
            format: numpy # Can be numpy or torch.
            in_memory: true # If not specified, default to true.
            path: set/paper-train-node_pairs.npy
          - name: labels
            format: numpy
            path: set/paper-train-labels.npy
    validation_set:
      - type: paper
        data:
          - name: node_pairs
            format: numpy
            path: set/paper-validation-node_pairs.npy
          - name: labels
            format: numpy
            path: set/paper-validation-labels.npy
    test_set:
      - type: paper
        data:
          - name: node_pairs
            format: numpy
            path: set/paper-test-node_pairs.npy
          - name: labels
            format: numpy
            path: set/paper-test-labels.npy

Parameters

path (str) – The YAML file path.
include_original_edge_id (bool, optional) – Whether to include the original edge id in the FusedCSCSamplingGraph.

load()[source]¶: Load the dataset.

property all_nodes_set¶: Return the itemset containing all nodes.

property dataset_name¶: Return the dataset name.

property feature¶: Return the feature.

property graph¶: Return the graph.

property tasks¶: Return the tasks.

property yaml_data¶: Return the YAML data.