SSTDataset

class dgl.data.SSTDataset(mode='train', glove_embed_file=None, vocab_file=None, raw_dir=None, force_reload=False, verbose=False, transform=None)[source]

Bases: dgl.data.dgl_dataset.DGLBuiltinDataset

Stanford Sentiment Treebank dataset.

    Deprecated since version 0.5.0:
  • trees is deprecated, it is replaced by:

    >>> dataset = SSTDataset()
    >>> for tree in dataset:
    ....    # your code here
    
  • num_vocabs is deprecated, it is replaced by vocab_size.

Each sample is the constituency tree of a sentence. The leaf nodes represent words. The word is a int value stored in the x feature field. The non-leaf node has a special value PAD_WORD in the x field. Each node also has a sentiment annotation: 5 classes (very negative, negative, neutral, positive and very positive). The sentiment label is a int value stored in the y feature field. Official site: http://nlp.stanford.edu/sentiment/index.html

Statistics:

  • Train examples: 8,544

  • Dev examples: 1,101

  • Test examples: 2,210

  • Number of classes for each node: 5

Parameters
  • mode (str, optional) – Should be one of [‘train’, ‘dev’, ‘test’, ‘tiny’] Default: train

  • glove_embed_file (str, optional) – The path to pretrained glove embedding file. Default: None

  • vocab_file (str, optional) – Optional vocabulary file. If not given, the default vacabulary file is used. Default: None

  • raw_dir (str) – Raw file directory to download/contains the input data directory. Default: ~/.dgl/

  • force_reload (bool) – Whether to reload the dataset. Default: False

  • verbose (bool) – Whether to print out progress information. Default: True.

  • transform (callable, optional) – A transform that takes in a DGLGraph object and returns a transformed version. The DGLGraph object will be transformed before every access.

vocab

Vocabulary of the dataset

Type

OrderedDict

trees

A list of DGLGraph objects

Type

list

num_classes

Number of classes for each node

Type

int

pretrained_emb

Pretrained glove embedding with respect the vocabulary.

Type

Tensor

vocab_size

The size of the vocabulary

Type

int

num_vocabs

The size of the vocabulary

Type

int

Notes

All the samples will be loaded and preprocessed in the memory first.

Examples

>>> # get dataset
>>> train_data = SSTDataset()
>>> dev_data = SSTDataset(mode='dev')
>>> test_data = SSTDataset(mode='test')
>>> tiny_data = SSTDataset(mode='tiny')
>>>
>>> len(train_data)
8544
>>> train_data.num_classes
5
>>> glove_embed = train_data.pretrained_emb
>>> train_data.vocab_size
19536
>>> train_data[0]
Graph(num_nodes=71, num_edges=70,
  ndata_schemes={'x': Scheme(shape=(), dtype=torch.int64), 'y': Scheme(shape=(), dtype=torch.int64), 'mask': Scheme(shape=(), dtype=torch.int64)}
  edata_schemes={})
>>> for tree in train_data:
...     input_ids = tree.ndata['x']
...     labels = tree.ndata['y']
...     mask = tree.ndata['mask']
...     # your code here
__getitem__(idx)[source]

Get graph by index

Parameters

idx (int) –

Returns

graph structure, word id for each node, node labels and masks.

  • ndata['x']: word id of the node

  • ndata['y']: label of the node

  • ndata['mask']: 1 if the node is a leaf, otherwise 0

Return type

dgl.DGLGraph

__len__()[source]

Number of graphs in the dataset.