{ "cells": [ { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n.. currentmodule:: dgl\n\nWorking with Heterogeneous Graphs in DGL\n========================================\n\n**Author**: Quan Gan, Minjie Wang _, Mufei Li,\nGeorge Karypis, Zheng Zhang\n\nIn this tutorial, you learn about:\n\n* Examples of heterogenous graph data and typical applications.\n\n* Creating and manipulating a heterogenous graph in DGL.\n\n* Implementing Relational-GCN _, a popular GNN model,\n for heterogenous graph input.\n\n* Training a model to solve a node classification task.\n\nHeterogeneous graphs, or *heterographs* for short, are graphs that contain\ndifferent types of nodes and edges. The different types of nodes and edges tend\nto have different types of attributes that are designed to capture the\ncharacteristics of each node and edge type. Within the context of\ngraph neural networks, depending on their complexity, certain node and edge types\nmight need to be modeled with representations that have a different number of dimensions.\n\nDGL supports graph neural network computations on such heterogeneous graphs, by\nusing the heterograph class and its associated API.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Examples of heterographs\n-----------------------\nMany graph datasets represent relationships among various types of entities.\nThis section provides an overview for several graph use-cases that show such relationships \nand can have their data represented as heterographs.\n\nCitation graph \n~~~~~~~~~~~~~~~\nThe Association for Computing Machinery publishes an ACM dataset _ that contains two\nmillion papers, their authors, publication venues, and the other papers\nthat were cited. This information can be represented as a heterogeneous graph.\n\nThe following diagram shows several entities in the ACM dataset and the relationships among them \n(taken from Shi et al., 2015 _).\n\n.. figure:: https://s3.us-east-2.amazonaws.com/dgl.ai/tutorial/hetero/acm-example.png# \n\nThis graph has three types of entities that correspond to papers, authors, and publication venues.\nIt also contains three types of edges that connect the following:\n\n* Authors with papers corresponding to *written-by* relationships\n\n* Papers with publication venues corresponding to *published-in* relationships\n\n* Papers with other papers corresponding to *cited-by* relationships\n\n\nRecommender systems \n~~~~~~~~~~~~~~~~~~~~ \nThe datasets used in recommender systems often contain\ninteractions between users and items. For example, the data could include the\nratings that users have provided to movies. Such interactions can be modeled\nas heterographs.\n\nThe nodes in these heterographs will have two types, *users* and *movies*. The edges\nwill correspond to the user-movie interactions. Furthermore, if an interaction is\nmarked with a rating, then each rating value could correspond to a different edge type.\nThe following diagram shows an example of user-item interactions as a heterograph.\n\n.. figure:: https://s3.us-east-2.amazonaws.com/dgl.ai/tutorial/hetero/recsys-example.png\n\n\nKnowledge graph \n~~~~~~~~~~~~~~~~\nKnowledge graphs are inherently heterogenous. For example, in\nWikidata, Barack Obama (item Q76) is an instance of a human, which could be viewed as\nthe entity class, whose spouse (item P26) is Michelle Obama (item Q13133) and\noccupation (item P106) is politician (item Q82955). The relationships are shown in the following.\ndiagram.\n\n.. figure:: https://s3.us-east-2.amazonaws.com/dgl.ai/tutorial/hetero/kg-example.png\n\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Creating a heterograph in DGL\n-----------------------------\nYou can create a heterograph in DGL using the :func:dgl.heterograph API.\nThe argument to :func:dgl.heterograph is a dictionary. The keys are tuples\nin the form of (srctype, edgetype, dsttype) specifying the relation name\nand the two entity types it connects. Such tuples are called *canonical edge\ntypes*. The values are data to initialize the graph structures, that is, which\nnodes the edges actually connect.\n\nFor instance, the following code creates the user-item interactions heterograph shown earlier.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Each value of the dictionary is a list of edge tuples.\n# Nodes are integer IDs starting from zero. Nodes IDs of different types have\n# separate countings.\nimport dgl\n\nratings = dgl.heterograph(\n {('user', '+1', 'movie') : [(0, 0), (0, 1), (1, 0)],\n ('user', '-1', 'movie') : [(2, 1)]})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "DGL supports creating a graph from a variety of data sources. The following\ncode creates the same graph as the above.\n\nCreating from scipy matrix\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import scipy.sparse as sp\nplus1 = sp.coo_matrix(([1, 1, 1], ([0, 0, 1], [0, 1, 0])), shape=(3, 2))\nminus1 = sp.coo_matrix((, (, )), shape=(3, 2))\nratings = dgl.heterograph(\n {('user', '+1', 'movie') : plus1,\n ('user', '-1', 'movie') : minus1})\n\n# Creating from networkx graph\nimport networkx as nx\nplus1 = nx.DiGraph()\nplus1.add_nodes_from(['u0', 'u1', 'u2'], bipartite=0)\nplus1.add_nodes_from(['m0', 'm1'], bipartite=1)\nplus1.add_edges_from([('u0', 'm0'), ('u0', 'm1'), ('u1', 'm0')])\n# To simplify the example, reuse the minus1 object.\n# This also means that you could use different sources of graph data\n# for different relationships.\nratings = dgl.heterograph(\n {('user', '+1', 'movie') : plus1,\n ('user', '-1', 'movie') : minus1})\n\n# Creating from edge indices\nratings = dgl.heterograph(\n {('user', '+1', 'movie') : ([0, 0, 1], [0, 1, 0]),\n ('user', '-1', 'movie') : (, )})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Manipulating heterograph\n------------------------\nYou can create a more realistic heterograph using the ACM dataset. To do this, first \ndownload the dataset as follows:\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import scipy.io\nimport urllib.request\n\ndata_url = 'https://s3.us-east-2.amazonaws.com/dgl.ai/dataset/ACM.mat'\ndata_file_path = '/tmp/ACM.mat'\n\nurllib.request.urlretrieve(data_url, data_file_path)\ndata = scipy.io.loadmat(data_file_path)\nprint(list(data.keys()))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The dataset stores node information by their types: P for paper, A\nfor author, C for conference, L for subject code, and so on. The relationships\nare stored as SciPy sparse matrix under key XvsY, where X and Y\ncould be any of the node type code.\n\nThe following code prints out some statistics about the paper-author relationships.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "print(type(data['PvsA']))\nprint('#Papers:', data['PvsA'].shape)\nprint('#Authors:', data['PvsA'].shape)\nprint('#Links:', data['PvsA'].nnz)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Converting this SciPy matrix to a heterograph in DGL is straightforward.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "pa_g = dgl.heterograph({('paper', 'written-by', 'author') : data['PvsA']})\n# equivalent (shorter) API for creating heterograph with two node types:\npa_g = dgl.bipartite(data['PvsA'], 'paper', 'written-by', 'author')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can easily print out the type names and other structural information.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "print('Node types:', pa_g.ntypes)\nprint('Edge types:', pa_g.etypes)\nprint('Canonical edge types:', pa_g.canonical_etypes)\n\n# Nodes and edges are assigned integer IDs starting from zero and each type has its own counting.\n# To distinguish the nodes and edges of different types, specify the type name as the argument.\nprint(pa_g.number_of_nodes('paper'))\n# Canonical edge type name can be shortened to only one edge type name if it is\n# uniquely distinguishable.\nprint(pa_g.number_of_edges(('paper', 'written-by', 'author')))\nprint(pa_g.number_of_edges('written-by'))\nprint(pa_g.successors(1, etype='written-by')) # get the authors that write paper #1\n\n# Type name argument could be omitted whenever the behavior is unambiguous.\nprint(pa_g.number_of_edges()) # Only one edge type, the edge type argument could be omitted" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A homogeneous graph is just a special case of a heterograph with only one type\nof node and edge. In this case, all the APIs are exactly the same as in\n:class:DGLGraph.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Paper-citing-paper graph is a homogeneous graph\npp_g = dgl.heterograph({('paper', 'citing', 'paper') : data['PvsP']})\n# equivalent (shorter) API for creating homogeneous graph\npp_g = dgl.graph(data['PvsP'], 'paper', 'cite')\n\n# All the ntype and etype arguments could be omitted because the behavior is unambiguous.\nprint(pp_g.number_of_nodes())\nprint(pp_g.number_of_edges())\nprint(pp_g.successors(3))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Create a subset of the ACM graph using the paper-author, paper-paper, \nand paper-subject relationships. Meanwhile, also add the reverse\nrelationship to prepare for the later sections.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "G = dgl.heterograph({\n ('paper', 'written-by', 'author') : data['PvsA'],\n ('author', 'writing', 'paper') : data['PvsA'].transpose(),\n ('paper', 'citing', 'paper') : data['PvsP'],\n ('paper', 'cited', 'paper') : data['PvsP'].transpose(),\n ('paper', 'is-about', 'subject') : data['PvsL'],\n ('subject', 'has', 'paper') : data['PvsL'].transpose(),\n })\n\nprint(G)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Metagraph** (or network schema) is a useful summary of a heterograph.\nServing as a template for a heterograph, it tells how many types of objects\nexist in the network and where the possible links exist.\n\nDGL provides easy access to the metagraph, which could be visualized using\nexternal tools.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Draw the metagraph using graphviz.\nimport pygraphviz as pgv\ndef plot_graph(nxg):\n ag = pgv.AGraph(strict=False, directed=True)\n for u, v, k in nxg.edges(keys=True):\n ag.add_edge(u, v, label=k)\n ag.layout('dot')\n ag.draw('graph.png')\n\nplot_graph(G.metagraph)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Learning tasks associated with heterographs\n-------------------------------------------\nSome of the typical learning tasks that involve heterographs include:\n\n* *Node classification and regression* to predict the class of each node or\n estimate a value associated with it.\n\n* *Link prediction* to predict if there is an edge of a certain\n type between a pair of nodes, or predict which other nodes a particular\n node is connected with (and optionally the edge types of such connections).\n\n* *Graph classification/regression* to assign an entire\n heterograph into one of the target classes or to estimate a numerical\n value associated with it.\n\nIn this tutorial, we designed a simple example for the first task.\n\nA semi-supervised node classification example\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\nOur goal is to predict the publishing conference of a paper using the ACM\nacademic graph we just created. To further simplify the task, we only focus\non papers published in three conferences: *KDD*, *ICML*, and *VLDB*. All\nthe other papers are not labeled, making it a semi-supervised setting.\n\nThe following code extracts those papers from the raw dataset and prepares \nthe training, validation, testing split.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import numpy as np\nimport torch\nimport torch.nn as nn\nimport torch.nn.functional as F\n\npvc = data['PvsC'].tocsr()\n# find all papers published in KDD, ICML, VLDB\nc_selected = [0, 11, 13] # KDD, ICML, VLDB\np_selected = pvc[:, c_selected].tocoo()\n# generate labels\nlabels = pvc.indices\nlabels[labels == 11] = 1\nlabels[labels == 13] = 2\nlabels = torch.tensor(labels).long()\n\n# generate train/val/test split\npid = p_selected.row\nshuffle = np.random.permutation(pid)\ntrain_idx = torch.tensor(shuffle[0:800]).long()\nval_idx = torch.tensor(shuffle[800:900]).long()\ntest_idx = torch.tensor(shuffle[900:]).long()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Relational-GCN on heterograph\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\nWe use Relational-GCN _ to learn the\nrepresentation of nodes in the graph. Its message-passing equation is as\nfollows:\n\n\\begin{align}h_i^{(l+1)} = \\sigma\\left(\\sum_{r\\in \\mathcal{R}}\n \\sum_{j\\in\\mathcal{N}_r(i)}W_r^{(l)}h_j^{(l)}\\right)\\end{align}\n\nBreaking down the equation, you see that there are two parts in the\ncomputation.\n\n(i) Message computation and aggregation within each relation $r$\n\n(ii) Reduction that merges the results from multiple relationships\n\nFollowing this intuition, perform message passing on a heterograph in\ntwo steps.\n\n(i) Per-edge-type message passing\n\n(ii) Type wise reduction\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import dgl.function as fn\n\nclass HeteroRGCNLayer(nn.Module):\n def __init__(self, in_size, out_size, etypes):\n super(HeteroRGCNLayer, self).__init__()\n # W_r for each relation\n self.weight = nn.ModuleDict({\n name : nn.Linear(in_size, out_size) for name in etypes\n })\n\n def forward(self, G, feat_dict):\n # The input is a dictionary of node features for each type\n funcs = {}\n for srctype, etype, dsttype in G.canonical_etypes:\n # Compute W_r * h\n Wh = self.weight[etype](feat_dict[srctype])\n # Save it in graph for message passing\n G.nodes[srctype].data['Wh_%s' % etype] = Wh\n # Specify per-relation message passing functions: (message_func, reduce_func).\n # Note that the results are saved to the same destination feature 'h', which\n # hints the type wise reducer for aggregation.\n funcs[etype] = (fn.copy_u('Wh_%s' % etype, 'm'), fn.mean('m', 'h'))\n # Trigger message passing of multiple types.\n # The first argument is the message passing functions for each relation.\n # The second one is the type wise reducer, could be \"sum\", \"max\",\n # \"min\", \"mean\", \"stack\"\n G.multi_update_all(funcs, 'sum')\n # return the updated node feature dictionary\n return {ntype : G.nodes[ntype].data['h'] for ntype in G.ntypes}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Create a simple GNN by stacking two HeteroRGCNLayer. Since the\nnodes do not have input features, make their embeddings trainable.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "class HeteroRGCN(nn.Module):\n def __init__(self, G, in_size, hidden_size, out_size):\n super(HeteroRGCN, self).__init__()\n # Use trainable node embeddings as featureless inputs.\n embed_dict = {ntype : nn.Parameter(torch.Tensor(G.number_of_nodes(ntype), in_size))\n for ntype in G.ntypes}\n for key, embed in embed_dict.items():\n nn.init.xavier_uniform_(embed)\n self.embed = nn.ParameterDict(embed_dict)\n # create layers\n self.layer1 = HeteroRGCNLayer(in_size, hidden_size, G.etypes)\n self.layer2 = HeteroRGCNLayer(hidden_size, out_size, G.etypes)\n\n def forward(self, G):\n h_dict = self.layer1(G, self.embed)\n h_dict = {k : F.leaky_relu(h) for k, h in h_dict.items()}\n h_dict = self.layer2(G, h_dict)\n # get paper logits\n return h_dict['paper']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Train and evaluate\n~~~~~~~~~~~~~~~~~~\nTrain and evaluate this network.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Create the model. The output has three logits for three classes.\nmodel = HeteroRGCN(G, 10, 10, 3)\n\nopt = torch.optim.Adam(model.parameters(), lr=0.01, weight_decay=5e-4)\n\nbest_val_acc = 0\nbest_test_acc = 0\n\nfor epoch in range(100):\n logits = model(G)\n # The loss is computed only for labeled nodes.\n loss = F.cross_entropy(logits[train_idx], labels[train_idx])\n\n pred = logits.argmax(1)\n train_acc = (pred[train_idx] == labels[train_idx]).float().mean()\n val_acc = (pred[val_idx] == labels[val_idx]).float().mean()\n test_acc = (pred[test_idx] == labels[test_idx]).float().mean()\n\n if best_val_acc < val_acc:\n best_val_acc = val_acc\n best_test_acc = test_acc\n\n opt.zero_grad()\n loss.backward()\n opt.step()\n\n if epoch % 5 == 0:\n print('Loss %.4f, Train Acc %.4f, Val Acc %.4f (Best %.4f), Test Acc %.4f (Best %.4f)' % (\n loss.item(),\n train_acc.item(),\n val_acc.item(),\n best_val_acc.item(),\n test_acc.item(),\n best_test_acc.item(),\n ))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What's next?\n------------\n* Check out our full implementation in PyTorch\n here _.\n\n* We also provide the following model examples:\n\n * Graph Convolutional Matrix Completion _,\n which we implement in MXNet\n here _.\n\n * Heterogeneous Graph Attention Network _\n requires transforming a heterograph into a homogeneous graph according to\n a given metapath (i.e. a path template consisting of edge types). We\n provide :func:dgl.transform.metapath_reachable_graph to do this. See full\n implementation\n here _.\n\n * Metapath2vec _ requires\n generating random walk paths according to a given metapath. Please\n refer to the full metapath2vec implementation\n here _.\n\n* :doc:Full heterograph API reference <../../api/python/heterograph>.\n\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.3" } }, "nbformat": 4, "nbformat_minor": 0 }