{ "cells": [ { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\nLink Prediction using Graph Neural Networks\n===========================================\n\nIn the :doc:introduction <1_introduction>, you have already learned\nthe basic workflow of using GNNs for node classification,\ni.e.\u00a0predicting the category of a node in a graph. This tutorial will\nteach you how to train a GNN for link prediction, i.e.\u00a0predicting the\nexistence of an edge between two arbitrary nodes in a graph.\n\nBy the end of this tutorial you will be able to\n\n- Build a GNN-based link prediction model.\n- Train and evaluate the model on a small DGL-provided dataset.\n\n(Time estimate: 28 minutes)\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import dgl\nimport torch\nimport torch.nn as nn\nimport torch.nn.functional as F\nimport itertools\nimport numpy as np\nimport scipy.sparse as sp" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Overview of Link Prediction with GNN\n------------------------------------\n\nMany applications such as social recommendation, item recommendation,\nknowledge graph completion, etc., can be formulated as link prediction,\nwhich predicts whether an edge exists between two particular nodes. This\ntutorial shows an example of predicting whether a citation relationship,\neither citing or being cited, between two papers exists in a citation\nnetwork.\n\nThis tutorial formulates the link prediction problem as a binary classification\nproblem as follows:\n\n- Treat the edges in the graph as *positive examples*.\n- Sample a number of non-existent edges (i.e.\u00a0node pairs with no edges\n between them) as *negative* examples.\n- Divide the positive examples and negative examples into a training\n set and a test set.\n- Evaluate the model with any binary classification metric such as Area\n Under Curve (AUC).\n\n

#### Note

The practice comes from\n SEAL __,\n although the model here does not use their idea of node labeling.

\n\nIn some domains such as large-scale recommender systems or information\nretrieval, you may favor metrics that emphasize good performance of\ntop-K predictions. In these cases you may want to consider other metrics\nsuch as mean average precision, and use other negative sampling methods,\nwhich are beyond the scope of this tutorial.\n\nLoading graph and features\n--------------------------\n\nFollowing the :doc:introduction <1_introduction>, this tutorial\nfirst loads the Cora dataset.\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import dgl.data\n\ndataset = dgl.data.CoraGraphDataset()\ng = dataset" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Prepare training and testing sets\n---------------------------------\n\nThis tutorial randomly picks 10% of the edges for positive examples in\nthe test set, and leave the rest for the training set. It then samples\nthe same number of edges for negative examples in both sets.\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Split edge set for training and testing\nu, v = g.edges()\n\neids = np.arange(g.number_of_edges())\neids = np.random.permutation(eids)\ntest_size = int(len(eids) * 0.1)\ntrain_size = g.number_of_edges() - test_size\ntest_pos_u, test_pos_v = u[eids[:test_size]], v[eids[:test_size]]\ntrain_pos_u, train_pos_v = u[eids[test_size:]], v[eids[test_size:]]\n\n# Find all negative edges and split them for training and testing\nadj = sp.coo_matrix((np.ones(len(u)), (u.numpy(), v.numpy())))\nadj_neg = 1 - adj.todense() - np.eye(g.number_of_nodes())\nneg_u, neg_v = np.where(adj_neg != 0)\n\nneg_eids = np.random.choice(len(neg_u), g.number_of_edges())\ntest_neg_u, test_neg_v = neg_u[neg_eids[:test_size]], neg_v[neg_eids[:test_size]]\ntrain_neg_u, train_neg_v = neg_u[neg_eids[train_size:]], neg_v[neg_eids[train_size:]]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When training, you will need to remove the edges in the test set from\nthe original graph. You can do this via dgl.remove_edges.\n\n

#### Note

dgl.remove_edges works by creating a subgraph from the\n original graph, resulting in a copy and therefore could be slow for\n large graphs. If so, you could save the training and test graph to\n disk, as you would do for preprocessing.

\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "train_g = dgl.remove_edges(g, eids[:test_size])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Define a GraphSAGE model\n------------------------\n\nThis tutorial builds a model consisting of two\nGraphSAGE __ layers, each computes\nnew node representations by averaging neighbor information. DGL provides\ndgl.nn.SAGEConv that conveniently creates a GraphSAGE layer.\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from dgl.nn import SAGEConv\n\n# ----------- 2. create model -------------- #\n# build a two-layer GraphSAGE model\nclass GraphSAGE(nn.Module):\n def __init__(self, in_feats, h_feats):\n super(GraphSAGE, self).__init__()\n self.conv1 = SAGEConv(in_feats, h_feats, 'mean')\n self.conv2 = SAGEConv(h_feats, h_feats, 'mean')\n \n def forward(self, g, in_feat):\n h = self.conv1(g, in_feat)\n h = F.relu(h)\n h = self.conv2(g, h)\n return h" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The model then predicts the probability of existence of an edge by\ncomputing a score between the representations of both incident nodes\nwith a function (e.g.\u00a0an MLP or a dot product), which you will see in\nthe next section.\n\n\\begin{align}\\hat{y}_{u\\sim v} = f(h_u, h_v)\\end{align}\n\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Positive graph, negative graph, and apply_edges\n---------------------------------------------------\n\nIn previous tutorials you have learned how to compute node\nrepresentations with a GNN. However, link prediction requires you to\ncompute representation of *pairs of nodes*.\n\nDGL recommends you to treat the pairs of nodes as another graph, since\nyou can describe a pair of nodes with an edge. In link prediction, you\nwill have a *positive graph* consisting of all the positive examples as\nedges, and a *negative graph* consisting of all the negative examples.\nThe *positive graph* and the *negative graph* will contain the same set\nof nodes as the original graph. This makes it easier to pass node\nfeatures among multiple graphs for computation. As you will see later,\nyou can directly feed the node representations computed on the entire\ngraph to the positive and the negative graphs for computing pair-wise\nscores.\n\nThe following code constructs the positive graph and the negative graph\nfor the training set and the test set respectively.\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "train_pos_g = dgl.graph((train_pos_u, train_pos_v), num_nodes=g.number_of_nodes())\ntrain_neg_g = dgl.graph((train_neg_u, train_neg_v), num_nodes=g.number_of_nodes())\n\ntest_pos_g = dgl.graph((test_pos_u, test_pos_v), num_nodes=g.number_of_nodes())\ntest_neg_g = dgl.graph((test_neg_u, test_neg_v), num_nodes=g.number_of_nodes())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The benefit of treating the pairs of nodes as a graph is that you can\nuse the DGLGraph.apply_edges method, which conveniently computes new\nedge features based on the incident nodes\u2019 features and the original\nedge features (if applicable).\n\nDGL provides a set of optimized builtin functions to compute new\nedge features based on the original node/edge features. For example,\ndgl.function.u_dot_v computes a dot product of the incident nodes\u2019\nrepresentations for each edge.\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import dgl.function as fn\n\nclass DotPredictor(nn.Module):\n def forward(self, g, h):\n with g.local_scope():\n g.ndata['h'] = h\n # Compute a new edge feature named 'score' by a dot-product between the\n # source node feature 'h' and destination node feature 'h'.\n g.apply_edges(fn.u_dot_v('h', 'h', 'score'))\n # u_dot_v returns a 1-element vector for each edge so you need to squeeze it.\n return g.edata['score'][:, 0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can also write your own function if it is complex.\nFor instance, the following module produces a scalar score on each edge\nby concatenating the incident nodes\u2019 features and passing it to an MLP.\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "class MLPPredictor(nn.Module):\n def __init__(self, h_feats):\n super().__init__()\n self.W1 = nn.Linear(h_feats * 2, h_feats)\n self.W2 = nn.Linear(h_feats, 1)\n\n def apply_edges(self, edges):\n \"\"\"\n Computes a scalar score for each edge of the given graph.\n\n Parameters\n ----------\n edges :\n Has three members src, dst and data, each of\n which is a dictionary representing the features of the\n source nodes, the destination nodes, and the edges\n themselves.\n\n Returns\n -------\n dict\n A dictionary of new edge features.\n \"\"\"\n h = torch.cat([edges.src['h'], edges.dst['h']], 1)\n return {'score': self.W2(F.relu(self.W1(h))).squeeze(1)}\n\n def forward(self, g, h):\n with g.local_scope():\n g.ndata['h'] = h\n g.apply_edges(self.apply_edges)\n return g.edata['score']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

#### Note

The builtin functions are optimized for both speed and memory.\n We recommend using builtin functions whenever possible.

\n\n

#### Note

If you have read the :doc:message passing\n tutorial <3_message_passing>, you will notice that the\n argument apply_edges takes has exactly the same form as a message\n function in update_all.

\n\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Training loop\n-------------\n\nAfter you defined the node representation computation and the edge score\ncomputation, you can go ahead and define the overall model, loss\nfunction, and evaluation metric.\n\nThe loss function is simply binary cross entropy loss.\n\n\\begin{align}\\mathcal{L} = -\\sum_{u\\sim v\\in \\mathcal{D}}\\left( y_{u\\sim v}\\log(\\hat{y}_{u\\sim v}) + (1-y_{u\\sim v})\\log(1-\\hat{y}_{u\\sim v})) \\right)\\end{align}\n\nThe evaluation metric in this tutorial is AUC.\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "model = GraphSAGE(train_g.ndata['feat'].shape, 16)\n# You can replace DotPredictor with MLPPredictor.\n#pred = MLPPredictor(16)\npred = DotPredictor()\n\ndef compute_loss(pos_score, neg_score):\n scores = torch.cat([pos_score, neg_score])\n labels = torch.cat([torch.ones(pos_score.shape), torch.zeros(neg_score.shape)])\n return F.binary_cross_entropy_with_logits(scores, labels)\n\ndef compute_auc(pos_score, neg_score):\n scores = torch.cat([pos_score, neg_score]).numpy()\n labels = torch.cat(\n [torch.ones(pos_score.shape), torch.zeros(neg_score.shape)]).numpy()\n return roc_auc_score(labels, scores)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The training loop goes as follows:\n\n

#### Note

This tutorial does not include evaluation on a validation\n set. In practice you should save and evaluate the best model based on\n performance on the validation set.

\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# ----------- 3. set up loss and optimizer -------------- #\n# in this case, loss will in training loop\noptimizer = torch.optim.Adam(itertools.chain(model.parameters(), pred.parameters()), lr=0.01)\n\n# ----------- 4. training -------------------------------- #\nall_logits = []\nfor e in range(100):\n # forward\n h = model(train_g, train_g.ndata['feat'])\n pos_score = pred(train_pos_g, h)\n neg_score = pred(train_neg_g, h)\n loss = compute_loss(pos_score, neg_score)\n \n # backward\n optimizer.zero_grad()\n loss.backward()\n optimizer.step()\n \n if e % 5 == 0:\n print('In epoch {}, loss: {}'.format(e, loss))\n\n# ----------- 5. check results ------------------------ #\nfrom sklearn.metrics import roc_auc_score\nwith torch.no_grad():\n pos_score = pred(test_pos_g, h)\n neg_score = pred(test_neg_g, h)\n print('AUC', compute_auc(pos_score, neg_score))\n\n\n# Thumbnail credits: Link Prediction with Neo4j, Mark Needham\n# sphinx_gallery_thumbnail_path = '_static/blitz_4_link_predict.png'" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.3" } }, "nbformat": 4, "nbformat_minor": 0 }