{ "cells": [ { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n\nUnderstand Graph Attention Network\n=======================================\n\n**Authors:** Hao Zhang _, Mufei Li\n_, Minjie Wang\n_ Zheng Zhang\n_\n\n

#### Warning

The tutorial aims at gaining insights into the paper, with code as a mean\n of explanation. The implementation thus is NOT optimized for running\n efficiency. For recommended implementation, please refer to the official\n examples _.

\n\nIn this tutorial, you learn about a graph attention network (GAT) and how it can be \nimplemented in PyTorch. You can also learn to visualize and understand what the attention \nmechanism has learned.\n\nThe research described in the paper Graph Convolutional Network (GCN) _,\nindicates that combining local graph structure and node-level features yields\ngood performance on node classification tasks. However, the way GCN aggregates\nis structure-dependent, which can hurt its generalizability.\n\nOne workaround is to simply average over all neighbor node features as described in\nthe research paper GraphSAGE\n_.\nHowever, Graph Attention Network _ proposes a\ndifferent type of aggregation. GAT uses weighting neighbor features with feature dependent and\nstructure-free normalization, in the style of attention.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Introducing attention to GCN\n----------------------------\n\nThe key difference between GAT and GCN is how the information from the one-hop neighborhood is aggregated.\n\nFor GCN, a graph convolution operation produces the normalized sum of the node features of neighbors.\n\n\n\\begin{align}h_i^{(l+1)}=\\sigma\\left(\\sum_{j\\in \\mathcal{N}(i)} {\\frac{1}{c_{ij}} W^{(l)}h^{(l)}_j}\\right)\\end{align}\n\n\nwhere $\\mathcal{N}(i)$ is the set of its one-hop neighbors (to include\n$v_i$ in the set, simply add a self-loop to each node),\n$c_{ij}=\\sqrt{|\\mathcal{N}(i)|}\\sqrt{|\\mathcal{N}(j)|}$ is a\nnormalization constant based on graph structure, $\\sigma$ is an\nactivation function (GCN uses ReLU), and $W^{(l)}$ is a shared\nweight matrix for node-wise feature transformation. Another model proposed in\nGraphSAGE\n_\nemploys the same update rule except that they set\n$c_{ij}=|\\mathcal{N}(i)|$.\n\nGAT introduces the attention mechanism as a substitute for the statically\nnormalized convolution operation. Below are the equations to compute the node\nembedding $h_i^{(l+1)}$ of layer $l+1$ from the embeddings of\nlayer $l$.\n\n![](https://data.dgl.ai/tutorial/gat/gat.png)\n\n :width: 450px\n :align: center\n\n\\begin{align}\\begin{align}\n z_i^{(l)}&=W^{(l)}h_i^{(l)},&(1) \\\\\n e_{ij}^{(l)}&=\\text{LeakyReLU}(\\vec a^{(l)^T}(z_i^{(l)}||z_j^{(l)})),&(2)\\\\\n \\alpha_{ij}^{(l)}&=\\frac{\\exp(e_{ij}^{(l)})}{\\sum_{k\\in \\mathcal{N}(i)}^{}\\exp(e_{ik}^{(l)})},&(3)\\\\\n h_i^{(l+1)}&=\\sigma\\left(\\sum_{j\\in \\mathcal{N}(i)} {\\alpha^{(l)}_{ij} z^{(l)}_j }\\right),&(4)\n \\end{align}\\end{align}\n\n\nExplanations:\n\n\n* Equation (1) is a linear transformation of the lower layer embedding $h_i^{(l)}$\n and $W^{(l)}$ is its learnable weight matrix.\n* Equation (2) computes a pair-wise *un-normalized* attention score between two neighbors.\n Here, it first concatenates the $z$ embeddings of the two nodes, where $||$\n denotes concatenation, then takes a dot product of it and a learnable weight vector\n $\\vec a^{(l)}$, and applies a LeakyReLU in the end. This form of attention is\n usually called *additive attention*, contrast with the dot-product attention in the\n Transformer model.\n* Equation (3) applies a softmax to normalize the attention scores on each node's\n incoming edges.\n* Equation (4) is similar to GCN. The embeddings from neighbors are aggregated together,\n scaled by the attention scores.\n\nThere are other details from the paper, such as dropout and skip connections.\nFor the purpose of simplicity, those details are left out of this tutorial. To see more details, \ndownload the full example _.\nIn its essence, GAT is just a different aggregation function with attention\nover features of neighbors, instead of a simple mean aggregation.\n\nGAT in DGL\n----------\n\nDGL provides an off-the-shelf implementation of the GAT layer under the dgl.nn.\nsubpackage. Simply import the GATConv as the follows.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from dgl.nn.pytorch import GATConv" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Readers can skip the following step-by-step explanation of the implementation and\njump to the Put everything together_ for training and visualization results.\n\nTo begin, you can get an overall impression about how a GATLayer module is\nimplemented in DGL. In this section, the four equations above are broken down \none at a time.\n\n

#### Note

This is showing how to implement a GAT from scratch. DGL provides a more\n efficient :class:builtin GAT layer module .

\n\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import torch\nimport torch.nn as nn\nimport torch.nn.functional as F\n\n\nclass GATLayer(nn.Module):\n def __init__(self, g, in_dim, out_dim):\n super(GATLayer, self).__init__()\n self.g = g\n # equation (1)\n self.fc = nn.Linear(in_dim, out_dim, bias=False)\n # equation (2)\n self.attn_fc = nn.Linear(2 * out_dim, 1, bias=False)\n self.reset_parameters()\n\n def reset_parameters(self):\n \"\"\"Reinitialize learnable parameters.\"\"\"\n gain = nn.init.calculate_gain('relu')\n nn.init.xavier_normal_(self.fc.weight, gain=gain)\n nn.init.xavier_normal_(self.attn_fc.weight, gain=gain)\n\n def edge_attention(self, edges):\n # edge UDF for equation (2)\n z2 = torch.cat([edges.src['z'], edges.dst['z']], dim=1)\n a = self.attn_fc(z2)\n return {'e': F.leaky_relu(a)}\n\n def message_func(self, edges):\n # message UDF for equation (3) & (4)\n return {'z': edges.src['z'], 'e': edges.data['e']}\n\n def reduce_func(self, nodes):\n # reduce UDF for equation (3) & (4)\n # equation (3)\n alpha = F.softmax(nodes.mailbox['e'], dim=1)\n # equation (4)\n h = torch.sum(alpha * nodes.mailbox['z'], dim=1)\n return {'h': h}\n\n def forward(self, h):\n # equation (1)\n z = self.fc(h)\n self.g.ndata['z'] = z\n # equation (2)\n self.g.apply_edges(self.edge_attention)\n # equation (3) & (4)\n self.g.update_all(self.message_func, self.reduce_func)\n return self.g.ndata.pop('h')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Equation (1)\n^^^^^^^^^^^^\n\n\\begin{align}z_i^{(l)}=W^{(l)}h_i^{(l)},(1)\\end{align}\n\nThe first one shows linear transformation. It's common and can be\neasily implemented in Pytorch using torch.nn.Linear.\n\nEquation (2)\n^^^^^^^^^^^^\n\n\\begin{align}e_{ij}^{(l)}=\\text{LeakyReLU}(\\vec a^{(l)^T}(z_i^{(l)}|z_j^{(l)})),(2)\\end{align}\n\nThe un-normalized attention score $e_{ij}$ is calculated using the\nembeddings of adjacent nodes $i$ and $j$. This suggests that the\nattention scores can be viewed as edge data, which can be calculated by the\napply_edges API. The argument to the apply_edges is an **Edge UDF**,\nwhich is defined as below:\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "def edge_attention(self, edges):\n # edge UDF for equation (2)\n z2 = torch.cat([edges.src['z'], edges.dst['z']], dim=1)\n a = self.attn_fc(z2)\n return {'e' : F.leaky_relu(a)}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here, the dot product with the learnable weight vector $\\vec{a^{(l)}}$\nis implemented again using PyTorch's linear transformation attn_fc. Note\nthat apply_edges will **batch** all the edge data in one tensor, so the\ncat, attn_fc here are applied on all the edges in parallel.\n\nEquation (3) & (4)\n^^^^^^^^^^^^^^^^^^\n\n\\begin{align}\\begin{align}\n \\alpha_{ij}^{(l)}&=\\frac{\\exp(e_{ij}^{(l)})}{\\sum_{k\\in \\mathcal{N}(i)}^{}\\exp(e_{ik}^{(l)})},&(3)\\\\\n h_i^{(l+1)}&=\\sigma\\left(\\sum_{j\\in \\mathcal{N}(i)} {\\alpha^{(l)}_{ij} z^{(l)}_j }\\right),&(4)\n \\end{align}\\end{align}\n\nSimilar to GCN, update_all API is used to trigger message passing on all\nthe nodes. The message function sends out two tensors: the transformed z\nembedding of the source node and the un-normalized attention score e on\neach edge. The reduce function then performs two tasks:\n\n\n* Normalize the attention scores using softmax (equation (3)).\n* Aggregate neighbor embeddings weighted by the attention scores (equation(4)).\n\nBoth tasks first fetch data from the mailbox and then manipulate it on the\nsecond dimension (dim=1), on which the messages are batched.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "def reduce_func(self, nodes):\n # reduce UDF for equation (3) & (4)\n # equation (3)\n alpha = F.softmax(nodes.mailbox['e'], dim=1)\n # equation (4)\n h = torch.sum(alpha * nodes.mailbox['z'], dim=1)\n return {'h' : h}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Multi-head attention\n^^^^^^^^^^^^^^^^^^^^\n\nAnalogous to multiple channels in ConvNet, GAT introduces **multi-head\nattention** to enrich the model capacity and to stabilize the learning\nprocess. Each attention head has its own parameters and their outputs can be\nmerged in two ways:\n\n\\begin{align}\\text{concatenation}: h^{(l+1)}_{i} =||_{k=1}^{K}\\sigma\\left(\\sum_{j\\in \\mathcal{N}(i)}\\alpha_{ij}^{k}W^{k}h^{(l)}_{j}\\right)\\end{align}\n\nor\n\n\\begin{align}\\text{average}: h_{i}^{(l+1)}=\\sigma\\left(\\frac{1}{K}\\sum_{k=1}^{K}\\sum_{j\\in\\mathcal{N}(i)}\\alpha_{ij}^{k}W^{k}h^{(l)}_{j}\\right)\\end{align}\n\nwhere $K$ is the number of heads. You can use\nconcatenation for intermediary layers and average for the final layer.\n\nUse the above defined single-head GATLayer as the building block\nfor the MultiHeadGATLayer below:\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "class MultiHeadGATLayer(nn.Module):\n def __init__(self, g, in_dim, out_dim, num_heads, merge='cat'):\n super(MultiHeadGATLayer, self).__init__()\n self.heads = nn.ModuleList()\n for i in range(num_heads):\n self.heads.append(GATLayer(g, in_dim, out_dim))\n self.merge = merge\n\n def forward(self, h):\n head_outs = [attn_head(h) for attn_head in self.heads]\n if self.merge == 'cat':\n # concat on the output feature dimension (dim=1)\n return torch.cat(head_outs, dim=1)\n else:\n # merge using average\n return torch.mean(torch.stack(head_outs))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Put everything together\n^^^^^^^^^^^^^^^^^^^^^^^\n\nNow, you can define a two-layer GAT model.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "class GAT(nn.Module):\n def __init__(self, g, in_dim, hidden_dim, out_dim, num_heads):\n super(GAT, self).__init__()\n self.layer1 = MultiHeadGATLayer(g, in_dim, hidden_dim, num_heads)\n # Be aware that the input dimension is hidden_dim*num_heads since\n # multiple head outputs are concatenated together. Also, only\n # one attention head in the output layer.\n self.layer2 = MultiHeadGATLayer(g, hidden_dim * num_heads, out_dim, 1)\n\n def forward(self, h):\n h = self.layer1(h)\n h = F.elu(h)\n h = self.layer2(h)\n return h" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We then load the Cora dataset using DGL's built-in data module.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from dgl import DGLGraph\nfrom dgl.data import citation_graph as citegrh\nimport networkx as nx\n\ndef load_cora_data():\n data = citegrh.load_cora()\n features = torch.FloatTensor(data.features)\n labels = torch.LongTensor(data.labels)\n mask = torch.BoolTensor(data.train_mask)\n g = DGLGraph(data.graph)\n return g, features, labels, mask" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The training loop is exactly the same as in the GCN tutorial.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import time\nimport numpy as np\n\ng, features, labels, mask = load_cora_data()\n\n# create the model, 2 heads, each head has hidden size 8\nnet = GAT(g,\n in_dim=features.size(),\n hidden_dim=8,\n out_dim=7,\n num_heads=2)\n\n# create optimizer\noptimizer = torch.optim.Adam(net.parameters(), lr=1e-3)\n\n# main loop\ndur = []\nfor epoch in range(30):\n if epoch >= 3:\n t0 = time.time()\n\n logits = net(features)\n logp = F.log_softmax(logits, 1)\n loss = F.nll_loss(logp[mask], labels[mask])\n\n optimizer.zero_grad()\n loss.backward()\n optimizer.step()\n\n if epoch >= 3:\n dur.append(time.time() - t0)\n\n print(\"Epoch {:05d} | Loss {:.4f} | Time(s) {:.4f}\".format(\n epoch, loss.item(), np.mean(dur)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Visualizing and understanding attention learned\n----------------------------------------------\n\nCora\n^^^^\n\nThe following table summarizes the model performance on Cora that is reported in\nthe GAT paper _ and obtained with DGL \nimplementations.\n\n.. list-table::\n :header-rows: 1\n\n * - Model\n - Accuracy\n * - GCN (paper)\n - $81.4\\pm 0.5%$\n * - GCN (dgl)\n - $82.05\\pm 0.33%$\n * - GAT (paper)\n - $83.0\\pm 0.7%$\n * - GAT (dgl)\n - $83.69\\pm 0.529%$\n\n*What kind of attention distribution has our model learned?*\n\nBecause the attention weight $a_{ij}$ is associated with edges, you can\nvisualize it by coloring edges. Below you can pick a subgraph of Cora and plot the\nattention weights of the last GATLayer. The nodes are colored according\nto their labels, whereas the edges are colored according to the magnitude of\nthe attention weights, which can be referred with the colorbar on the right.\n\n![](https://data.dgl.ai/tutorial/gat/cora-attention.png)\n\n :width: 600px\n :align: center\n\nYou can see that the model seems to learn different attention weights. To\nunderstand the distribution more thoroughly, measure the entropy\n_) of the\nattention distribution. For any node $i$,\n$\\{\\alpha_{ij}\\}_{j\\in\\mathcal{N}(i)}$ forms a discrete probability\ndistribution over all its neighbors with the entropy given by\n\n\\begin{align}H({\\alpha_{ij}}_{j\\in\\mathcal{N}(i)})=-\\sum_{j\\in\\mathcal{N}(i)} \\alpha_{ij}\\log\\alpha_{ij}\\end{align}\n\nA low entropy means a high degree of concentration, and vice\nversa. An entropy of 0 means all attention is on one source node. The uniform\ndistribution has the highest entropy of $\\log(\\mathcal{N}(i))$.\nIdeally, you want to see the model learns a distribution of lower entropy\n(i.e, one or two neighbors are much more important than the others).\n\nNote that since nodes can have different degrees, the maximum entropy will\nalso be different. Therefore, you plot the aggregated histogram of entropy\nvalues of all nodes in the entire graph. Below are the attention histogram of\nlearned by each attention head.\n\n|image2|\n\nAs a reference, here is the histogram if all the nodes have uniform attention weight distribution.\n\n![](https://data.dgl.ai/tutorial/gat/cora-attention-uniform-hist.png)\n\n :width: 250px\n :align: center\n\nOne can see that **the attention values learned is quite similar to uniform distribution**\n(i.e, all neighbors are equally important). This partially\nexplains why the performance of GAT is close to that of GCN on Cora\n(according to author's reported result\n_, the accuracy difference averaged\nover 100 runs is less than 2 percent). Attention does not matter\nsince it does not differentiate much.\n\n*Does that mean the attention mechanism is not useful?* No! A different\ndataset exhibits an entirely different pattern, as you can see next.\n\nProtein-protein interaction (PPI) networks\n^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n\nThe PPI dataset used here consists of $24$ graphs corresponding to\ndifferent human tissues. Nodes can have up to $121$ kinds of labels, so\nthe label of node is represented as a binary tensor of size $121$. The\ntask is to predict node label.\n\nUse $20$ graphs for training, $2$ for validation and $2$\nfor test. The average number of nodes per graph is $2372$. Each node\nhas $50$ features that are composed of positional gene sets, motif gene\nsets, and immunological signatures. Critically, test graphs remain completely\nunobserved during training, a setting called \"inductive learning\".\n\nCompare the performance of GAT and GCN for $10$ random runs on this\ntask and use hyperparameter search on the validation set to find the best\nmodel.\n\n.. list-table::\n :header-rows: 1\n\n * - Model\n - F1 Score(micro)\n * - GAT\n - $0.975 \\pm 0.006$\n * - GCN\n - $0.509 \\pm 0.025$\n * - Paper\n - $0.973 \\pm 0.002$\n\nThe table above is the result of this experiment, where you use micro F1\nscore _ to evaluate the model\nperformance.\n\n

#### Note

Below is the calculation process of F1 score:\n\n .. math::\n\n precision=\\frac{\\sum_{t=1}^{n}TP_{t}}{\\sum_{t=1}^{n}(TP_{t} +FP_{t})}\n\n recall=\\frac{\\sum_{t=1}^{n}TP_{t}}{\\sum_{t=1}^{n}(TP_{t} +FN_{t})}\n\n F1_{micro}=2\\frac{precision*recall}{precision+recall}\n\n * $TP_{t}$ represents for number of nodes that both have and are predicted to have label $t$\n * $FP_{t}$ represents for number of nodes that do not have but are predicted to have label $t$\n * $FN_{t}$ represents for number of output classes labeled as $t$ but predicted as others.\n * $n$ is the number of labels, i.e. $121$ in our case.

\n\nDuring training, use BCEWithLogitsLoss as the loss function. The\nlearning curves of GAT and GCN are presented below; what is evident is the\ndramatic performance adavantage of GAT over GCN.\n\n![](https://data.dgl.ai/tutorial/gat/ppi-curve.png)\n\n :width: 300px\n :align: center\n\nAs before, you can have a statistical understanding of the attentions learned\nby showing the histogram plot for the node-wise attention entropy. Below are\nthe attention histograms learned by different attention layers.\n\n*Attention learned in layer 1:*\n\n|image5|\n\n*Attention learned in layer 2:*\n\n|image6|\n\n*Attention learned in final layer:*\n\n|image7|\n\nAgain, comparing with uniform distribution: \n\n![](https://data.dgl.ai/tutorial/gat/ppi-uniform-hist.png)\n\n :width: 250px\n :align: center\n\nClearly, **GAT does learn sharp attention weights**! There is a clear pattern\nover the layers as well: **the attention gets sharper with a higher\nlayer**.\n\nUnlike the Cora dataset where GAT's gain is minimal at best, for PPI there\nis a significant performance gap between GAT and other GNN variants compared\nin the GAT paper _ (at least 20 percent),\nand the attention distributions between the two clearly differ. While this\ndeserves further research, one immediate conclusion is that GAT's advantage\nlies perhaps more in its ability to handle a graph with more complex\nneighborhood structure.\n\nWhat's next?\n------------\n\nSo far, you have seen how to use DGL to implement GAT. There are some\nmissing details such as dropout, skip connections, and hyper-parameter tuning,\nwhich are practices that do not involve DGL-related concepts. For more information\ncheck out the full example.\n\n* See the optimized full example _.\n* The next tutorial describes how to speedup GAT models by parallelizing multiple attention heads and SPMV optimization.\n\n.. |image2| image:: https://data.dgl.ai/tutorial/gat/cora-attention-hist.png\n.. |image5| image:: https://data.dgl.ai/tutorial/gat/ppi-first-layer-hist.png\n.. |image6| image:: https://data.dgl.ai/tutorial/gat/ppi-second-layer-hist.png\n.. |image7| image:: https://data.dgl.ai/tutorial/gat/ppi-final-layer-hist.png\n\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.3" } }, "nbformat": 4, "nbformat_minor": 0 }