#### Warning

The tutorial aims at gaining insights into the paper, with code as a mean\n of explanation. The implementation thus is NOT optimized for running\n efficiency. For recommended implementation, please refer to the [official\n examples](https://github.com/dmlc/dgl/tree/master/examples).

\n\nIn this tutorial, you learn about a graph attention network (GAT) and how it can be\nimplemented in PyTorch. You can also learn to visualize and understand what the attention\nmechanism has learned.\n\nThe research described in the paper [Graph Convolutional Network (GCN)](https://arxiv.org/abs/1609.02907),\nindicates that combining local graph structure and node-level features yields\ngood performance on node classification tasks. However, the way GCN aggregates\nis structure-dependent, which can hurt its generalizability.\n\nOne workaround is to simply average over all neighbor node features as described in\nthe research paper [GraphSAGE](https://www-cs-faculty.stanford.edu/people/jure/pubs/graphsage-nips17.pdf).\nHowever, [Graph Attention Network](https://arxiv.org/abs/1710.10903) proposes a\ndifferent type of aggregation. GAT uses weighting neighbor features with feature dependent and\nstructure-free normalization, in the style of attention.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Introducing attention to GCN\n\nThe key difference between GAT and GCN is how the information from the one-hop neighborhood is aggregated.\n\nFor GCN, a graph convolution operation produces the normalized sum of the node features of neighbors.\n\n\n\\begin{align}h_i^{(l+1)}=\\sigma\\left(\\sum_{j\\in \\mathcal{N}(i)} {\\frac{1}{c_{ij}} W^{(l)}h^{(l)}_j}\\right)\\end{align}\n\n\nwhere $\\mathcal{N}(i)$ is the set of its one-hop neighbors (to include\n$v_i$ in the set, simply add a self-loop to each node),\n$c_{ij}=\\sqrt{|\\mathcal{N}(i)|}\\sqrt{|\\mathcal{N}(j)|}$ is a\nnormalization constant based on graph structure, $\\sigma$ is an\nactivation function (GCN uses ReLU), and $W^{(l)}$ is a shared\nweight matrix for node-wise feature transformation. Another model proposed in\n[GraphSAGE](https://www-cs-faculty.stanford.edu/people/jure/pubs/graphsage-nips17.pdf)\nemploys the same update rule except that they set\n$c_{ij}=|\\mathcal{N}(i)|$.\n\nGAT introduces the attention mechanism as a substitute for the statically\nnormalized convolution operation. Below are the equations to compute the node\nembedding $h_i^{(l+1)}$ of layer $l+1$ from the embeddings of\nlayer $l$.\n\n