#### Note

Below is the calculation process of F1 score:\n\n .. math::\n\n precision=\\frac{\\sum_{t=1}^{n}TP_{t}}{\\sum_{t=1}^{n}(TP_{t} +FP_{t})}\n\n recall=\\frac{\\sum_{t=1}^{n}TP_{t}}{\\sum_{t=1}^{n}(TP_{t} +FN_{t})}\n\n F1_{micro}=2\\frac{precision*recall}{precision+recall}\n\n * $TP_{t}$ represents for number of nodes that both have and are predicted to have label $t$\n * $FP_{t}$ represents for number of nodes that do not have but are predicted to have label $t$\n * $FN_{t}$ represents for number of output classes labeled as $t$ but predicted as others.\n * $n$ is the number of labels, i.e. $121$ in our case.

\n\nDuring training, we use ``BCEWithLogitsLoss`` as the loss function. The\nlearning curves of GAT and GCN are presented below; what is evident is the\ndramatic performance adavantage of GAT over GCN.\n\n![](https://s3.us-east-2.amazonaws.com/dgl.ai/tutorial/gat/ppi-curve.png)\n\n :width: 300px\n :align: center\n\nAs before, we can have a statistical understanding of the attentions learnt\nby showing the histogram plot for the node-wise attention entropy. Below are\nthe attention histogram learnt by different attention layers.\n\n*Attention learnt in layer 1:*\n\n|image5|\n\n*Attention learnt in layer 2:*\n\n|image6|\n\n*Attention learnt in final layer:*\n\n|image7|\n\nAgain, comparing with uniform distribution: \n\n![](https://s3.us-east-2.amazonaws.com/dgl.ai/tutorial/gat/ppi-uniform-hist.png)\n\n :width: 250px\n :align: center\n\nClearly, **GAT does learn sharp attention weights**! There is a clear pattern\nover the layers as well: **the attention gets sharper with higher\nlayer**.\n\nUnlike the Cora dataset where GAT's gain is lukewarm at best, for PPI there\nis a significant performance gap between GAT and other GNN variants compared\nin `the GAT paper