# Multi-hop Attention Graph Neural Networks

GAT中的attention运算只能关注节点相连节点表达，这种机制不考虑不直接相连但又有很重要信息的节点表达。

• 通过扩散多跳注意力捕捉 $\alpha{D,C}’$ 表达为 $\alpha{D,C}’ = f([\alpha{B,C},\alpha{D,B}])$
• 基于图邻接矩阵的权值，通过分散注意力来考虑节点之间的所有路径，从而增强图结构学习。MAGNA利用D的节点特征进行A和B之间的注意力计算，这意味着MAGNA中的两跳注意力是基于上下文的。

## 方法

### 参数定义

Embedding 的每行 $x_i = X[i:]$ 表示节点 $v_i (1\le i\le N_n)$ 的embedding ， $r_j=R[j:] , r_j(1\le j\le N_r)$

### Multi-hop Attention Diffusion

Attention diffusion是每一层中用于计算MAGNA‘s的attention分数。首先第一阶段，计算每一条边上的attention分数。第二阶段，用扩散注意力计算多条邻居的注意力。

#### Edge Attention Computation.

$W_h^{(l)} , W_t^{(l)}\in \mathbb{R}^{d^{(l)}\times d^{(l)}} , W_r^{(l)}\in \mathbb{R}^{d^{(l)}\times d_r} , v_a^{(l)}\in \mathbb{R }^{1\times 3d^{(l)}}$ 可共享的可训练参数。

$h_i^{(l)}\in \mathbb{R}^{d^{(l)}}$ 是第 $l$ 层第 $i$ 个节点的embedding。 $h_i^{(0)} = x_i$

$r_k (1\le k \le N_r)$ 是可训练的第 $k$ 个关系类型的embedding

$A^{(l)}_{i,j}$ 就定义为在第 $l$ 层中当 从节点 $j$ 和 节点 $i$ 聚合消息时的关注值。

## Reviewer

The central question of the reviewers’ discussion was whether the contribution of this paper was significant enough or too incremental. The discussion emphasized relevant literature which already considers multi-hop attention (e.g. https://openreview.net/forum?id=rkKvBAiiz [Cucurull et al.], https://ieeexplore.ieee.org/document/8683050 [Feng et al.], https://arxiv.org/abs/2001.07620 [Isufi et al.]), and which should have served as baseline. In particular, the experiment suggested by R3 was in line with some of these previous works, which consider “a multi-hop adjacency matrix “ as a way to increase the GAT’s receptive field. This was as opposed to preserving the 1-hop adjacency matrix used in the original GAT and stacking multiple layers to enlarge the receptive field, which as noted by the authors, may result in over-smoothed node features. The reviewers acknowledged that there is indeed as slight difference between the formulation proposed in the paper and the one in e.g. [Cucurull et al.]. The difference consists in calculating attention and then computing the powers with a decay factor vs. increasing the receptive field first by using powers of the adjacency matrix and then computing attention. Still, the multi-hop GAT baseline of [Cucurull et al.] could be extended to use a multi-hop adjacency matrix computed with the diffusion process from [Klicpera 2019], as suggested by R3. In light of these works and the above-mentioned missing baselines, the reviewers agreed that the contribution may be viewed as rather incremental (combining multi-hop graph attention with graph diffusion). The discussion also highlighted the potential of the presented spectral analysis, which could be strengthened by developing new insights in order to become a stronger contribution (see R2’s suggestions).

Proposed methodology being more powerful than GAT is arguable:
When the attention scores for indirectly connected neighbors are still computed based on the immediate neighbors’ attention scores, it is not convincing enough to be argued as more powerful than GAT, which learns attention scores over contextualized immediate neighbors.Also, the approximate realization of the model described in Eqn: 5 follows a message-passing style to propagate attention scores. Suppose it is to be argued that standard message-passing-based diffusion is not powerful enough to get a good immediate neighbor representation that encodes neighbors’ information from far away. In that case, it is not immediately clear how a similar diffusion, when used for propagating attention scores from immediate neighbors to neighbors multiple hops away, will be more powerful.