-
Notifications
You must be signed in to change notification settings - Fork 0
/
02_setup.tex
63 lines (51 loc) · 6.29 KB
/
02_setup.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
\section{The Graph Distance Network}
We propose the Graph Distance Network (GDN), which uses a siamese network structure with multiple shared message passing layers, the previously derived learnable cost matrix, and the Sinkhorn distance with analytic gradients.
Let $d_h$ be an arbitrary embedding size, $d_{\text{node}}$ the number of node features of the input graphs, $d_{\text{edge}}$ the number of edge types, $W_l \in \mathbb{R}^{d_h \times d_h}$, $b_l \in \mathbb{R}^{d_h}$ a weight matrix and bias, $h_i^l \in \mathbb{R}^{d_h}$ the embedding of node $i$ at layer $l$ for $l > 0$, $\sigma(\cdot)$ an elementwise non-linearity, and $E \in \mathbb{R}^{d_{\text{edge}} \times d_h \times d_h}$ a weight tensor ($h_i^0 = v_i \in \mathbb{R}^{d_{\text{node}}}$, $W_0 \in \mathbb{R}^{d_{\text{node}} \times d_h}$, $b_0 \in \mathbb{R}^{d_{\text{node}}}$). We then use the following message passing function:
\begin{equation}
h_i^{l} = \sigma(W_{l} h_i^{l-1} + b_l) + \sum_{j \rightarrow i} \sigma(W_{l} h_j^{l-1} + b_l) E_{e_{j \rightarrow i}}
\end{equation}
where $E_{e_{j \rightarrow i}} \in \mathbb{R}^{d_h \times d_h}$ is the weight tensor indexed by the discrete feature of the edge connecting node $j$ with node $i$. As activation function $\sigma(\cdot)$ we use the $\operatorname{leaky-relu}$ non-linearity.
To compute the entries of the cost matrix we concatenate all node embeddings and take the norm of pairwise distances across both graphs, or the embeddings scaled by a trainable $\alpha \in \mathbb{R}^{d_\text{final}}$:
\begin{equation}
\begin{gathered}
h_i^\text{final} = [h_i^L ; \dots ; h_i^1 ; \sigma(W_{1} h_i^{0} + b_0)] \in \mathbb{R}^{d_\text{final}}\\
C_{i,j} = \norm{h_i^\text{final} - h_j^\text{final}} \quad
C_{i, \epsilon} = \norm{ \alpha h_i^\text{final}} \quad
C_{\epsilon, j} = \norm{ \alpha h_j^\text{final}} \quad i \in \{1 \dots N_1\}, j \in \{1 \dots N_2\}
\end{gathered}
\end{equation}
In the following experiment we tested three different norms: the euclidean norm ($p=2$), the manhattan norm ($p=1$), and a simple two layer perceptron (MLP). The first layer of this MLP does not change the embedding size, then we apply a $\operatorname{relu}$ non-linearity. The second layer reduces the embedding size to a single number to which we apply the $\operatorname{softplus}$ non-linearity, $\ln(1 + e^x)$.
As for the matrix $C$ we tested two variants. The first one we call BP-Distance on account of bipartite graph matching. Due to implementation advantages we decided to pad the pairwise distances with rows and columns of $C_{i, \epsilon}, C_{\epsilon, j}$ respectively, instead of the $\infty$-values typically used.
\begin{equation}
C_\text{BP}=
\left[
\begin{array}{ccc|ccc}
C_{1,1} & \dotsi & C_{1, N_2} & C_{1, \epsilon} & \dotsi & C_{1, \epsilon} \\
\vdots & \ddots & \vdots & \vdots & \ddots & \vdots \\
C_{N_1, 1} & \dotsi & C_{N_1, N_2} & C_{N_1, \epsilon} & \dotsi & C_{N_1, \epsilon} \\
\hline
C_{\epsilon, 1} & \dotsi & C_{\epsilon, N_2} & 0 & \dotsi & 0 \\
\vdots & \ddots & \vdots & \vdots & \ddots & \vdots \\
C_{\epsilon, 1} & \dotsi & C_{\epsilon, N_2} & 0 & \dotsi & 0 \\
\end{array}
\right]
\in \mathbb{R}^{(N_1 + N_2) \times (N_1 + N_2)}
\end{equation}
This does lead to different results after Sinkhorn normalization due to the entropy regularization, but the network can easily adapt by increasing $\norm{\alpha}$. Therefore, this modification should not affect our trainable model. The second variant of the cost matrix is the smallest square sub-matrix of the one above that contains all pairwise distances, such that \mbox{$C_\text{no-BP} \in \mathbb{R}^{\max({N_1, N_2}) \times \max({N_1, N_2})}$}.
The final distance $d$ is then computed and trained via the mean squared error between target distance and prediction:
\begin{equation}
\begin{gathered}
M = \operatorname{sinkhorn}\left(e^{-\frac{C}{\lambda}}\right) \\
d = \langle M, C \rangle_\mathrm{F} \\
\text{loss} = (d - d_\text{target})^2
\end{gathered}
\end{equation}
Implementation. Our GNN is implemented with PyTorch (\citealp{pytorch}), PyTorch Scatter, and PyTorch-geometric (\citealp{pytorchgeometric}) and can process sparse mini-batches with parallel GPU acceleration. We have two implementations for the matching. The first version pads matrices to form batches and can be used in cases where all graphs have a similar number of nodes. The second version uses sparse matrices in COOrdinate format, which is less efficient with evenly sized graphs but can be crucial for parallelizing workloads with skewed graph size distributions.
\section{Experiment Setup}
% Dataset
For our first experiment we use synthetic preferential attachment graphs (\citealp{pref_att2002}) and real-world chemical molecules of the AIDS dataset\footnote{\url{https://wiki.nci.nih.gov/display/NCIDTPdata/AIDS+Antiviral+Screen+Data}}. Both datasets have discrete node and edge features, and graphs are limited to at most 30 nodes. We provide some basic dataset statists in the Appendix \ref{appendix:GED}.
% Baseline Setup
We compare our results to the implementations of \cite{riba2018}, \cite{bai2019}, and \cite{li2019} in Table \ref{tab:ex1-baselines}. All methods are described in the Related Work Section. We train for 500 epochs and report the best error on the validation set and the corresponding test error. We ran 3 trials of every setting and searched over the cross product of 2 different batch sizes (depending on the implementation), 32 and 64-dimensional node embeddings, 1 and 3 layers of message passing, 4 learning rates, and 4 degrees of regularization. Implementation details can be found in the respective GitLab repositories\footnote{Riba: \url{https://gitlab.lrz.de/gdn/siamese_ged}, Bai \url{https://gitlab.lrz.de/gdn/simgnn}, \\Li \url{https://gitlab.lrz.de/gdn/gmn/tree/gr_report}}.
% Adam/SGD
% GDN Setup
For our model, the graph distance network (GDN), we fixed the batch size to 1024, the embedding size to 32, and the number of layers to 3, and added 3 choices for the Sinkhorn regularization constant $\lambda$. Moreover, we include an ablation study over the choice of the cost matrix and different norms in Table \ref{tab:ex1-ablation}.