-
Notifications
You must be signed in to change notification settings - Fork 0
/
05_appendix.tex
179 lines (143 loc) · 10.3 KB
/
05_appendix.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
\appendix
\appendixpage
In this report I summarize the work I did during my guided research. The main part of the report focuses on findings that are relevant to a potential paper submission, but here in the appendix I will add some information about the datasets we used, and some we did not use (for completeness).
\section{Graph Edit Distance Datasets}
\label{appendix:GED}
\begin{table}[htbp]
\addtolength{\tabcolsep}{-1pt}
\fontsize{9pt}{10.25pt}\selectfont
\centering
\renewcommand{\arraystretch}{1.2}
\begin{tabular}{|l|c|c|c|c|c|c|c|}
\hline
Dataset & Node Feat & Edge Feat & \#Train & \#Val/Test & Max Nodes & Avg Nodes & Avg Edges \\
\hline
Pref-Att & 6 & 4 & 144 & 48 & 30 & 20.6 & 75.4 \\ % edges mean 75.4
\hline
AIDS & 53 & 4 & 144 & 48 & 30 & 20.6 & 44.6 \\ % edges mean 44.6
\hline
\end{tabular}
\caption{Datasets}
\label{tab:ex1-data}
\end{table}
Note \#Train and \#Val/Test are the number of graphs. We train and validate/test on all possible pairs of graphs, except those where a graph is paired with itself.
\section{Control Flow Dataset}
\label{appendix:CFG}
\cite{li2019} used control flow graphs to evaluate their model. They used metric learning (margin-loss with pairs, and triplets) to recognize a function compiled from different compilers as similar, and different functions as not similar. They didn't share their dataset of FFmpeg functions with us, but referred us to a GitHub repository, which I then forked (\url{https://github.com/johannespitz/functionsimsearch/tree/master}) and used to generate the UnRAR datasets, which (at the time of writing) can be found at '/nfs/students/pitz/cfgs' on the file server. To generate the graphcollections I used the notebook 'cfgs.ipynb' in the graph-distance repository. Note: use the .pickle version because it is much faster than .npz.
Currently we have a few different options for train/val/test splits. The simplest split is to use train\_all/val/test, which is an 80/10/10 split in terms of functions (not graphs). The attraction list contains all pairs of graphs that are from the same function, and the repulsion list contains as many negative samples. The train\_all split is then further split in: Option1, train\_across/val\_across/test\_acorss. Here out of the n graphs of any function we pick one to be the test graph, and all pairs between it and the remaining n-1 graphs are placed in the attraction list. Of the n-1 we then draw a validation graph and again all pairs between it and the remaining n-2 graphs are placed in the attraction list. The attraction list of the train\_across split contains all possible pairs of the remaining n-2 graphs. Option2, train12/val1/val2/test1/test2. Here we split train\_all as chunk/val2/test2 (70/15/15) in terms of graphs (not functions), while ensuring that at least one graph of each function is contained the chunk. We then generate all positive pairs, but for the chunk we split those again into train12/val1/test1 (80/10/10) ensuring that every graph is in train12 at least once. All repulsion lists for both options contain as many negative samples as the corresponding attraction list. Triplets are only available with the simple train\_all/val/test (80/10/10) split.
Our model can be trained with pairs or triplets as it is. I did not include any results with the dataset in this report because the baselines cannot be trained on it (without putting in additional work).
\section{Yeast Dataset}
\label{appendix:yest}
The Yeast Dataset (\url{https://www3.nd.edu/~cone/MAGNA++/}), used by \cite{yeast2019}, contains the protein-protein interaction (PPI) network of yeast and noisy versions thereof. The base network contains 1,004 proteins and 4,920 high-confidence interactions. The noisy versions contain additional low-confidence interactions.
The first obstacle with this dataset would have been the generation of train/val/test splits. Moreover, we could not have compared our results directly with \cite{yeast2019} because they learn unsupervised. The second obstacle is that the dataset contains no node or edge features. That is mostly a problem because GNNs are particularly good at using node, and also edge features. Therefore, it is questionable if a model based on GNNs would perform impressive enough. In the end we decided not to use this dataset, also because we discovered the paper by \cite{fey2020_update} who use plenty of graph matching datasets that fit our model much better.
\section{Consensus Step}
\label{appendix:consensus}
The main difference between our model and \cite{fey2020_update} is that they normalize the cost matrix $C$ instead of kernel matrix $e^{-\frac{C}{\lambda}}$. Consequently, they report bad gradients with Sinkhorn normalization and use row-wise $\operatorname{softmax}$ instead. For their cost matrix, they only use the pairwise distances between nodes without padding with norms, which should not be a problem for tasks where each node in the first graph has a match in the second graph but could be problematic in other settings. In Table \ref{tab:diff-models} we collected a complete list of the differences between both models:
\begin{table}[htbp]
\addtolength{\tabcolsep}{-1pt}
\fontsize{9pt}{10.25pt}\selectfont
\centering
\renewcommand{\arraystretch}{1.2}
\begin{tabular}{|C{1cm}|c|c|}
\hline
& DGC & GDN \\
\hline
$C^0_{i,j} $ & $h_i^0 \cdot h_j^0$ & $\begin{cases}
\norm{h_i^0 - h_j^0}_2 & \text{if}\ 1 \leq i \leq N_1,\ 1 \leq j \leq N_2\\
\norm{\alpha h_i^0}_2 & \text{if}\ N_1 < i \leq N_2,\ 1 \leq j \leq N_2
\end{cases}$ \\
% \hline
$C^l_{i,j} $ & $C^{l - 1}_{i,j} + \text{MLP}(h_i^l - h_j^l)$ & $
\begin{cases}
\beta^l C^0_{i,j} + (1 - \beta^l) \text{MLP}(h_i^l - h_j^l) & \text{if}\ 1 \leq i \leq N_1,\ 1 \leq j \leq N_2\\
\gamma^l \gamma^l C^{l - 1}_{i,j} & \text{if}\ N_1 < i \leq N_2,\ 1 \leq j \leq N_2
\end{cases}
$ \\
% \hline
$M^l $ & $\operatorname{softmax}(C^l)$ & $\operatorname{sinkhorn}\left(e^{-\frac{C^l * \text{reg}^l}{\lambda}}\right)$ \\
\hline
\end{tabular}
\caption{Differences of the two models}
\label{tab:diff-models}
\end{table}
Note1: All superscripts indicate the layer of the consensus step.
Note2: $\gamma^l \in \mathbb{R}$ is a trainable parameter initialized with 1.
Note3: \cite{fey2020_update} use only those pairs where every node in the source graph finds a match in the target graph (therefore: $N_1 < N_2$). And since we use the no-BP cost matrix we have $1 \leq i, j \leq N_2$
% $$ \begin{cases}
% \beta^l C^0_{i,j} + (1 - \beta^l) \text{MLP}(h_i - h_j) & \text{if}\ 1 \leq i \leq N_1,\ 1 \leq j \leq N_2\\
% \gamma^2 C^0_{i,j} & \text{if}\ i > N_1,\ 1 \leq j \leq N_2\\
% \gamma^2 C^0_{i,j} & \text{if}\ 1 \leq i \leq N_1,\ j > N_2
% \end{cases}
% $$
% \begin{alignat*}{3}
% & && \quad \text{\cite{fey2020_update}} && \quad \text{Ours} \\
% &C^0_{i,j} &&= h_i \cdot h_j &&= \norm{h_i - h_j}_2 \qquad \text{(fill with norms)} \\
% &C^l_{i,j} &&= C^{l - 1}_{i,j} + \text{MLP}(h_i - h_j) &&= \beta_l C^0_{i,j} + (1 - \beta_l) \text{MLP}(h_i - h_j) \qquad \text{(keep old norms scaled by } \alpha^2 \text{)} \\
% &M^l &&= \operatorname{softmax}(C^l) &&= \operatorname{sinkhorn}(e^{-\frac{C^l * \text{reg}}{\lambda}}) \\
% \end{alignat*}
% We also implemented a consensus step based on their work. We use instead of
% \\
% $C_0 = h_s \cdot h_t$\\
% $C_0 = \norm{h_s - h_t}_2$ fill with norms\\
% $C_l = C_{l - 1} + \text{MLP}(h_s - h_t)$\\
% $C_l = \beta_l C_0 + (1 - \beta_l) \text{MLP}(h_s - h_t)$ keep old norms scaled by $\alpha^2$\\
% $M_l = \operatorname{softmax}(C_l)$ \\
% $M_l = \operatorname{sinkhorn}(e^{-\frac{C_l * \text{reg}}{\lambda}}) \quad \text{reg} = \frac{\norm{C_l}_1}{\max(n1,n2)^2}$\\
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% \begin{equation}
% \begin{gathered}
% h_i^\text{final} = [h_i^L ; \dots ; h_i^1] \in \mathbb{R}^{d_\text{final}}\\
% C_{i,j} = \norm{h_i^\text{final} - h_j^\text{final}} \\
% C_{i, \epsilon} = \norm{ \alpha h_i^\text{final}} \quad
% C_{\epsilon, j} = \norm{ \alpha h_j^\text{final}} \\ i \in \{1 \dots N_1\}, j \in \{1 \dots N_2\}
% \end{gathered}
% \end{equation}
% \begin{equation}
% C_\text{no-BP} \in \mathbb{R}^{\max({N_1, N_2}) \times \max({N_1, N_2})}
% \end{equation}
% \begin{equation}
% \frac{\partial d}{\partial c_{i,j}} = m^*_{i,j}
% \end{equation}
% \begin{equation}
% d(C) = \min_{M} \langle M, C \rangle_\mathrm{F} + \lambda \text{H}(M) \quad \text{s.t. } \sum_i m_{i,j} = \sum_j m_{i,j} = 1
% \end{equation}
% \begin{equation}
% M^* = \text{Sinkhorn}(e^{-\frac{C}{\lambda}})
% \end{equation}
% \begin{table}[htbp]
% \addtolength{\tabcolsep}{-1pt}
% \fontsize{9pt}{10.25pt}\selectfont
% \centering
% \renewcommand{\arraystretch}{1.2}
% \begin{tabular}{|l|c|c|c|c|}
% \hline
% \multirow{2}{*}{} & \multicolumn{2}{c|}{Pref-Attachment} & \multicolumn{2}{c|}{AIDS} \\ \hhline{|~|-|-|-|-|}
% & Val & Test & Val & Test \\ \hhline{|=|=|=|=|=|}
% Riba et al. (2018) & $12.2 \pm 0.2$ & $12.1 \pm 0.6$ & $15.5 \pm 0.3$ & $15.6 \pm 0.3$ \\ \hline
% Bai et al. (2019) & $7.7 \pm 1.0$ & $9.6 \pm 2.5$ & $4.2 \pm 0.3$ & $8.7 \pm 0.1$ \\ \hline
% Li et al. (2019) & $5.5 \pm 0.1 $ & $7.8 \pm 0.3$ & $10.6 \pm 0.3$ & $11.7 \pm 0.9$ \\ \hline
% GDN & $4.2 \pm 0.1$ & $\boldsymbol{4.5 \pm 0.2}$ & $3.3 \pm 0.04$ & $\boldsymbol{6.2 \pm 1.0}$ \\ \hline
% \end{tabular}
% \caption{RMSE to ground truth GED with standard deviation across 3 runs.}
% \label{tab:ex1-baselines}
% \end{table}
% \begin{table}[ht]
% \centering
% \begin{tabular}{|c|c|c|}
% \hline
% \textbf{Method}&\textbf{Consensus} &\textbf{Mean} \\
% \hline
% DGC & L=0 &67.3\\
% \hline
% GDN & L=0&70.5\\
% \hline
% DGC & L=10&72.8\\
% \hline
% GDN & L=10&70.0\\
% \hline
% \end{tabular}
% \caption{Hits@1 (\%) on PASCALVOC}
% \label{tab:ex2}
% \end{table}