-
Notifications
You must be signed in to change notification settings - Fork 0
/
01b_derivation.tex
84 lines (64 loc) · 10.1 KB
/
01b_derivation.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
\section{Learning Distances via Matching}
A metric is defined by four properties: non-negativity, symmetry, the identity of indiscernibles and the triangle equation. A model for learning metrics should be designed to enforce as many of these properties as possible. However, guaranteeing the identity of indiscernibles implies solving the graph isomorphism problem, which is not possible with standard GNNs (\citealp{gin2019}). And the triangle inequality seems extremely challenging to enforce on a powerful neural network model. Therefore, we will focus on symmetry and non-negativity. Moreover, the model should be invariant to the representation of the graph, i.e. the order of nodes in memory.
While the model should ideally be able to learn any metric on a set of graphs, we will focus on the ubiquitous GED for the derivation. The GED between two graphs is defined analog to the String Edit Distance.
\begin{equation}
\text{GED}(G_{1},G_{2}) = \min_{(e_{1},...,e_{k}) \in \mathcal{P}(g_{1},g_{2})} \sum_{i=1}^{k} c(e_{i})
\end{equation}
where $e_{i}$ are edit operations (edge/node addition, removal, and substitution), $c(e)$ is the cost of an edit operation, and $\mathcal{P}(G_{1},G_{2})$ is the set of all possible edit paths that transform $G_{1}$ into a graph isomorphic to $G_{2}$.
A simple model to predict the GED might be to embed both graphs into a common vector space and compute the distance between these two vectors. One can use message passing networks to obtain permutation equivariant node embeddings. They can then be aggregated into a single graph embedding and, finally, any distance between two vectors (cosine, $p$-norm, etc.) will ensure non-negative and symmetric predictions. The main problem with this approach is that the embedding size becomes an information bottleneck for larger graphs. Working with varying sizes of graphs becomes impractical because increasing or decreasing the embedding size requires the entire model to be trained from scratch.
To solve this problem we propose\footnote{\cite{riba2018} already used an implicit matching for predicting the GED with neural networks but we improve on their work by utilizing a classical cost matrix and modern matching tools.} to match the nodes of both graphs, similar to classical GED algorithms (\citealp{hungarian2009}; \citealp{frankhauser2011}). These algorithms are based on bipartite graph matching. They set up a cost Matrix $C \in \mathbb{R}^{N \times N}$, where $N = N_1 + N_2$ and solve the following assignment problem:
\vspace{.2cm}
\noindent
\begin{minipage}{.5\linewidth}
\[
C=
\left[
\begin{array}{ccc|ccc}
C_{1,1} & \dotsi & C_{1, N_2} & C_{1, \epsilon} & \dotsi & \infty \\
\vdots & \ddots & \vdots & \vdots & \ddots & \vdots \\
C_{N_1, 1} & \dotsi & C_{N_1, N_2} & \infty & \dotsi & C_{N_1, \epsilon} \\
\hline
C_{\epsilon, 1} & \dotsi & \infty & 0 & \dotsi & 0 \\
\vdots & \ddots & \vdots & \vdots & \ddots & \vdots \\
\infty & \dotsi & C_{\epsilon, N_2} & 0 & \dotsi & 0 \\
\end{array}
\right]
\]
\end{minipage}%
\begin{minipage}{.5\linewidth}
\begin{equation}
\begin{gathered}
% \min \sum_{i = 1}^{N} \sum_{j = 1}^{N} M_{ij} C_{ij} = d(M, C) \\
d(C) = \min \langle M, C \rangle_\mathrm{F} \\
\text{subject to} \\
\sum_{i = 1}^{N} M_{ij} = 1 \quad \forall j \in \{1 \dots N\} \\
\sum_{j = 1}^{N} M_{ij} = 1 \quad \forall i \in \{1 \dots N\} \\
M_{ij} \in \{0, 1\}
\end{gathered}
\end{equation}
\end{minipage}
\vspace{.2cm}
Here $\langle \cdot, \cdot \rangle_\mathrm{F}$ denotes the Frobenius inner product. $C_{i, j}$ is the cost for replacing node $i$ with node $j$, $C_{\epsilon, j}$ the cost for inserting node $j$, and $C_{i, \epsilon}$ the cost for deleting node $i$. These costs depend on the direct neighborhood of the node, and can even be extended with random walks around it, but they still lack accuracy on graphs with complex global structures (\citealp{hungarian2009}). The assignment problem can be solved exactly by algorithms such as the Hungarian method, also known as the Kuhn-Munkres algorithm (\citealp{hungarian1955}) or the VJ algorithm (\citealp{vj1987}).
To allow the model to also learn other metrics on graphs we use a message passing network to learn the entries of the cost matrix $C$. The network generates node embeddings that contain information about their local neighborhood. We then compute pairwise distances between nodes of both graphs yielding a trainable equivalent to the upper left bock of the cost matrix. The corresponding values for insertion and deletion can also be learned by taking some norm of the embeddings. Analogous to the classical approach we compute the distance as $ d = \langle M^*, C \rangle_\mathrm{F} $.
A matching $M^*$ can be found by:
\begin{itemize}
\itemsep0em
\item \textbf{Nearest Neighbor.} Taking the sum of the row-wise minima (and, for the sake of symmetry, the sum of the column-wise minima) yields what is known in computer vision as the \textbf{Chamfer Distance}. Therefore, one can compute the Chamfer Distance using the $\max$ operator, without an explicit matching. An explicit matching, to use in the Frobenius inner product above, can be found with the $\argmax$ operator. Either way the process runs in $\mathcal{O}(N^2)$.
\item \textbf{Optimal Assignment.} Applying the Kuhn-Munkres algorithm results in an optimal assignment, $M_{ij} \in \{0,1\}$, in roughly $\mathcal{O}(N^3)$ operations.
\item \textbf{Optimal Transport.} The optimal transport problem is a continuous relaxation of the optimal assignment problem, where $M_{ij} \in \left[ 0,1 \right]$. In this setting usually called the \textbf{Earth Mover's Distance}, it can be found in roughly $\mathcal{O}(N^3 \log(N))$ operations.
\item \textbf{Regularized Optimal Transport.} \cite{sinkhorn2013} introduced an algorithm for solving an entropy regularized version of the optimal transport problem minimizing $\langle M, C \rangle_\mathrm{F} + \lambda \mathrm{H}(M)$. One simply computes the kernel matrix $ K := e^{-\frac{C}{\lambda}}$ and applies Sinkhorn’s iterative matrix scaling. This is called the \textbf{Sinkhorn Distance} and can be computed extremely fast in practice with a running time of $\mathcal{O}(N^2)$.
\end{itemize}
While the $\max$ operation is discrete, in deep learning frameworks its gradient is implemented as the piecewise function to which the gradient of the $\operatorname{softmax}$ converges as base $b \to \infty$. That allows for direct backpropagation through the Chamfer distance. Also the Sinkhorn scaling and therefore the entire computation of the Sinkhorn distance is differentiable. However, backpropagating through the iterative scaling process is not efficient, and for the other methods we cannot compute proper gradients due to discrete operations. Fortunately, we can apply Danskin's theorem (\citealp{danskin1967}) to calculate analytic gradients for the optimization problems above.
\begin{theorem}
If $\phi(x,z)$ is a continuous function of two arguments, $\phi: {\mathbb R}^n \times Z \rightarrow {\mathbb R}$ where $Z \subset {\mathbb R}^m$ is a compact set, and $\phi(x,z)$ is convex in $x$ for every $z \in Z$.
Then the derivate of the function $f(x) = \max_{z \in Z} \phi(x,z)$ is:
\begin{equation}
D f(x) = \phi'(x,z^*) \text{, where } z^*(x) = \argmax_{z \in Z}(\phi(x,z))
\end{equation}
\end{theorem}
Since $d(C, M)$ is affine, $-d(M,C)$ is also affine and convex. Moreover, the set of all possible matchings $M$ is compact. We can apply Danskin's theorem which yields that the derivate of $d(C) = \min_M \langle M, C \rangle_\mathrm{F}$ is the derivate of the Frobenius inner product between $C$ and the optimal matching $M^*$, which we have computed already for the forward pass. And the final gradient is simply the matrix $M^*$ itself. For the discrete assignment of the Kuhn-Munkres algorithm this gradient is a piecewise function similar to the gradient of the $\max$ operator. Armed with the analytic gradient we can quickly compute gradient updates, and learn in an end-to-end fashion.
Preliminary experiments indicated that the Sinkhorn distance is not only similarly fast to compute as the Chamfer distance, but also yields better results than the other much slower methods. The regularization results in softer matchings compared to hard assignments or the still very peaky matchings of the earth mover's distance. We believe these soft matchings propagate more gradient information to the message passing network allowing it to thoroughly adapt to the given distribution of graphs. Note that by adjusting the regularization parameter $\lambda$ we can control how soft the matchings are.
The final objective function depends on the specific task. For learning the GED we simply use the mean squared error between prediction and ground truth values. More specialized objectives such as the contrastive loss proposed by \cite{riba2018} for keyword spotting, or pairwise margin loss, or triplet loss proposed by \cite{li2019} for detecting fraudulent binaries can also be used with our model. Additionally to the experiments in this report we applied our model to a similar dataset of control flow graphs. Details can be found in Appendix \ref{appendix:CFG}.
In response to \cite{fey2020_update}, who used a similar model for graph matching, we decided to evaluate our model on a graph matching task as well. Another graph matching dataset we investigated shortly can be found in Appendix \ref{appendix:yest}. Because graph matching uses the matching matrix directly in the objective function, we are not able to apply Danskin's theorem in this setting. Therefore, we do have to backpropagate through the Sinkhorn iterations. However, we found that using 50 iterations is sufficient to get a good matching and computing the gradient updates is reasonably fast.
% Note wikipedia gives best running times as $O(mn + n^2 * \log(n))$ for Munkres
% and $O(N \log(N)^2)$ for EMD