01c_related_work.tex

\section{Related Work}
\label{section:related_work}

Researchers have already applied graph neural networks to the task of GED approximation. Bai et al. proposed in 2018 (published \citealp{bai2019}) to apply the same network to both graphs and aggregate the resulting node embeddings with an attention layer into two global graph embeddings, which can be used to compute a distance. In addition to this fully differentiable and therefore trainable part, they compute a histogram of the pairwise distances between node embeddings to incorporate some of the local node level information. However, the histogram method is not differentiable and it is unclear if the model can properly exploit this information. Moreover, this process reduces any graph to fixed-size embeddings and is unlikely to scale to larger graphs without hand-tuning the embedding size. Despite these flaws the authors show that their method significantly outperforms classical methods while being orders of magnitude faster at inference.

Later \cite{bai2018_cnn1} proposed to apply a standard 2-D Convolutional Neural Networks (CNN) to the matrix of pairwise distances, which replaces the histogram. As the authors point out, the main problems with this approach are permutation invariance and spatial locality. Firstly, a CNN depends on the ordering of the nodes in the graph representation. Therefore, the model is not invariant to permutations. Secondly, CNN's introduce a strong structural prior by assuming that nearby points are strongly related while further apart points are not. This assumption increases the importance of node order further. The authors decided to use breadth-first-search node-ordering (\citealp{bfs2018}). They claim that nearby nodes are sorted close to each other. However, the ordering cannot be guaranteed to be unique. It is not yet known if there exists a polynomial algorithm to find a canonical ordering (\citealp{canonical2016}) and therefore the predictions can depend on the representation of the graphs. Yet, the authors show an insightful connection between convolutional kernels and the optimal assignment problem and show strong empirical results compared to classical algorithms.

Independently, \cite{riba2018} proposed a similar model to estimate graph distances. They apply the same message passing network to both graphs generating sets $A$ and $B$ of node embeddings (one for each graph). The distance between two graphs is then computed as:
\begin{equation}
     d(G_{1}, G_{2}) = \frac{\sum_{a \in A} \inf_{b \in B} d(a, b) + \sum_{b \in B} \inf_{a \in A} d(a, b)}{N_1 + N_2}
\end{equation}
This equation is derived as a soft variant of the Hausdorff distance, which is also known as the Chamfer distance in computer vision (\citealp{chamfer1977}). Instead of approximating the GED directly they validate their method on graph classification and keyword spotting datasets. Their empirical results are therefore difficult to compare to Bai et al. (2018; 2019). This method is interesting because it generates a complete matching between nodes and might scale to larger graphs without hand-tuning the embedding size. Local node structures are incorporated during the message passing, and global node correspondence is used for predicting the final distance.

\cite{li2019} used a similar method, called Graph Matching Network (GMN), for graph classification and detection of vulnerable binaries. They search for vulnerable binaries by comparing the control flow graphs of functions across different compilers. The GMN also employs message passing layers to both graphs. However, they modify the message passing scheme to allow information of one graph to flow into the node updates of the other graph. % cross-graph attention formulas?
This process is fully differentiable and makes use of local node level and global graph level information but still suffers from the problem of reducing the entire graph to a single embedding vector which is expected to scale unfavorably to larger graphs. Moreover, computing the cross-graph matching vector costs $\mathcal{O}(N_1 N_2)$. While all previously described models have technically the same complexity ($\mathcal{O}(N_1 N_2 + \vert E_1 \vert + \vert E_2 \vert)$), the GMN computes this cross-graph matching vector for each layer.

% Note that in principle one could also apply GMN layers and then use the Soft Hausdorff Distance, making these two methods somewhat orthogonal.

\cite{fey2020_update} apply message passing networks to graph matching. In particular they test their method on keypoint matching of objects between pictures and cross-lingual knowledge graph alignment. Although this task is different from those mentioned previously, their model is extremely similar to the one of \cite{riba2018}. They also apply the same network to both graphs and compute a matching. Instead of the Chamfer distance \cite{fey2020_update} use the Sinkhorn normalization (\citealp{sinkhorn2013}) on the matrix of pairwise distances. The Sinkhorn normalization returns a doubly stochastic matrix, which is interpreted as a soft correspondence between nodes. One could immediately find a matching by taking the maximum likelihood estimate, but the authors then apply a newly proposed consensus step. They generate an injective node coloring, or random node embeddings in practice, for one graph and use the soft correspondence matrix to map these to the other graph. Using the generated node embeddings, they apply another message passing network to both graphs, yielding, after normalization, a new soft correspondence matrix. This step is repeated multiple times and the final answer is found from the last soft correspondence matrix. Unfortunately, the authors report that in practice they had to use the $\operatorname{softmax}$ operator on each row, instead of Sinkhorn normalization due to lacking gradient information. However, when using $\operatorname{softmax}$ the model is not necessarily symmetric anymore, i.e. $M(G_1, G_2) \neq M(G_2, G_1)$.

% This can be a problem if the graphs have different number of nodes.
% In a toy experiment they show that with consensus steps their method is robust to node additions and removal, but it certainly is less principled because the soft correspondence matrix and the resulting matching is not necessarily symmetric, $M(G_1, G_2) \neq M(G_2, G_1)$.

% consensus is orthogonal to our work


% maybe add how everyone computes their pairwise distances
% < a, b >, MLP(a - b), ||a - b||_p


% In the end Sinkhorn > soft hausdorff, and gmn is probably obsolete
% really fast only if we never generate a full matching